What’s in My Pocket? Read it now! (or Read It Later)

Introduction

You know what’s awesome? Pocket.

I mean, sure, it’s not the first. I think Instapaper existed a little before (perhaps). And there are alternatives, like Google Reader. But Pocket is still my favorite. It’s pretty awesome at what it does.

Pocket (or Read It Later, as it used to be known) has fundamentally changed the way I read.

Before I had an Android phone I used to primarily read books. But applications like Pocket allow you to save an article from the web so you can read it later. Being a big fan of reading (and also procrastination) this was a really great application for me to discover, and I’m quite glad I did. Now I can still catch up on the latest Lifehacker even if I am on the subway and don’t have data connectivity.

Background

The other interesting thing about this application is that they make it fairly easy to get a hold of your data. The website has an export function which allows you to dump all your data for everything you’ve ever added to your reading list into HTML.

Having the URL of every article you’ve ever read in Pocket is handy, as you can revisit all the articles you’ve saved. But there’s more to it than that. The HTML export also contains the time each article was added (in UNIX epoch). Combine this with an XML or JSON dump from the API, and now we’ve got some data to work with.

My data set comprises a list of 2975 URLs added to the application over the period 14/07/2011 – 19/09/2012. The data from the export includes the article ID, article URL, date added and updated, and tags added to each article.

In order to add to the data provided by the export functionalities, I wrote a simple Python script using webarticle2text, which is available on github. This script downloaded the all the text from each article URL and continually added it to a single text file, as well as doing a word count for each article and extracting the top-level domain (TLD).

Analysis

First of all we can take a very simple overview of all the articles I have saved by site:

And because pie-type charts make Edward R. Tufte (and some other dataviz people) cry, here is the same information in a bar chart:
Head and shoulders above all other websites at nearly half of all articles saved is Psychology Today. I would just like to be on the record as saying – don’t hate. I know this particular publication is written in such a fashion that it usually thought of as being slanted towards women, however I find the majority of articles to be quite interesting (as evidenced by the number of articles I have read). Perhaps other men are not that interested in the goings-on in their own and other people’s heads, but I am (apparently).

Also, I think this is largely due to the design of the site. I commented before that using Pocket has changed the way I read. Well, one example of this is that I find I save a lot more articles from sites which have well designed mobile versions, as I primarily add articles from my phone. For this reason I can also see why I have saved so many articles from Psych Today, as their well-designed mobile site has made it easy to do so. Plus the article titles are usually enough to grab me.

You can have a look at their visually appealing mobile site if you are on a phone (it detects if the browser is a desktop browser). The other top sites in the list also have well-designed mobile sites (e.g. The Globe and Mail, AskMen, Ars Technica).

Good mobile site design aside, I like reading psych articles, men’s magazines, news, and tech.

Next we examine the data with respect to time.

Unfortunately the Pocket export only provides two categories: time added and time ‘updated’. Looking at the data, I believe this ‘updated’ definition applies to mutiple actions on the article, like marking as read, adding tags, re-downloading, et cetera. It would be ideal to actually have the date/time when the article was marked as read, as then further interesting analysis could be done. For example, looking at the time interval between when articles were added and read, or the number the number of articles read per day.

Anyhow, we continue with what data are available. As in a previous post, we can get a high-level overview of the data with a scatterplot:

Pretty.

The most salient features which immediately stand out are the two distinct bands in the early morning and late afternoon. These correspond to when the majority of my reading is done, on my communte to and from work on public transit.

You can also see the general usage lining up with events in my personal life. The bands start in early October, shortly after I began my new job and started taking public transit. There is also a distinct gap from late December to early January when I was home visiting family over the Christmas holidays.

You can see that as well as being added while I am on public transit, articles are also added all throughout the day. This is as expected; I often add articles (either on my phone or via browser) over the course of the day while at work. Again, it would be interesting to have more data to look at this further, in particular knowing which articles were read or added from which platform.

I am uncertain about articles which are listed as being updated in the late hours in the evening. Although I sometimes do read articles (usually through the browser) in these hours, I think this may correspond to things like adding tags or also a delay in synching between my phone and the Pocket servers.

I played around with heatmaps and boxplots of the data with respect to time, but there was nothing particularly interesting which you can’t see from this scatterplot. The majority of articles are added and updated Monday to Friday during commute hours.

We can also look at the daily volume of articles added:

This graph looks similar to one seen previously in my post on texting. There are some days where very few articles are added and a few where there are a large number. Looking at the distribution of the number of articles added daily, we see an exponential type distribution:

Lastly we examine the content of the articles I read. As I said, all the article text was downloaded using Python and word counts were calculated for each. We can plot a histogram of this to see the distribution of the article length for what I’ve been reading:

Hmmmmm.

Well, that doesn’t look quite right. Did I really read an article 40,000 words long? That’s about 64 pages isn’t it? Looking at URLs for the articles with tens of thousands of words, I could see that those articles added were either malfunctions of the Pocket article parser, the webarticle2text script, or both. For example, the 40,000 word article was a post on the Dictionary.com blog where the article parser also grabbed the entire comment thread.

Leaving the data as is, but zooming in on a more reasonable portion of the histogram, we see something a little more sensical:

This is a little more what we expect. The bulk of the data are distributed between very short articles and those about 1500 words long. The spikes in the low end also correspond to failures of the article parsers.

Now what about the text content of the articles? I really do enjoy a good wordcloud, however, I know that some people tend look down upon them. This is because there are alternate ways of depicting the same data which are more informative. However as I said, I do enjoy them as they are visually appealing.

So firstly I will present the word content in a more traditional way. After removing stop words, the top 25 words found in the conglomerate file of the article text are as follows:

As you can see, there are issues with the download script as there is some garbage in there (div, the years 2011 and 2012, and garbage characters for “don’t” and “are”, or possibly “you’re”). But it appears that my recreational reading corresponds to the most common subjects of its main sources. The majority of my reading was from Psychology Today and so the number one word we see is “people”. I also read a lot articles from men’s magazines, and so we see words which I suspect primarily come from there (“women”, “social”, “sex”, “job”), as well as the psych articles.

And now the pretty visualization:

Seeing the content of what I read depicted this way has made me have some realizations about my interests. I primarily think of myself as a data person, but obviously I am genuinely interested in people as well.

I’m glad data is in there as a ‘big word’ (just above ‘person’), though maybe not as big as some of the others. I’ve just started to fill my reading list with a lot of data visualization and analysis articles as of late.

Well, that was fun, and somewhat educational. In the meantime, I’ll keep on reading. Because the moment you stop reading is the moment you stop learning. As Dr. Seuss said: “The more that you read, the more things you will know. The more that you learn, the more places you’ll go!”

Conclusions

  • Majority of reading done during commute on public transit
  • Number of articles added daily of exponential-type distribution
  • Most articles read from very short to ~1500 words
  • Articles focused on people, dating, social topics; more recently data

Resources

Pocket (formerly Read It Later) on Google Play:
https://play.google.com/store/apps/details?id=com.ideashower.readitlater.pro

Pocket export to HTML:
http://getpocket.com/export

Mediagazer Editor Lyra McKee: What’s In My Pocket
http://getpocket.com/blog/2012/09/mediagazer-editor-lyra-mckee-whats-in-my-pocket/

Founder/CEO of Pocket Nate Weiner: What’s In My Pocket
http://getpocket.com/blog/2012/08/nate-weiner-whats-in-my-pocket/

Pocket Trends (Data analysis/analytics section of Pocket Blog)
http://getpocket.com/blog/category/trends/

webarticle2text (Python script by Chris Spencer)
https://github.com/chrisspen/webarticle2text

omg lol brb txt l8r – Text Message Analysis, 2011-2012

Introduction

I will confess, I don’t really like texting. I communicate through text messages, because it does afford many conveniences, and occupies a sort of middle ground between actual conversation and email, but that doesn’t mean that I like it.

Even though I would say I text a fair bit, more than some other Luddites I know, I’m not a serial texter. I’m not like one of these 14-year-old girls who sends thousands of text messages a day (about what, exactly?).

I recall reading about one such girl in the UK who sent in excess of 100,000 text messages one month. Unfortunately her poor parents received a rather hefty phone bill, as she did this without knowing she did not have an unlimited texting plan. But seriously, what the hell did she write? Even if she only wrote one word per text message, 100,000 words is ~200 pages of text. She typed all that out on a mobile phone keyboard (or even worse, a touch screen)? That would be a sizeable book.

If you do the math it’s even crazier in terms of time. There are only 24 hours in the day, so assuming little Miss Teen Texter of the Year did not sleep, she still would have to send 100,000 in a 24 * 30 = 720 hour period, which averages out to be about one message every 25 seconds. I think by that point there is really no value added to the conversations you are having. I’m pretty sure I have friends I haven’t said 100,000 words to over all the time that we’ve know each other.

But I digress.

Background

Actually getting all the data out turned out to be much easier than I anticipated. There exists an Android App which will not only back up all your texts (with the option of emailing it to you), but conveniently does so in an XML file with human-readable dates and a provided stylesheet (!). Import the XML file into Excel or other software and boom! You’ve got time series data for every single text message you’ve ever sent.

My data set spans the time from when I first started using an Android phone (July 2011) up to approximately the present, when I last created the backup (August 13th).

In total over this time period (405 days) I sent 3655 messages (~46.8%) and received 4151 (~53.2%) for a grand total of 7806 messages. This averages out to approximately 19 messages / day total, or about 1.25 messages per hour. As I said, I’m not a serial texter. Also I should probably work on responding to messages.

Analysis

First we can get a ‘bird’s eye view’ of the data by plotting a colour-coded data point for each message, with time of day on the y-axis and the date on the x-axis:


Looks like the majority of my texting occurs between the hours of 8 AM to midnight, which is not surprising. As was established in my earlier post on my sleeping patterns, I do enjoy the night life, as you can see from the intermittent activity in the range outside of these hours (midnight to 4 AM). As Dr. Wolfram commented in his personal analytics posting, it was interesting to look at the plot and think ‘What does this feature correspond to?’ then go back and say ‘Ah, I remember that day!’.

It’s also interesting to see the back and forth nature of the messaging. As I mentioned before, the split in Sent and Received is almost 50/50. This is not surprising – we humans call these ‘conversations’.

We can cross-tabulate the data to produce a graph of the total daily volume in SMS: 

Interesting to note here the spiking phenomenon, in what appears to be a somewhat periodic fashion. This corresponds to the fact that there are some days where I do a lot of texting (i.e. carry on several day-long conversations) contrasted with days where I might have one smaller conversation, or just send one message or so to confirm something (‘We still going to the restaurant at 8?’ – ‘Yup, you know it’ – ‘Cool. I’m going to eat more crab than they hauled in on the latest episode of Deadliest Catch!’).

I appeared to be texting more back in the Fall, and my overall volume of text diminished slightly into the New Year. Looking back at some of the spikes, some corresponded to noteworthy events (birthday, Christmas, New Year’s), whereas others did not. For example, the largest spike, which occurred on September 3rd, just happened to be a day where I had a lot of conversations at once not related to anything in particular.

Lastly, through the magic of a Tableau dashboard (pa-zow!) we can combine these two interactive graphs for some data visualization goodness:


Next we make a histogram of the data to look at the distribution of the daily message volume. The spiking behaviour and variation in volume previously evident can be seen in the tail of the histogram dropping off exponentially:

Note that is the density in black, not a fitted theoretical distribution

The daily volume follows what appears to be an exponential-type distribution (log-normal?). This is really neat to see out of this, as I did not know what to expect (when in doubt, guess Gaussian) but is not entirely shocking –  other communication phenomena have been shown to be a Poisson process (e.g. phone calls). Someone correct me if I am way out of line here.

Lastly we can analyze the volume of text messages per day of the week, by making a box plot:

Something’s not quite right here…

As we saw in the histogram, the data are of an exponential nature. Correcting the y-axis in this regard, the box plot looks a little more how one would expect:

Ahhhh.

We can see that overall there tends to be a greater volume of texts Thursday to Sunday. Hmmm, can you guess why this is? πŸ™‚

This can be further broken down with a heat map of the total hourly volume per day of week:

This is way easier to make in Tableau than in R.

As seen previously in the scatterplot, the majority of messages are concentrated between the hours of 8 (here it looks more like 10) to midnight. In line with the boxplot just above, most of that traffic is towards the weekend. In particular, the majority of the messages were mid-to-late afternoon on Fridays.

We have thus fair mainly been looking at my text messages as time series data. What about the content of the texts I send and receive?

Let’s compare the distribution of message lengths, sent versus received. Since there are an unequal number of Sent and Received messages, I stuck with a density plot:

Line graphs are pretty.

Interestingly, again, the data are distributed in an exponential fashion.

You can see distinctive humps at the 160 character mark. This is due to longer messages being broken down into multiple messages under the max length. Some carriers (or phones?) don’t break up the messages, and so there are a small number of length greater than the ‘official’ limit.

Comparing the blue and red lines, you can see that in general I tend to be wordier than my friends and acquaintances.

Lastly, we can look at the written content. I do enjoy a good wordcloud, so we can by plunk the message contents into R and create one:

Names blurred to protect the innoncent (except me!).

What can we gather from this representation of the text? Well, nothing I didn’t already know…. my phone isn’t exactly a work Blackberry.

Conclusions

  • Majority of text message volume is between 10 AM to midnight
  • Text messages split approximately 50/50 between sent and received due to conversations
  • Daily volume is distributed in an exponential fashion (Poisson?)
  • Majority of volume is towards the end of the week, especially Friday afternoon
  • I should be less wordy (isn’t that the point of the medium?)
  • Everybody’s working for the weekend

References & Resources

SMS Backup and Restore @ Google Play
https://play.google.com/store/apps/details?id=com.riteshsahu.SMSBackupRestore&hl=en

Tableau Public
http://www.tableausoftware.com/public/community

Zzzzzz….. – Quantified Self Toronto #14

Sleep is another one of those things like diet, where I feel if you asked anyone if they wanted to improve that area of their life most would say yes.

I remember hearing a quote that sleep is like sex; no one is quite sure how much everyone else is getting, but they are pretty sure it is more than them. Or wait, I think that was salary. With sleep it is more like – no one is quite sure how much they should be getting, but they sure as hell wish they were getting a lot more.

A lot of research has been done on the topic and it seems like the key takeaway from it is always the same: we are not getting enough sleep and this is a problem.

I know that I am a busy guy, that I am young, and that I go out on the weekends, so I know for a fact that my sleep is ‘bad’. But I was curious as to how ‘bad’ it actually is. I started tracking my sleep in April to find out, and also to see if there were any interesting patterns in it of which I was not aware.
 
I spoke again at Quantified Self Toronto (#14) (I spoke previously at #12 on June 7th) about it on August 7th. I gave an overview of my sleep-tracking activities and my simple examination of the data I had gathered. Here is the gist of my talk, as I remember it.

Hi everyone, I’m Myles Harrison and this is my second time speaking at Quantified Self Toronto, and the title of my second presentation is ‘Zzzzzzzz….’. 

I started tracking how much I was sleeping per night starting in April of this year, to find out just how good or bad my sleep is, and also to see if there are any patterns in my sleep cycle.

Now I want to tell you that the first thing I thought of when I started to putting this slide deck together was Star Trek. I remember there was the episode of Star Trek called ‘Deja Q’. Q is an omnipotent being from another dimension that torments the crew of the Enterprise for his own amusement, and in this particular episode he becomes mortal. In one part of the episode he is captured and kept in a cell onboard the ship, and he describes a terrible physical experience he has:

Q
I have been entirely preoccupied by a most frightening experience of my own. A couple of hours ago, I started realizing this body was no longer functioning properly… I felt weak, the life oozing out of me… I could no longer stand… and then I lost consciousness…

PICARD
You fell asleep.

Q
It’s terrifying…. how can you stand it day after day?

PICARD
One gets used to it…


And this is kind of how I have always felt about sleep: I may not like it, there are many other things I’d rather be doing during all those hours, however it’s a necessary evil, and you get used to it. If I could be like Kramer on Seinfeld and try to get by on ‘Da Vinci Sleep’, I probably would. However for me, and for most of the rest of us, that is not a reasonable possibility.

So now we come to the question of ‘how much sleep do we really need?’. Obviously there is a hell of a lot of research which has been done on sleep, and if you ask most people how much sleep they need to get every night, they will tell you something like ‘6-8 hours’. I believe that number comes from this chart which is from the National Sleep Foundation in the States. Here they give the figure of 7-9 hours of sleep for an adult, however this is an average. If you read some of the literature you will find, unsurprisingly, that the amount of sleep needed depends on a lot physiological factors and so varies from person to person. Some lucky people are perfectly capable of functioning normally during the day on only 3 or 4 hours of sleep a night, whereas some other unlucky people really need about 10 to 12 hours of sleep a night to feel fully rested. I highly doubt these unlucky folks regularly get that much sleep a night, as most of us have to get up in the morning for this thing called ‘work’. So yes, these are the extremes but they serve to illustrate the fact that this 6-8 (or 7-9) hours per night figure is an average and is not for everyone.

Also I found a report compiled by Statistics Canada in 2005 which says that the average Canadian sleeps about 8 and a half hours a night, usually starting at about 11 PM. Additionally, most Canadians get about 20 extra minutes of sleep on weekend nights as they don’t have to go to work in the morning and so can hit the snooze button.

So knowing this, now I can look at my own sleep and say, how am I doing and where do I fit in?


So as I said, I have been recording my sleep since early April up until today. In terms of data collection, I simply made note of the approximate time I went to bed and the approximate time at which I woke up the following morning, and recorded these values in a spreadsheet. Note that I counted only continuous night-time sleep and so the data do not include sleep during the day or things napping [Note: this is the same as the data collected by StatsCan for the 2005 report]. Also as a side interest I kept a simple yes/no record of whether or not I had consumed any alcohol that evening, counting as a yes any evening on which I had a drink after 5 PM.

O
n to the data. Now we can answer the question ‘What does my sleep look like?’ and the answer is this:

There does not appear to be any particular rhyme or reason to my sleep pattern. Looking at the graph we can conclude that I am still living like a University student. There are some nights where I got a lot of sleep (sometimes in excess of 11 or 12 hours) and there are other nights where I got very, very little sleep (such as this one particular night in June where I got no sleep at all, but that is another story). The only thing I can really pick out of this graph of note is that following nights or sequences of nights where I got very little sleep or went to bed very late, there is usually a night where I got a very large amount of sleep. Interestingly this night is sometimes not until several days later but this may be due to the constraints of the work week.

So despite the large amount of variability in my sleep we can still look at it and do some simple descriptive statistics and see if we can pull any meaningful patterns out of it. This is a histogram of the number of hours of sleep I got each night.

Despite all the variability in the data from what we saw earlier, it looks like the amount of sleep I get is still somewhat normally distributed. It looks like I am still getting about 7 hours of sleep on average, which actually really surprised me and in my opinion is quite good, all things considered and given the chaotic nature of my personal life. [Note: the actual value is 6.943 hrs for the mean, 7 for the median with a standard deviation of 1.82 hours]. 

So we can ask the question, ‘Is my amount of nightly sleep normally distributed?’. Well, at first glance it sure appears like it might be. So we can compare to what the theoretical values should be, and this certainly seems to be the case, though using a histogram is maybe not the best way as it will depend on how you choose your bin sizes.


We can also look at what is called a Q-Q plot which plots the values against the theoretical values, and if the two distributions are the same then the values should lie along that straight line. They do lie along it well, with maybe a few up near the top there straying away… so perhaps it is a skew-normal distribution or something like that, but we can still safely say that the amount of sleep I get at night is approximately normally distributed.


Okay, so that is looking at all the data, but now we can also look at the data over the course of the week, as things like the work week and weekend may have an affect on how many hours of sleep I get.

So here is a boxplot of the number of hours of sleep I got for each day of the week and we can see some interesting things here.

Most notably, Wednesday and Saturday appear to be the ‘worst’ nights of the week for me for sleep. Saturday is understandable, as I tend to go out on Saturday nights, and so the large amount of variability in the number of hours and low median value is to be expected; however, I am unsure as to why Wednesday has less hours than the other days (although I have do go out some Wednesday nights). Tuesdays and Thursdays appears to be best both in terms of variability and the median amount, these days being mid-week where presumably my sleep cycle is becoming regular during the work week (despite the occasional bad Wednesday?).

We can also examine when I feel asleep over the course of the week. Wait, that sounds bad, like I am sleeping at my desk at work. What I mean is we can also examine what time I went to bed each night over the course of the week:

Again we can see some interesting things. First of all, it is easy to note that on average I am not asleep before 1 AM! Secondly we can see that I get to sleep latest on Saturday nights (as this is the weekend) and that there is a large amount of variability in the hour I fall asleep on Fridays. But again we see that in terms of getting to bed earliest, Wednesday and Saturday are my ‘worst’ days, in addition to being the days when I get the least amount of sleep on average. Hmmmmmm….! Could there be some sort of relationship here?

So we can create a scatterplot and see if there exists a relationship between the hour at which I get to bed and the number of hours of sleep I get. And when we do this we can see that there is appears to be [surprise, surprise!] a negative correlation between the hour at which I get to sleep and the number of hours of sleep I get.

And we can hack a trend line through there to verify this:

> tl1 <- lm(sleep$hours ~ starthrs)
> summary(tl1)

Call:
lm(formula = sleep$hours ~ starthrs)

Residuals:
    Min      1Q  Median      3Q     Max
-7.9234 -0.6745 -0.0081  0.5569  4.8669

Coefficients:
            Estimate Std. Error t value Pr(>|t|)   
(Intercept)  9.78363    0.43696  22.390  < 2e-16 ***
starthrs    -0.62007    0.09009  -6.883 3.56e-10 ***

Signif. codes:  0 β€˜***’ 0.001 β€˜**’ 0.01 β€˜*’ 0.05 β€˜.’ 0.1 β€˜ ’ 1

Residual standard error: 1.533 on 112 degrees of freedom
Multiple R-squared: 0.2973,    Adjusted R-squared: 0.291
F-statistic: 47.38 on 1 and 112 DF,  p-value: 3.563e-10

So there is a highly statistically significant relationship between how late I get to sleep and the number of hours of sleep I get. For those of you that are interested, the p-value is very small (on the order of e-10). However you can see that the goodness of fit is not that great, as the R-squared about 0.3. This means that perhaps there are other explanations as to why getting to sleep later results in me getting less sleep, however I could not immediately think of anything. I am open to other suggestions and interpretations if you have any.

Also I got to thinking that this is the relationship between how late I get to sleep and how much sleep I get for all the data. Like a lot of people, I have a 9 to 5, and so I do not have the much choice about when I can get up in the morning. Therefore I would expect that this trend is largely dependent upon the data from the days during the work week.

So I thought I would do the same examination only for the days of the week where the following day I do not have to be up by a certain hour, that is, Friday and Saturday nights. And we can create the same plot, and:

We can see that, despite there being less data, there still exists the negative relationship.

> tl2 <- lm(wkend$hours ~ hrs)
> summary(tl2)

Call:
lm(formula = wkend$hours ~ hrs)

Residuals:
    Min      1Q  Median      3Q     Max
-5.4288 -0.4578  0.0871  0.5536  4.4300

Coefficients:
            Estimate Std. Error t value Pr(>|t|)   
(Intercept)  12.1081     0.9665  12.528 1.89e-13 ***
hrs          -0.8718     0.1669  -5.224 1.24e-05 ***

Signif. codes:  0 β€˜***’ 0.001 β€˜**’ 0.01 β€˜*’ 0.05 β€˜.’ 0.1 β€˜ ’ 1

Residual standard error: 1.737 on 30 degrees of freedom
Multiple R-squared: 0.4764,    Adjusted R-squared: 0.4589
F-statistic: 27.29 on 1 and 30 DF,  p-value: 1.236e-05

So it appears that on the days on which I could sleep in and make up the hours of sleep I am losing by going to bed later I am not necessarily doing so. Just because I can sleep in until a ridiculously late hour doesn’t necessarily mean that my body is letting me do so. This came as a bit of a surprise to me, as I thought that if I didn’t have to be up at a particular hour in the morning to do something, I would just sleep more to make up for the sleep I lost. An interesting insight – even though I can sleep in and make up for hours lost doesn’t necessarily mean that I will. 

So basically I just need to get to sleep earlier. Also, I am reminded of what my Dad always used to say to me when I was a kid, ‘An hour of sleep before midnight is worth two afterwards.’

Lastly, as I said, I did keep track of which nights I had consumed any alcohol in the evening to see what impact, if any, this was having on the quality and duration of my sleep. For this I just did a simple box plot of all the data and we can see that having a drink does mean I get less sleep overall.


Though this is a very simple overview it is consistent with what you can read in the research done on alcohol consumption and sleep. The belief that having a drink before bed will help you sleep better is a myth, as alcohol changes physiological processes in the body which are necessary for a good night’s sleep, and disrupts it.


So those were the conclusions I drew from tracking my sleep and doing this simple analysis of it. In terms of future directions, I could also further quantify my tracking of my sleep. I have simply measured the amount of sleep I have been getting, going with the assumption that getting close to the recommended amount of time is better. I could further quantify things by rating how rested I feel when I awake (or during the day) or rating how I felt the quality of rest I got was, on a scale of 1-10.

I could also measure other factors, such eating and exercise, and the time these things occur and how this play in to the amount and quality of the sleep I get.

Lastly, though I did have a simple yes/no measurement for whether or not I had consumed alcohol each evening, I did not quantify this. In the future I could measure caffeine consumption as well, as this known to be another important factor affecting sleep and restfulness.

That concludes my presentation, I hope I kept you awake. I thank you for your time, and for listening. If you have any questions I would be happy to answer them.

References & Resources 

National Sleep Foundation
http://www.sleepfoundation.org/

Who gets any sleep these days? Sleep patterns of Canadians (Statistics Canada)
http://www.statcan.gc.ca/pub/11-008-x/2008001/article/10553-eng.htm

The Harvard Medical School’s Guide to A Good Night’s Sleep
http://books.google.ca/books?id=VsOWD6J5JQ0C&lpg=PP1&pg=PP1#v=onepage&q&f=false

Quantified Self Toronto
http://quantifiedself.ca/

50 Shades of Grey Wordcloud

Sometimes you just want to see what all the fuss is about. File this under the ‘because I can’ category: I proudly (?) present – a wordcloud produced from the text of E. L. James’ “50 Shades of Grey”.

For a book which is getting all this press about being full of explicit sexuality, the data are not what you expect. Obviously the main characters’ names feature prominently, but if you ask me this visualization shows that this is another romance novel much like any other.

Sure, you probably wouldn’t expect to see the words ‘dominant’ (left, next to grey) and ‘submissive’ (right, next to don’t) in some other books of this type. But look at some of the other words which are largest besides the names of the main characters – eyes, head, hands, hair, voice, smile. Obviously, it’s not just about the sex.

Produced in R using the excellent tm and wordcloud packages.

11 Million Yellow Slips – City of Toronto Parking Tickets, 2008-2011

Introduction

I don’t know about you, but I really hate getting parking tickets. Sometimes I feel like it’s all just a giant cash grab. Really? I can’t park there between the hours of 11 and 3, but every other time is okay? Well, why the hell not?

But ah, such is life. Rules must be in place to keep civil order, keep the engines of city life running and prevent total chaos in the downtown core. However knowing this does not make coming out to the street to find that bright yellow slip of paper under your windshield wiper any easier.

Like everything else in the universe, parking tickets are a source of data. The great people at Open Data Toronto (@Open_TO) have provided all the data from every parking ticket issued in Toronto from 2008 to the end of last year.

So, let us dive in and have a look. We might just discover why we keeping getting all these tickets, or at least ease the collective pain a little in realizing how many others are sharing in it.

Background

The data set is an anonymized record of every parking ticket issued in the city of Toronto from the period 01/01/2008 – 12/31/2011. The fields provided are: the anonymized ticket #, date of infraction, infraction code, description, fine amount, time of infraction, and location (address).

The data set and more information can be found in Open Data Toronto’s data catalogue here.

Originally I had this brilliant idea to geocode every data point, and then create an awesome heat map of the geographical distribution of parking tickets issued. However, given the fact that there are ~11 million records and the Google Maps API has a daily limit of 2,500 geocoding requests per day, even if I was completely diligent and performed the task daily it would still take approximately 4400 days or about 12 years to complete. And no, I am not paying to use the API for Business (which at a limit of 100,000 requests per day would still take ~3.5 months).

If anyone knows a way around this, please drop me an email and fill me in.

Otherwise, you can check out prior art. Patrick Cain at Global News created an awesome interactive map of aggregated parking ticket data from 2010 for locations in the city where over 500 tickets were issued. This turns out to be mainly hospitals, and unsurprisingly, tickets are clustered in the downtown core. Mr. Cain did a similar analysis while at the Toronto Star back in 2009, using data from the previous year.

I just don’t like throwing out data points.

Analysis

Parking Infractions by Type 
Next we consider the parking tickets for the period by infraction type. A simple bar chart outlines the most common parking ticket types:

We will consider those codes which stick out most on the bar chart (the top 10):

> sort(codeTable, decreasing=TRUE)[1:11]
    005     029     210     003     207     009     002     008     006     015
2336433 1822690 1366945 1354671  933478  718692  496283  443706  369079 173078

Putting that into more human-readable format, the most commonly issued types of parking infractions were:

1. 005 – Park on Highway at Prohibited Time of Day
2. 029 – Park Prohibited Place/Time – No Permit
3. 210 – Park Fail to Display Receipt
4. 003 – Park on Private Property w/o Consent
5. 207 – Park w/o ticket from machine
6. 009 – Stop on Highway at Prohibited Time/Day
7. 002 – Park Longer than 3 Hours
8. 008 – Vehicle Standing Prohibited Time/Day
9. 006 – Park on Highway – Excess of Permitted Time
10. 015 – Park within 3M of Fire Hydrant

In case you were wondering, the most expensive tickets (in the range of 100’s of dollars, the max being $450 [!!] ) are all related to handicapped parking spaces.

Time Distribution of Parking Infractions
Let us now consider the parking ticket information with regards to time. First and foremost, we consider the ticket data as a simple tim
e series and plot the data for the exploratory purposes:

Cool.

Most strikingly, there are clearly defined dips in the total number of tickets over the holiday season each year. There also appears to be some kind of periodic variation in the number of tickets issued over time (the downward spikes). A good first guess would be that this is likely related to the day of the week, due to the cycle of the work week related to the volume of cars parked, vehicles in the city, et cetera.

Quickly whipping up a box plot up for the data, we can see that a significantly less proportion of the tickets are issued on Sunday. Also for some reason plotting there are many outliers on the low end. I suspect these are in the aforementioned dips around the holiday season though I did not investigate this.

Conclusions

Performing a quick analysis of many different aspects of the data was not as easy as I had hoped, given the size of the set. Still, it is interesting to see the most common types of violations and the distribution of the majority of the parking tickets with respect to time.

Interesting general points of note:

  • The most common parking infractions are wrong place / wrong time, followed by various types of failing to display a permit / buy a ticket
  • Significantly reduced number of parking violations during the Christmas holiday season
  • More tickets issued during the work week

For Part II, I plan to create some heat maps / 2D histograms of the ticket data with respect to time, and I may yet create a geospatial representation of the data, albeit in aggregated form.

My bookshelf

I’d like to start with something small, and simple. The thing about analyzing the data of your own life is that you are the only one doing the research, so you also have to collect all of the data yourself. This takes effort; and, if you’d like to build a large enough data set to do some really interesting (and valid) analysis, time.

So I thought I’d start small. And simple. So I thought, what is an easily available source of data in my life to do some preliminary work? The answer was right next to me as I sat at my desk.

I am not a bibliophile by any stretch of the imagination, as I try to make good use of the public library when I can. I’d prefer to avoid spending copiously on books which will be read once and then collect dust. I have, over time however, amassed a small collection which is currently surpassing the capacity of my tiny IKEA bookcase.

I catalogued all the books in my collection and kept track of a few simple characteristics: number of pages, list price, publication year, binding, type (fiction, non-fiction or reference), subject, and whether or not I had read the book from cover-to-cover (“Completed”).

At the time of cataloguing I had a total of 60 books on my bookshelf. Summary of data:

> source(“books.R”)
[1] “Reading books.csv”
> summary(books)

Pages     

Min.   :  63.0 
1st Qu.: 209.5 
Median : 260.0 
Mean   : 386.1 
3rd Qu.: 434.0 Max.   :1694.0 
      Binding        Year               Type              Subject 
 Hardcover:21   Min.   :1921   Fiction    :15   Math          :12 
 Softcover:39   1st Qu.:1995   Non-fiction:34   Communications: 7 
                Median :2002   Reference  :11   Humour        : 6 
                Mean   :1994                    Coffee Table  : 5 
                3rd Qu.:2006                    Classics      : 4 
                Max.   :2011                    Sci-Fi        : 4 
                                                (Other)       :22 
     Price        Completed
 Min.   :  1.00   -:16    
 1st Qu.: 16.45   N:13    
 Median : 20.49   Y:31    
 Mean   : 35.41           
 3rd Qu.: 30.37           
 Max.   :155.90           

Some of this information is a bit easier to interpret if provided in visual form (click to enlarge):


Looking at the charts we can see that I’m not really into novels, and that almost 1/5th of my library is reference books – due mainly to textbooks from university I still have kicking around. For about 1/3rd of the books which are intended to be read cover-to-cover I have not done so (“Not Applicable” refers to books like coffee-table and reference books which are not intended to be read in their entirety).

Breaking it down further we look at the division by subject/topic:

Interestingly enough, the topics in my book collection are varied (apparently I am well-read?), with the largest chunks being made up by math (both pop-science and textbooks) and communications (professional development reading in the last year).

Let’s take a look at the relationship between the list price of books and other factors.

As expected, there does not appear to be any particular relationship between the publication year of the book and the list price. The outliers near the top of the price range are the textbooks and those on the very far left of publication date are Kafka.

A more likely relationship would be that between a book’s length and its price, as larger books are typically more costly. Having a look at the data for all the books it appears this could be the case:

We can coarsely fit a trendline to the data:
> price <- books$Price
> pages <- books$Pages
> page_price_line <- lm(price ~ pages)
> summary(page_price_line)

Call:
lm(formula = price ~ pages)

Residuals:
    Min      1Q  Median      3Q     Max
-56.620 -13.948  -6.641  -1.508 109.802

Coefficients:
            Estimate Std. Error t value Pr(>|t|)   
(Intercept)  9.95801    6.49793   1.532    0.131   
pages        0.06592    0.01294   5.096 3.97e-06 ***

Signif. codes:  0 β€˜***’ 0.001 β€˜**’ 0.01 β€˜*’ 0.05 β€˜.’ 0.1 β€˜ ’ 1

Residual standard error: 32.19 on 58 degrees of freedom
Multiple R-squared: 0.3092,    Adjusted R-squared: 0.2973
F-statistic: 25.96 on 1 and 58 DF,  p-value: 3.971e-06
  
 

Our p-value is super small however our goodness of fit (R-squared) is not great. There appears to be some sort of clustering going on here as the larger values (both in price and pages) are more dispersed. We re-examine the plot and divide by binding type:

The softcovers make up the majority of the tightly clustered values and the values for the hardcovers seem to be more spread out. The dashed line is the linear fit for the hardcovers and the solid line for the soft. However the small number (n=21) and dispersion of the points for the former make even doing this questionable. That point aside, we can see on the whole that hardcovers appear to be more expensive, as one would expect. This is illustrated in the box plot below:
 

However there a lot of outlying points on the plot. Looking at the scatterplot again we divide by book type and the picture becomes clearer:

It is clear the reference books make up the majority of the extreme values away from those clustered in the lower regions of the plot and thus could be treated separately.

Closing notes:

  • I did not realize how many non-fiction / general interest / popular reading books have subtitles (e.g. Zero – The Biography of A Dangerous Idea) until cataloguing the ones I own. I suppose this is to make them seem more interesting, with the hopes that people browsing at bookstores to read the blurb on the back and be enticed to purchase the book.
  • Page numbering appears to be completely arbitrary. When I could I used the last page of each book which had a page number listed. Some books have the last page in the book numbered, others have the last full page of text numbered, and still others the last written page before supplementary material at the back (index, appendix, etc.) numbered. The first numbered page also varies, accounting for things like the table of contents, introduction, prologue, copyright notices and the like.
  • Textbooks are expensive. Unreasonably so.
  • Amazon has metadata for each book which you can see under “Details” when you view it (I had to look up some things like price when it was not listed on the book. In these cases, I used Amazon’s “list price”, the crossed out value at the top of the page for a book). I imagine there is an enormous trove of data which would lend itself to much more interesting and detailed analysis than I could perform here.