What’s in My Inbox? Data Analysis of Outlook

Introduction

Email is the bane of our modern existence.

Who of us hasn’t had a long, convoluted, back-and-forth email thread going on for days (if not weeks) in order to settle an issue which could have been resolved with a simple 5 minute conversation?

With some colleagues of mine, email has become so overwhelming (or their attempts to organize it so futile) that it brings to my mind Orwell’s workers at the Ministry of Truth in 1984 and their pneumatic tubes and memory holes – if the message you want is not in the top 1% (or 0.01%) of your inbox and you don’t know how to use search effectively, then for all intents and purposes it might as well be gone (see also: Snapchat).

Much has been written on the subject of why exactly we send and receive so much of it, how to best organize it, and whether or not it is, in fact, even an effective method of communication.

At one time even Gmail and the concept of labels was revolutionary – and it has done some good in organizing the ever-increasing deluge that is email for the majority of people. Other attempts have sprung up to tame the beast and make sense of such a flood of communication – most notably in my mind Inbox Zero, the simply-titled smartphone app Mailbox, and MIT’s recent data visualization project Immersion.

But email, with all its systemic flaws, misuse, and annoyances, is definitely here for good, no question. What a world we live in.

But I digress.

Background

I had originally hoped to export everything from Gmail and do a very thorough analysis of all my personal email. Though this is now a lot easier than it used to be, I got frustrated at the time trying to write a Python script and moved on to other projects.
But then I thought, hey, why not do the same thing for my work email? I recently discovered that it’s quite easy to export email from Outlook (as I detailed last time) so that brings us to this post.
I was somewhat disappointed that Outlook can only export a folder at a time (which does not include special folders such as search folders or ‘All Mail’) – I organize my mail into folders and wanted an export of all of it.
That being said, the bulk probably does remain in my inbox (4,217 items in my inbox resulted in a CSV that was ~15 MB) and we can still get a rough look using what’s available  The data cover the period from February 27th, 2013 to Nov 16th, 2013.

Email by Contact
First let’s  look at the top 15 contacts by total number of emails. Here are some pretty simple graphs summarizing that data, first by category of contact:

In the top 15, split between co-workers/colleagues and management is pretty even. I received about 5 times as much email from coworkers and managers as from stakeholders (but then again a lot of the latter ended up sorted into folders, so this count is probably higher). Still, I don’t directly interact with stakeholders as much as some others, and tend to work with teams or my immediate manager. Also, calls are usually better.

Here you can see that I interacted primarily with my immediate colleague and manager the most, then other management, and the remainder further down the line are a mix which includes email to myself and from office operations. Also of note – I don’t actually receive that much email (I’m more of a “in the weeds” type of guy) or, as I said, much has gone into the appropriate folders.

Time-Series Analysis
The above graphs show a very simplistic and high level view of what proportion of email I was receiving from who (with a suitable level of anonymity, I hope). More interesting is a quick and simple analysis of patterns in time of the volume of email I received – and I’m pretty sure you already have an idea of what some of these patterns might be.

When doing data analysis, I always feel it is important to first visualize as much of the data as practically possible – in order to get “a feel” for the data and avoid making erroneous conclusions without having this overall familiarity (as I noted in an earlier post). If a picture is worth thousand words then a good data visualization is worth a thousand keystrokes and mouse clicks.

Below is a simple scatter plot all the emails received by day, with the time of day on the y-axis:


This scatterpolot is perhaps not immediately particulary illuminating, however it already shows us a few things worth noting:

  • the majority of emails appear in a band approximately between 8 AM and 5 PM
  • there is increased density of email in the period between the end of July and early October, after which there is a sparse interval until mid-month / early November
  • there appears to be some kind of periodic nature to the volume of daily emails, giving a “strip-like” appearance (three guesses what that periodic nature is…)

We can look into this further by considering the daily volume of emails, as below. The black line is a 7 day moving average:

We can see the patterns noted above – the increase in daily volume after 7/27 and the marked decrease mid-October. Though I wracked my brain and looked thoroughly, I couldn’t find a specific reason why there was an increase over the summer – this was just a busy time for projects (and probably not for myself sorting email). The marked decrease in October corresponds to a period of bench time, which you can see was rather short-lived.

As I noted previously in analyzing communications data, the distribution of this type of information is exponential in nature and usually follows a log-normal distribution. As such, a moving average is not the greatest measure of central tendency – but a decent approximation for our purposes. Still, I find the graph a little more digestible when depicted with a logarithmic y-axis, as below:

Lastly we consider the periodic nature of the emails which is noted in the initial scatterplot. We can look for patterns by making a standard heatmap with the weekday as the column and hour of day as the row, as below:

You can clearly see that the the majority of work email occurs between the hours of 9 to 5 (shocking!). However some other interesting points of note are the bulk of email in the mornings at the begiinning of the week, fall-off after 5 PM at the end of the week (Thursday & Friday) and the messages received Saturday morning. Again, I don’t really receive that much email, or have spirited a lot of it away into folders as I noted at the beginning of the article (this analysis does not include things like automated emails and reports, etc.)

Email Size & Attachments
Looking at file attachments, I believe the data are more skewed than the rest, as the clean-up of large emails is a semi-regular task for the office worker (as not many have the luxury of an unlimited email inbox capacity – even executives) so I would expect that values on the high end to have largely been removed. Nevertheless it still provides a rough approximation of how email sizes are distributed and what proportion have attachments included.

First we look at the overall proportion of email left in my inbox which has attachments – of the 4,217 emails, 2914 did not have an attachment (69.1%) and 1303 did (30.9%).

Examining the size of emails (which includes the attachments) in a histogram, we see a familiar looking distribution, which here I have further expanded by making it into a Pareto chart. (note that the scale on the left y-axis is logarithmic):

Here we can see that of what was left in my inbox, all messages were about 8 MB in size or less, with the vast majority being 250K or less. In fact 99% of the email was less than 1750KB, and 99.9% less than 6MB.

Conclusion

This was a very quick analysis of what was in my inbox, however we saw some interesting points of note, some of which confirm what one would expect – in particular:
  • vast majority of email is received between the hours of 9-5 Monday to Friday
  • majority of email I received was between the two managers & colleagues I work closest with
  • approximately 3 out of 10 emails I received had attachments
  • the distribution of email sizes is logarithmic in nature
If I wanted to take this analysis further, we could also look at the trending by contact and also do some content analysis (the latter not being done here for obvious reasons, of course).
This was an interesting exercise because it made me mindful again of what everyday analytics is all about – analyzing rich data sets we are producing all the time, but of which we are not always aware.

References and Resources

Inbox Zero
http://inboxzero.com/

Mailbox
http://www.mailboxapp.com/

Immersion
https://immersion.media.mit.edu/

Data Mining Email to Discover Organizational Networks and Emergent Communities in Work Flows

How to Export Your Outlook Inbox to CSV for Data Analysis

So one of my colleagues at work showed me this cool script he wrote in Visual Basic to pull all the data from Outlook for analysis.

Cool, I thought – I’d like to do that, but don’t want to muck about in VB.

Well, I was surprised to discover that Outlook has the ability to export email to CSV built in! Follow the simple steps below (here demonstrated in Outlook 2010) and you can analyze your emails yourself and do some cool quantified self type analysis

How to Export Outlook Email to CSV (from Outlook)

1. Open Outlook and click File then Options to bring up the options dialog:
2. Selected Advanced, then click the Export button:
3. Click Export to a file and then the next button:
4. Selected Comma Separated Values (Windows) and click next.
5. Unless you want to export a different folder, select Inbox and click next.
6. Browse to a folder and/or type a filename for your export.
7.  Choose Map Custom Fields… if you want to customize which fields to export. Otherwise click the Finish button.
8. Sit tight while Outlook does its thing.
You should now have a CSV file of your inbox data!

How to Export Outlook Email to CSV (from Access)

This is all very well and good, but unfortunately exporting to CSV from Outlook does not provide the option for date and time as fields to be included, which makes it useless if you’d like to do time series (or other temporal) analysis.
To get the date and time data you can pull data from Outlook into Access and then export it as noted in this metafilter thread.
Import from Outlook into Access
1. Fire up Access and create a new database. Select External Data, More.. and then Outlook Folder.
2. Select Import the source data into a new table in the current database and click OK


3. Select the email account and folder you’d like to import and click Next 
4. Change the field settings if you’d like. Otherwise accept the defaults by clicking Next


5. Let Access add the primary key or not (you don’t need it). Click Next 


6. Click Finish and wait. When the process is done you should have a new table called ‘Inbox’.



Export Data from Access to a CSV
1. Make sure the Inbox table is selected and click External Data then Text File.
2. Pick or type a filename and click OK


3. Selected Delimited and click Next
4. Select Comma as the delimiter and tick the box which says Include Field Names on First Row. Click next.
5. Pick or type a filename and click Finish


You should now have your Inbox data exported as CSV (including time / date data!) and ready for analysis. Of course you can repeat this process and append to the Access database folder by folder to analyze all the mail you have in Outlook.

What’s in my Pocket? (Part II) – Analysis of Pocket App Article Tagging

Introduction

You know what’s still awesome? Pocket.

As I noted in an earlier post (oh god, was that really more than a year ago?!) I started using the Pocket application, previously known as Read It Later, in July of 2011 and it has changed my reading behavior ever since.

Lately I’ve been thinking a lot about quantified self and how I’m not really tracking anything anymore. Something which was noted at one of the Meetups is that data collection is really the hurdle: like anything in life – voting, marketing, dating, whatever – you have to make it easy otherwise most people probably won’t bother to do it. I’m pretty sure there’s a psychological term for this – something involving the word ‘threshold’.

That’s where smartphones come in. Some people have privacy concerns about having all their data in the cloud (obviously I don’t, as I’m willing putting myself all on display in the blog here) but that aside, one of the cool things about smartphone apps is that you are passively creating lots of data. Over time this results in a data set about you. And if you know how to pull that data you can analyze it (and hence yourself).  I did this previously, for instance with my text messages and also with data from Pocket collected up to that time.

So let’s give it a go again, but this time with a different focus for the analysis.

Background

This time I wasn’t so interested in when I read articles and from where, but moreso in the types of articles I was reading. In the earlier analysis, I summarized what I was reading by top-level domain of the site – and what resulted was a high-level overview of my online reading behavior.
Pocket added the ability for you to tag your articles. The tags are similar to labels in Gmail and so the relationships can be many to one. This provides a way for you to categorize your reading list (and archive) by category, and for the purposes of this analysis here, to analyze them accordingly.
First and foremost, we need the data (again). Unfortunately over the course of the development of the Pocket application, the amount of data you can get easily via export (without using the API) has diminished. Originally the export was available both as XML or JSON, but unfortunately those are now no longer available.
However, you can still export your reading list as an HTML file, which contains attributes in the link elements for the time the article was added and the tags it has attached.

Basically the export is quasi-XML, so it’s a simple matter of writing some R code using the XML library to get the data into a format we can work with (CSV):

Here I extract the attributes and also create a column for each tag name with a binary value for if the article had that tag (one of my associates at work would call this a ‘classifier’, though it’s not the data science-y kind). Because I wrote this in a general enough fashion, you should be able to run the code on your own Pocket export and get the same results.
Now that we have some data we can plunk it into Excel and do some data visualization.

Analysis

First we examine the state of articles over time – what is the proportion of articles added over time which were tagged versus not?

Tagged vs. Untagged

You can see that initially I resisted tagging articles, but starting November adopted it and began tagging almost all articles added. And because stacked area graphs are not especially good data visualization, here is a line graph of the number of articles tagged per month:

Which better shows that I gradually adopted tagging from October into November. Another thing to note from this graph is that my Pocket usage peaked between November of last year to May of this year, after which the number of articles added on a monthly basis decreases significantly (hence the previous graph being proportional).

Next we examine the number of articles by subject area. I’ve collected them into more-or-less meaningful groups and will explain the different tags as we go along. Note the changing scale on the y-axes for these graphs, as the absolute number of articles varies greatly by category.

Psych & Other Soft Topics
As I noted previously in the other post, when starting to use Pocket I initially read a very large number of psych articles.

I also read a fair number of “personal development” articles (read: self-helpish – mainly from The Art of Manliness) which has decreased greatly as of late. The purple are articles on communications, the light blue “parapsych”, which is my catchall for new-agey articles relating to things like the zodiac, astrology, mentalism, mythology, etc. (I know it’s all nonsense, but hey it’s good conversation for dinner parties and the next category).

The big spike recently was a cool site I found recently with lots of articles on the zodiac (see: The Barnum Effect). Most of these later got deleted.

Dating & Sex
Now that I have your attention… what you don’t read articles on sex? The Globe and Mail’s Life section has a surprising number of them. Also if you read men’s magazines online there are a lot, most of which are actually pretty awful. You can see too that articles on dating made up a large proportion of my reading back in the fall, also from those types of sites (which thankfully I now visit far less frequently).

News, etc.
This next graph is actually a bit busy for my liking, but I found this data set somewhat challenging to visualize overall, given the number of categories and how they change in time.

News is just that. Tech mostly the internet and gadgets. Jobs is anything career related. Finance is both in the news (macro) and personal. Marketing is a newcomer.

Web & Data

The data tag relates to anything data-centric – as of late more applied to big data, data science and analytics. Interestingly my reading on web analytics preceded my new career in it (January 2013), just like my readings in marketing did – which is kind of cool. It also goes to show that if you read enough about analytics in general you’ll eventually read about web analytics.

Data visualization is a tag I created recently so has very few articles – many of which I would have previously tagged with ‘data’.

Life & Humanities

If that other graph was a little too busy this one is definitely so, but I’m not going to bother to break it out into more graphs now. Articles on style are of occasional interest, and travel has become a recent one. ‘Living’ refers mainly to articles on city life (mostly from The Globe as well as the odd one from blogto).

Work
And finally some new-comers, making up the minority, related to work:

SEO is search engine optimization and dev refers to development, web and otherwise.

Gee that was fun, and kind of enlightening. But tagging in Pocket is like in Gmail – it is not one-to-one but many-to-one. So next I thought to try to answer the question: which tags are most related? That is, which tags are most commonly applied to articles together?

To do this we again turn to R and the following code snippet, on top of that previous, does the trick:

All this does is remove the untagged articles from the tag frame and then run a correlation between each column of the tag matrix. I’m no expert on exotic correlation coefficients, so I simply used the standard (Pearson’s). In the case of simple binary variables (true / false such as here), the internet informs me that this reduces to the phi coefficient.

Given there are 30 unique tags, this creates a 30 x 30 matrix, which is visualized below as a heatmap:

Redder is negative, greener is positive. I neglected to add a legend here as when not using ggplot or a custom function it is kind of a pain, but some interesting relationships can still immediately be seen. Most notably food and health articles are the most strongly positively correlated while data and psych articles are most strongly negatively correlated.

Other interesting relationships are that psych articles are negatively correlated with jobs, tech and web analytics (surprise, surprise) and positively correlated with communications, personal development and sex; news is positively correlated with finance, science and tech.

Conclusion

All in all this was a fun exercise and I also learned some things about my reading habits which I already suspected – the amount I read (or at least save to read later) has changed over time as well as the sorts of topics I read about. Also some types of topics are far more likely to go together than others.
If I had a lot more time I could see taking this code and standing it up into some sort of generalized analytics web service (perhaps using Shiny if I was being really lazy) for Pocket users, if there was sufficient interest in that sort of thing.
Though it was still relatively easy to get the data out, I do wish that the XML/JSON export would be restored to provide easier access, for people who want their data but are not necessarily developers. Not being a developer, my attempts to use the new API for extraction purposes were somewhat frustrating (and ultimately unsuccessful).

Though apps often make our lives easier with passive data collection, all this information being “in the cloud” does raise questions of data ownership (and governance) and I do wish more companies, large and small, would make it easier for us to get a hold of our data when we want it.

Because at the end of the day, it is ultimately our data that we are producing – and it’s the things it can tell us about ourselves that makes it valuable to us.

Resources

Pocket – Export Reading List to HTML
Pocket – Developer API
Phi Coefficient
The Barnum (Forer) Effect
code on github

Fine Cuppa Joe: 96 Days and 162 Cups of Coffee

Introduction

Let’s get one thing straight: I love me some coffee.
Some people would disagree with me on this, but coffee is really important. Really, really important, and not just to me. Not just because companies like Starbucks and Second Cup and Caribou and Timothy’s and Tim Hortons make it their business, but for another reason.
As far as I know, there are only three legal, socially acceptable drugs: alcohol, nicotine, and caffeine (and some would argue that the first two are not always socially acceptable). Coffee is really important because coffee is the most common, effective and ubiquitous source of delivery for that third drug – and one which is acceptable and ubiquitous not only socially, but also in the world of business.
I remember a long time ago there was a big blackout. I remember that after people pointed out how such a widespread outage was caused by such a small point of failure – they said things like ‘This just goes to show how fragile our infrastructure is! If the terrorists want to win, all they have to do is take out one circuit breaker here or there and all of North America will collapse!’

Ha ha ha, yeah.
But I’d argue that if you really wanted all of North American society to shut down, you could really hit us where it hurts, take away something from us without which we are completely and totally hopeless – cut off our supply of coffee. Think about it! The widespread effects of everyone across all walks of life and all the industries suddenly going Cold Turkey on coffee would be far more damaging in the long run than any little black out. Run for the hills, the great Tim Hortons’ riots of 2013 have erupted and apparently the Mayans only missed date of The Apocalypse by a small margin!
Or at least I think so. Or at least I think the idea is entertaining, though I probably largely got the idea from this Dilbert comic (which I find funnier and more spot-on than most).
But I digress.

Background

Like I said, I love me some coffee (it says so in my Twitter profile), and I’m no stranger to quantified self either, so I thought it would be interesting to apply it and answer the question “Exactly how much coffee am I drinking?” amongst others.
I kept track of my coffee consumption over the period spanning November 30, 2012 to March 5, 2013. I recorded when I had coffee, where it was from, what size, and how I took it. It wasn’t until almost the end of January that I realized I could also be keeping track of how long it took me to consume each cup of coffee, so I started doing that as well. Every time I do something like this I forget then remember how important it is to think about data collection before you set off on your merry way (like for example with the post on my commute).
As well as keeping track of the amount of coffee I drank in terms of cups, I converted the cups to volume by multiplying by the following approximate values:
  • Mug / Small / Short – 240 ml
  • Medium / Tall – 350 ml
  • Large / Grande – 470 ml

Analysis

First and foremost, we examine where I consumed the majority of coffee from over the 3 month period. Starbucks is the clear winner and apparently I almost never go to Second Cup.
bar chart of coffee consumption by location
Second was at work (which is not really a fair comparison, as it’s actually Starbucks coffee anyways). Third is at Dad’s place, almost of all which is due to my being home over the holidays.
Next we look at the time of day when the coffees were consumed. I am going to use this as an illustrative example of why it is important to visualize data.
First consider a histogram for the entire time period of when all the java was imbibed:
histogram of coffee consumption by hour of day
You can see there are peaks at the hours of 10 AM and also at 2 PM. However is this telling the whole story? Let’s look at the all the data plotted out by time of day:
scatterplot of coffee consumption by date and time of day
Having the data plotted out, you can see there is a distinct shift in the hours of the day when I was drinking coffee around the beginning of January. The earliest cup of the day goes from being around 9 AM to around 8, and the latest from in the evening from around 8 PM to the late afternoon (3-4 PM). Well, what happened to constitute this shift in the time of my daily java consumption? Simple – I got a new job.
You can see this shift if we overplot histograms for the hour of day before and after the change:
combined histogram of coffee consumption by hour of day
You can see that the distribution of my coffee consumption is different after I started the new gig – my initial morning coffees occur earlier (in the hours of 7-8 AM instead of 9 or later). You wouldn’t have known that just from looking at the other histogram – so you can see why it’s important to look at all the data before you can go jumping ahead into any analysis!
Using the ml values for the different sizes as mentioned in the introduction, we can calculate the amount consumed per day in ml for visualization of my total coffee consumption over time in volume:
cumulative coffee consumption by date
You can see that my coffee consumption is fairly consistent over time. Over the whole time period of 96 days I drank approximately 50 L of java which comes out to about 520 ml a day (or about 1.5 Talls from Starbucks). 
We can see this by adding a trend line which fits amazing well, the slope is ~0.52 and R-squared ~0.998:
cumulative coffee consumption by date (with trend line)
So the answer to the question from the beginning (“Exactly how much coffee am I drinking?”) is: not as much as I thought – only about 1-2 cups a day. 
When I am drinking it? The peak times of day changed a little bit, but early in the morning and in the mid-afternoon (which I imagine is fairly typical).
How does my daily consumption look over the time period in question? Remarkably consistent.
And just in case you were wondering, out of the 162 cups of coffee I drank over the 3 months, 160 were black.

Conclusions

  • Majority of coffee bought from Starbucks
  • Marked shift in time of day when coffees were consumed due to change in employment
  • Regular / daily rate of consumption about 520 ml and consistent over period of examination
  • I’ll take mine black, thanks

The Hour of Hell of Every Morning – Commute Analysis, April to October 2012

Introduction

So a little while ago I quit my job.

Well, actually, that sounds really negative. I’m told that when you are discussing large changes in your life, like finding a new career, relationship, or brand of diet soda, it’s important to frame things positively.

So let me rephrase that – I’ve left job I previously held to pursue other directions. Why? Because I have to do what I love. I have to move forward. And I have to work with data. It’s what I want, what I’m good at, and what I was meant to do.

So onward and upward to bigger, brighter and better things.

But I digress. The point is that my morning commute has changed.

Background

I really enjoyed this old post at Omninerd, about commute tracking activities and an attempt to use some data analysis to beat traffic mathematically. So I thought, hey, I’m commuting every day, and there’s a lot of data being generated there – why not collect some of it and analyze it too?

The difference here being that I was commuting with public transit instead of driving. So yes, the title is a bit dramatic (it’s an hour of hell in traffic for some people, I actually quite enjoy taking the TTC).

When I initially started collecting the data, I had intended to time both my commute to and from work. Unfortunately, I discovered that due to having a busy personal and professional life outside of the 9 to 5, that there was little point in tracking my commute at the end of the work day, as I was very rarely going straight home (I was ending up with a very sparse data set). I suppose this was one point of insight into my life before even doing any analysis in this experiment.

So I just collected data on the way to work in the morning.

Without going into the personal details of my life in depth, my commute went something like this:

  • walk from home to station
  • take streetcar from station west to next station
  • take subway north to station near place of work
  • walk from subway platform to place of work

Punching the route into Google Maps, it tells me the entire distance is 11.5 km. As we’ll see from the data, my travel time was pretty consistent and on average took about 40 minutes every morning (I knew this even before beginning the data collection). So my speed with all three modes of transportation averages out to ~17.25 km/hr. That probably doesn’t seem that fast, but if you’ve ever driven in Toronto traffic, trust me, it is.

In terms of the methodology for data collection, I simply used the stopwatch on my phone, starting it when I left my doorstep and stopping it when reaching the revolving doors by the elevators at work.

So all told, I kept track of the date, starting time and commute length (and therefore end time). As with many things in life, hindsight is 20/20, and looking back I realized I could have collected the data in a more detailed fashion by breaking it up for each leg of the journey.

This occurred to me towards the end of the experiment, and so I did this for a day. Though you can’t do much data analysis with just this one day, it gives a general idea of the typical structure of my commute:

Okay, that’s fun and all, but that’s really an oversimplification as the journey is broken up into distinct legs. So I made this graphic which shows the breakdown for the trip and makes it look more like a journey. The activity / transport type is colour-coded the same as the pie chart above. The circles are sized proportionally to the time spent, as are the lines between each section.

There should be another line coming from the last circle, but it looks better this way.

Alternatively the visualization can be made more informative by leaving the circles sized by time and changing the curve lengths to represent the distance of each leg travelled. Then the distance for the waiting periods is zero and the graphic looks quite different:

I really didn’t think the walk from house was that long in comparison to the streetcar. Surprising.

Cool, no? And there’s an infinite number of other ways you could go about representing that data, but we’re getting into the realm of information design here. So let’s have a look at the data set.

Analysis

So first and foremost, we ask the question, is there a relationship between the starting time of my morning commute and the length of that commute? That is to say, does how early I leave to go to work in the morning impact how long it takes me to get to work, regardless of which day it is?
Before even looking at the data this is an interesting question to consider, as you could assume (I would venture to say know for a fact) that departure time is an important factor for a driving commute as the speed of one’s morning commute is directly impacted by congestion, which is relative to the number of people commuting at any given time.
However, I was taking public transit and I’m fairly certain congestion doesn’t affect it as much. Plus I headed in the opposite direction of most (away from the downtown core). So is there a relationship here?
Looking at this graph we can see a couple things. First of all, there doesn’t appear to be a salient relationship between the commute start time and duration. Some economists are perfectly happy to run a regression and slam a trend line through a big cloud of data points, but I’m not going to do that here. Maybe if there were a lot of points I’d consider it.

The other reason I’m not going to do that is that you can see from looking at this graph that the data are unevenly distributed. There are more larger values and outliers in the middle, but that’s only because the majority of my commutes started between ~8:15 and ~9:20 so that’s where most of the data lie. 

You can see this if we look at the distribution of starting hour:

I’ve included a density plot as well so I don’t have to worry about bin-sizing issues, though it should be noted that in this case it gives the impression of continuity when there isn’t any. It does help illustrate the earlier point however, about the distribution of starting times. If I were a statistician (which I’m not) I would comment on the distribution being symmetrical (i.e. is not skewed) and on its kurtosis.

The distribution of commute duration, on the other hand, is skewed:

I didn’t have any morning where the combination of my walking and the TTC could get me to North York in less than a half hour.

Next we look at commute duration and starting hour over time. The black line is a 5-day moving average.

Other than several days near the beginning of the experiment in which I left for work extra early, the average start time for the morning trip did not change greatly over the course of the months. There looks like there might be some kind of pattern in the commute duration though, with the peaking?

We can investigate if this is the case by comparing the commute duration per day of week:

There seems to be slightly more variation in the commute duration on Monday, and it takes a bit longer on Thursdays? But look at the y-axis. These aren’t big differences, were talking about a matter of several minutes here. The breakdown for when I leave each day isn’t particularly earth-shattering either:

Normally, I’d leave it at that, but are these differences significant? We can do a one-way ANOVA and check:

> aov1 = aov(commute$starthour ~ commute$weekday, data=commute)
> aov2 = aov(commute$time ~ commute$weekday, data=commute)
> summary(aov1)
              Df Sum Sq Mean Sq F value Pr(>F)
data$weekday   4  0.456  0.1140     0.7  0.593
Residuals    118 19.212  0.1628              
> summary(aov2)
              Df Sum Sq Mean Sq F value Pr(>F)
data$weekday   4   86.4   21.59   1.296  0.275
Residuals    118 1965.4   16.66              

This requires making a lot of assumptions about the data, but assuming they’re true, these results tell us there aren’t statistically significant differences in the either the average commute start time or average commute duration per weekday.

That is to say, on average, it took about the same amount of time per day to get to work and I left around the same time.

This is in stark contrast to what people talk around the water cooler about when they’re discussing their commute. I’ve never done any data analysis on a morning drive myself (or seen any, other than the post at Omninerd), but there are likely more clearly defined weekly patterns to your average driving commute than what we saw here with public transit.

Conclusions

There’s a couple ways you can look at this.
You could say there were no earth-shattering conclusions as a result of the experiment.
Or you could say that, other than the occasional outlier (of the “Attention All Passengers on the Yonge-University-Spadina line” variety) the TTC is remarkably consistent over the course of the week, as is my average departure time (which is astounding given my sleeping patterns).
It’s all about perspective. So onward and upward, until next time.

Resources

How to Beat Traffic Mathematically

TTC Trip Planner
myTTC (independently built by an acquaintance of mine – check out more of his cool work at branigan.ca):
FlowingData: Commute times in your area, mapped [US only]

Quantified Self Toronto #15 – Text Message Analysis (rehash)

Tonight was Quantified Self Toronto #15.

Eric, Sacha and Carlos shared about what they saw at the Quantified Self Conference in California.

I presented my data analysis of a year of my text messaging behaviour, albeit in slidedeck form.

Sharing my analysis was both awesome and humbling.

It was awesome because I received so many interesting questions about the analysis, and so much interesting discussion about communications was had, both during the meeting and after.

It was humbling because I received so many insightful suggestions about further analysis which could have been done, and which, in most cases, I had overlooked. These suggestions to dig deeper included analysis of:

  • Time interval between messages in conversations (Not trivial, I noted)
  • Total amount of information exchanged over time (length, as opposed to the number of messages)
  • Average or distribution of message length per contact,  and per gender
  • Number of messages per day per contact, as a measure/proxy of relationship strength over time
  • Sentiment analysis of messages, aggregate and per contact (Brilliant! How did I miss that?)

Again, it was quite humbling and also fantastic to hear all these suggestions.

The thing about data analysis is that there are always so many ways to analyze the data (and make data visualizations), and it’s what you want to know and what you want to say that help determine how to best look at it.

It’s late, and on that note, I leave you with a quick graph of the weekly number of messages for several contacts, as a proxy of relationship strength over time (pardon my lack of labeling). So looking forward to the next meeting.

Carlos Rizo, Sacha Chua, Eric Boyd and Alan Majer are the organizers of Quantified Self Toronto. More can be found out about them on their awesome blogs, or by visting quantifiedself.ca

What’s in My Pocket? Read it now! (or Read It Later)

Introduction

You know what’s awesome? Pocket.

I mean, sure, it’s not the first. I think Instapaper existed a little before (perhaps). And there are alternatives, like Google Reader. But Pocket is still my favorite. It’s pretty awesome at what it does.

Pocket (or Read It Later, as it used to be known) has fundamentally changed the way I read.

Before I had an Android phone I used to primarily read books. But applications like Pocket allow you to save an article from the web so you can read it later. Being a big fan of reading (and also procrastination) this was a really great application for me to discover, and I’m quite glad I did. Now I can still catch up on the latest Lifehacker even if I am on the subway and don’t have data connectivity.

Background

The other interesting thing about this application is that they make it fairly easy to get a hold of your data. The website has an export function which allows you to dump all your data for everything you’ve ever added to your reading list into HTML.

Having the URL of every article you’ve ever read in Pocket is handy, as you can revisit all the articles you’ve saved. But there’s more to it than that. The HTML export also contains the time each article was added (in UNIX epoch). Combine this with an XML or JSON dump from the API, and now we’ve got some data to work with.

My data set comprises a list of 2975 URLs added to the application over the period 14/07/2011 – 19/09/2012. The data from the export includes the article ID, article URL, date added and updated, and tags added to each article.

In order to add to the data provided by the export functionalities, I wrote a simple Python script using webarticle2text, which is available on github. This script downloaded the all the text from each article URL and continually added it to a single text file, as well as doing a word count for each article and extracting the top-level domain (TLD).

Analysis

First of all we can take a very simple overview of all the articles I have saved by site:

And because pie-type charts make Edward R. Tufte (and some other dataviz people) cry, here is the same information in a bar chart:
Head and shoulders above all other websites at nearly half of all articles saved is Psychology Today. I would just like to be on the record as saying – don’t hate. I know this particular publication is written in such a fashion that it usually thought of as being slanted towards women, however I find the majority of articles to be quite interesting (as evidenced by the number of articles I have read). Perhaps other men are not that interested in the goings-on in their own and other people’s heads, but I am (apparently).

Also, I think this is largely due to the design of the site. I commented before that using Pocket has changed the way I read. Well, one example of this is that I find I save a lot more articles from sites which have well designed mobile versions, as I primarily add articles from my phone. For this reason I can also see why I have saved so many articles from Psych Today, as their well-designed mobile site has made it easy to do so. Plus the article titles are usually enough to grab me.

You can have a look at their visually appealing mobile site if you are on a phone (it detects if the browser is a desktop browser). The other top sites in the list also have well-designed mobile sites (e.g. The Globe and Mail, AskMen, Ars Technica).

Good mobile site design aside, I like reading psych articles, men’s magazines, news, and tech.

Next we examine the data with respect to time.

Unfortunately the Pocket export only provides two categories: time added and time ‘updated’. Looking at the data, I believe this ‘updated’ definition applies to mutiple actions on the article, like marking as read, adding tags, re-downloading, et cetera. It would be ideal to actually have the date/time when the article was marked as read, as then further interesting analysis could be done. For example, looking at the time interval between when articles were added and read, or the number the number of articles read per day.

Anyhow, we continue with what data are available. As in a previous post, we can get a high-level overview of the data with a scatterplot:

Pretty.

The most salient features which immediately stand out are the two distinct bands in the early morning and late afternoon. These correspond to when the majority of my reading is done, on my communte to and from work on public transit.

You can also see the general usage lining up with events in my personal life. The bands start in early October, shortly after I began my new job and started taking public transit. There is also a distinct gap from late December to early January when I was home visiting family over the Christmas holidays.

You can see that as well as being added while I am on public transit, articles are also added all throughout the day. This is as expected; I often add articles (either on my phone or via browser) over the course of the day while at work. Again, it would be interesting to have more data to look at this further, in particular knowing which articles were read or added from which platform.

I am uncertain about articles which are listed as being updated in the late hours in the evening. Although I sometimes do read articles (usually through the browser) in these hours, I think this may correspond to things like adding tags or also a delay in synching between my phone and the Pocket servers.

I played around with heatmaps and boxplots of the data with respect to time, but there was nothing particularly interesting which you can’t see from this scatterplot. The majority of articles are added and updated Monday to Friday during commute hours.

We can also look at the daily volume of articles added:

This graph looks similar to one seen previously in my post on texting. There are some days where very few articles are added and a few where there are a large number. Looking at the distribution of the number of articles added daily, we see an exponential type distribution:

Lastly we examine the content of the articles I read. As I said, all the article text was downloaded using Python and word counts were calculated for each. We can plot a histogram of this to see the distribution of the article length for what I’ve been reading:

Hmmmmm.

Well, that doesn’t look quite right. Did I really read an article 40,000 words long? That’s about 64 pages isn’t it? Looking at URLs for the articles with tens of thousands of words, I could see that those articles added were either malfunctions of the Pocket article parser, the webarticle2text script, or both. For example, the 40,000 word article was a post on the Dictionary.com blog where the article parser also grabbed the entire comment thread.

Leaving the data as is, but zooming in on a more reasonable portion of the histogram, we see something a little more sensical:

This is a little more what we expect. The bulk of the data are distributed between very short articles and those about 1500 words long. The spikes in the low end also correspond to failures of the article parsers.

Now what about the text content of the articles? I really do enjoy a good wordcloud, however, I know that some people tend look down upon them. This is because there are alternate ways of depicting the same data which are more informative. However as I said, I do enjoy them as they are visually appealing.

So firstly I will present the word content in a more traditional way. After removing stop words, the top 25 words found in the conglomerate file of the article text are as follows:

As you can see, there are issues with the download script as there is some garbage in there (div, the years 2011 and 2012, and garbage characters for “don’t” and “are”, or possibly “you’re”). But it appears that my recreational reading corresponds to the most common subjects of its main sources. The majority of my reading was from Psychology Today and so the number one word we see is “people”. I also read a lot articles from men’s magazines, and so we see words which I suspect primarily come from there (“women”, “social”, “sex”, “job”), as well as the psych articles.

And now the pretty visualization:

Seeing the content of what I read depicted this way has made me have some realizations about my interests. I primarily think of myself as a data person, but obviously I am genuinely interested in people as well.

I’m glad data is in there as a ‘big word’ (just above ‘person’), though maybe not as big as some of the others. I’ve just started to fill my reading list with a lot of data visualization and analysis articles as of late.

Well, that was fun, and somewhat educational. In the meantime, I’ll keep on reading. Because the moment you stop reading is the moment you stop learning. As Dr. Seuss said: “The more that you read, the more things you will know. The more that you learn, the more places you’ll go!”

Conclusions

  • Majority of reading done during commute on public transit
  • Number of articles added daily of exponential-type distribution
  • Most articles read from very short to ~1500 words
  • Articles focused on people, dating, social topics; more recently data

Resources

Pocket (formerly Read It Later) on Google Play:
https://play.google.com/store/apps/details?id=com.ideashower.readitlater.pro

Pocket export to HTML:
http://getpocket.com/export

Mediagazer Editor Lyra McKee: What’s In My Pocket
http://getpocket.com/blog/2012/09/mediagazer-editor-lyra-mckee-whats-in-my-pocket/

Founder/CEO of Pocket Nate Weiner: What’s In My Pocket
http://getpocket.com/blog/2012/08/nate-weiner-whats-in-my-pocket/

Pocket Trends (Data analysis/analytics section of Pocket Blog)
http://getpocket.com/blog/category/trends/

webarticle2text (Python script by Chris Spencer)
https://github.com/chrisspen/webarticle2text

omg lol brb txt l8r – Text Message Analysis, 2011-2012

Introduction

I will confess, I don’t really like texting. I communicate through text messages, because it does afford many conveniences, and occupies a sort of middle ground between actual conversation and email, but that doesn’t mean that I like it.

Even though I would say I text a fair bit, more than some other Luddites I know, I’m not a serial texter. I’m not like one of these 14-year-old girls who sends thousands of text messages a day (about what, exactly?).

I recall reading about one such girl in the UK who sent in excess of 100,000 text messages one month. Unfortunately her poor parents received a rather hefty phone bill, as she did this without knowing she did not have an unlimited texting plan. But seriously, what the hell did she write? Even if she only wrote one word per text message, 100,000 words is ~200 pages of text. She typed all that out on a mobile phone keyboard (or even worse, a touch screen)? That would be a sizeable book.

If you do the math it’s even crazier in terms of time. There are only 24 hours in the day, so assuming little Miss Teen Texter of the Year did not sleep, she still would have to send 100,000 in a 24 * 30 = 720 hour period, which averages out to be about one message every 25 seconds. I think by that point there is really no value added to the conversations you are having. I’m pretty sure I have friends I haven’t said 100,000 words to over all the time that we’ve know each other.

But I digress.

Background

Actually getting all the data out turned out to be much easier than I anticipated. There exists an Android App which will not only back up all your texts (with the option of emailing it to you), but conveniently does so in an XML file with human-readable dates and a provided stylesheet (!). Import the XML file into Excel or other software and boom! You’ve got time series data for every single text message you’ve ever sent.

My data set spans the time from when I first started using an Android phone (July 2011) up to approximately the present, when I last created the backup (August 13th).

In total over this time period (405 days) I sent 3655 messages (~46.8%) and received 4151 (~53.2%) for a grand total of 7806 messages. This averages out to approximately 19 messages / day total, or about 1.25 messages per hour. As I said, I’m not a serial texter. Also I should probably work on responding to messages.

Analysis

First we can get a ‘bird’s eye view’ of the data by plotting a colour-coded data point for each message, with time of day on the y-axis and the date on the x-axis:


Looks like the majority of my texting occurs between the hours of 8 AM to midnight, which is not surprising. As was established in my earlier post on my sleeping patterns, I do enjoy the night life, as you can see from the intermittent activity in the range outside of these hours (midnight to 4 AM). As Dr. Wolfram commented in his personal analytics posting, it was interesting to look at the plot and think ‘What does this feature correspond to?’ then go back and say ‘Ah, I remember that day!’.

It’s also interesting to see the back and forth nature of the messaging. As I mentioned before, the split in Sent and Received is almost 50/50. This is not surprising – we humans call these ‘conversations’.

We can cross-tabulate the data to produce a graph of the total daily volume in SMS: 

Interesting to note here the spiking phenomenon, in what appears to be a somewhat periodic fashion. This corresponds to the fact that there are some days where I do a lot of texting (i.e. carry on several day-long conversations) contrasted with days where I might have one smaller conversation, or just send one message or so to confirm something (‘We still going to the restaurant at 8?’ – ‘Yup, you know it’ – ‘Cool. I’m going to eat more crab than they hauled in on the latest episode of Deadliest Catch!’).

I appeared to be texting more back in the Fall, and my overall volume of text diminished slightly into the New Year. Looking back at some of the spikes, some corresponded to noteworthy events (birthday, Christmas, New Year’s), whereas others did not. For example, the largest spike, which occurred on September 3rd, just happened to be a day where I had a lot of conversations at once not related to anything in particular.

Lastly, through the magic of a Tableau dashboard (pa-zow!) we can combine these two interactive graphs for some data visualization goodness:


Next we make a histogram of the data to look at the distribution of the daily message volume. The spiking behaviour and variation in volume previously evident can be seen in the tail of the histogram dropping off exponentially:

Note that is the density in black, not a fitted theoretical distribution

The daily volume follows what appears to be an exponential-type distribution (log-normal?). This is really neat to see out of this, as I did not know what to expect (when in doubt, guess Gaussian) but is not entirely shocking –  other communication phenomena have been shown to be a Poisson process (e.g. phone calls). Someone correct me if I am way out of line here.

Lastly we can analyze the volume of text messages per day of the week, by making a box plot:

Something’s not quite right here…

As we saw in the histogram, the data are of an exponential nature. Correcting the y-axis in this regard, the box plot looks a little more how one would expect:

Ahhhh.

We can see that overall there tends to be a greater volume of texts Thursday to Sunday. Hmmm, can you guess why this is? πŸ™‚

This can be further broken down with a heat map of the total hourly volume per day of week:

This is way easier to make in Tableau than in R.

As seen previously in the scatterplot, the majority of messages are concentrated between the hours of 8 (here it looks more like 10) to midnight. In line with the boxplot just above, most of that traffic is towards the weekend. In particular, the majority of the messages were mid-to-late afternoon on Fridays.

We have thus fair mainly been looking at my text messages as time series data. What about the content of the texts I send and receive?

Let’s compare the distribution of message lengths, sent versus received. Since there are an unequal number of Sent and Received messages, I stuck with a density plot:

Line graphs are pretty.

Interestingly, again, the data are distributed in an exponential fashion.

You can see distinctive humps at the 160 character mark. This is due to longer messages being broken down into multiple messages under the max length. Some carriers (or phones?) don’t break up the messages, and so there are a small number of length greater than the ‘official’ limit.

Comparing the blue and red lines, you can see that in general I tend to be wordier than my friends and acquaintances.

Lastly, we can look at the written content. I do enjoy a good wordcloud, so we can by plunk the message contents into R and create one:

Names blurred to protect the innoncent (except me!).

What can we gather from this representation of the text? Well, nothing I didn’t already know…. my phone isn’t exactly a work Blackberry.

Conclusions

  • Majority of text message volume is between 10 AM to midnight
  • Text messages split approximately 50/50 between sent and received due to conversations
  • Daily volume is distributed in an exponential fashion (Poisson?)
  • Majority of volume is towards the end of the week, especially Friday afternoon
  • I should be less wordy (isn’t that the point of the medium?)
  • Everybody’s working for the weekend

References & Resources

SMS Backup and Restore @ Google Play
https://play.google.com/store/apps/details?id=com.riteshsahu.SMSBackupRestore&hl=en

Tableau Public
http://www.tableausoftware.com/public/community

Zzzzzz….. – Quantified Self Toronto #14

Sleep is another one of those things like diet, where I feel if you asked anyone if they wanted to improve that area of their life most would say yes.

I remember hearing a quote that sleep is like sex; no one is quite sure how much everyone else is getting, but they are pretty sure it is more than them. Or wait, I think that was salary. With sleep it is more like – no one is quite sure how much they should be getting, but they sure as hell wish they were getting a lot more.

A lot of research has been done on the topic and it seems like the key takeaway from it is always the same: we are not getting enough sleep and this is a problem.

I know that I am a busy guy, that I am young, and that I go out on the weekends, so I know for a fact that my sleep is ‘bad’. But I was curious as to how ‘bad’ it actually is. I started tracking my sleep in April to find out, and also to see if there were any interesting patterns in it of which I was not aware.
 
I spoke again at Quantified Self Toronto (#14) (I spoke previously at #12 on June 7th) about it on August 7th. I gave an overview of my sleep-tracking activities and my simple examination of the data I had gathered. Here is the gist of my talk, as I remember it.

Hi everyone, I’m Myles Harrison and this is my second time speaking at Quantified Self Toronto, and the title of my second presentation is ‘Zzzzzzzz….’. 

I started tracking how much I was sleeping per night starting in April of this year, to find out just how good or bad my sleep is, and also to see if there are any patterns in my sleep cycle.

Now I want to tell you that the first thing I thought of when I started to putting this slide deck together was Star Trek. I remember there was the episode of Star Trek called ‘Deja Q’. Q is an omnipotent being from another dimension that torments the crew of the Enterprise for his own amusement, and in this particular episode he becomes mortal. In one part of the episode he is captured and kept in a cell onboard the ship, and he describes a terrible physical experience he has:

Q
I have been entirely preoccupied by a most frightening experience of my own. A couple of hours ago, I started realizing this body was no longer functioning properly… I felt weak, the life oozing out of me… I could no longer stand… and then I lost consciousness…

PICARD
You fell asleep.

Q
It’s terrifying…. how can you stand it day after day?

PICARD
One gets used to it…


And this is kind of how I have always felt about sleep: I may not like it, there are many other things I’d rather be doing during all those hours, however it’s a necessary evil, and you get used to it. If I could be like Kramer on Seinfeld and try to get by on ‘Da Vinci Sleep’, I probably would. However for me, and for most of the rest of us, that is not a reasonable possibility.

So now we come to the question of ‘how much sleep do we really need?’. Obviously there is a hell of a lot of research which has been done on sleep, and if you ask most people how much sleep they need to get every night, they will tell you something like ‘6-8 hours’. I believe that number comes from this chart which is from the National Sleep Foundation in the States. Here they give the figure of 7-9 hours of sleep for an adult, however this is an average. If you read some of the literature you will find, unsurprisingly, that the amount of sleep needed depends on a lot physiological factors and so varies from person to person. Some lucky people are perfectly capable of functioning normally during the day on only 3 or 4 hours of sleep a night, whereas some other unlucky people really need about 10 to 12 hours of sleep a night to feel fully rested. I highly doubt these unlucky folks regularly get that much sleep a night, as most of us have to get up in the morning for this thing called ‘work’. So yes, these are the extremes but they serve to illustrate the fact that this 6-8 (or 7-9) hours per night figure is an average and is not for everyone.

Also I found a report compiled by Statistics Canada in 2005 which says that the average Canadian sleeps about 8 and a half hours a night, usually starting at about 11 PM. Additionally, most Canadians get about 20 extra minutes of sleep on weekend nights as they don’t have to go to work in the morning and so can hit the snooze button.

So knowing this, now I can look at my own sleep and say, how am I doing and where do I fit in?


So as I said, I have been recording my sleep since early April up until today. In terms of data collection, I simply made note of the approximate time I went to bed and the approximate time at which I woke up the following morning, and recorded these values in a spreadsheet. Note that I counted only continuous night-time sleep and so the data do not include sleep during the day or things napping [Note: this is the same as the data collected by StatsCan for the 2005 report]. Also as a side interest I kept a simple yes/no record of whether or not I had consumed any alcohol that evening, counting as a yes any evening on which I had a drink after 5 PM.

O
n to the data. Now we can answer the question ‘What does my sleep look like?’ and the answer is this:

There does not appear to be any particular rhyme or reason to my sleep pattern. Looking at the graph we can conclude that I am still living like a University student. There are some nights where I got a lot of sleep (sometimes in excess of 11 or 12 hours) and there are other nights where I got very, very little sleep (such as this one particular night in June where I got no sleep at all, but that is another story). The only thing I can really pick out of this graph of note is that following nights or sequences of nights where I got very little sleep or went to bed very late, there is usually a night where I got a very large amount of sleep. Interestingly this night is sometimes not until several days later but this may be due to the constraints of the work week.

So despite the large amount of variability in my sleep we can still look at it and do some simple descriptive statistics and see if we can pull any meaningful patterns out of it. This is a histogram of the number of hours of sleep I got each night.

Despite all the variability in the data from what we saw earlier, it looks like the amount of sleep I get is still somewhat normally distributed. It looks like I am still getting about 7 hours of sleep on average, which actually really surprised me and in my opinion is quite good, all things considered and given the chaotic nature of my personal life. [Note: the actual value is 6.943 hrs for the mean, 7 for the median with a standard deviation of 1.82 hours]. 

So we can ask the question, ‘Is my amount of nightly sleep normally distributed?’. Well, at first glance it sure appears like it might be. So we can compare to what the theoretical values should be, and this certainly seems to be the case, though using a histogram is maybe not the best way as it will depend on how you choose your bin sizes.


We can also look at what is called a Q-Q plot which plots the values against the theoretical values, and if the two distributions are the same then the values should lie along that straight line. They do lie along it well, with maybe a few up near the top there straying away… so perhaps it is a skew-normal distribution or something like that, but we can still safely say that the amount of sleep I get at night is approximately normally distributed.


Okay, so that is looking at all the data, but now we can also look at the data over the course of the week, as things like the work week and weekend may have an affect on how many hours of sleep I get.

So here is a boxplot of the number of hours of sleep I got for each day of the week and we can see some interesting things here.

Most notably, Wednesday and Saturday appear to be the ‘worst’ nights of the week for me for sleep. Saturday is understandable, as I tend to go out on Saturday nights, and so the large amount of variability in the number of hours and low median value is to be expected; however, I am unsure as to why Wednesday has less hours than the other days (although I have do go out some Wednesday nights). Tuesdays and Thursdays appears to be best both in terms of variability and the median amount, these days being mid-week where presumably my sleep cycle is becoming regular during the work week (despite the occasional bad Wednesday?).

We can also examine when I feel asleep over the course of the week. Wait, that sounds bad, like I am sleeping at my desk at work. What I mean is we can also examine what time I went to bed each night over the course of the week:

Again we can see some interesting things. First of all, it is easy to note that on average I am not asleep before 1 AM! Secondly we can see that I get to sleep latest on Saturday nights (as this is the weekend) and that there is a large amount of variability in the hour I fall asleep on Fridays. But again we see that in terms of getting to bed earliest, Wednesday and Saturday are my ‘worst’ days, in addition to being the days when I get the least amount of sleep on average. Hmmmmmm….! Could there be some sort of relationship here?

So we can create a scatterplot and see if there exists a relationship between the hour at which I get to bed and the number of hours of sleep I get. And when we do this we can see that there is appears to be [surprise, surprise!] a negative correlation between the hour at which I get to sleep and the number of hours of sleep I get.

And we can hack a trend line through there to verify this:

> tl1 <- lm(sleep$hours ~ starthrs)
> summary(tl1)

Call:
lm(formula = sleep$hours ~ starthrs)

Residuals:
    Min      1Q  Median      3Q     Max
-7.9234 -0.6745 -0.0081  0.5569  4.8669

Coefficients:
            Estimate Std. Error t value Pr(>|t|)   
(Intercept)  9.78363    0.43696  22.390  < 2e-16 ***
starthrs    -0.62007    0.09009  -6.883 3.56e-10 ***

Signif. codes:  0 β€˜***’ 0.001 β€˜**’ 0.01 β€˜*’ 0.05 β€˜.’ 0.1 β€˜ ’ 1

Residual standard error: 1.533 on 112 degrees of freedom
Multiple R-squared: 0.2973,    Adjusted R-squared: 0.291
F-statistic: 47.38 on 1 and 112 DF,  p-value: 3.563e-10

So there is a highly statistically significant relationship between how late I get to sleep and the number of hours of sleep I get. For those of you that are interested, the p-value is very small (on the order of e-10). However you can see that the goodness of fit is not that great, as the R-squared about 0.3. This means that perhaps there are other explanations as to why getting to sleep later results in me getting less sleep, however I could not immediately think of anything. I am open to other suggestions and interpretations if you have any.

Also I got to thinking that this is the relationship between how late I get to sleep and how much sleep I get for all the data. Like a lot of people, I have a 9 to 5, and so I do not have the much choice about when I can get up in the morning. Therefore I would expect that this trend is largely dependent upon the data from the days during the work week.

So I thought I would do the same examination only for the days of the week where the following day I do not have to be up by a certain hour, that is, Friday and Saturday nights. And we can create the same plot, and:

We can see that, despite there being less data, there still exists the negative relationship.

> tl2 <- lm(wkend$hours ~ hrs)
> summary(tl2)

Call:
lm(formula = wkend$hours ~ hrs)

Residuals:
    Min      1Q  Median      3Q     Max
-5.4288 -0.4578  0.0871  0.5536  4.4300

Coefficients:
            Estimate Std. Error t value Pr(>|t|)   
(Intercept)  12.1081     0.9665  12.528 1.89e-13 ***
hrs          -0.8718     0.1669  -5.224 1.24e-05 ***

Signif. codes:  0 β€˜***’ 0.001 β€˜**’ 0.01 β€˜*’ 0.05 β€˜.’ 0.1 β€˜ ’ 1

Residual standard error: 1.737 on 30 degrees of freedom
Multiple R-squared: 0.4764,    Adjusted R-squared: 0.4589
F-statistic: 27.29 on 1 and 30 DF,  p-value: 1.236e-05

So it appears that on the days on which I could sleep in and make up the hours of sleep I am losing by going to bed later I am not necessarily doing so. Just because I can sleep in until a ridiculously late hour doesn’t necessarily mean that my body is letting me do so. This came as a bit of a surprise to me, as I thought that if I didn’t have to be up at a particular hour in the morning to do something, I would just sleep more to make up for the sleep I lost. An interesting insight – even though I can sleep in and make up for hours lost doesn’t necessarily mean that I will. 

So basically I just need to get to sleep earlier. Also, I am reminded of what my Dad always used to say to me when I was a kid, ‘An hour of sleep before midnight is worth two afterwards.’

Lastly, as I said, I did keep track of which nights I had consumed any alcohol in the evening to see what impact, if any, this was having on the quality and duration of my sleep. For this I just did a simple box plot of all the data and we can see that having a drink does mean I get less sleep overall.


Though this is a very simple overview it is consistent with what you can read in the research done on alcohol consumption and sleep. The belief that having a drink before bed will help you sleep better is a myth, as alcohol changes physiological processes in the body which are necessary for a good night’s sleep, and disrupts it.


So those were the conclusions I drew from tracking my sleep and doing this simple analysis of it. In terms of future directions, I could also further quantify my tracking of my sleep. I have simply measured the amount of sleep I have been getting, going with the assumption that getting close to the recommended amount of time is better. I could further quantify things by rating how rested I feel when I awake (or during the day) or rating how I felt the quality of rest I got was, on a scale of 1-10.

I could also measure other factors, such eating and exercise, and the time these things occur and how this play in to the amount and quality of the sleep I get.

Lastly, though I did have a simple yes/no measurement for whether or not I had consumed alcohol each evening, I did not quantify this. In the future I could measure caffeine consumption as well, as this known to be another important factor affecting sleep and restfulness.

That concludes my presentation, I hope I kept you awake. I thank you for your time, and for listening. If you have any questions I would be happy to answer them.

References & Resources 

National Sleep Foundation
http://www.sleepfoundation.org/

Who gets any sleep these days? Sleep patterns of Canadians (Statistics Canada)
http://www.statcan.gc.ca/pub/11-008-x/2008001/article/10553-eng.htm

The Harvard Medical School’s Guide to A Good Night’s Sleep
http://books.google.ca/books?id=VsOWD6J5JQ0C&lpg=PP1&pg=PP1#v=onepage&q&f=false

Quantified Self Toronto
http://quantifiedself.ca/