In Critique of Slopegraphs

I've been doing more research into less common types of data visualization techniques recently, and was reading up on slopegraphs.

Andy Kirk wrote a piece praising slopegraphs last December, which goes over the construction of a slopegraph with some example data very nicely. However I've seen some other bad examples of data visualization across the web using them, and just thought I'd put in my two cents.

Introductory remarks

I tend to think of slopegraphs as a very boiled-down version of a normal line chart, in which you have only two values for your independent variable and strip away all the non-data ink. This works because if you label all the individual components, you can take away all the cruft because you don't need the legend or axes anymore, do you? Here's the example of the before and after that below, using the soccer data from the Andy's post.
First as a line graph:
Hmm, that's not very enlightening is it? There are so many values for the categorical variable (team) that the graph requires a plethora of colours in the legend, and a considerable amount of back-and-forth to interpret. Contrast with the slopegraph, which is much easier to interpret as the individual values can be read off, and it also ditches the non-data ink of the axes:

Here it is much easier to read off values for the individual teams, it feels less cluttered, and more data have been encoded both in colour (orange for a decrease between the two years, and blue for an increase) as well as the thickness of the lines (thicker lines for change of > 25%).

Pros and Cons

In my opinion, the slope graph should be viewed as an extension of the line graph, and so even though traditional chart elements like the y-axis have been stripped away, consistency should be kept with the regular conventions of data visualization.
In the above example, Andy has correctly honoured vertical position, so that each team appears on other side of the graph at the correct height according to the number of points it has. This is the same as one of Dr. Tufte's original graphs (from the Visual Display of Quantitative Information), which follows the same practice and I quite like:
Brilliant. However when you no longer honour the vertical position to encode value, you lose the ability to truly compare across the categorical variable, which tend I disagree with. This is usually done for legibility's sake (to "uncrowd" the graph when there are a lot of lines), however, I feel like it could still be avoided in most of cases. See below for the example.

Here the vertical position is not honoured, as some values which are smaller appear above those which are larger, so that the lines do not cross and the graph is uncluttered.

Also it should be noted in this case there is more than one value in the independent variable. As long as the scale in the vertical direction is still consistent, the changes in quantity can still be compared by the slope of the lines, even if the exact values cannot be compared because the vertical position no longer corresponds directly to quantity.

Either way, this type of slopegraph is closer to a group of sparklines (as Tufte originally noted), as it allows comparison of the changes in the dependent variable across values of the independent for each value of the categorical variable, but not the exact quantities.

Where things really start to fall apart though, is when slope graphs are used to connect values from two different variables. Charlie Park has some examples of this on his blog post on the subject, such as the one from Ben Fry below:

So here's the question - what exactly, does the slope of the different lines correspond to? The variable on the left is win-loss record and on the right is total salary. The first author correctly notes that in this case, the slopegraph is an extension of a parallel coordinates graph, which requires some further discussion.
A parallel coordinates graph is all very well and good for doing exploratory data analysis, and finding patterns in data with a large number of variables. However I would avoid graphs like the one above in general - because the variable on the left and the right are not the same, the slope of the line is essentially meaningless. 
In this case of the baseball data, why not just display the information in a regular scatterplot, as below? Simple and clear. You can then include the additional information using colour and size respectively if desired and make a bubble chart.

Was the disproportionately large payroll of the Yankees as obvious in the previous visualization? Maybe, but not as saliently. The relative size of the payroll was encoded in the thickness of the line, but quantity is not interpreted as quickly and accurately when encoded using area/thickness as it is when using position. Also because the previous data were ranked (vertical position did not portray quantity), the much smaller number of wins by Kansas relative to the other teams was not as apparent at is it here.

Fry notes that he chose not to use a scatterplot as he wanted ranking for both quantities, which I suppose is the advantage of the original treatment, and something which is not depicted in the alternative I've presented. Also Park correctly notes in the examples on his post that different visualizations draw the eye to different features of the data, and some people have more difficulty interpreting a visualization like a bubble chart than slopegraph. Still, I remain a skeptical functionalist as far as visualization is concerned, and prefer the treatment above to the former.

Alternatives

I've presented some criticism of the slopegraphs here, but are there alternatives? Yes. In addition to the above, let's explore some others, using the data from the soccer example.

Really what we are interested in is the change in the quantity over the two values of the independent variable (year). So we can instead look at that quantity (change between the two years), and visualize it as a bar graph with a baseline of zero. Here the bars are again coloured by whether the change is positive or negative.

This is fine; however we lost the information encoded in the thickness of the lines. We can encode that using the lightness (intensity) of the different bars. Dark for > 25% change, light for the others:

Hmm, not bad. However we've still lost the information about the absolute value of points each year. So let's make that the value along the horizontal axis instead.

Okay fine, now the length of the bars corresponds to the magnitude of the change in points across the two years, with positive changes being coloured blue and negative orange, and the shading corresponding to whether the change was greater or less than 25%.

However, even if I put a legend and told you what the colours correspond to, it's pretty common for people to think of things as progressing from left to right (at least in Western cultures). The graph is difficult to interpret because for bars in orange the score for the first year is on the right, whereas for those in blue it's on the left. That is to say, we have the absolute values, but direction of the change is not depicted well. Changing the bars to arrows solves this, as below:

Now we have the absolute values of the points in each year for each team, and the direction of the change is displayed better than just with colour. Adding the gridlines allows the viewer to read off the individual values of points more easily. Lastly, we encode the other categorical variable of interest (change greater/less than 25%) as the thickness of the line.

Like so. After creating the above independently, I discovered visualization consultant Naomi Robbins had already written about this type of chart on Forbes, as an alternative to using multiple pie charts. Jon Peltier also has an excellent in-depth description how to make these types of charts in Excel, as well as showing another alternative visualization option to slope graphs, using a dot plot.

Of course, providing the usual fixings for a graph such as a legend, title and proper axis labels would complete the above, which brings me to my last point. Though I think it's a good alternative to slopegraphs, it can in no way compete in simplicity given that Dr. Tufte's example of a slopegraph as it had zero non-data ink. And, of course, this type of graph will not work when there are more than two values in the independent variable which to compare across.

Closing Remarks

It is easy to tell who are the true thought leaders in data visualization, because they often take it upon themselves to find special cases for visualization where people struggle or visualize data poorly, and then invent new visualizations types to fill the need (Tufte with the slopegraph, and Few came up with the bullet graph to supplant god-awful gauges on dashboards).
As I discussed, there are certain cases when slopegraphs should not be used, and I feel you would be better served by other types of graphs; in particular, cases where the slopegraph is a variation of the parallel coordinates chart not the line graph, or where quantity is not encoded in vertical position and comparing quantities for each value of the independent variable is important.

That being said, it is (as always) very important when making choices regarding data visualization to consider the pros and cons of different visualization types, the properties of the data you are trying to communicate, and, of course, the target audience.

Judiciously used, slopegraphs provide a highly efficient way in terms of data-ink ratio to visualize change in quantity across a categorical variable with a large number of values. Their appeal lies both in this and their elegant simplicity.

References & Resources

Slopegraphs discussion on Edward Tufte forum
http://www.edwardtufte.com/bboard/q-and-a-fetch-msg?msg_id=0003nk
In Praise of Slopegraphs, by Andy Kirk
Edward Tufte's "Slopegraphs" by Charlie Park
http://charliepark.org/slopegraphs/
Peltier Tech: How to Make Arrow Charts in Excel
http://peltiertech.com/WordPress/arrow-charts-in-excel/

Salary vs. Performance of MLB Teams by Ben Fry
http://fathom.info/salaryper/

salary vs performance scatterplot (Tableau Public)

Creepypasta – Votes vs. Rating (& learning ggplot2)

Excel:

R, base package:

R, ggplot:

Am I overfitting? Probably.

Code:
More fun stuff to come....

References

Source data at Creepypasta.com:

Code on gist:
http://gist.github.com/mylesmharrison/8886272

Creepypasta -  in list of internet phenomena (Wikipedia):
http://en.wikipedia.org/wiki/Creepypasta#Other_phenomena

Looking for Your Lens: 3 Tips on How to Be a Great Analyst

The other day as I was walking to work, all of a sudden, "pop!" one of the lenses in my glasses decided to free itself from the prison of its metal frame and take flight.

Well, damn.

The sidewalk was wet and partially covered in snow, and also with little islands of ice here and there. Finding a transparent piece of glass was not going to be easy.

So there I was, wandering about a small patch of sidewalk next to Toronto City Hall, squatting on my haunches, peering down at the sidewalk and awkwardly searching for my special missing piece of glass. I was not optimistic about my chances.

Most people walked on by and paid me no notice, but one kind soul, a woman with black, curly hair, stopped to help.

"Did you lose something?" she asked.

"Yeah," I said, defeated, and held up my half empty black frames.

"Can I help?" she kindly offered. "I'm good at this sort of thing."

"I guess," I said, having already given up on restoring my headgear to completeness.

We scoured the sidewalk while urban passerby gave the occasional puzzled look, hurrying along.

"Ah!" she said, and amazingly, picked up my lens which she had located. It had been hiding on a small patch of snow near a planter.

"WOW!" I was genuinely impressed. "You are good at this. Thanks so much."

"No problem," she said. "have a great day!" and then promptly disappeared down the street, leaving me standing there on the sidewalk, bewildered.

That single small episode, a tiny vignette of a single life in a giant city amongst millions of others, was quite profound for me. This was because it got me thinking about two things: one, the kindness of strangers, and the other, of course, what I am always thinking about - the business of doing analysis.

Because as it turns out, those few statements that kind stranger made are equally important in being a great analyst.

"Did you lose something?"

A problem that a lot of analysts deal with on a regular basis is one of communication. The business, the stakeholder, the client, whoever it may be, comes to the analyst for help. They want to find out something about their business because they have data, and it's the job of the analyst to turn that data (information) into insights (knowledge).
But here's the problem - you can't find something if you don't know what you're looking for.
Just as the kind passerby wouldn't have been able to help me find my missing lens if she didn't know what to look for, if you don't know what kinds of insights you want to pull out of the data you have, then you won't be able to find what you're looking for either.
"We want to know how our people are connecting with our brand."
It is the job of the analyst to turn these (often vague) desires of the business into specific questions that can be answered by analyzing data.
What people? (everyone, purchasers only, Boomers, Gen X, Gen Y, single mothers between the ages of 22 and 32 in urban centers?) What does connecting with the brand mean? (viewing an ad, purchase, visits to the website, app downloads, posts on social media, all of the above?)
So remember that a very large part of the job of the analyst is communication - not just about data - but working with others to determine exactly what it is they want to know. Once you know that, you can determine how to best do analysis to find the answers that are being sought after - hiding in plain sight in the data, like a piece of glass on a snowy patch of sidewalk.

"Can I help?"

Here's something I think that a lot of analytical-type thinkers (this author included) often need to be reminded of: you can't know everything. Even if you really, really want to. I'm sorry but you just can't.
And that's why once you know what it is you're looking for, and what you need, you'll need to ask for help (and that's okay, that's why we have meetings!). Sometimes the mere process of tracking down the data is a considerable task in itself. Sometimes no one really has a great overall understanding of a how a really large, complicated system works - that kind of knowledge is often very distributed. These sorts of situations may require the help of many others in your company (or another business, vendor or client) who all have varying knowledge bases and skill sets.
It's the job of the analyst to connect with the people they need to, get the data that they need, and do analysis to find the answers which are desired. Also if you're a good analyst, you'll probably provide some kind of context around the impact (i.e. business implications) of your answer, and what parties would need to be involved to make take the most beneficial actions as a result.
So even if you're a data rock star don't ever be afraid to ask for help; and conversely don't hesitate to let others know who should help them too.

"I'm good at this sort of thing."

Getting the analysis done requires not only not being afraid of asking for help, but also knowing the strengths and weaknesses of yourself, your team, and any others you may be working with.
It's hard, but in my opinion, it takes a bigger person to be honest and admit when they are out of their depth than to say they can do something they clearly cannot.
When you're out of your depth you have three options, which are really just three different ways of finishing the statement I'm not an expert. And they go something like this: I'm not an expert....
  1.  "... so I'm not going to do it because: I don't know how / wouldn't be able to figure it out / it's not in my job description."
  2.  "... but I can: learn quickly / give it a try / do my best / become one in 5 days."
  3.  "... but I know <colleague> is and could: provide context to the problem / definitely help do it / teach us how."
And the difference between answer #1 and the last two is what separates the office drones from the thought leaders, the reporting monkeys from the truly great analysts, and the unsuccessful from the successful in the world of data.
As I noted in the section above, you should never be afraid to ask for help, because there are going to be others out there that are better at things than you, and if you're good you'll recognize this fact and both of you will benefit. Hey, you might even learn something too, so next time you will be the expert.
Just remember that you can do analysis without crunching every number personally. You can work in data science without building the predictive model all by yourself. And you can work with data without writing every line of code alone. No analyst is an island.

"No problem! Have a great day!"

I hope that my little story and these points will help, or at least help you think, about the business of working with data and doing analysis, and what it means to be a great analyst.

This last point is perhaps equally, or even more  important, than the others - always be kind to the people you work with; always make it look easy, no matter how hard it was; and always be happy to help. That, above all, is what will make you a truly great analyst.

The Mathematics of Wind Chill

Introduction

Holy crap, it was cold out.

If you haven't been reading the news, don't live in the American Midwest or Canada, or do and didn't go outside the last couple weeks (for which I don't blame you) there was some mighty cold weather lately due to something called a polar vortex.

Meteorologists stated that a lot of people (those in younger generations) would never have experienced anything like this before - cold from the freezing temperatures and high winds the likes of which parts of the US and Canada haven't seen for 40 years.

It was really cold. So cold that weird stuff happened, including the ground exploding here in Ontario due to frost quakes, or cryoseisms, as they are technically known (or as my sister suggested they should be called, "frosted quakes" - get it?)

When there is all this talk of a polar vortex, all I could think of was a particularly ridiculous TV-movie that came out lately, and that this is our Northern equivalent, which probably looked something like this artist's depiction below:

Scientific depiction of polar vortex phenomena (not to scale)

But I digress. The real point is that all this cold weather got me thinking about windchill - what is it exactly? How is it determined? Let's do some everyday analysis.

Background

Wind chill hasn't always been the same, and there is some controversy exactly how scientific it is in the way it is calculated.
Wind chill depends upon only two variables - air temperature and wind speed - and the formula was derived not from physical models of atmosphere but from participants in simulated laboratory conditions.
Also, the old formula was replaced in 2001 by a new formula, with Canada greatly leading the effort, since there was some concern that the old formula gave values too low and that people would think they can safely withstand colder temperatures than they actually could.
The old formula had strange units but I found this page at University of Carleton which provides it in degrees Fahrenheit, so we can compare the old and new systems directly.

Analysis

Since the wind chill index is a function of two variables (a surface), we can calculate it using vectors in R and visually depict the results as an image (filled contour). This is in the following code below:

Which results in the following plots:

And the absolute difference between the two:

For low wind speeds (around 5 mph - wind chill is only defined when wind speed is greater 5 mph) you can see that the new system is colder, but for wind speeds greater than 10 mph the opposite is true, especially so in the bitter bitter cold (high winds and very cold temperatures). This is in line with the desire to correct the old system for giving values which were felt were too low.

If you're really visual person, here is the last contour plot as a surface:

Which, despite some of the limitations of 3-D visualization, shows the non-linear nature of the two systems and the difference between them.

Conclusion

This was a pretty interesting exercise and shows again how mathematics permeates many of our everyday notions - even if we're not necessarily aware of it being the case. 
For me the takeaway here is that wind chill is not an exact metric based on the physical laws of the atmosphere, but instead a more subjective one based upon people's reaction to cold and wind (an inanimate object cannot "feel" wind chill).
Despite the difficulty of the problem of trying to exactly quantify how much colder the blustery arctic winds make it feel outside, saying "-32F with the wind chill" will still always be better than saying "dude, it's really really cold outside."
Either way, be sure to wear a hat.

References & Resources

Windchill (at Wikipedia)
National Weather Service - Windchill Calculator
National Weather Service - Windchill Terms & Definitions 
Environment Canada - Canada's Windchill Index

Snapchat Database Leak – Visualized

Introduction

In case you don't read anything online, or live under a rock, the internet is all atwitter (get it?) with the recent news that Snapchat has had 4.6 million users' details leaked due to a security flaw which was compromised.

The irony here is that Snapchat was warned of the vulnerability by Gibson Security, but was rather dismissive and flippant and has now had this blow up in their faces (as it rightly should, given their response). It appears there may be very real consequences of this to the (overblown) perceived value of the company, yet another wildly popular startup with no revenue model. I bet that offer from Facebook is looking pretty good right about now.

Anyhow, a group of concerned hackers gave Snapchat what-for by exploiting the hole, and released a list of 4.6 million (4,609,621 to be exact) users details with the intent to "raise public awareness on how reckless many internet companies are with user information."

Which is awesome - kudos to those guys, once for being whitehat (they obscured two digits of each phone number to preserve some anonymity) and twice for keeping companies with large amounts of user data accountable. Gibsonsec has provided a tool so you can check if your account is in the DB here.

However, if you're a datahead like me, when you hear that there is a file out there with 4.6M user accounts in it, your first thought is not OMG am I safe?! but let's do some analysis!

Analysis

Area Code
As I have noted in a previous musing, it's difficult to do any sort of in-depth analysis if you have limited dimensionality of your data - here only 3 fields - the phone number with last two digits obscured, the username, and the area.
Fortunately because some of the data here is geographic, we can do some cool visualization with mapping. First we look at the high level view, with state and those states by area. California had the most accounts compromised overall, with just shy of 1.4 M details leaked. New York State was next at just over a million. 
Because the accounts weren't spread evenly across the states, below is a more detailed view by area code. You can see that it's mainly Southern California and the Bay Area where the accounts are concentrated.
Usernames
Well, that covers the geographic component. Which leaves the only the username and phones numbers. I'm not going to look into the phone numbers (I mean what really can you do, other than look at the distribution of numbers - which I have a strong hypothesis about already).
Looking at the number of accounts which include numerals versus those that do not, the split is fairly even - 2,586,281 (~56.1%) do not contain numbers and the remaining 2,023,340 (~43.9%) do. There are no purely numeric usernames.
Looking at the distribution of the length of Snapchat usernames below, we see what appears to be a skew-normal distribution centered around 9.5 characters or so:
The remainder of the tail is not present, which I assume would fill in if there were more data. I had the axis stretch to 30 for perspective as there was one username in the file of length 29.

Conclusion

If anything this analysis has shown anything it has reassured me that:
  1. You are very likely not in the leak unless you live in California or New York City
  2. How amazingly natural phenomena follow or nearly follow theoretical distributions so closely
I'm not in the leak, so I'm not concerned. But once again, this stresses the importance of being mindful of where our personal data are going when using smartphone apps, and ensuring there is some measure of care and accountability on the creators' end.

Update:
Snapchat has released a new statement promising an update to the app which makes the compromised feature optional, increased security around the API, and working with security experts in a more open fashion.

What’s in My Inbox? Data Analysis of Outlook

Introduction

Email is the bane of our modern existence.

Who of us hasn't had a long, convoluted, back-and-forth email thread going on for days (if not weeks) in order to settle an issue which could have been resolved with a simple 5 minute conversation?

With some colleagues of mine, email has become so overwhelming (or their attempts to organize it so futile) that it brings to my mind Orwell's workers at the Ministry of Truth in 1984 and their pneumatic tubes and memory holes - if the message you want is not in the top 1% (or 0.01%) of your inbox and you don't know how to use search effectively, then for all intents and purposes it might as well be gone (see also: Snapchat).

Much has been written on the subject of why exactly we send and receive so much of it, how to best organize it, and whether or not it is, in fact, even an effective method of communication.

At one time even Gmail and the concept of labels was revolutionary - and it has done some good in organizing the ever-increasing deluge that is email for the majority of people. Other attempts have sprung up to tame the beast and make sense of such a flood of communication - most notably in my mind Inbox Zero, the simply-titled smartphone app Mailbox, and MIT's recent data visualization project Immersion.

But email, with all its systemic flaws, misuse, and annoyances, is definitely here for good, no question. What a world we live in.

But I digress.

Background

I had originally hoped to export everything from Gmail and do a very thorough analysis of all my personal email. Though this is now a lot easier than it used to be, I got frustrated at the time trying to write a Python script and moved on to other projects.
But then I thought, hey, why not do the same thing for my work email? I recently discovered that it's quite easy to export email from Outlook (as I detailed last time) so that brings us to this post.
I was somewhat disappointed that Outlook can only export a folder at a time (which does not include special folders such as search folders or 'All Mail') - I organize my mail into folders and wanted an export of all of it.
That being said, the bulk probably does remain in my inbox (4,217 items in my inbox resulted in a CSV that was ~15 MB) and we can still get a rough look using what's available  The data cover the period from February 27th, 2013 to Nov 16th, 2013.

Email by Contact
First let's  look at the top 15 contacts by total number of emails. Here are some pretty simple graphs summarizing that data, first by category of contact:

In the top 15, split between co-workers/colleagues and management is pretty even. I received about 5 times as much email from coworkers and managers as from stakeholders (but then again a lot of the latter ended up sorted into folders, so this count is probably higher). Still, I don't directly interact with stakeholders as much as some others, and tend to work with teams or my immediate manager. Also, calls are usually better.

Here you can see that I interacted primarily with my immediate colleague and manager the most, then other management, and the remainder further down the line are a mix which includes email to myself and from office operations. Also of note - I don't actually receive that much email (I'm more of a "in the weeds" type of guy) or, as I said, much has gone into the appropriate folders.

Time-Series Analysis
The above graphs show a very simplistic and high level view of what proportion of email I was receiving from who (with a suitable level of anonymity, I hope). More interesting is a quick and simple analysis of patterns in time of the volume of email I received - and I'm pretty sure you already have an idea of what some of these patterns might be.

When doing data analysis, I always feel it is important to first visualize as much of the data as practically possible - in order to get "a feel" for the data and avoid making erroneous conclusions without having this overall familiarity (as I noted in an earlier post). If a picture is worth thousand words then a good data visualization is worth a thousand keystrokes and mouse clicks.

Below is a simple scatter plot all the emails received by day, with the time of day on the y-axis:


This scatterpolot is perhaps not immediately particulary illuminating, however it already shows us a few things worth noting:

  • the majority of emails appear in a band approximately between 8 AM and 5 PM
  • there is increased density of email in the period between the end of July and early October, after which there is a sparse interval until mid-month / early November
  • there appears to be some kind of periodic nature to the volume of daily emails, giving a "strip-like" appearance (three guesses what that periodic nature is...)

We can look into this further by considering the daily volume of emails, as below. The black line is a 7 day moving average:

We can see the patterns noted above - the increase in daily volume after 7/27 and the marked decrease mid-October. Though I wracked my brain and looked thoroughly, I couldn't find a specific reason why there was an increase over the summer - this was just a busy time for projects (and probably not for myself sorting email). The marked decrease in October corresponds to a period of bench time, which you can see was rather short-lived.

As I noted previously in analyzing communications data, the distribution of this type of information is exponential in nature and usually follows a log-normal distribution. As such, a moving average is not the greatest measure of central tendency - but a decent approximation for our purposes. Still, I find the graph a little more digestible when depicted with a logarithmic y-axis, as below:

Lastly we consider the periodic nature of the emails which is noted in the initial scatterplot. We can look for patterns by making a standard heatmap with the weekday as the column and hour of day as the row, as below:

You can clearly see that the the majority of work email occurs between the hours of 9 to 5 (shocking!). However some other interesting points of note are the bulk of email in the mornings at the begiinning of the week, fall-off after 5 PM at the end of the week (Thursday & Friday) and the messages received Saturday morning. Again, I don't really receive that much email, or have spirited a lot of it away into folders as I noted at the beginning of the article (this analysis does not include things like automated emails and reports, etc.)

Email Size & Attachments
Looking at file attachments, I believe the data are more skewed than the rest, as the clean-up of large emails is a semi-regular task for the office worker (as not many have the luxury of an unlimited email inbox capacity - even executives) so I would expect that values on the high end to have largely been removed. Nevertheless it still provides a rough approximation of how email sizes are distributed and what proportion have attachments included.

First we look at the overall proportion of email left in my inbox which has attachments - of the 4,217 emails, 2914 did not have an attachment (69.1%) and 1303 did (30.9%).

Examining the size of emails (which includes the attachments) in a histogram, we see a familiar looking distribution, which here I have further expanded by making it into a Pareto chart. (note that the scale on the left y-axis is logarithmic):

Here we can see that of what was left in my inbox, all messages were about 8 MB in size or less, with the vast majority being 250K or less. In fact 99% of the email was less than 1750KB, and 99.9% less than 6MB.

Conclusion

This was a very quick analysis of what was in my inbox, however we saw some interesting points of note, some of which confirm what one would expect - in particular:
  • vast majority of email is received between the hours of 9-5 Monday to Friday
  • majority of email I received was between the two managers & colleagues I work closest with
  • approximately 3 out of 10 emails I received had attachments
  • the distribution of email sizes is logarithmic in nature
If I wanted to take this analysis further, we could also look at the trending by contact and also do some content analysis (the latter not being done here for obvious reasons, of course).
This was an interesting exercise because it made me mindful again of what everyday analytics is all about - analyzing rich data sets we are producing all the time, but of which we are not always aware.

References and Resources

Inbox Zero
http://inboxzero.com/

Mailbox
http://www.mailboxapp.com/

Immersion
https://immersion.media.mit.edu/

Data Mining Email to Discover Organizational Networks and Emergent Communities in Work Flows

How to Export Your Outlook Inbox to CSV for Data Analysis

So one of my colleagues at work showed me this cool script he wrote in Visual Basic to pull all the data from Outlook for analysis.

Cool, I thought - I'd like to do that, but don't want to muck about in VB.

Well, I was surprised to discover that Outlook has the ability to export email to CSV built in! Follow the simple steps below (here demonstrated in Outlook 2010) and you can analyze your emails yourself and do some cool quantified self type analysis

How to Export Outlook Email to CSV (from Outlook)

1. Open Outlook and click File then Options to bring up the options dialog:
2. Selected Advanced, then click the Export button:
3. Click Export to a file and then the next button:
4. Selected Comma Separated Values (Windows) and click next.
5. Unless you want to export a different folder, select Inbox and click next.
6. Browse to a folder and/or type a filename for your export.
7.  Choose Map Custom Fields... if you want to customize which fields to export. Otherwise click the Finish button.
8. Sit tight while Outlook does its thing.
You should now have a CSV file of your inbox data!

How to Export Outlook Email to CSV (from Access)

This is all very well and good, but unfortunately exporting to CSV from Outlook does not provide the option for date and time as fields to be included, which makes it useless if you'd like to do time series (or other temporal) analysis.
To get the date and time data you can pull data from Outlook into Access and then export it as noted in this metafilter thread.
Import from Outlook into Access
1. Fire up Access and create a new database. Select External Data, More.. and then Outlook Folder.
2. Select Import the source data into a new table in the current database and click OK


3. Select the email account and folder you'd like to import and click Next 
4. Change the field settings if you'd like. Otherwise accept the defaults by clicking Next


5. Let Access add the primary key or not (you don't need it). Click Next 


6. Click Finish and wait. When the process is done you should have a new table called 'Inbox'.



Export Data from Access to a CSV
1. Make sure the Inbox table is selected and click External Data then Text File.
2. Pick or type a filename and click OK


3. Selected Delimited and click Next
4. Select Comma as the delimiter and tick the box which says Include Field Names on First Row. Click next.
5. Pick or type a filename and click Finish


You should now have your Inbox data exported as CSV (including time / date data!) and ready for analysis. Of course you can repeat this process and append to the Access database folder by folder to analyze all the mail you have in Outlook.

What’s in my Pocket? (Part II) – Analysis of Pocket App Article Tagging

Introduction

You know what's still awesome? Pocket.

As I noted in an earlier post (oh god, was that really more than a year ago?!) I started using the Pocket application, previously known as Read It Later, in July of 2011 and it has changed my reading behavior ever since.

Lately I've been thinking a lot about quantified self and how I'm not really tracking anything anymore. Something which was noted at one of the Meetups is that data collection is really the hurdle: like anything in life - voting, marketing, dating, whatever - you have to make it easy otherwise most people probably won't bother to do it. I'm pretty sure there's a psychological term for this - something involving the word 'threshold'.

That's where smartphones come in. Some people have privacy concerns about having all their data in the cloud (obviously I don't, as I'm willing putting myself all on display in the blog here) but that aside, one of the cool things about smartphone apps is that you are passively creating lots of data. Over time this results in a data set about you. And if you know how to pull that data you can analyze it (and hence yourself).  I did this previously, for instance with my text messages and also with data from Pocket collected up to that time.

So let's give it a go again, but this time with a different focus for the analysis.

Background

This time I wasn't so interested in when I read articles and from where, but moreso in the types of articles I was reading. In the earlier analysis, I summarized what I was reading by top-level domain of the site - and what resulted was a high-level overview of my online reading behavior.
Pocket added the ability for you to tag your articles. The tags are similar to labels in Gmail and so the relationships can be many to one. This provides a way for you to categorize your reading list (and archive) by category, and for the purposes of this analysis here, to analyze them accordingly.
First and foremost, we need the data (again). Unfortunately over the course of the development of the Pocket application, the amount of data you can get easily via export (without using the API) has diminished. Originally the export was available both as XML or JSON, but unfortunately those are now no longer available.
However, you can still export your reading list as an HTML file, which contains attributes in the link elements for the time the article was added and the tags it has attached.

Basically the export is quasi-XML, so it's a simple matter of writing some R code using the XML library to get the data into a format we can work with (CSV):

Here I extract the attributes and also create a column for each tag name with a binary value for if the article had that tag (one of my associates at work would call this a 'classifier', though it's not the data science-y kind). Because I wrote this in a general enough fashion, you should be able to run the code on your own Pocket export and get the same results.
Now that we have some data we can plunk it into Excel and do some data visualization.

Analysis

First we examine the state of articles over time - what is the proportion of articles added over time which were tagged versus not?

Tagged vs. Untagged

You can see that initially I resisted tagging articles, but starting November adopted it and began tagging almost all articles added. And because stacked area graphs are not especially good data visualization, here is a line graph of the number of articles tagged per month:

Which better shows that I gradually adopted tagging from October into November. Another thing to note from this graph is that my Pocket usage peaked between November of last year to May of this year, after which the number of articles added on a monthly basis decreases significantly (hence the previous graph being proportional).

Next we examine the number of articles by subject area. I've collected them into more-or-less meaningful groups and will explain the different tags as we go along. Note the changing scale on the y-axes for these graphs, as the absolute number of articles varies greatly by category.

Psych & Other Soft Topics
As I noted previously in the other post, when starting to use Pocket I initially read a very large number of psych articles.

I also read a fair number of "personal development" articles (read: self-helpish - mainly from The Art of Manliness) which has decreased greatly as of late. The purple are articles on communications, the light blue "parapsych", which is my catchall for new-agey articles relating to things like the zodiac, astrology, mentalism, mythology, etc. (I know it's all nonsense, but hey it's good conversation for dinner parties and the next category).

The big spike recently was a cool site I found recently with lots of articles on the zodiac (see: The Barnum Effect). Most of these later got deleted.

Dating & Sex
Now that I have your attention... what you don't read articles on sex? The Globe and Mail's Life section has a surprising number of them. Also if you read men's magazines online there are a lot, most of which are actually pretty awful. You can see too that articles on dating made up a large proportion of my reading back in the fall, also from those types of sites (which thankfully I now visit far less frequently).

News, etc.
This next graph is actually a bit busy for my liking, but I found this data set somewhat challenging to visualize overall, given the number of categories and how they change in time.

News is just that. Tech mostly the internet and gadgets. Jobs is anything career related. Finance is both in the news (macro) and personal. Marketing is a newcomer.

Web & Data

The data tag relates to anything data-centric - as of late more applied to big data, data science and analytics. Interestingly my reading on web analytics preceded my new career in it (January 2013), just like my readings in marketing did - which is kind of cool. It also goes to show that if you read enough about analytics in general you'll eventually read about web analytics.

Data visualization is a tag I created recently so has very few articles - many of which I would have previously tagged with 'data'.

Life & Humanities

If that other graph was a little too busy this one is definitely so, but I'm not going to bother to break it out into more graphs now. Articles on style are of occasional interest, and travel has become a recent one. 'Living' refers mainly to articles on city life (mostly from The Globe as well as the odd one from blogto).

Work
And finally some new-comers, making up the minority, related to work:

SEO is search engine optimization and dev refers to development, web and otherwise.

Gee that was fun, and kind of enlightening. But tagging in Pocket is like in Gmail - it is not one-to-one but many-to-one. So next I thought to try to answer the question: which tags are most related? That is, which tags are most commonly applied to articles together?

To do this we again turn to R and the following code snippet, on top of that previous, does the trick:

All this does is remove the untagged articles from the tag frame and then run a correlation between each column of the tag matrix. I'm no expert on exotic correlation coefficients, so I simply used the standard (Pearson's). In the case of simple binary variables (true / false such as here), the internet informs me that this reduces to the phi coefficient.

Given there are 30 unique tags, this creates a 30 x 30 matrix, which is visualized below as a heatmap:

Redder is negative, greener is positive. I neglected to add a legend here as when not using ggplot or a custom function it is kind of a pain, but some interesting relationships can still immediately be seen. Most notably food and health articles are the most strongly positively correlated while data and psych articles are most strongly negatively correlated.

Other interesting relationships are that psych articles are negatively correlated with jobs, tech and web analytics (surprise, surprise) and positively correlated with communications, personal development and sex; news is positively correlated with finance, science and tech.

Conclusion

All in all this was a fun exercise and I also learned some things about my reading habits which I already suspected - the amount I read (or at least save to read later) has changed over time as well as the sorts of topics I read about. Also some types of topics are far more likely to go together than others.
If I had a lot more time I could see taking this code and standing it up into some sort of generalized analytics web service (perhaps using Shiny if I was being really lazy) for Pocket users, if there was sufficient interest in that sort of thing.
Though it was still relatively easy to get the data out, I do wish that the XML/JSON export would be restored to provide easier access, for people who want their data but are not necessarily developers. Not being a developer, my attempts to use the new API for extraction purposes were somewhat frustrating (and ultimately unsuccessful).

Though apps often make our lives easier with passive data collection, all this information being "in the cloud" does raise questions of data ownership (and governance) and I do wish more companies, large and small, would make it easier for us to get a hold of our data when we want it.

Because at the end of the day, it is ultimately our data that we are producing - and it's the things it can tell us about ourselves that makes it valuable to us.

Resources

Pocket - Export Reading List to HTML
Pocket - Developer API
Phi Coefficient
The Barnum (Forer) Effect
code on github

Bananagrams!!!

It was nice to be home with the family for Thanksgiving, and to finally take some time off.

A fun little activity which took up a lot of our time over the past weekend was Bananagrams, which, if you don't already know, is sort of like a more action-packed version of Scrabble without the board.

Being the type of guy that I am, I started to think about the distribution of letters in the game. A little Googling led to some prior art to this post.

The author did something neat (which I wouldn't have thought of) by making a sort of bar chart using the game pieces. Strangely though, they chose not to graph the different distributions of letters in Bananagrams and Scrabble but instead listed them in a table.

So, assuming the data from the post are accurate, here is a quick breakdown of said distributions below. As an added bonus, I've also included that trendy digital game that everyone plays on Facebook and their iDevices:

Bar graph of letter frequencies of Scrabble, Bananagrams and Words with Friends

Looking at the graph, it's clear the Bananagrams has more tiles than the other games (the total counts are 144, 104 and 100 for Banagrams, Words with Friends and Scrabble respectively) and notably also does not have blank tiles of which the other games have 2 each. Besides the obvious prevalence of vowels in all 3 games, T, S, R, N, L and D also have high occurrence.

We can also compare the relative frequencies of the different letters in each game with respect to Scrabble. Here I took the letter frequency for each game (as a percent) then divided it by the frequency of the same letter in Scrabble. The results are below:

Bar chart of Bananagrams and Words with Friends letter frequencies relative to Scrabble

Here it is interesting to note that the relative frequency of H in Words with Friends is nearly double that in Scrabble (~192%). Also D, S and T have greater relative frequencies. The remaining letters are fairly consistent with the exception of I and N which are notably less frequent.

Bananagrams relative letter frequency is fairly consistent overall, with the exception of J, K, Q, X, and Z which are around the 140 mark. I guess the creator of the game felt there weren't enough of the "difficult" letters in Scrabble.

There's more analysis that could be done here (looking at the number of points per letter in WWF & Scrabble versus their relative frequency immediately comes to mind) but that should do for now. Hope you found this post "a-peeling".

Analysis of the TTC Open Data – Ridership & Revenue 2009-2012

Introduction

I would say that the relationship between the citizens of Toronto and public transit is a complicated one. Some people love it. Other people hate it and can't stop complaining about how bad it is. The TTC want to raise fare prices. Or they don't. It's complicated.
I personally can't say anything negative about the TTC. Running a business is difficult, and managing a complicated beast like Toronto's public system (and trying to keep it profitable while keeping customers happy) cannot be easy. So I feel for them. 
I rely extensively on public transit - in fact, I used to ride it every day to get to work. All things considered, for what you're paying, this way of getting around the city is a hell of a good deal (if you ask me) compared to the insanity that is driving in Toronto.
The TTC's ridership and revenue figures are available as part of the (awesome) Toronto Open Data initiative for accountability and transparency. As I noted previously, I think the business of keeping track of things like how many people ride public transit every day must be a difficult one, so you have to appreciate having this data, even if it is likely more of an approximation and is in a highly summarized format.
There are larger sources of open data related to the TTC which would probably be a lot cooler to work with (as my acquaintance Mr. Branigan has done) but things have been busy at work lately, so we'll stick to this little analysis exercise.

Background

The data set comprises numbers for: average weekly ridership (in 000's), annual ridership (peak and off-peak), monthly & budgeted monthly ridership (in 000's), and monthly revenue, actual and budgeted (in millions $). More info here [XLS doc].

Analysis

First we consider the simplest data and that is the peak and off-peak ridership. Looking at this simple line-graph you can see that the off-peak ridership has increased more than peak ridership since 2009 - peak and off-peak ridership increasing by 4.59% and 12.78% respectively. Total ridership over the period has increased by 9.08%.

Below we plot the average weekday ridership by month. As you can see, this reflects the increasing demand on the TTC system we saw summarized yearly above. Unfortunately Google Docs doesn't have trendlines built-in like Excel (hint hint, Google), but unsurprisingly if you add a regression line the trend is highly significant ( > 99.9%) and the slope gives an increase of approximately 415 weekday passengers per month on average.

Next we come to the ridership by month. If you look at the plot over the period of time, you can see that there is a distinct periodic behavior:

Taking the monthly averages we can better see the periodicity - there are peaks in March, June & September, and a mini-peak in the last month of the year:

This is also present in both the revenue (as one would expect) and the monthly budget (which means that the TTC is aware of it). As to why this is the case, I can't immediately discern, though I am curious to know the answer. This is where it would be great to have some finer grained data (daily or hourly) or data related to geographic area or per station to look for interesting outliers and patterns.

Alternatively if we look at the monthly averages over the years of average weekday ridership (an average of averages, I am aware - but the best we can do given the data we have), you can see that there is a different periodic behavior, with a distinct downturn over the summer, reaching a low in August which then recovers in September to the maximum. This is interesting and I'm not exactly sure what to make of it, so I will do what I normally do which is attribute it to students.

Lastly, we come to the matter of the financials. As I said the monthly revenue and budget for the TTC follow the same periodic pattern as the ridership, and on the plus side, with increased ridership, there is increased revenue. Taking the arithmetic difference of the budgeted (targeted) revenue from actual, you can see that over time there is a decrease in this quantity:
Again if you do a linear regression this is highly significant ( > 99.9%). Does this mean that the TTC is becoming less profitable over time? Maybe. Or perhaps they are just getting better at setting their targets? I acknowledge that I'm not an economist, and what's been done here is likely a gross oversimplification of the financials of something as massive as the TTC.

That being said, the city itself acknowledges [warning - large PDF] that while the total cost per hour for an in-service transit vehicle has decreased, the operating cost has increased, which they attribute to increases in wages and fuel prices. Operating public transit is also more expensive here in TO than other cities in the province, apparently, because we have things like streetcars and the subway, whereas most other cities only have buses. Either way, as I said before, it's complicated.

Conclusion

I always enjoy working with open data and I definite appreciate the city's initiative to be more transparent and accountable by providing the data for public use.
This was an interesting little analysis and visualization exercise and some of the key points to take away are that, over the period in question:
  • Off-peak usage of the TTC is increasing at a greater rate than peak usage
  • Usage as a whole is increasing, with about 415 more weekday riders per month on average, and a growth of ~9% from 2009 - 2012
  • Periodic behavior in the actual ridership per month over the course of the year
  • Different periodicity in average weekday ridership per month, with a peak in September
It would be really interesting to investigate the patterns in the data in finer detail, which hopefully should be possible in the future if more granular time-series, geographic, and categorical data become available. I may also consider digging into some of the larger data sets, which have been used by others to produce beautiful visualizations such as this one.

I, for one, continue to appreciate the convenience of public transit here in Toronto and wish the folks running it the best of luck with their future initiatives.

References & Resources

TTC Ridership - Ridership Numbers and Revenues Summary (at Toronto Open Data Portal)

Toronto Progress Portal - 2011 Performance Measurement and Benchmarking Report