Perception in Data Visualization – A Quick 7 Question Test

When most people think of data, they probably think of a dry, technical analysis, without a lot of creativity or freedom. Quite to the contrary, data visualization encompasses choices of design, creative freedom, and also (perhaps most interestingly) elements of cognitive psychology, particularly related to the science of visual perception and information processing.

If you read any good text on dataviz, like TufteFew, or Cairo, you will, at some point, come across a discussion of the cognitive aspects of data visualization (the latter two devoting entire chapters to this topic). This will likely include a discussion of the most elemental ways to encode information visually, and their respective accuracies when quantity is interpreted from them, usually referencing the work of Cleveland & McGill [PDF].

Mulling over the veracity of my brief mention of the visual ways of encoding quantity in my recent talk, and also recently re-reading Nathan Yau’s discussion of the aforementioned paper, I got to thinking about just how different the accuracy of interpretation between the different encodings might be.

I am not a psychologist or qualitative researcher, but given the above quickly put together a simple test of 7 questions in Google Docs, to examine the accuracy of interpreting proportional quantities when encoded visually; and I humbly request the favour of your participation. If there are enough responses I will put together what analysis is possible in a future post (using the appropriate visualization techniques, of course).

Apologies in advance for the grade-school wording of the questions, but I wanted to be as clear as possible to ensure consistency in the results. Thanks so much in advance for contributing! Click below for the quiz:


EDIT: The quiz will now be up indefinitely on this page.

colorRampPaletteAlpha() and addalpha() – helper functions for adding transparency to colors in R

colorRampPalette is a very useful function in R for creating colors vectors to use as the palette, or to pass as an argument to a plotting function; however, a weakness lies in that it disregards the alpha channel of the colors passed to it when creating the new vector.

I have also found that working with the alpha channel in R is not always the easiest, but is something that scientists and analysts may often have to do – when overplotting, for example.

To address this I’ve quickly written the helper functions addalpha and colorRampPaletteAlpha, the former which makes passing a scalar or vector to a vector of colors as the alpha channel easier, and the latter as a wrapper for colorRampPalette which preserves the alpha channel of the colors provided.

Using the two functions in combination it is easy to produce plots with variable transparency such as in the figure below:


The code is on github.

I’ve also written examples of usage, which includes the figure above.

# addalpha() and colorRampPaletteAlpha() usage examples
# Myles Harrison
# www.everydayanalytics.ca

library(MASS)
library(RColorBrewer)
# Source the colorRampAlpha file
source ('colorRampPaletteAlpha.R')

# addalpha()
# ----------
# scalars:
col1 <- "red"
col2 <- rgb(1,0,0)
addalpha(col2, 0.8)
addalpha(col2,0.8)

# scalar alpha with vector of colors:
col3 <- c("red", "green", "blue", "yellow")
addalpha(col3, 0.8)
plot(rnorm(1000), col=addalpha(brewer.pal(11,'RdYlGn'), 0.5), pch=16)

# alpha and colors vector:
alpha <- seq.int(0, 1, length.out=4)
addalpha(col3, alpha)

# Simple example
x <- seq.int(0, 2*pi, length=1000)
y <- sin(x)
plot(x, y, col=addalpha(rep("red", 1000), abs(sin(y))))

# with RColorBrewer
x <- seq.int(0, 1, length.out=100)
z <- outer(x,x)
c1 <- colorRampPalette(brewer.pal(11, 'Spectral'))(100)
c2 <- addalpha(c1,x)
par(mfrow=c(1,2))
image(x,x,z,col=c1)
image(x,x,z,col=c2)

# colorRampPaletteAlpha()
# Create normally distributed data
x <- rnorm(1000)
y <- rnorm(1000)
k <- kde2d(x,y,n=250)

# Sample colors with alpha channel
col1 <- addalpha("red", 0.5)
col2 <-"green"
col3 <-addalpha("blue", 0.2)
cols <- c(col1,col2,col3)

# colorRampPalette ditches the alpha channel
# colorRampPaletteAlpha does not
cr1 <- colorRampPalette(cols)(32)
cr2 <- colorRampPaletteAlpha(cols, 32)

par(mfrow=c(1,2))
plot(x, y, pch=16, cex=0.3)
image(k$x,k$y,k$z,col=cr1, add=T)
plot(x, y, pch=16, cex=0.3)
image(k$x,k$y,k$z,col=cr2, add=T)

# Linear vs. spline interpolation
cr1 <- colorRampPaletteAlpha(cols, 32, interpolate='linear') # default
cr2 <- colorRampPaletteAlpha(cols, 32, interpolate='spline')
plot(x, y, pch=16, cex=0.3)
image(k$x,k$y,k$z,col=cr1, add=T)
plot(x, y, pch=16, cex=0.3)
image(k$x,k$y,k$z,col=cr2, add=T)

Hopefully other R programmers who work extensively with color and transparency will find these functions useful.

Toronto Data Science Group – A Survey of Data Visualization Techniques and Practice

Recently I spoke at the Toronto Data Science group. The folks at Mozilla were kind enough to record it and put it on Air, so here it is for your viewing pleasure (and critique):


Overall it was quite well received. Aside from the usual omg does my voice really sound like that?? which is to be expected, a couple of thoughts on the business of giving presentations which were quite salient here:

  • Talk slower and enunciate
  • Gesture, but not too much
  • Tailor sizing and colouring of visuals, depending on projection & audience size

I’ve reproduced the code which was used to create the figures made in R (including the bubble chart example, with code and data from FlowingData), which regrettably at the time I neglected to save. Here it is in a gist:

The visuals are also available on Slideshare.

Lessons learned: talk slower, always save your code, and Google stuff before starting – because somebody’s probably already done it before you.

In Critique of Slopegraphs

I’ve been doing more research into less common types of data visualization techniques recently, and was reading up on slopegraphs.

Andy Kirk wrote a piece praising slopegraphs last December, which goes over the construction of a slopegraph with some example data very nicely. However I’ve seen some other bad examples of data visualization across the web using them, and just thought I’d put in my two cents.

Introductory remarks

I tend to think of slopegraphs as a very boiled-down version of a normal line chart, in which you have only two values for your independent variable and strip away all the non-data ink. This works because if you label all the individual components, you can take away all the cruft because you don’t need the legend or axes anymore, do you? Here’s the example of the before and after that below, using the soccer data from the Andy’s post.
First as a line graph:
Hmm, that’s not very enlightening is it? There are so many values for the categorical variable (team) that the graph requires a plethora of colours in the legend, and a considerable amount of back-and-forth to interpret. Contrast with the slopegraph, which is much easier to interpret as the individual values can be read off, and it also ditches the non-data ink of the axes:

Here it is much easier to read off values for the individual teams, it feels less cluttered, and more data have been encoded both in colour (orange for a decrease between the two years, and blue for an increase) as well as the thickness of the lines (thicker lines for change of > 25%).

Pros and Cons

In my opinion, the slope graph should be viewed as an extension of the line graph, and so even though traditional chart elements like the y-axis have been stripped away, consistency should be kept with the regular conventions of data visualization.
In the above example, Andy has correctly honoured vertical position, so that each team appears on other side of the graph at the correct height according to the number of points it has. This is the same as one of Dr. Tufte’s original graphs (from the Visual Display of Quantitative Information), which follows the same practice and I quite like:
Brilliant. However when you no longer honour the vertical position to encode value, you lose the ability to truly compare across the categorical variable, which tend I disagree with. This is usually done for legibility’s sake (to “uncrowd” the graph when there are a lot of lines), however, I feel like it could still be avoided in most of cases. See below for the example.

Here the vertical position is not honoured, as some values which are smaller appear above those which are larger, so that the lines do not cross and the graph is uncluttered.

Also it should be noted in this case there is more than one value in the independent variable. As long as the scale in the vertical direction is still consistent, the changes in quantity can still be compared by the slope of the lines, even if the exact values cannot be compared because the vertical position no longer corresponds directly to quantity.

Either way, this type of slopegraph is closer to a group of sparklines (as Tufte originally noted), as it allows comparison of the changes in the dependent variable across values of the independent for each value of the categorical variable, but not the exact quantities.

Where things really start to fall apart though, is when slope graphs are used to connect values from two different variables. Charlie Park has some examples of this on his blog post on the subject, such as the one from Ben Fry below:

So here’s the question – what exactly, does the slope of the different lines correspond to? The variable on the left is win-loss record and on the right is total salary. The first author correctly notes that in this case, the slopegraph is an extension of a parallel coordinates graph, which requires some further discussion.
A parallel coordinates graph is all very well and good for doing exploratory data analysis, and finding patterns in data with a large number of variables. However I would avoid graphs like the one above in general – because the variable on the left and the right are not the same, the slope of the line is essentially meaningless. 
In this case of the baseball data, why not just display the information in a regular scatterplot, as below? Simple and clear. You can then include the additional information using colour and size respectively if desired and make a bubble chart.

Was the disproportionately large payroll of the Yankees as obvious in the previous visualization? Maybe, but not as saliently. The relative size of the payroll was encoded in the thickness of the line, but quantity is not interpreted as quickly and accurately when encoded using area/thickness as it is when using position. Also because the previous data were ranked (vertical position did not portray quantity), the much smaller number of wins by Kansas relative to the other teams was not as apparent at is it here.

Fry notes that he chose not to use a scatterplot as he wanted ranking for both quantities, which I suppose is the advantage of the original treatment, and something which is not depicted in the alternative I’ve presented. Also Park correctly notes in the examples on his post that different visualizations draw the eye to different features of the data, and some people have more difficulty interpreting a visualization like a bubble chart than slopegraph. Still, I remain a skeptical functionalist as far as visualization is concerned, and prefer the treatment above to the former.

Alternatives

I’ve presented some criticism of the slopegraphs here, but are there alternatives? Yes. In addition to the above, let’s explore some others, using the data from the soccer example.

Really what we are interested in is the change in the quantity over the two values of the independent variable (year). So we can instead look at that quantity (change between the two years), and visualize it as a bar graph with a baseline of zero. Here the bars are again coloured by whether the change is positive or negative.

This is fine; however we lost the information encoded in the thickness of the lines. We can encode that using the lightness (intensity) of the different bars. Dark for > 25% change, light for the others:

Hmm, not bad. However we’ve still lost the information about the absolute value of points each year. So let’s make that the value along the horizontal axis instead.

Okay fine, now the length of the bars corresponds to the magnitude of the change in points across the two years, with positive changes being coloured blue and negative orange, and the shading corresponding to whether the change was greater or less than 25%.

However, even if I put a legend and told you what the colours correspond to, it’s pretty common for people to think of things as progressing from left to right (at least in Western cultures). The graph is difficult to interpret because for bars in orange the score for the first year is on the right, whereas for those in blue it’s on the left. That is to say, we have the absolute values, but direction of the change is not depicted well. Changing the bars to arrows solves this, as below:

Now we have the absolute values of the points in each year for each team, and the direction of the change is displayed better than just with colour. Adding the gridlines allows the viewer to read off the individual values of points more easily. Lastly, we encode the other categorical variable of interest (change greater/less than 25%) as the thickness of the line.

Like so. After creating the above independently, I discovered visualization consultant Naomi Robbins had already written about this type of chart on Forbes, as an alternative to using multiple pie charts. Jon Peltier also has an excellent in-depth description how to make these types of charts in Excel, as well as showing another alternative visualization option to slope graphs, using a dot plot.

Of course, providing the usual fixings for a graph such as a legend, title and proper axis labels would complete the above, which brings me to my last point. Though I think it’s a good alternative to slopegraphs, it can in no way compete in simplicity given that Dr. Tufte’s example of a slopegraph as it had zero non-data ink. And, of course, this type of graph will not work when there are more than two values in the independent variable which to compare across.

Closing Remarks

It is easy to tell who are the true thought leaders in data visualization, because they often take it upon themselves to find special cases for visualization where people struggle or visualize data poorly, and then invent new visualizations types to fill the need (Tufte with the slopegraph, and Few came up with the bullet graph to supplant god-awful gauges on dashboards).
As I discussed, there are certain cases when slopegraphs should not be used, and I feel you would be better served by other types of graphs; in particular, cases where the slopegraph is a variation of the parallel coordinates chart not the line graph, or where quantity is not encoded in vertical position and comparing quantities for each value of the independent variable is important.

That being said, it is (as always) very important when making choices regarding data visualization to consider the pros and cons of different visualization types, the properties of the data you are trying to communicate, and, of course, the target audience.

Judiciously used, slopegraphs provide a highly efficient way in terms of data-ink ratio to visualize change in quantity across a categorical variable with a large number of values. Their appeal lies both in this and their elegant simplicity.

References & Resources

Slopegraphs discussion on Edward Tufte forum
http://www.edwardtufte.com/bboard/q-and-a-fetch-msg?msg_id=0003nk
In Praise of Slopegraphs, by Andy Kirk
Edward Tufte’s “Slopegraphs” by Charlie Park
http://charliepark.org/slopegraphs/
Peltier Tech: How to Make Arrow Charts in Excel
http://peltiertech.com/WordPress/arrow-charts-in-excel/

Salary vs. Performance of MLB Teams by Ben Fry
http://fathom.info/salaryper/

salary vs performance scatterplot (Tableau Public)

Creepypasta – Votes vs. Rating (& learning ggplot2)

Excel:

R, base package:

R, ggplot:

Am I overfitting? Probably.

Code:
More fun stuff to come….

References

Source data at Creepypasta.com:

Code on gist:
http://gist.github.com/mylesmharrison/8886272

Creepypasta –  in list of internet phenomena (Wikipedia):
http://en.wikipedia.org/wiki/Creepypasta#Other_phenomena

Looking for Your Lens: 3 Tips on How to Be a Great Analyst

The other day as I was walking to work, all of a sudden, “pop!” one of the lenses in my glasses decided to free itself from the prison of its metal frame and take flight.

Well, damn.

The sidewalk was wet and partially covered in snow, and also with little islands of ice here and there. Finding a transparent piece of glass was not going to be easy.

So there I was, wandering about a small patch of sidewalk next to Toronto City Hall, squatting on my haunches, peering down at the sidewalk and awkwardly searching for my special missing piece of glass. I was not optimistic about my chances.

Most people walked on by and paid me no notice, but one kind soul, a woman with black, curly hair, stopped to help.

“Did you lose something?” she asked.

“Yeah,” I said, defeated, and held up my half empty black frames.

“Can I help?” she kindly offered. “I’m good at this sort of thing.”

“I guess,” I said, having already given up on restoring my headgear to completeness.

We scoured the sidewalk while urban passerby gave the occasional puzzled look, hurrying along.

“Ah!” she said, and amazingly, picked up my lens which she had located. It had been hiding on a small patch of snow near a planter.

“WOW!” I was genuinely impressed. “You are good at this. Thanks so much.”

“No problem,” she said. “have a great day!” and then promptly disappeared down the street, leaving me standing there on the sidewalk, bewildered.

That single small episode, a tiny vignette of a single life in a giant city amongst millions of others, was quite profound for me. This was because it got me thinking about two things: one, the kindness of strangers, and the other, of course, what I am always thinking about – the business of doing analysis.

Because as it turns out, those few statements that kind stranger made are equally important in being a great analyst.

“Did you lose something?”

A problem that a lot of analysts deal with on a regular basis is one of communication. The business, the stakeholder, the client, whoever it may be, comes to the analyst for help. They want to find out something about their business because they have data, and it’s the job of the analyst to turn that data (information) into insights (knowledge).
But here’s the problem – you can’t find something if you don’t know what you’re looking for.
Just as the kind passerby wouldn’t have been able to help me find my missing lens if she didn’t know what to look for, if you don’t know what kinds of insights you want to pull out of the data you have, then you won’t be able to find what you’re looking for either.
“We want to know how our people are connecting with our brand.”
It is the job of the analyst to turn these (often vague) desires of the business into specific questions that can be answered by analyzing data.
What people? (everyone, purchasers only, Boomers, Gen X, Gen Y, single mothers between the ages of 22 and 32 in urban centers?) What does connecting with the brand mean? (viewing an ad, purchase, visits to the website, app downloads, posts on social media, all of the above?)
So remember that a very large part of the job of the analyst is communication – not just about data – but working with others to determine exactly what it is they want to know. Once you know that, you can determine how to best do analysis to find the answers that are being sought after – hiding in plain sight in the data, like a piece of glass on a snowy patch of sidewalk.

“Can I help?”

Here’s something I think that a lot of analytical-type thinkers (this author included) often need to be reminded of: you can’t know everything. Even if you really, really want to. I’m sorry but you just can’t.
And that’s why once you know what it is you’re looking for, and what you need, you’ll need to ask for help (and that’s okay, that’s why we have meetings!). Sometimes the mere process of tracking down the data is a considerable task in itself. Sometimes no one really has a great overall understanding of a how a really large, complicated system works – that kind of knowledge is often very distributed. These sorts of situations may require the help of many others in your company (or another business, vendor or client) who all have varying knowledge bases and skill sets.
It’s the job of the analyst to connect with the people they need to, get the data that they need, and do analysis to find the answers which are desired. Also if you’re a good analyst, you’ll probably provide some kind of context around the impact (i.e. business implications) of your answer, and what parties would need to be involved to make take the most beneficial actions as a result.
So even if you’re a data rock star don’t ever be afraid to ask for help; and conversely don’t hesitate to let others know who should help them too.

“I’m good at this sort of thing.”

Getting the analysis done requires not only not being afraid of asking for help, but also knowing the strengths and weaknesses of yourself, your team, and any others you may be working with.
It’s hard, but in my opinion, it takes a bigger person to be honest and admit when they are out of their depth than to say they can do something they clearly cannot.
When you’re out of your depth you have three options, which are really just three different ways of finishing the statement I’m not an expert. And they go something like this: I’m not an expert….
  1.  “… so I’m not going to do it because: I don’t know how / wouldn’t be able to figure it out / it’s not in my job description.”
  2.  “… but I can: learn quickly / give it a try / do my best / become one in 5 days.”
  3.  “… but I know <colleague> is and could: provide context to the problem / definitely help do it / teach us how.”
And the difference between answer #1 and the last two is what separates the office drones from the thought leaders, the reporting monkeys from the truly great analysts, and the unsuccessful from the successful in the world of data.
As I noted in the section above, you should never be afraid to ask for help, because there are going to be others out there that are better at things than you, and if you’re good you’ll recognize this fact and both of you will benefit. Hey, you might even learn something too, so next time you will be the expert.
Just remember that you can do analysis without crunching every number personally. You can work in data science without building the predictive model all by yourself. And you can work with data without writing every line of code alone. No analyst is an island.

“No problem! Have a great day!”

I hope that my little story and these points will help, or at least help you think, about the business of working with data and doing analysis, and what it means to be a great analyst.

This last point is perhaps equally, or even more  important, than the others – always be kind to the people you work with; always make it look easy, no matter how hard it was; and always be happy to help. That, above all, is what will make you a truly great analyst.

The Mathematics of Wind Chill

Introduction

Holy crap, it was cold out.

If you haven’t been reading the news, don’t live in the American Midwest or Canada, or do and didn’t go outside the last couple weeks (for which I don’t blame you) there was some mighty cold weather lately due to something called a polar vortex.

Meteorologists stated that a lot of people (those in younger generations) would never have experienced anything like this before – cold from the freezing temperatures and high winds the likes of which parts of the US and Canada haven’t seen for 40 years.

It was really cold. So cold that weird stuff happened, including the ground exploding here in Ontario due to frost quakes, or cryoseisms, as they are technically known (or as my sister suggested they should be called, “frosted quakes” – get it?)

When there is all this talk of a polar vortex, all I could think of was a particularly ridiculous TV-movie that came out lately, and that this is our Northern equivalent, which probably looked something like this artist’s depiction below:

Scientific depiction of polar vortex phenomena (not to scale)

But I digress. The real point is that all this cold weather got me thinking about windchill – what is it exactly? How is it determined? Let’s do some everyday analysis.

Background

Wind chill hasn’t always been the same, and there is some controversy exactly how scientific it is in the way it is calculated.
Wind chill depends upon only two variables – air temperature and wind speed – and the formula was derived not from physical models of atmosphere but from participants in simulated laboratory conditions.
Also, the old formula was replaced in 2001 by a new formula, with Canada greatly leading the effort, since there was some concern that the old formula gave values too low and that people would think they can safely withstand colder temperatures than they actually could.
The old formula had strange units but I found this page at University of Carleton which provides it in degrees Fahrenheit, so we can compare the old and new systems directly.

Analysis

Since the wind chill index is a function of two variables (a surface), we can calculate it using vectors in R and visually depict the results as an image (filled contour). This is in the following code below:

Which results in the following plots:

And the absolute difference between the two:

For low wind speeds (around 5 mph – wind chill is only defined when wind speed is greater 5 mph) you can see that the new system is colder, but for wind speeds greater than 10 mph the opposite is true, especially so in the bitter bitter cold (high winds and very cold temperatures). This is in line with the desire to correct the old system for giving values which were felt were too low.

If you’re really visual person, here is the last contour plot as a surface:

Which, despite some of the limitations of 3-D visualization, shows the non-linear nature of the two systems and the difference between them.

Conclusion

This was a pretty interesting exercise and shows again how mathematics permeates many of our everyday notions – even if we’re not necessarily aware of it being the case. 
For me the takeaway here is that wind chill is not an exact metric based on the physical laws of the atmosphere, but instead a more subjective one based upon people’s reaction to cold and wind (an inanimate object cannot “feel” wind chill).
Despite the difficulty of the problem of trying to exactly quantify how much colder the blustery arctic winds make it feel outside, saying “-32F with the wind chill” will still always be better than saying “dude, it’s really really cold outside.”
Either way, be sure to wear a hat.

References & Resources

Windchill (at Wikipedia)
National Weather Service – Windchill Calculator
National Weather Service – Windchill Terms & Definitions 
Environment Canada – Canada’s Windchill Index

Snapchat Database Leak – Visualized

Introduction

In case you don’t read anything online, or live under a rock, the internet is all atwitter (get it?) with the recent news that Snapchat has had 4.6 million users’ details leaked due to a security flaw which was compromised.

The irony here is that Snapchat was warned of the vulnerability by Gibson Security, but was rather dismissive and flippant and has now had this blow up in their faces (as it rightly should, given their response). It appears there may be very real consequences of this to the (overblown) perceived value of the company, yet another wildly popular startup with no revenue model. I bet that offer from Facebook is looking pretty good right about now.

Anyhow, a group of concerned hackers gave Snapchat what-for by exploiting the hole, and released a list of 4.6 million (4,609,621 to be exact) users details with the intent to “raise public awareness on how reckless many internet companies are with user information.

Which is awesome – kudos to those guys, once for being whitehat (they obscured two digits of each phone number to preserve some anonymity) and twice for keeping companies with large amounts of user data accountable. Gibsonsec has provided a tool so you can check if your account is in the DB here.

However, if you’re a datahead like me, when you hear that there is a file out there with 4.6M user accounts in it, your first thought is not OMG am I safe?! but let’s do some analysis!

Analysis

Area Code
As I have noted in a previous musing, it’s difficult to do any sort of in-depth analysis if you have limited dimensionality of your data – here only 3 fields – the phone number with last two digits obscured, the username, and the area.
Fortunately because some of the data here is geographic, we can do some cool visualization with mapping. First we look at the high level view, with state and those states by area. California had the most accounts compromised overall, with just shy of 1.4 M details leaked. New York State was next at just over a million. 
Because the accounts weren’t spread evenly across the states, below is a more detailed view by area code. You can see that it’s mainly Southern California and the Bay Area where the accounts are concentrated.
Usernames
Well, that covers the geographic component. Which leaves the only the username and phones numbers. I’m not going to look into the phone numbers (I mean what really can you do, other than look at the distribution of numbers – which I have a strong hypothesis about already).
Looking at the number of accounts which include numerals versus those that do not, the split is fairly even – 2,586,281 (~56.1%) do not contain numbers and the remaining 2,023,340 (~43.9%) do. There are no purely numeric usernames.
Looking at the distribution of the length of Snapchat usernames below, we see what appears to be a skew-normal distribution centered around 9.5 characters or so:
The remainder of the tail is not present, which I assume would fill in if there were more data. I had the axis stretch to 30 for perspective as there was one username in the file of length 29.

Conclusion

If anything this analysis has shown anything it has reassured me that:
  1. You are very likely not in the leak unless you live in California or New York City
  2. How amazingly natural phenomena follow or nearly follow theoretical distributions so closely
I’m not in the leak, so I’m not concerned. But once again, this stresses the importance of being mindful of where our personal data are going when using smartphone apps, and ensuring there is some measure of care and accountability on the creators’ end.

Update:
Snapchat has released a new statement promising an update to the app which makes the compromised feature optional, increased security around the API, and working with security experts in a more open fashion.

What’s in My Inbox? Data Analysis of Outlook

Introduction

Email is the bane of our modern existence.

Who of us hasn’t had a long, convoluted, back-and-forth email thread going on for days (if not weeks) in order to settle an issue which could have been resolved with a simple 5 minute conversation?

With some colleagues of mine, email has become so overwhelming (or their attempts to organize it so futile) that it brings to my mind Orwell’s workers at the Ministry of Truth in 1984 and their pneumatic tubes and memory holes – if the message you want is not in the top 1% (or 0.01%) of your inbox and you don’t know how to use search effectively, then for all intents and purposes it might as well be gone (see also: Snapchat).

Much has been written on the subject of why exactly we send and receive so much of it, how to best organize it, and whether or not it is, in fact, even an effective method of communication.

At one time even Gmail and the concept of labels was revolutionary – and it has done some good in organizing the ever-increasing deluge that is email for the majority of people. Other attempts have sprung up to tame the beast and make sense of such a flood of communication – most notably in my mind Inbox Zero, the simply-titled smartphone app Mailbox, and MIT’s recent data visualization project Immersion.

But email, with all its systemic flaws, misuse, and annoyances, is definitely here for good, no question. What a world we live in.

But I digress.

Background

I had originally hoped to export everything from Gmail and do a very thorough analysis of all my personal email. Though this is now a lot easier than it used to be, I got frustrated at the time trying to write a Python script and moved on to other projects.
But then I thought, hey, why not do the same thing for my work email? I recently discovered that it’s quite easy to export email from Outlook (as I detailed last time) so that brings us to this post.
I was somewhat disappointed that Outlook can only export a folder at a time (which does not include special folders such as search folders or ‘All Mail’) – I organize my mail into folders and wanted an export of all of it.
That being said, the bulk probably does remain in my inbox (4,217 items in my inbox resulted in a CSV that was ~15 MB) and we can still get a rough look using what’s available  The data cover the period from February 27th, 2013 to Nov 16th, 2013.

Email by Contact
First let’s  look at the top 15 contacts by total number of emails. Here are some pretty simple graphs summarizing that data, first by category of contact:

In the top 15, split between co-workers/colleagues and management is pretty even. I received about 5 times as much email from coworkers and managers as from stakeholders (but then again a lot of the latter ended up sorted into folders, so this count is probably higher). Still, I don’t directly interact with stakeholders as much as some others, and tend to work with teams or my immediate manager. Also, calls are usually better.

Here you can see that I interacted primarily with my immediate colleague and manager the most, then other management, and the remainder further down the line are a mix which includes email to myself and from office operations. Also of note – I don’t actually receive that much email (I’m more of a “in the weeds” type of guy) or, as I said, much has gone into the appropriate folders.

Time-Series Analysis
The above graphs show a very simplistic and high level view of what proportion of email I was receiving from who (with a suitable level of anonymity, I hope). More interesting is a quick and simple analysis of patterns in time of the volume of email I received – and I’m pretty sure you already have an idea of what some of these patterns might be.

When doing data analysis, I always feel it is important to first visualize as much of the data as practically possible – in order to get “a feel” for the data and avoid making erroneous conclusions without having this overall familiarity (as I noted in an earlier post). If a picture is worth thousand words then a good data visualization is worth a thousand keystrokes and mouse clicks.

Below is a simple scatter plot all the emails received by day, with the time of day on the y-axis:


This scatterpolot is perhaps not immediately particulary illuminating, however it already shows us a few things worth noting:

  • the majority of emails appear in a band approximately between 8 AM and 5 PM
  • there is increased density of email in the period between the end of July and early October, after which there is a sparse interval until mid-month / early November
  • there appears to be some kind of periodic nature to the volume of daily emails, giving a “strip-like” appearance (three guesses what that periodic nature is…)

We can look into this further by considering the daily volume of emails, as below. The black line is a 7 day moving average:

We can see the patterns noted above – the increase in daily volume after 7/27 and the marked decrease mid-October. Though I wracked my brain and looked thoroughly, I couldn’t find a specific reason why there was an increase over the summer – this was just a busy time for projects (and probably not for myself sorting email). The marked decrease in October corresponds to a period of bench time, which you can see was rather short-lived.

As I noted previously in analyzing communications data, the distribution of this type of information is exponential in nature and usually follows a log-normal distribution. As such, a moving average is not the greatest measure of central tendency – but a decent approximation for our purposes. Still, I find the graph a little more digestible when depicted with a logarithmic y-axis, as below:

Lastly we consider the periodic nature of the emails which is noted in the initial scatterplot. We can look for patterns by making a standard heatmap with the weekday as the column and hour of day as the row, as below:

You can clearly see that the the majority of work email occurs between the hours of 9 to 5 (shocking!). However some other interesting points of note are the bulk of email in the mornings at the begiinning of the week, fall-off after 5 PM at the end of the week (Thursday & Friday) and the messages received Saturday morning. Again, I don’t really receive that much email, or have spirited a lot of it away into folders as I noted at the beginning of the article (this analysis does not include things like automated emails and reports, etc.)

Email Size & Attachments
Looking at file attachments, I believe the data are more skewed than the rest, as the clean-up of large emails is a semi-regular task for the office worker (as not many have the luxury of an unlimited email inbox capacity – even executives) so I would expect that values on the high end to have largely been removed. Nevertheless it still provides a rough approximation of how email sizes are distributed and what proportion have attachments included.

First we look at the overall proportion of email left in my inbox which has attachments – of the 4,217 emails, 2914 did not have an attachment (69.1%) and 1303 did (30.9%).

Examining the size of emails (which includes the attachments) in a histogram, we see a familiar looking distribution, which here I have further expanded by making it into a Pareto chart. (note that the scale on the left y-axis is logarithmic):

Here we can see that of what was left in my inbox, all messages were about 8 MB in size or less, with the vast majority being 250K or less. In fact 99% of the email was less than 1750KB, and 99.9% less than 6MB.

Conclusion

This was a very quick analysis of what was in my inbox, however we saw some interesting points of note, some of which confirm what one would expect – in particular:
  • vast majority of email is received between the hours of 9-5 Monday to Friday
  • majority of email I received was between the two managers & colleagues I work closest with
  • approximately 3 out of 10 emails I received had attachments
  • the distribution of email sizes is logarithmic in nature
If I wanted to take this analysis further, we could also look at the trending by contact and also do some content analysis (the latter not being done here for obvious reasons, of course).
This was an interesting exercise because it made me mindful again of what everyday analytics is all about – analyzing rich data sets we are producing all the time, but of which we are not always aware.

References and Resources

Inbox Zero
http://inboxzero.com/

Mailbox
http://www.mailboxapp.com/

Immersion
https://immersion.media.mit.edu/

Data Mining Email to Discover Organizational Networks and Emergent Communities in Work Flows

How to Export Your Outlook Inbox to CSV for Data Analysis

So one of my colleagues at work showed me this cool script he wrote in Visual Basic to pull all the data from Outlook for analysis.

Cool, I thought – I’d like to do that, but don’t want to muck about in VB.

Well, I was surprised to discover that Outlook has the ability to export email to CSV built in! Follow the simple steps below (here demonstrated in Outlook 2010) and you can analyze your emails yourself and do some cool quantified self type analysis

How to Export Outlook Email to CSV (from Outlook)

1. Open Outlook and click File then Options to bring up the options dialog:
2. Selected Advanced, then click the Export button:
3. Click Export to a file and then the next button:
4. Selected Comma Separated Values (Windows) and click next.
5. Unless you want to export a different folder, select Inbox and click next.
6. Browse to a folder and/or type a filename for your export.
7.  Choose Map Custom Fields… if you want to customize which fields to export. Otherwise click the Finish button.
8. Sit tight while Outlook does its thing.
You should now have a CSV file of your inbox data!

How to Export Outlook Email to CSV (from Access)

This is all very well and good, but unfortunately exporting to CSV from Outlook does not provide the option for date and time as fields to be included, which makes it useless if you’d like to do time series (or other temporal) analysis.
To get the date and time data you can pull data from Outlook into Access and then export it as noted in this metafilter thread.
Import from Outlook into Access
1. Fire up Access and create a new database. Select External Data, More.. and then Outlook Folder.
2. Select Import the source data into a new table in the current database and click OK


3. Select the email account and folder you’d like to import and click Next 
4. Change the field settings if you’d like. Otherwise accept the defaults by clicking Next


5. Let Access add the primary key or not (you don’t need it). Click Next 


6. Click Finish and wait. When the process is done you should have a new table called ‘Inbox’.



Export Data from Access to a CSV
1. Make sure the Inbox table is selected and click External Data then Text File.
2. Pick or type a filename and click OK


3. Selected Delimited and click Next
4. Select Comma as the delimiter and tick the box which says Include Field Names on First Row. Click next.
5. Pick or type a filename and click Finish


You should now have your Inbox data exported as CSV (including time / date data!) and ready for analysis. Of course you can repeat this process and append to the Access database folder by folder to analyze all the mail you have in Outlook.