How to Think Like an Analyst

So I was talking to my Aunt a couple weekends ago. My Aunt explained that though she was happy for me and the work that I do, she didn’t understand any of it. I tried my best to explain in general terms what web analytics, and analytics as a whole, is and is all about.

Our conversation continued, and I further offered that though she may not understand exactly what it is I do, she could understand the spirit in which it is done – the way to think about analysis.

Not everyone is cut out to be an analyst. There are those who have always been good with numbers, and there are those describe themselves as ‘the one who was always bad at math in high school’.

And that’s fine. Like I said, not everyone is cut out to be an analyst, not everyone wants to be, and not everyone can be. However, it is a firm belief of mine that everyone, everyone can think like an analyst.

And I’ll show you how.

The Questions You Need to Ask

True, you may not have the skill set necessary to be an analyst – you may, in fact, be one of those who was bad at math in high school, and when people mention spreadsheets you think about bedding not computer software.

But that doesn’t mean you can’t think like an analyst.

Part of being a good analyst is not just being able to do analysis, but being able to ask the right questions which lead to it.

All good analysis starts with a question. So all you have to do is ask the right questions.

And, in this humble author’s opinion, those two questions are how and what.

Question 1 – How (many)?

This is the simple question, and is one of measurement and descriptive statistics.

Thinking quantitatively is a key part of thinking like an analyst.

If you learn to think in this way you will find that ordinary, everyday situations can become part of ordinary, everyday analytics.

For instance, any time you are at some sort of gathering of people or social function you can think like an analyst by asking yourself the question – how many?

How many men are there in the room? How many women? How many are there proportionally?
How many people at the party are wearing glasses? How many are not?
How many people at the networking mixer are eating and drinking? How many are just eating? Just drinking?
How many people at the dinner party decided to have the chicken? How many did not? How many finished all their food and how many left food behind? How many plates did each person have?

And so on.

But as I said, the question of how many is simply one of describing the state of affairs. To really think like an analyst you also need to ask the second question.

Question 2 – What (is the relationship between….)?

The second question helps you to think like an analyst and go beyond simply describing things quantitatively and start thinking about possible relationships.

Here, to illustrate how thinking like an analyst is subject-matter independent, we can pick a topic, any topic. So let’s go with….. peanut butter. I like peanut butter.

The second question is what lately I find I’m asking myself all the time about almost everything (whether I like it or not). And that very important question you can ask yourself is – what is the relationship between……?

Pick properties of, or related to, your subject of analysis – some of which you may compare across or between, and others which may be measured. In technical terms these are known as dimensions and measures, respectively.

For example, using our randomly chosen topic of peanut butter, first we brainstorm all the things we could possibly think of related to peanut butter.

Type (chunky or smooth), brand, container (size, type, colour), price, sales, consumption, nutritional content,  location, time…

And so on. Let’s stop there.

Then we ask the question: What is the relationship between a and b? Where a is one of the things we brainstormed as a category, and b is one of the things we brainstormed as a measurement.

What is the relationship between the type of peanut butter and its nutritional content? (That is, how is chunky peanut butter different from smooth peanut butter in terms of calories and fat?)

What is the relationship between the brand of peanut butter and its sales? (That is, how do the total sales of different peanut butter brands compare? You could also add time and location dimensions – how do sales between brands compare this year? Last year? Worldwide? In Canada vs. the US? Per store in Ontario?)

What is the relationship between the container size and location? (That is, do different countries have different sized containers for peanut butter? What is the average container size per country? In each region? Or look at location in store – are all the containers in the same aisle or are the different sizes in different places (e.g. the bulk food aisle)? How is the distribution of container sizes broken down across different stores across the country?

And so forth. As you can see, there are so many questions you can ask by combining properties of a topic of interest in this way. And these are only questions with two properties – many more questions of greater complexity could be generated by combining multiple properties (e.g. What is the relationship between peanut butter sales and consumption and the brand and type?)

The Hard Question

There is one final question which I did not mention, which, if you really want to think like an analyst, is the most important question of all. In fact, I would go further and say that even if you are not thinking like an analyst, this is the most important question of all. And that ultimate question is why.

The question of why is the most important question, the hardest question, the question which drives all of the analysis that analysts the world over do.

Why.

Why has our new marketing initiative not resulted in increased sales in the third quarter? Why is the sky blue? Why does Amazon send me so many emails related to Home and Garden products? Why can’t I sleep at night? Why are there three million kinds of laundry detergent but only two kinds of baking powder? Why? Why? Why.

This is the question which drives all investigation, which drives all measurement, which drives all analysis.

And this is the question, whether you want to think like an analyst or not, you should always be asking yourself.

Top 10 Super Bowl XLVII Commercials in Social TV (Respin)

So the Super Bowl is kind of a big deal.

Not just because there’s a lot of football. And not just because it’s a great excuse to get together with friends and drink a whole lot of beer and eat unhealthy foods. And not because it’s a good excuse to shout at your new 72″ flatscreen with home theater surround that you bought at Best Buy just for your Super Bowl party and are going to try to return the next day even though you’re pretty sure now that they don’t let you do that any more.

The Super Bowl is a big deal for marketers. For creatives. For ‘social media gurus’. Because there’s a lot of eyeballs watching those commercials. In fact, I’m pretty sure there’s people going to Super Bowl parties who don’t even like football and are just there for the commercials, that is if they’ve not decided to catch all the best ones after the fact on YouTube.

And also, you know, because if you’re putting down $6 million for a minute of commercial airtime, you want to make sure that those dollars are well spent.

So Bluefin Labs is generating a lot of buzz lately as they were acquired by Twitter. TV is big, social media is big, so Social TV analytics must be even bigger, right? Right?

Anyhow Bluefin showed up recently in my Twitter feed for a different reason: their report on the Top 10 Super Bowl XLVII commercials in Social TV that they did for AdAge.

The report’s pretty and all, but a little too pretty for my liking, so I thought I’d respin some of it.

Breakdown by Gender:

Superbowl XLVII Commercial Social Mentions by Gender

You can see that the male / female split is fairly even overall, with the exception of the NFL Network’s ad and to a lesser extent the ad for Fast & Furious 6 which were more heavily mentioned proportionally by males. The Budweiser, Calvin Klein and Taco Bell spots had greater percentages of women commenting.

Sentiment

The Taco Bell, Dodge and Budweiser ads had the most mentions with positive sentiment. The NFL ad had a very large amount of neutral comments (74%), moreso than any other ad, proportionally. The Go Daddy ad had the most negative mentions, for good reason – it’s gross and just kind of weird. It wouldn’t be the Super Bowl if Go Daddy didn’t air a commercial of questionable taste though, right?
Superbowl XLVII Commercial Sentiment Breakdown by Gender
Superbowl XLVII Commercial Sentiment Breakdown by Gender (Proportional)
Lastly, I am going to go against the grain here and say that the next big thing in football is most definitely going to be Leon Sandcastle.

Finer Points Regarding Data Visualization Choices

The human mind is limited.

We can only process so much information at one time. Numerals are text which communicate quantity. However, unlike other text, it’s a lot harder to read a whole bunch of numbers and get a high-level understanding of what is being communicated. There are sentences of numbers and quantities (these are called equations, but not everyone is as literate in them) however simply looking at a pile of data and having an understanding of the ‘big picture’ is not something most people can do. This is especially true as the amount of information becomes larger than a table with a few categories and values.

If you’re a market research, business, data, financial, or (insert other prefix here) analyst, part of your job is taking a lot of information and making sense of that information, so that other people don’t have to. Let’s face it – your Senior Manager or The VP doesn’t have time to wade through all the data – that’s why they hired you.

Ever since Descartes’ epiphany (and even before that) people have been realizing that there are other, more effective ways to communicate information than having to look at all the details. You can communicate the shape of the data without knowing exactly how many Twitter followers were gained each day. You can see what the data look like without having to know the exact dollar value for sales each and every day. You can feel what the data are like, and get an intuitive understanding of what’s going on, without having to look at all the raw information.

Enter data visualization.

Like any practice, data visualization and the depicting quantitative relationships visually can be done poorly or can be done well. I’m sure you’ve seen examples of the former, whether it be in a presentation or other report, or perhaps floating around the Internet. And the latter, like so many good things, is not always so plentiful, nor appreciated. Here I present some finer points between data visualization choices, in the hope that you will always find yourself depicting data well.

Pie (and Doughnut) Chart

Ah, the pie chart. The go-to the world over when most people seek to communicate data, and one both loved and loathed by many.
The pie chart should be used to compare quantities of different categories where the proportion of the whole is important, not the absolute values (though these can be added with labelling as well). It’s important that the number of categories being compared remain small – depending on the values, the readability of the chart decreases greatly as the number of categories increases. You can see this below. The second example is a case where an alternate representation should be considered, as the chart’s readability and usefulness is lower given the larger number of proportions being compared:

Doughnut charts are the same as pie charts but with a hole in the center. They may be used in the place of multiple pie charts by nesting the rings:

Hmm.

Though again, as the number of quantities being compared increases the readability and visual utility generally decreases and you are better served by a bar chart in these cases. Also there is the issue that the area of each annulus will be different for the same angle, depending upon which ring it is in.

With circular charts it is best to avoid legends as this causes the eye to flit back and forth between the different segments and the legend, however when abiding by this practice for doughnut charts labeling becomes a problem, as you can see above.

Tufte contends that a bar chart will always serve better than a pie chart (though some others disagree). The issue is that there is some debate about the way the human mind processes comparisons with angular representations versus those depicted linearly or by area. I tend to agree and find the chart below much better data visualization that the one we saw previously:

Isn’t that much better?

From a practical perspective – a pie chart is useful because of its simplicity and familiarity, and is a way to communicate proportion of quantities when the number of categories being compared is small. 
Bonus question:
Q. When is it a good idea to use a 3-D pie chart?
A. Never. Only as an example of bad data visualization!

Bar Charts

Bar charts are used to depict the values of a quantity or quantities across categories. For example, to depict sales by department, or per product type.
This type of chart can be (and is) used to depict values over time, however, said chunks of time should be discrete (e.g. quarters, years) and of a small number. When a comparison is to be done over time and the number of periods / data points is larger, it is better visualized using a line chart.

As the number of categories becomes large, an alternative to the usual arrangement (‘column’ chart) is to arrange the categories vertically and bars horizontally. Note this is best done only for categorical / nominal data as data with an implied order (ordinal, interval, or ratio type data) should be displayed left-to-right in increasing order to be consistent with reading left to right.
Bar charts may also be stacked in order to depict both the values between categories as well as the total across them. If the absolute values are not important, then stacked bar charts may be used in this way in the place of several pie charts, with all bars having a maximum height of 100%:

Stephen Few contends that this still makes it difficult to compare proportions, similar to the problem with pie charts, and has other suggestions [PDF], though I think it is fine on some occassions, depending the nature of the data being depicted.

When creating bar charts it is important to always start the y-axis from zero so as not to produce a misleading graph.

A column chart may also be combined with a line graph of the total per category in a type of combo chart known as Pareto chart.

Scatterplot (and Bubble Graphs)

Scatterplots are used to depict a relationship between two quantitative variables. The value pairs for the variables are plotted against each other, as below:

When used to depict relationships occurring over time, we instead use a special type of scatterplot known as a line graph (next section).

A bubble chart is a type of scatterplot used to compare relationships between three variables, where the points are sized by area according to a third value. Care should be taken to ensure that the points are sized correctly in this type of chart, so as not to incorrectly depict the relative proportion of quantities

Relationships between four variables may also be visualized by colouring each point according to the value of a fourth variable, though this may be a lot of information to depict all at once, depending upon the nature of the data. When animated to include a fifth variable (usually time) it is known as a motion chart, which is perhaps most famously demonstrated in Hans Rosling’s landmark TED Talk which has become somewhat of a legend.

Line Graphs

Line graphs are usually used to depict quantities changing over time. They may also be used to depict relationships between two (numeric) quantities when there is continuity in both.

For example, it makes sense to compare sales over time with a line graph, as time is numerical quantity that varies continuously:

However it would not make sense to use a line graph to compare sales across departments as that is categorical / nominal. Note that there is one exception to this rule and that is the aforementioned Pareto chart.

Omitting the points on the line graph and using a smooth graph instead of line segments creates an impression of more data being plotted, and hence a greater continuity. Compare with the plot above the one below:

So practically speaking save the smooth line graphs for when you have a lot of data and the points would just be visual clutter, otherwise it’s best to overplot the points to be clear about what quantities are being communicated.

Also note that unlike a bar chart, it is acceptable to have a non-zero starting point for the y-axis of a line graph as the change in values is being depicted, not their absolute values.

Now Go Be Great!

This is just a sample of some of the finer differences between the choices for visualizing data. There are of course many more ways to depict data, and, I would argue, that possibilities for data visualization are only limited by the imagination of the visualizer. However when sticking with the tried, true and familiar, keep these points in mind to be great at what you do and get your point across quantitatively and visually.
Go, visualize the data, and be amazing!

What The Smeg? Some Text Analysis of the Red Dwarf Scripts

Introduction

Just as Pocket fundamentally changed my reading behaviour, I am finding that now having Netflix (and even before that, other downloadable or streaming digital content) is really changing my behaviour as far as television is concerned.

Where watching TV used to be an affair of browsing through 500 channels and complaining there was nothing on, now with the advent of on-demand digital services there is choice. Instead of flipping through hundreds of channels (is that a linear search or a random walk?), most of which have nothing whatsoever that interests you, now you can search for exactly the show you are looking for and watch it when you want. Without commercials.

Wait, what? That’s amazing! No wonder people are ‘cutting the cord’ and media corporations are concerned about the future of their business model.

True, you can still browse. People complain that the selection on Netflix is bad for Canada, but for 8 dollars a month, really it’s pretty good what you’re getting. And given the…. eclectic nature of the selection I sometimes find myself watching something I would never think to look for directly, or give a second chance if I just caught 5 minutes of the middle of it on cable.

Such is the case with Red Dwarf. Red Dwarf is one of those shows that gained a cult following, and, despite its many flaws, for me has a certain charm and some great moments. This despite my not being able to understand all of the jokes (or dialogue!) as it is a show from the BBC.

The point is that before Netflix, I probably wouldn’t come across something like this, and I definitely wouldn’t watch all of it, if there wasn’t that option so easily laid out.

So I watched a lot of this show and got to thinking, why not take this as an opportunity to do some more everyday analytics?

Background

If you’re not familiar with the show or a fan, I’ll briefly summarize here so you’re not totally lost.

The series centers around Dave Lister, an underachieving chicken-soup vending machine repairman aboard the intergalactic mining ship Red Dwarf. Lister inadvertently becomes the last human being alive when being put into stasis for 3 million years by the ship’s computer, Holly, when there is a radiation leak aboard the ship. The remainder of the ship’s crew are Arnold J. Rimmer, a hologram of Lister’s now-deceased bunkmate and superior officer; The Cat, a humanoid evolved from Lister’s pet cat; Kryten, a neurotic sanitation droid; and later Kristine Kochanski, a love interest who gets brought back to life from another dimension.

Conveniently, the Red Dwarf scripts are available online, transcribed by dedicated fans of the program. This just goes to show that the series truly does have cult following, when there are fans who love the show so much as to sit and transcribe episodes just for it’s own sake! But then again, I am doing data analysis and visualization on that same show….

Analysis

Of the ten seasons and 61 episodes of the series, the data set covers Seasons 1-8 and comprises and 51 episodes of those 52 (S08E03 – Back In The Red (Part III) is missing).
I did some text analysis of the data with the tm package for R. 

First we can see the prevalence of different characters within the show over the course of the series. I’ve omitted the x-axis labels as they made the chart appear cluttered, you can see them by interacting.

Lister and Rimmer, the two main characters, have the highest amount of mentions overall. Kryten appears in the eponymous S02E01 and is then later introduced as one of the core characters at the beginning of Season 3. The Cat remains fairly constant throughout the whole series as he appears or speaks mainly for comedic value. In S01E06, Rimmer makes a duplicate of himself which explains the high number of lines by his character and mentions of his name in the script. You can see he disappears after Episode 2 of Season 7 in which his character is written out, until re-appearing in Season 8 (he appears in S07E05 as there is an episode dedicated to the rest of the crew reminiscing about him).

Holly, the ship’s computer, appears consistently at the beginning of the program until disappearing with the Red Dwarf towards the beginning of Season 6. He is later reintroduced when it returns at the beginning of Season 8.

Lister wants to bring back Kochanski as a hologram in S01E03, and she also appears in S02E04, as it is a time travel episode. She is introduced as one of the core cast members in Episode 3 of Season 7 and continues to be so until the end of the series.

Ace is Rimmer’s macho alter-ego from another dimension. He appears a couple time in the series before S07E02, in which he is used as a plot device to write Rimmer out of the show for that season.

Appearance and mentions of other crew members of the Dwarf correspond to the beginning of the series and the end (Season 8) when they are reintroduced. The Captain, Hollister, appears much more frequently towards the end of the show.

Robots appear mainly as one-offs who are the focus of a single episode. The exceptions are the Scutters (Red Dwarf’s utility droids) whose appearances coincide with the parts of the show where the Dwarf exists, and simulants which are mentioned occasionally as villians / plot devices. The toaster and snarky dispensing machine also appear towards the beginning and end, with the former also having speaking parts in S04E04.

As mentioned before, the Dwarf gets destroyed towards at the end of Season 5 until being reintroduced at the beginning of Season 8. During this time, the crew live in one of the ship’s shuttlecraft, The Starbug. You can also see that the starbug is mentioned more frequently in episodes when the crew go on excursions (e.g. Season 3, Episodes 1 and 2).

One of the recurring themes of the show is how much Lister really enjoys Indian food, particularly chicken vindaloo. That and how he’d much rather just drink beer at the pub than do anything. S04E02 (spike 1) features a monster, a Chicken Vindaloo man (don’t ask), and the whole premise of S07E01 (spike 2) is Lister wanting to go back in time to get poppadoms.

Thought this would be fun. Space is a consistent theme of the show, obviously. S07E01 is a time travel episode, and the episodes with Pete (Season 8, 6-7) at the end feature a time-altering device.

Conclusions

I recall talking to associate of mine who recounted his experiences in a data analysis and programming workshop where the data set used was the Enron emails. As he quite rightly pointed out, he knew nothing about the Enron emails, so doing the analysis was difficult – he wasn’t quite sure what he was looking at, or what he should be expecting. He said he later used the Seinfeld scripts as a starting point, as this was at least something he was familiar with.

And that’s an excellent point. You don’t need necessarily need to be a subject matter expert to be an analyst, but it sure helps to have some idea what you exactly you are analyzing. Also I would think that there’s a higher probability you care about what you are trying to analyze more if you know something about it.

On that note, it was enjoyable to analyze the scripts in this manner, and see something so familiar as a television show visualized as data like any other. I think the major themes and changes in the plotlines of the show were well represented in this way.

In terms of future directions, I tried looking at the correlation between terms using the findAssocs() function but got strange results, which I believe is due to the small number of documents. At a later point I’d like to do that properly, with a larger number of documents (perhaps tweets). Also this would work better if synonym replacement for the characters was handled in the original corpus, instead of ad-hoc and after the fact (see code).

Lastly, another thing I took away from all this is that cult TV shows have very, very devoted fan-bases. Probably due to its systemic bias, there is an awful lot about Red Dwarf on Wikipedia, and elsewhere on the internet.

Resources

code and data on github
https://github.com/mylesmharrison/reddwarf

Red Dwarf Scripts (Lady of the Cake)

Seriously, What’s a Data Scientist? (and The Newgrounds Scrape)

So here’s the thing. I wouldn’t feel comfortable calling myself a data scientist (yet).

Whenever someone mentions the term data science (or, god forbid BIG DATA, without a hint of skepticism or irony) people inevitably start talking about the elephant in the room (see what I did there)?

And I don’t know how to ride elephants (yet).

Some people (like yours truly, as just explained) are cautious – “I’m not a data scientist. Data science is a nascent field. No one can go around really calling themselves a data scientist because no one even really knows what data science is yet, there isn’t a strict definition.” (though Wikipedia’s attempt is noble).

Other people are not cautious at all – “I’m a data scientist! Hire me! I know what data are and know how to throw around the term BIG DATA! I’m great with pivot tables in Excel!!”

Aha ha. But I digress.

The point is that I’ve done the first real work which I think falls under the category of data science.

I’m no Python guru, but threw together a scraper to grab all the metadata from Newgrounds portal content.

The data are here if you’re interested in having a go at it already.

The analysis and visualization will take time, that’s for a later article. For now, here’s one of my exploratory plots, of the content rating by date. Already we can gather from this that, at least at Newgrounds, 4-and-half stars equals perfection.

Sure feels like science.

The Hour of Hell of Every Morning – Commute Analysis, April to October 2012

Introduction

So a little while ago I quit my job.

Well, actually, that sounds really negative. I’m told that when you are discussing large changes in your life, like finding a new career, relationship, or brand of diet soda, it’s important to frame things positively.

So let me rephrase that – I’ve left job I previously held to pursue other directions. Why? Because I have to do what I love. I have to move forward. And I have to work with data. It’s what I want, what I’m good at, and what I was meant to do.

So onward and upward to bigger, brighter and better things.

But I digress. The point is that my morning commute has changed.

Background

I really enjoyed this old post at Omninerd, about commute tracking activities and an attempt to use some data analysis to beat traffic mathematically. So I thought, hey, I’m commuting every day, and there’s a lot of data being generated there – why not collect some of it and analyze it too?

The difference here being that I was commuting with public transit instead of driving. So yes, the title is a bit dramatic (it’s an hour of hell in traffic for some people, I actually quite enjoy taking the TTC).

When I initially started collecting the data, I had intended to time both my commute to and from work. Unfortunately, I discovered that due to having a busy personal and professional life outside of the 9 to 5, that there was little point in tracking my commute at the end of the work day, as I was very rarely going straight home (I was ending up with a very sparse data set). I suppose this was one point of insight into my life before even doing any analysis in this experiment.

So I just collected data on the way to work in the morning.

Without going into the personal details of my life in depth, my commute went something like this:

  • walk from home to station
  • take streetcar from station west to next station
  • take subway north to station near place of work
  • walk from subway platform to place of work

Punching the route into Google Maps, it tells me the entire distance is 11.5 km. As we’ll see from the data, my travel time was pretty consistent and on average took about 40 minutes every morning (I knew this even before beginning the data collection). So my speed with all three modes of transportation averages out to ~17.25 km/hr. That probably doesn’t seem that fast, but if you’ve ever driven in Toronto traffic, trust me, it is.

In terms of the methodology for data collection, I simply used the stopwatch on my phone, starting it when I left my doorstep and stopping it when reaching the revolving doors by the elevators at work.

So all told, I kept track of the date, starting time and commute length (and therefore end time). As with many things in life, hindsight is 20/20, and looking back I realized I could have collected the data in a more detailed fashion by breaking it up for each leg of the journey.

This occurred to me towards the end of the experiment, and so I did this for a day. Though you can’t do much data analysis with just this one day, it gives a general idea of the typical structure of my commute:

Okay, that’s fun and all, but that’s really an oversimplification as the journey is broken up into distinct legs. So I made this graphic which shows the breakdown for the trip and makes it look more like a journey. The activity / transport type is colour-coded the same as the pie chart above. The circles are sized proportionally to the time spent, as are the lines between each section.

There should be another line coming from the last circle, but it looks better this way.

Alternatively the visualization can be made more informative by leaving the circles sized by time and changing the curve lengths to represent the distance of each leg travelled. Then the distance for the waiting periods is zero and the graphic looks quite different:

I really didn’t think the walk from house was that long in comparison to the streetcar. Surprising.

Cool, no? And there’s an infinite number of other ways you could go about representing that data, but we’re getting into the realm of information design here. So let’s have a look at the data set.

Analysis

So first and foremost, we ask the question, is there a relationship between the starting time of my morning commute and the length of that commute? That is to say, does how early I leave to go to work in the morning impact how long it takes me to get to work, regardless of which day it is?
Before even looking at the data this is an interesting question to consider, as you could assume (I would venture to say know for a fact) that departure time is an important factor for a driving commute as the speed of one’s morning commute is directly impacted by congestion, which is relative to the number of people commuting at any given time.
However, I was taking public transit and I’m fairly certain congestion doesn’t affect it as much. Plus I headed in the opposite direction of most (away from the downtown core). So is there a relationship here?
Looking at this graph we can see a couple things. First of all, there doesn’t appear to be a salient relationship between the commute start time and duration. Some economists are perfectly happy to run a regression and slam a trend line through a big cloud of data points, but I’m not going to do that here. Maybe if there were a lot of points I’d consider it.

The other reason I’m not going to do that is that you can see from looking at this graph that the data are unevenly distributed. There are more larger values and outliers in the middle, but that’s only because the majority of my commutes started between ~8:15 and ~9:20 so that’s where most of the data lie. 

You can see this if we look at the distribution of starting hour:

I’ve included a density plot as well so I don’t have to worry about bin-sizing issues, though it should be noted that in this case it gives the impression of continuity when there isn’t any. It does help illustrate the earlier point however, about the distribution of starting times. If I were a statistician (which I’m not) I would comment on the distribution being symmetrical (i.e. is not skewed) and on its kurtosis.

The distribution of commute duration, on the other hand, is skewed:

I didn’t have any morning where the combination of my walking and the TTC could get me to North York in less than a half hour.

Next we look at commute duration and starting hour over time. The black line is a 5-day moving average.

Other than several days near the beginning of the experiment in which I left for work extra early, the average start time for the morning trip did not change greatly over the course of the months. There looks like there might be some kind of pattern in the commute duration though, with the peaking?

We can investigate if this is the case by comparing the commute duration per day of week:

There seems to be slightly more variation in the commute duration on Monday, and it takes a bit longer on Thursdays? But look at the y-axis. These aren’t big differences, were talking about a matter of several minutes here. The breakdown for when I leave each day isn’t particularly earth-shattering either:

Normally, I’d leave it at that, but are these differences significant? We can do a one-way ANOVA and check:

> aov1 = aov(commute$starthour ~ commute$weekday, data=commute)
> aov2 = aov(commute$time ~ commute$weekday, data=commute)
> summary(aov1)
              Df Sum Sq Mean Sq F value Pr(>F)
data$weekday   4  0.456  0.1140     0.7  0.593
Residuals    118 19.212  0.1628              
> summary(aov2)
              Df Sum Sq Mean Sq F value Pr(>F)
data$weekday   4   86.4   21.59   1.296  0.275
Residuals    118 1965.4   16.66              

This requires making a lot of assumptions about the data, but assuming they’re true, these results tell us there aren’t statistically significant differences in the either the average commute start time or average commute duration per weekday.

That is to say, on average, it took about the same amount of time per day to get to work and I left around the same time.

This is in stark contrast to what people talk around the water cooler about when they’re discussing their commute. I’ve never done any data analysis on a morning drive myself (or seen any, other than the post at Omninerd), but there are likely more clearly defined weekly patterns to your average driving commute than what we saw here with public transit.

Conclusions

There’s a couple ways you can look at this.
You could say there were no earth-shattering conclusions as a result of the experiment.
Or you could say that, other than the occasional outlier (of the “Attention All Passengers on the Yonge-University-Spadina line” variety) the TTC is remarkably consistent over the course of the week, as is my average departure time (which is astounding given my sleeping patterns).
It’s all about perspective. So onward and upward, until next time.

Resources

How to Beat Traffic Mathematically

TTC Trip Planner
myTTC (independently built by an acquaintance of mine – check out more of his cool work at branigan.ca):
FlowingData: Commute times in your area, mapped [US only]

OECD Data Visualization Challenge: My Entry

The people behind Visualising are doing some great things. As well as providing access to open data sets, an embeddable data visualization player, and a massive gallery of data visualizations, they are building an online community of data visualization enthusiasts (and professionals) with their challenges and events.

In addition, those behind it (Seed and GE) are also connecting people in the real world with their data visualization marathons for students, which are looking to be the dataviz equivalent of the ever popular hackathons held around the world. As far as I know no one else is really doing this sort of thing, with a couple notable exceptions – for example Datakind and their Data Dive events (these not strictly visualization focused, however).

Okay, enough copious hyperlinking.

The latest challenge was to visualize the return on education around the world using some educational indicators from the OECD, and I thought I’d give it a go in Tableau Public.

For my visualization I chose to highlight the differences in the return on education not only between nations, but also the gender-based differences for each country.

I incorporated some other data from the OECD portal on GDP and public spending on education, and so the countries included are those with data present in all three sets.

The World Map shows the countries, coloured by GDP. The bar chart to the right depicts the public spending on education, both tertiary (blue) and non-tertiary (orange), as a percentage of GDP.

The scatterplots contrast both the gender-specific benefit-cost ratios per country, as well as between public (circles) and private (squares) benefit, and between the levels of education. A point higher up on the plots and to the left has a greater benefit-cost ratio (BCR) than a point lower and to the right, which represents a worse investment. The points are sized by internal rate-of-return (ROR).

All in all it was fun and not only did I learn a lot more about using Tableau, it gave me a lot of food for thought about how to best depict data visually as well.

Top 5 Tips for Communicating Data

Properly communicating a message with data is not always easy.

If it were, everyone could do it, and there wouldn’t be questions at the end of presentations, discussions around the best way to tweak a scatterplot, or results to a Google Images search for chartjunk.

Much has been written on the subject of how to properly communicate data, and there’s a real art and science to it. Many fail to appreciate this, which can result in confusion – about the message trying to be conveyed, the salience of various features of the data being presented, or why the information is important.

There’s a lot to be said on the subject, but keep these 5 tips for communicating data in mind, and when you have a data-driven message to get across they will help you do so with clarity and precision.

1. Plan: Know What You Want to Say

Just like you wouldn’t expect an author to write a book without a plot, or an entrepreneur to launch a new venture without a business plan, you can’t expect to march blindly into creating a report or article using data without knowing what you want to say.

Sometimes all the analysis will have already been done, and so you just need to think about how to best present it to get your message across. What variables and relationships are most important? What is the best way to depict them? Why oh why am I using aquamarine in this bar chart?

Other times figuring out your exact message will come together with the analysis, and so you would instead start with a question you want to answer, like “How effective has our new marketing initiative been over the last quarter?” or “How has the size of the middle class in Canada in changed over the last 15 years?”

2. Prepare: Be Ready

As I reflected upon in a previous post, sometimes people fail to recognize that just getting the information and putting in the proper shape is a part of the process that should not be overlooked.

Before you even begin to think about communicating your message, you need to make sure you have the data available and in a format (or formats) that you can comfortably work with. You should also consider what data are most important and how to treat them accordingly, and if any other sets should also be included (see Tip #3).

On this same note, before launching into the analysis or creation of the end product (article, report, slidedeck, etc.) it is important to think about if you are ready in terms of tools. What software packages or analysis environments will be used for the data analysis? What applications will be used to create the end product, whatever it may be?

3. Frame: Context is Key

Another important tip to remember is to properly frame your message by placing the data in context.

Failure to follow this tip results in simply serving up information – data are being presented but there is no message being communicated. Context answers the questions “Why is this important?” and “How is this related to x, y, and z?”

Placing the data in context allows the audience to see how it relates to other data, and why it matters. Do not forget about context, or you will have people asking why they should care about what you are trying to communicate.

4. Simplify: Less is More

Let me be incredibly clear about this: more is not always better. If you want to get a message across, simpler is better. Incredibly complicated relationships can be discussed, depicted, and dissected, but that doesn’t mean that your article, slide or infographic needs to look like a spreadsheet application threw up all over it.

Keep the amount of information that your audience has to process at a time (per slide, paragraph, or figure) small. Relationships and changes should be clearly depicted and key differences highlighted with differences in colour or shape. The amount of text on graphs should be kept to a minimum, and if this is not possible, then perhaps the information needs to be thought about being presented in a different way.

The last thing you want to do is muddle your message with information overload and end up confusing your audience.

5. Engage: It’s Useless If No One Knows It Exists

In the world of business, when creating a report or presenting some data, the audience is often predefined. You create a slidedeck to present to the VP and if your data are communicated properly (because you’ve followed Tips 1-4, wink wink) then all is well and you’re on your way to the top. You email the report and it gets delivered to the client and your dazzling data analysis skills make them an even greater believer in your product. And so on.

In other cases though, like when writing a blog post or news article, your audience may not be picked out for you and so it’s also your job to engage them. All your dazzling data analysis and beautiful visual work will contribute nothing if no eyeballs are laid upon it. For this reason, another tip to remember is to engage interested parties, either directly or indirectly through channels such as social media.

What Are You Waiting For?

So there are your Top 5 Tips for Communicating Data. Like I said, it’s not always easy. Keep these tips in mind, and you’ll ask yourself the right questions before you give all the answers.

Go. Explore the data, and be great. Happy communicating.

Quantified Self Toronto #15 – Text Message Analysis (rehash)

Tonight was Quantified Self Toronto #15.

Eric, Sacha and Carlos shared about what they saw at the Quantified Self Conference in California.

I presented my data analysis of a year of my text messaging behaviour, albeit in slidedeck form.

Sharing my analysis was both awesome and humbling.

It was awesome because I received so many interesting questions about the analysis, and so much interesting discussion about communications was had, both during the meeting and after.

It was humbling because I received so many insightful suggestions about further analysis which could have been done, and which, in most cases, I had overlooked. These suggestions to dig deeper included analysis of:

  • Time interval between messages in conversations (Not trivial, I noted)
  • Total amount of information exchanged over time (length, as opposed to the number of messages)
  • Average or distribution of message length per contact,  and per gender
  • Number of messages per day per contact, as a measure/proxy of relationship strength over time
  • Sentiment analysis of messages, aggregate and per contact (Brilliant! How did I miss that?)

Again, it was quite humbling and also fantastic to hear all these suggestions.

The thing about data analysis is that there are always so many ways to analyze the data (and make data visualizations), and it’s what you want to know and what you want to say that help determine how to best look at it.

It’s late, and on that note, I leave you with a quick graph of the weekly number of messages for several contacts, as a proxy of relationship strength over time (pardon my lack of labeling). So looking forward to the next meeting.

Carlos Rizo, Sacha Chua, Eric Boyd and Alan Majer are the organizers of Quantified Self Toronto. More can be found out about them on their awesome blogs, or by visting quantifiedself.ca