I’m Dreaming of a White Christmas

I’m heading home for the holidays soon.

It’s been unseasonably warm this winter, at least here in Ontario, so much so that squirrels in Ottawa are getting fat. I wanted to put together a really cool post predicting the chance of a white Christmas using lots of historical climate data, but it turns out Environment Canada has already put together something like that by crunching some numbers. We can just slam this into Google Fusion tables and get some nice visualizations of simple data.

Map


It seems everything above a certain latitude has a much higher chance of having a white Christmas in recent times than those closer to the America border and on the coast, which I’m going to guess is likely due to how cold it gets in those areas on average during the winter. Sadly Toronto has less than a coin-flip’s chance of a white Christmas in recent times, with only a 40% chance of snow on the ground come the holiday.

Chart

But just because there’s snow on the ground doesn’t necessary mean that your yuletide weather is that worthy of a Christmas storybook or holiday movie. Environment Canada also has a definition for what they call a “Perfect Christmas”: 2 cm or more of snow on the ground and snowfall at some point during the day. Which Canadian cities had the most of these beautiful Christmases in the past?

Interestingly Ontario, Quebec and Atlantic Canada are better represented here, which I imagine has something to do with how much precipitation they get due to proximity to bodies of water, but hey, I’m not a meteorologist.
A white Christmas would be great this year, but I’m not holding my breath. Either way it will be good to sit by the fire with an eggnog and not think about data for a while. Happy Holidays!

Visual Analytics of Every PS4 and XBox One Game by Install Size

Introduction

I have a startling confession to make: sometimes I just want things to be simple. Sometimes I just want things to be easier. All this talk about “Big Data”, predictive modeling, machine learning, and all those associated bits and pieces that go along with data science can be a bit mentally exhausting. I want to take a step back and work with a smaller dataset, something simple, something easy, something everyone can relate to – after all, that’s what this blog started out being about.

A while back, someone posted on Slashdot that the folks over at Finder.com had put together data sets of the install size of every PS4 and Xbox One game released to date. Being a a console owner myself – I’m a PS4 guy, but no fanboy or hardcore gamer by any means – I thought this would be a fun and rather unique data set to play around with, one that would fit well within the category of ‘everyday analytics’. So let’s take a look shall we?

Background

Very little background required here – the dataset comprises the title, release date, release type (major or indie), console (PS4 or Xbox One), and size in GiB of all games released as of September 10th, 2015. For this post we will ignore the time-related dimension and look only at the quantity of interest: install size.

Analysis

Okay, if I gave this data to your average Excel jockey what’s the first thing they’d do? A high level summary of the data broken apart by categorical variables and summarized by quantitative? You got it!
We can see that far more PS4 games have been released than Xbox (462 vs. 336) and the relative proportions are reversed for the former platform versus the latter as release type goes.

A small aside here on data visualization: it’s worth noting that the above is a good way to go for making a bar chart from a functional perspective. Since there are data labels and the y-axis metric is in the title, we can ditch the axis and maximize the data-ink ratio (well, data-pixel anyhow). I’ve also avoided using a stacked bar chart as interpretation of absolute values tends to suffer when not read from the same baseline. I’m okay with doing it for relative proportions though – as in the below, which further illustrates the difference in release type proportion between the two consoles:

Finally, how does the install size differ between the consoles and game types? If I’m an average analyst and just trying to get a grip on the data, I’d take an average to start:
We can see (unsurprisingly, if you know anything about console games) that major releases tend to be much larger in size than indie. Also in both cases, Xbox install sizes are larger on average: about 1.7x for indie titles and 1.25x for major.

Okay, that’s interesting. But if you’re like me, you’ll be thinking about how 99% of the phenomena in the universe are distributed by a power law or have some kind of non-Gaussian based distribution, and so averages are actually not always such a great way to summarize data. Is this the case for our install size data set?

Yes, it is. We can see here in this combination histogram / cumulative PDF (in the form of a Pareto chart) that the games follow a power law, with approximately 55 and 65 percent being < 5 GiB, for PS4 and Xbox games respectively

But is this entirely due to the indie games having small sizes? Might the major releases be centered around some average or median size?

No, we can see that even when broken apart by type of release the power-law like distribution for install sizes persists. I compared the averages to medians found them to be still be decent representations of central tendency and not too affected by outliers.

Finally we can look at the distribution of the install sizes by using another type of visualization suited for this task, the boxplot. While it is at least possible to jury-rig up a boxplot in Excel (see this excellent how-to over at Peltier Tech) Google Sheets doesn’t give us as much to work with, but I did my best (the data label is at the maximum point, and the value is the difference between the max and Q3):

The plots show that install sizes are generally greater for Xbox One vs. PS4, and that the difference (and skew) appears to be a bit more pronounced for indie games versus major releases, as we saw in the previous figures.

Okay, that’s all very interesting, but what about games that are available for both consoles? Are the install sizes generally the same or do they differ?

Difference in Install Size by Console Type
Because we’ve seen that the Xbox install sizes are generally larger than Playstation, here I take the PS4 size to be the baseline for games which appear on both (that is, differences are of the form XBOX Size – PS4 Size).

Of the 618 unique titles in the whole data set (798 titles if you double count across platform), 179 (~29%) were available on both – so roughly only a third of games are released for both major consoles.

Let’s take a look at the difference in install sizes – do those games which appear for both reflect what we saw earlier?

Yes, for both categories of game the majority are larger on Xbox than PS4 (none were the same size). Overall about 85% of the games were larger on Microsoft’s console (152/179).

Okay, but how much larger? Are we talking twice as large? Five times larger? Because the size of the games varies widely (especially between the release types) I opted to go for percentages here:

Unsurprisingly, on average indie games tend to have larger differences proportionally, because they’re generally much smaller in size than major releases. We can see they are nearly twice as large on Xbox vs. PS4 while major releases about 1 and a quarter. When games are larger on PS4, there’s not as big a disparity, and the pattern across release types is the same (though keep in mind the number of games here is a lot smaller than for the former).

Finally, just to ground this a bit more I thought I’d look at the top 10 games in each release type where the absolute differences are the largest. As I said before, here the difference is Xbox size minus PS4:

For major releases, the worst offender for being larger on PS4 is Batman: Arkham Night (~6.1 GiB difference) while on the opposite end, The Elder Scrolls Online has a ~19 GiB difference. Wow.

For indies, we can see the absolute difference is a lot smaller for those games bigger on PS4, with Octodad having the largest difference of ~1.4 GiB (56% of its PS4 size). Warframe is 19.6 GiB bigger on Xbox than PS4, or 503% larger (!!)

Finally, I’ve visualized all the data together for you so you can explore it yourself. Below is a bubble chart of the Xbox install size plotted against PS4, coloured by release type, where the size of each point represents the absolute value of the percentage difference between the platforms (with the PS4 size taken to be the baseline). So points above the diagonal are larger for Xbox than PS4, and points below the diagonal are larger for PS4 than Xbox. Also note that the scale is log-log. You can see that most of the major releases are pretty close to each other in size, as they nearly lie on the y=x line.

Conclusion

It’s been nice to get back into the swing of things and do a little simple data visualization, as well as play with a data set that falls into the ‘everyday analytics’ category.
And, as a result, we’ve learned:

  • XBox games generally tend to have larger install sizes than PS4 ones, even for the same title
  • Game install sizes follow a power law, just like most everything else in the universe (or maybe just 80% of it)
  • What the heck a GiB is
Until next time then, don’t fail to keep looking for the simple beauty in data all around you.

References & Resources

Complete List of Xbox One Install Sizes:
Complete List of PlayStation 4 Install Sizes:
Compiled data set (Google Sheets):
Excel Box and Whisker Diagrams (Box Plots) @ Peltier Tech:

Data Visualization Fundamentals with Skittles

So I have a shocking confession to make: I love Skittles.

This post is not sponsored, endorsed, compensated, or paid for in any way, shape or form, by Skittles Candy. I’m not particular – I like other types of candy that are similar – you know, those ones that are chocolate covered in a hard shell, whether they be the kind where you eat the red ones last or not.

Anyhow, I got to thinking about how, abstractly, each individual candy can be viewed like a pixel of a different color. So you can make art using candy, just like artists make a mosaic. There’s lots of this on the internet you can already see: in fact, Skittles has done print advertising this way.

But…. each individual candy can also represent something else: a unit of measurement. I thought it would be cool to go through some data visualization fundamentals using the candy in this way. So let’s dive in.

Data Visualization using only 1 bag of Skittles

So, what would your average first grader do with a bag of Skittles if you asked them to sort it? Probably something like below, the physical equivalent of a bubble chart depicting the quantities of each colour by area, assuming each Skittle is approximately the same size.

A perhaps more useful way to do the same would be to organize each colour in rows, with each row a set number (like tally marks). Here it’s not only easy to see the relative proportions of the different colours in the bag, but also count them as each row and group is a set number (5 & 10, respectively). This is equivalent to a pictogram, with each Skittle representing, well, 1 Skittle:

It’s not a big stretch of the imagination to collapse those groups together into groups of a set height. So here we have a proportional bar chart, where the length of each bar represents the percentage of the bag that is each colour. Note that because I didn’t slice Skittles in half, the physical analogue is not exactly the same as what you’d put down on paper or in Excel (there is one additional unit for yellow and orange):

And, as I both often have to remind people of this rule, and also observe many people not following it, it is best practice to sort the bars in descending order for maximum clarity / comparative value (assuming there is not another more important ordering):

And, if we want to transform our proportional bar chart into one comparing absolute quantities, it is not a giant stretch of the imagination to break apart the different bars so they are only one ‘pixel’ high:

Here it’s much easier to get an idea of the absolute number of each colour in the bag, but harder to tally that numbers exactly – for that we’d need to add an axis or data labels.

More bags please

Okay, I have another shocking confession to make: I lied. I really like Skittles. So I actually bought a whole bunch of bags.

So let’s look at some more visualization fundamentals, where we required comparing not only across a categorical variable (colour) but also between groups.

Here is the equivalent to our first graph from before, only showing the different numbers of Skittles in each bag. You can see there’s actually a fair amount of variance; the smallest bag had 89 pieces of candy, whereas the largest had 110.

Now let’s make a bubble graph which not only compares the sizes between the different bags, but also their makeups by colour. The end result is actually closer to a collection of pie charts:

We can also group by colour only to see the overall makeup for the whole group of bags. Whereas orange dominated in the first bag we looked at, you can see here that orange and yellow are approximately at parity overall.

Now let’s look at the tally mark / pictograph method. Here each row represents a bag:

You can see there’s a fair bit of variance in the different colours. I also tried rearranging things so they result was less like a pictograph and more like a treemap:

Really the best way to compare would be a bar graph. Here’s a stacked area graph. I didn’t bother sorting by length, because at this point I was pretty tired of shuffling Skittles around:

To get a better idea of the different makeups of each bag by colour, we can break this out into a grouped bar graph, first by bag, then by colour:
And, of course, we can reverse the order if we want to more directly compare the colour makeups. The columns are in numerical order by bag. And just for fun, we’ll make this one a column chart:
There. That’s better! Clearly Bag 1 was an outlier as far as the number of purple went, and Bag 3 had a lot of yellow. 

Concluding Remark

I thought it’d be cool to mix things up a bit, and trying doing some data visualization using a physical medium. The end result ended up being something more like an exercise for an elementary school mathematics class (indeed, there are many examples of this online), but I think it still drives home some of the fundamental strengths and weakness of different visualization types, as well as showing how they can be depicted using different media. 
If you’re really interested, you can download the data yourself and slice and dice visualizations to your heart’s content. And I’m sure if you bought enough bags of Skittles you could learn something of a statistical nature about their manufacturing and packaging process – but perhaps that’s for a different day. Until then I’ll just enjoy good candy and data visualization.

What’s in My Inbox? Data Analysis of Outlook

Introduction

Email is the bane of our modern existence.

Who of us hasn’t had a long, convoluted, back-and-forth email thread going on for days (if not weeks) in order to settle an issue which could have been resolved with a simple 5 minute conversation?

With some colleagues of mine, email has become so overwhelming (or their attempts to organize it so futile) that it brings to my mind Orwell’s workers at the Ministry of Truth in 1984 and their pneumatic tubes and memory holes – if the message you want is not in the top 1% (or 0.01%) of your inbox and you don’t know how to use search effectively, then for all intents and purposes it might as well be gone (see also: Snapchat).

Much has been written on the subject of why exactly we send and receive so much of it, how to best organize it, and whether or not it is, in fact, even an effective method of communication.

At one time even Gmail and the concept of labels was revolutionary – and it has done some good in organizing the ever-increasing deluge that is email for the majority of people. Other attempts have sprung up to tame the beast and make sense of such a flood of communication – most notably in my mind Inbox Zero, the simply-titled smartphone app Mailbox, and MIT’s recent data visualization project Immersion.

But email, with all its systemic flaws, misuse, and annoyances, is definitely here for good, no question. What a world we live in.

But I digress.

Background

I had originally hoped to export everything from Gmail and do a very thorough analysis of all my personal email. Though this is now a lot easier than it used to be, I got frustrated at the time trying to write a Python script and moved on to other projects.
But then I thought, hey, why not do the same thing for my work email? I recently discovered that it’s quite easy to export email from Outlook (as I detailed last time) so that brings us to this post.
I was somewhat disappointed that Outlook can only export a folder at a time (which does not include special folders such as search folders or ‘All Mail’) – I organize my mail into folders and wanted an export of all of it.
That being said, the bulk probably does remain in my inbox (4,217 items in my inbox resulted in a CSV that was ~15 MB) and we can still get a rough look using what’s available  The data cover the period from February 27th, 2013 to Nov 16th, 2013.

Email by Contact
First let’s  look at the top 15 contacts by total number of emails. Here are some pretty simple graphs summarizing that data, first by category of contact:

In the top 15, split between co-workers/colleagues and management is pretty even. I received about 5 times as much email from coworkers and managers as from stakeholders (but then again a lot of the latter ended up sorted into folders, so this count is probably higher). Still, I don’t directly interact with stakeholders as much as some others, and tend to work with teams or my immediate manager. Also, calls are usually better.

Here you can see that I interacted primarily with my immediate colleague and manager the most, then other management, and the remainder further down the line are a mix which includes email to myself and from office operations. Also of note – I don’t actually receive that much email (I’m more of a “in the weeds” type of guy) or, as I said, much has gone into the appropriate folders.

Time-Series Analysis
The above graphs show a very simplistic and high level view of what proportion of email I was receiving from who (with a suitable level of anonymity, I hope). More interesting is a quick and simple analysis of patterns in time of the volume of email I received – and I’m pretty sure you already have an idea of what some of these patterns might be.

When doing data analysis, I always feel it is important to first visualize as much of the data as practically possible – in order to get “a feel” for the data and avoid making erroneous conclusions without having this overall familiarity (as I noted in an earlier post). If a picture is worth thousand words then a good data visualization is worth a thousand keystrokes and mouse clicks.

Below is a simple scatter plot all the emails received by day, with the time of day on the y-axis:


This scatterpolot is perhaps not immediately particulary illuminating, however it already shows us a few things worth noting:

  • the majority of emails appear in a band approximately between 8 AM and 5 PM
  • there is increased density of email in the period between the end of July and early October, after which there is a sparse interval until mid-month / early November
  • there appears to be some kind of periodic nature to the volume of daily emails, giving a “strip-like” appearance (three guesses what that periodic nature is…)

We can look into this further by considering the daily volume of emails, as below. The black line is a 7 day moving average:

We can see the patterns noted above – the increase in daily volume after 7/27 and the marked decrease mid-October. Though I wracked my brain and looked thoroughly, I couldn’t find a specific reason why there was an increase over the summer – this was just a busy time for projects (and probably not for myself sorting email). The marked decrease in October corresponds to a period of bench time, which you can see was rather short-lived.

As I noted previously in analyzing communications data, the distribution of this type of information is exponential in nature and usually follows a log-normal distribution. As such, a moving average is not the greatest measure of central tendency – but a decent approximation for our purposes. Still, I find the graph a little more digestible when depicted with a logarithmic y-axis, as below:

Lastly we consider the periodic nature of the emails which is noted in the initial scatterplot. We can look for patterns by making a standard heatmap with the weekday as the column and hour of day as the row, as below:

You can clearly see that the the majority of work email occurs between the hours of 9 to 5 (shocking!). However some other interesting points of note are the bulk of email in the mornings at the begiinning of the week, fall-off after 5 PM at the end of the week (Thursday & Friday) and the messages received Saturday morning. Again, I don’t really receive that much email, or have spirited a lot of it away into folders as I noted at the beginning of the article (this analysis does not include things like automated emails and reports, etc.)

Email Size & Attachments
Looking at file attachments, I believe the data are more skewed than the rest, as the clean-up of large emails is a semi-regular task for the office worker (as not many have the luxury of an unlimited email inbox capacity – even executives) so I would expect that values on the high end to have largely been removed. Nevertheless it still provides a rough approximation of how email sizes are distributed and what proportion have attachments included.

First we look at the overall proportion of email left in my inbox which has attachments – of the 4,217 emails, 2914 did not have an attachment (69.1%) and 1303 did (30.9%).

Examining the size of emails (which includes the attachments) in a histogram, we see a familiar looking distribution, which here I have further expanded by making it into a Pareto chart. (note that the scale on the left y-axis is logarithmic):

Here we can see that of what was left in my inbox, all messages were about 8 MB in size or less, with the vast majority being 250K or less. In fact 99% of the email was less than 1750KB, and 99.9% less than 6MB.

Conclusion

This was a very quick analysis of what was in my inbox, however we saw some interesting points of note, some of which confirm what one would expect – in particular:
  • vast majority of email is received between the hours of 9-5 Monday to Friday
  • majority of email I received was between the two managers & colleagues I work closest with
  • approximately 3 out of 10 emails I received had attachments
  • the distribution of email sizes is logarithmic in nature
If I wanted to take this analysis further, we could also look at the trending by contact and also do some content analysis (the latter not being done here for obvious reasons, of course).
This was an interesting exercise because it made me mindful again of what everyday analytics is all about – analyzing rich data sets we are producing all the time, but of which we are not always aware.

References and Resources

Inbox Zero
http://inboxzero.com/

Mailbox
http://www.mailboxapp.com/

Immersion
https://immersion.media.mit.edu/

Data Mining Email to Discover Organizational Networks and Emergent Communities in Work Flows

Bananagrams!!!

It was nice to be home with the family for Thanksgiving, and to finally take some time off.

A fun little activity which took up a lot of our time over the past weekend was Bananagrams, which, if you don’t already know, is sort of like a more action-packed version of Scrabble without the board.

Being the type of guy that I am, I started to think about the distribution of letters in the game. A little Googling led to some prior art to this post.

The author did something neat (which I wouldn’t have thought of) by making a sort of bar chart using the game pieces. Strangely though, they chose not to graph the different distributions of letters in Bananagrams and Scrabble but instead listed them in a table.

So, assuming the data from the post are accurate, here is a quick breakdown of said distributions below. As an added bonus, I’ve also included that trendy digital game that everyone plays on Facebook and their iDevices:

Bar graph of letter frequencies of Scrabble, Bananagrams and Words with Friends

Looking at the graph, it’s clear the Bananagrams has more tiles than the other games (the total counts are 144, 104 and 100 for Banagrams, Words with Friends and Scrabble respectively) and notably also does not have blank tiles of which the other games have 2 each. Besides the obvious prevalence of vowels in all 3 games, T, S, R, N, L and D also have high occurrence.

We can also compare the relative frequencies of the different letters in each game with respect to Scrabble. Here I took the letter frequency for each game (as a percent) then divided it by the frequency of the same letter in Scrabble. The results are below:

Bar chart of Bananagrams and Words with Friends letter frequencies relative to Scrabble

Here it is interesting to note that the relative frequency of H in Words with Friends is nearly double that in Scrabble (~192%). Also D, S and T have greater relative frequencies. The remaining letters are fairly consistent with the exception of I and N which are notably less frequent.

Bananagrams relative letter frequency is fairly consistent overall, with the exception of J, K, Q, X, and Z which are around the 140 mark. I guess the creator of the game felt there weren’t enough of the “difficult” letters in Scrabble.

There’s more analysis that could be done here (looking at the number of points per letter in WWF & Scrabble versus their relative frequency immediately comes to mind) but that should do for now. Hope you found this post “a-peeling”.

xkcd: Visualized

Introduction

It’s been said that the ideal job is one you love enough to do for free but are good enough at that people will pay you for it. That if you do what you love no matter what others may say, and if you work at it hard enough, and long enough, eventually people will recognize it and you’ll be a success.

Such is the case with Randall Munroe. Because any nerd worth their salt knows what xkcd is.

What started as simply a hobby and posting some sketches online turned into a cornerstone of internet popular culture, with a cult following amongst geekdom, the technically savvy, and more.

Though I would say that it’s gone beyond that now, and even those less nerdy and techie know what xkcd means – it’s become such a key part of internet popular culture. Indeed, Mr. Munroe’s work has swing due to the sheer number of people who know and love it, and content on the site has resulted in changes being made on some of the biggest sites on the Internet – take, for example, Google adding a comment read-a-loud in 2008, quite possibly because of a certain comic.

As another nerdy / tech / data citizen of the internet who knows, loves and follows xkcd, I thought I could pay tribute to it with its own everyday analysis.

Background

Initially, I thought I would have to go about doing it the hard way again. I’ve done some web scraping before with Python and thought this would be the same using the (awesome) Beautiful Soup package.

But Randall, being the tech-savvy (and Creative Commons abiding) guy that he is, was nice enough to provide an API to return all the comic metadata in JSON format (thanks Randall!).

That being said it was straightforward to write some Python with urrlib2 to download the data and then get going on the analysis.

Of course, after doing all that I realized that someone else was nice enough to have already written the equivalent code in R to access the data. D’oh! Oh well. Lesson learned – Google stuff first, code later.

But it was important to write that code in Python as I used the Python Imaging Library (PIL) (also awesome… thanks mysterious, shadowy developers at Pythonware/Secret Labs AB) to extract metadata from the comic images.

The data includes the 1204 comics from the very beginning (#1, Barrel – Part 1 posted on Jan 1, 2006) to #1204, Detail, posted on April 26, 2013.

As well as the data provided via the JSON (comic #, url, title, date, transcript and alt text) I pulled out additional fields using the Python Imaging Library (file format, filesize, dimensions, aspect ratio and luminosity). I also wanted to calculate hue, however, regrettably this is a somewhat more complicated process which my image processing chops were not immediately up to, and so I deferred on this point.

Analysis

File type
Ring chart of xkcd comics by file typeBar chart of xkcd comics by file type


You can see out of the 1204 comics, 1073 (~89.19%) were in PNG format, 128 (~10.64%) were in JPEG and only 2 (#961, Eternal Flame and #1116, Stoplight) (~0.17%) were in GIF. This of course, being because the latter are the only two comics which are animated.

Looking at the filetype over time below, you can see that initially xkcd was primarily composed of JPEG images (mostly because they were scanned sketches) and this quickly changed over time to being almost exclusively PNG with the exception of the two aforementioned animated GIFs. The lone outlying JPEG near 600 is Alternative Energy Revolution (#556).

strip chart of xkcd comics by file type
Image Mode
Next we can look at the image mode of all the xkcd images. For a little context, the image modes are roughly as following:
  • L – 8 bit black & white
  • P – 8 bit colour
  • RGB – colour
  • LA, RGBA – black & white with alpha channel (transparency), colour with alpha channel

The breakdown for all the comics is depicted below.

ring chart of xkcd comics by image modebar chart of xkcd comics by image mode

You can see that the majority are imagemode L (847, ~70.41%) followed by 346 in RGB (~28.76%) and a tiny remaining number are in P (8, ~0.7%) with the remaining two in L and RGB modes with alpha channel (LA & RGBA).

Any readers will know that the bulk of xkcd comics are simple black-and-white images with stick figures and you can see this reflected in the almost ¾ to ¼ ratio of monochrome to coloured images.

The two images with alpha channel are Helping (#383) and Click and Drag (#1110), most likely because of the soft image effect and interactivity, respectively.

Looking at the image mode over time, we can see that like the filetype, almost all of the images were initially in RGB mode as they were scans. After this period, the coloured comics are fairly evenly interspersed with the more common black and white images.

strip chart of xkcd comics by image mode
Luminosity

You can see in the figure on the left that given the black-and-white nature of xkcd the luminosity of each image is usually quite high (the maximum is 255). We can see the distribution better summarized on the right in a histogram:

scatterplot of luminosity of xkcd comicshistogram of luminosity of xkcd comics

Luminosity was the only quality of the images which had significant change over the years that Randall has created the comic. Doing an analysis of variance we can see there is a statistically significant year-on-year difference in the average comic brightness (> 99%):

> aov(data$lumen ~ data$year)
Call:
aov(formula = data$lumen ~ data$year)

Terms:
data$year Residuals
Sum of Squares 5762.0 829314.2
Deg. of Freedom 1 1201

Residual standard error: 26.27774
Estimated effects may be unbalanced
> summary(aov(data$lumen ~ data$year))
Df Sum Sq Mean Sq F value Pr(>F)
data$year 1 5762 5762 8.344 0.00394 **
Residuals 1201 829314 691

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

True, there are currently less data points for the 2013 year, however even doing the same excluding this year is significant with 99% significance.

The average luminosity decreases by year, and this is seen in the plot below which shows a downward trend:

line plot of average luminosity of xkcd per year

Image Dimensions
Next we look at the sizes of each comic. xkcd ranges from very tall comic-book style strips to ultra-simplistic single small images driving the whole point or punch line home in one frame.

scatterplot of height vs. width for xkcd comics

Looking at the height of each comic versus the width, you can see that there appears to be several standard widths which Randall produces the comic at (not so with heights). These standard widths are 400, 500, 600, 640, and 740.

distribution of image heights of xkcd comic

We can see these reflected in the distribution of all image widths, 740 is by far the most common comic width. There is no such pattern in the image heights, which appears to have a more logarithmic-like distribution.

histogram of width of xkcd comicshistogram of height of xkcd comics

Interesting, the ‘canonical’ widths are not constant over time – there were several widths which were used frequently near the beginning, after which the more common standard of 740px was used. This may be due to the large number of scanned images near the beginning, as I imagine scanning an A4 sheet of paper would often result in the same image resolutions. 

scatterplot of width of xkcd comics

The one lone outlier on the high end of image width is 780px wide and is #1193, Externalities.

Looking at the aspect ratio of the comics over time, you can see that there are appear to be two classes of comics – a larger number (about 60%) of which are more tightly clustered around an even 1:1 aspect ratio, and then a second class more evenly distributed with aspect ratio 2 and above. There are also a small peaks around 1.5 and 1.75.
scatterplot of aspect ratio of xkcd comicshistogram of aspect ratio of xkcd comics
In case you were wondering the comic with an aspect ratio of ~8 is Tags (#1144) and the tallest comic proportionally is Future Timeline (#887).
Filesize

As well as examining the resolution (dimensions) of the comic images we can also examine the distribution of the images by their filesize.

distribution of file size of xkcd comics

You can see that the majority of the images are below 100K in size – in general the xkcd comics are quite small as the majority are simple PNGs displaying very little visual information.

We can also look at the comic size (area in square pixels) versus the filesize:

scatterplot of file size versus image size of xkcd comicsscatterplot of file size versus image size of xkcd comics (with trend line)

There is clearly a relationship here, as illustrated on the log-log plot on the right with the trend line.Of course, I am just stating the obvious – this relationship is not unique to the comics and exists as a property for the image formats in general.

If we separated out the images by file type (JPEG and PNG) I believe we would see different numbers for the relationship as a result of the particularities of the image compression techniques.

Conclusions

I have this theory that how funny a joke is to someone who gets it is inversely proportional to the number of other people who would get it. That is to say, the more esoteric and niche the comedy is, the funnier and more appealing it is to those who actually get the punch line. It’s a feeling of being special – a feeling that someone else understands and that the joke was made just for you, and others like you, and that you’re not alone in thinking that comics involving Avogadro’s number, Kurt Godel or Turing Completeness can be hilarious.

As an analyst who has come out of the school of mathematics, and continually been immersed in the world of technology, it is reassuring to read something like xkcd and feel like you’re not the only one who thinks matters of data, math, science, and technology can be funny, along with all the other quirkiness and craziness in life which Randall so aptly (and sometimes poignantly) portrays.

That being said, Randall’s one dedicated guy who has done some awesome work for the digitally connected social world of science, technology, and internet geekdom, and now we know how much he likes using 740px as the canvas width, and that xkcd is gradually using less white pixels over the years.

And let’s hope there will be many more years to come.

Resources

xkcd
xkcd – JSON API
xkcd – Wikipedia
code on github

Toronto Licensed Cats & Dogs 2012 Data Visualization

It’s raining cats and dogs! No, I lied, it’s not.

But I wanted to do so more data viz and work with some more open data.

So for this quick plot I present, Cat and Dog Licenses in the City of Toronto for 2012, visualized!


Above in the top pane is the number of licensed cats and dogs per postal code (or Forward Sortation Area, FSA). I really would like to have produced a filled map (chloropleth) with the different postal code areas, however Tableau unfortunately does not have Canadian postal code boundaries, just lat/lon and getting geographic data in is a bit of an arduous process.

I needed something to plot given that I just had counts of cat and dog licenses per FSA, so threw up a scatter and there is amazing correlation! Surprise, surprise – this is just the third variable, and I bet that if you found a map of (human) population density by postal code you’d see why the two quantities are so closely related. Or perhaps not – this is just my assumption – maybe some areas of the GTA are better about getting their pets licensed or have more cats and dogs. Interesting food for thought.


Above is the number of licenses per breed type. Note that the scale is logarithmic for both as the “hairs” (domestic shorthair, domestic mediumhair and domestic longhair) dominate for cats and I wanted to keep the two graphs consistent.

The graphs are searchable by keyword, try it out!

Also I find it shocking that the second most popular breed of dog was Shih Tzu and the fourth most type of cat was Siamese – really?

Resources

Toronto Licensed Cat & Dog Reports (at Toronto Open Data Portal)

Toronto Animal Services
http://www.toronto.ca/animal_services/

Top 10 Super Bowl XLVII Commercials in Social TV (Respin)

So the Super Bowl is kind of a big deal.

Not just because there’s a lot of football. And not just because it’s a great excuse to get together with friends and drink a whole lot of beer and eat unhealthy foods. And not because it’s a good excuse to shout at your new 72″ flatscreen with home theater surround that you bought at Best Buy just for your Super Bowl party and are going to try to return the next day even though you’re pretty sure now that they don’t let you do that any more.

The Super Bowl is a big deal for marketers. For creatives. For ‘social media gurus’. Because there’s a lot of eyeballs watching those commercials. In fact, I’m pretty sure there’s people going to Super Bowl parties who don’t even like football and are just there for the commercials, that is if they’ve not decided to catch all the best ones after the fact on YouTube.

And also, you know, because if you’re putting down $6 million for a minute of commercial airtime, you want to make sure that those dollars are well spent.

So Bluefin Labs is generating a lot of buzz lately as they were acquired by Twitter. TV is big, social media is big, so Social TV analytics must be even bigger, right? Right?

Anyhow Bluefin showed up recently in my Twitter feed for a different reason: their report on the Top 10 Super Bowl XLVII commercials in Social TV that they did for AdAge.

The report’s pretty and all, but a little too pretty for my liking, so I thought I’d respin some of it.

Breakdown by Gender:

Superbowl XLVII Commercial Social Mentions by Gender

You can see that the male / female split is fairly even overall, with the exception of the NFL Network’s ad and to a lesser extent the ad for Fast & Furious 6 which were more heavily mentioned proportionally by males. The Budweiser, Calvin Klein and Taco Bell spots had greater percentages of women commenting.

Sentiment

The Taco Bell, Dodge and Budweiser ads had the most mentions with positive sentiment. The NFL ad had a very large amount of neutral comments (74%), moreso than any other ad, proportionally. The Go Daddy ad had the most negative mentions, for good reason – it’s gross and just kind of weird. It wouldn’t be the Super Bowl if Go Daddy didn’t air a commercial of questionable taste though, right?
Superbowl XLVII Commercial Sentiment Breakdown by Gender
Superbowl XLVII Commercial Sentiment Breakdown by Gender (Proportional)
Lastly, I am going to go against the grain here and say that the next big thing in football is most definitely going to be Leon Sandcastle.

Finer Points Regarding Data Visualization Choices

The human mind is limited.

We can only process so much information at one time. Numerals are text which communicate quantity. However, unlike other text, it’s a lot harder to read a whole bunch of numbers and get a high-level understanding of what is being communicated. There are sentences of numbers and quantities (these are called equations, but not everyone is as literate in them) however simply looking at a pile of data and having an understanding of the ‘big picture’ is not something most people can do. This is especially true as the amount of information becomes larger than a table with a few categories and values.

If you’re a market research, business, data, financial, or (insert other prefix here) analyst, part of your job is taking a lot of information and making sense of that information, so that other people don’t have to. Let’s face it – your Senior Manager or The VP doesn’t have time to wade through all the data – that’s why they hired you.

Ever since Descartes’ epiphany (and even before that) people have been realizing that there are other, more effective ways to communicate information than having to look at all the details. You can communicate the shape of the data without knowing exactly how many Twitter followers were gained each day. You can see what the data look like without having to know the exact dollar value for sales each and every day. You can feel what the data are like, and get an intuitive understanding of what’s going on, without having to look at all the raw information.

Enter data visualization.

Like any practice, data visualization and the depicting quantitative relationships visually can be done poorly or can be done well. I’m sure you’ve seen examples of the former, whether it be in a presentation or other report, or perhaps floating around the Internet. And the latter, like so many good things, is not always so plentiful, nor appreciated. Here I present some finer points between data visualization choices, in the hope that you will always find yourself depicting data well.

Pie (and Doughnut) Chart

Ah, the pie chart. The go-to the world over when most people seek to communicate data, and one both loved and loathed by many.
The pie chart should be used to compare quantities of different categories where the proportion of the whole is important, not the absolute values (though these can be added with labelling as well). It’s important that the number of categories being compared remain small – depending on the values, the readability of the chart decreases greatly as the number of categories increases. You can see this below. The second example is a case where an alternate representation should be considered, as the chart’s readability and usefulness is lower given the larger number of proportions being compared:

Doughnut charts are the same as pie charts but with a hole in the center. They may be used in the place of multiple pie charts by nesting the rings:

Hmm.

Though again, as the number of quantities being compared increases the readability and visual utility generally decreases and you are better served by a bar chart in these cases. Also there is the issue that the area of each annulus will be different for the same angle, depending upon which ring it is in.

With circular charts it is best to avoid legends as this causes the eye to flit back and forth between the different segments and the legend, however when abiding by this practice for doughnut charts labeling becomes a problem, as you can see above.

Tufte contends that a bar chart will always serve better than a pie chart (though some others disagree). The issue is that there is some debate about the way the human mind processes comparisons with angular representations versus those depicted linearly or by area. I tend to agree and find the chart below much better data visualization that the one we saw previously:

Isn’t that much better?

From a practical perspective – a pie chart is useful because of its simplicity and familiarity, and is a way to communicate proportion of quantities when the number of categories being compared is small. 
Bonus question:
Q. When is it a good idea to use a 3-D pie chart?
A. Never. Only as an example of bad data visualization!

Bar Charts

Bar charts are used to depict the values of a quantity or quantities across categories. For example, to depict sales by department, or per product type.
This type of chart can be (and is) used to depict values over time, however, said chunks of time should be discrete (e.g. quarters, years) and of a small number. When a comparison is to be done over time and the number of periods / data points is larger, it is better visualized using a line chart.

As the number of categories becomes large, an alternative to the usual arrangement (‘column’ chart) is to arrange the categories vertically and bars horizontally. Note this is best done only for categorical / nominal data as data with an implied order (ordinal, interval, or ratio type data) should be displayed left-to-right in increasing order to be consistent with reading left to right.
Bar charts may also be stacked in order to depict both the values between categories as well as the total across them. If the absolute values are not important, then stacked bar charts may be used in this way in the place of several pie charts, with all bars having a maximum height of 100%:

Stephen Few contends that this still makes it difficult to compare proportions, similar to the problem with pie charts, and has other suggestions [PDF], though I think it is fine on some occassions, depending the nature of the data being depicted.

When creating bar charts it is important to always start the y-axis from zero so as not to produce a misleading graph.

A column chart may also be combined with a line graph of the total per category in a type of combo chart known as Pareto chart.

Scatterplot (and Bubble Graphs)

Scatterplots are used to depict a relationship between two quantitative variables. The value pairs for the variables are plotted against each other, as below:

When used to depict relationships occurring over time, we instead use a special type of scatterplot known as a line graph (next section).

A bubble chart is a type of scatterplot used to compare relationships between three variables, where the points are sized by area according to a third value. Care should be taken to ensure that the points are sized correctly in this type of chart, so as not to incorrectly depict the relative proportion of quantities

Relationships between four variables may also be visualized by colouring each point according to the value of a fourth variable, though this may be a lot of information to depict all at once, depending upon the nature of the data. When animated to include a fifth variable (usually time) it is known as a motion chart, which is perhaps most famously demonstrated in Hans Rosling’s landmark TED Talk which has become somewhat of a legend.

Line Graphs

Line graphs are usually used to depict quantities changing over time. They may also be used to depict relationships between two (numeric) quantities when there is continuity in both.

For example, it makes sense to compare sales over time with a line graph, as time is numerical quantity that varies continuously:

However it would not make sense to use a line graph to compare sales across departments as that is categorical / nominal. Note that there is one exception to this rule and that is the aforementioned Pareto chart.

Omitting the points on the line graph and using a smooth graph instead of line segments creates an impression of more data being plotted, and hence a greater continuity. Compare with the plot above the one below:

So practically speaking save the smooth line graphs for when you have a lot of data and the points would just be visual clutter, otherwise it’s best to overplot the points to be clear about what quantities are being communicated.

Also note that unlike a bar chart, it is acceptable to have a non-zero starting point for the y-axis of a line graph as the change in values is being depicted, not their absolute values.

Now Go Be Great!

This is just a sample of some of the finer differences between the choices for visualizing data. There are of course many more ways to depict data, and, I would argue, that possibilities for data visualization are only limited by the imagination of the visualizer. However when sticking with the tried, true and familiar, keep these points in mind to be great at what you do and get your point across quantitatively and visually.
Go, visualize the data, and be amazing!

What’s in My Pocket? Read it now! (or Read It Later)

Introduction

You know what’s awesome? Pocket.

I mean, sure, it’s not the first. I think Instapaper existed a little before (perhaps). And there are alternatives, like Google Reader. But Pocket is still my favorite. It’s pretty awesome at what it does.

Pocket (or Read It Later, as it used to be known) has fundamentally changed the way I read.

Before I had an Android phone I used to primarily read books. But applications like Pocket allow you to save an article from the web so you can read it later. Being a big fan of reading (and also procrastination) this was a really great application for me to discover, and I’m quite glad I did. Now I can still catch up on the latest Lifehacker even if I am on the subway and don’t have data connectivity.

Background

The other interesting thing about this application is that they make it fairly easy to get a hold of your data. The website has an export function which allows you to dump all your data for everything you’ve ever added to your reading list into HTML.

Having the URL of every article you’ve ever read in Pocket is handy, as you can revisit all the articles you’ve saved. But there’s more to it than that. The HTML export also contains the time each article was added (in UNIX epoch). Combine this with an XML or JSON dump from the API, and now we’ve got some data to work with.

My data set comprises a list of 2975 URLs added to the application over the period 14/07/2011 – 19/09/2012. The data from the export includes the article ID, article URL, date added and updated, and tags added to each article.

In order to add to the data provided by the export functionalities, I wrote a simple Python script using webarticle2text, which is available on github. This script downloaded the all the text from each article URL and continually added it to a single text file, as well as doing a word count for each article and extracting the top-level domain (TLD).

Analysis

First of all we can take a very simple overview of all the articles I have saved by site:

And because pie-type charts make Edward R. Tufte (and some other dataviz people) cry, here is the same information in a bar chart:
Head and shoulders above all other websites at nearly half of all articles saved is Psychology Today. I would just like to be on the record as saying – don’t hate. I know this particular publication is written in such a fashion that it usually thought of as being slanted towards women, however I find the majority of articles to be quite interesting (as evidenced by the number of articles I have read). Perhaps other men are not that interested in the goings-on in their own and other people’s heads, but I am (apparently).

Also, I think this is largely due to the design of the site. I commented before that using Pocket has changed the way I read. Well, one example of this is that I find I save a lot more articles from sites which have well designed mobile versions, as I primarily add articles from my phone. For this reason I can also see why I have saved so many articles from Psych Today, as their well-designed mobile site has made it easy to do so. Plus the article titles are usually enough to grab me.

You can have a look at their visually appealing mobile site if you are on a phone (it detects if the browser is a desktop browser). The other top sites in the list also have well-designed mobile sites (e.g. The Globe and Mail, AskMen, Ars Technica).

Good mobile site design aside, I like reading psych articles, men’s magazines, news, and tech.

Next we examine the data with respect to time.

Unfortunately the Pocket export only provides two categories: time added and time ‘updated’. Looking at the data, I believe this ‘updated’ definition applies to mutiple actions on the article, like marking as read, adding tags, re-downloading, et cetera. It would be ideal to actually have the date/time when the article was marked as read, as then further interesting analysis could be done. For example, looking at the time interval between when articles were added and read, or the number the number of articles read per day.

Anyhow, we continue with what data are available. As in a previous post, we can get a high-level overview of the data with a scatterplot:

Pretty.

The most salient features which immediately stand out are the two distinct bands in the early morning and late afternoon. These correspond to when the majority of my reading is done, on my communte to and from work on public transit.

You can also see the general usage lining up with events in my personal life. The bands start in early October, shortly after I began my new job and started taking public transit. There is also a distinct gap from late December to early January when I was home visiting family over the Christmas holidays.

You can see that as well as being added while I am on public transit, articles are also added all throughout the day. This is as expected; I often add articles (either on my phone or via browser) over the course of the day while at work. Again, it would be interesting to have more data to look at this further, in particular knowing which articles were read or added from which platform.

I am uncertain about articles which are listed as being updated in the late hours in the evening. Although I sometimes do read articles (usually through the browser) in these hours, I think this may correspond to things like adding tags or also a delay in synching between my phone and the Pocket servers.

I played around with heatmaps and boxplots of the data with respect to time, but there was nothing particularly interesting which you can’t see from this scatterplot. The majority of articles are added and updated Monday to Friday during commute hours.

We can also look at the daily volume of articles added:

This graph looks similar to one seen previously in my post on texting. There are some days where very few articles are added and a few where there are a large number. Looking at the distribution of the number of articles added daily, we see an exponential type distribution:

Lastly we examine the content of the articles I read. As I said, all the article text was downloaded using Python and word counts were calculated for each. We can plot a histogram of this to see the distribution of the article length for what I’ve been reading:

Hmmmmm.

Well, that doesn’t look quite right. Did I really read an article 40,000 words long? That’s about 64 pages isn’t it? Looking at URLs for the articles with tens of thousands of words, I could see that those articles added were either malfunctions of the Pocket article parser, the webarticle2text script, or both. For example, the 40,000 word article was a post on the Dictionary.com blog where the article parser also grabbed the entire comment thread.

Leaving the data as is, but zooming in on a more reasonable portion of the histogram, we see something a little more sensical:

This is a little more what we expect. The bulk of the data are distributed between very short articles and those about 1500 words long. The spikes in the low end also correspond to failures of the article parsers.

Now what about the text content of the articles? I really do enjoy a good wordcloud, however, I know that some people tend look down upon them. This is because there are alternate ways of depicting the same data which are more informative. However as I said, I do enjoy them as they are visually appealing.

So firstly I will present the word content in a more traditional way. After removing stop words, the top 25 words found in the conglomerate file of the article text are as follows:

As you can see, there are issues with the download script as there is some garbage in there (div, the years 2011 and 2012, and garbage characters for “don’t” and “are”, or possibly “you’re”). But it appears that my recreational reading corresponds to the most common subjects of its main sources. The majority of my reading was from Psychology Today and so the number one word we see is “people”. I also read a lot articles from men’s magazines, and so we see words which I suspect primarily come from there (“women”, “social”, “sex”, “job”), as well as the psych articles.

And now the pretty visualization:

Seeing the content of what I read depicted this way has made me have some realizations about my interests. I primarily think of myself as a data person, but obviously I am genuinely interested in people as well.

I’m glad data is in there as a ‘big word’ (just above ‘person’), though maybe not as big as some of the others. I’ve just started to fill my reading list with a lot of data visualization and analysis articles as of late.

Well, that was fun, and somewhat educational. In the meantime, I’ll keep on reading. Because the moment you stop reading is the moment you stop learning. As Dr. Seuss said: “The more that you read, the more things you will know. The more that you learn, the more places you’ll go!”

Conclusions

  • Majority of reading done during commute on public transit
  • Number of articles added daily of exponential-type distribution
  • Most articles read from very short to ~1500 words
  • Articles focused on people, dating, social topics; more recently data

Resources

Pocket (formerly Read It Later) on Google Play:
https://play.google.com/store/apps/details?id=com.ideashower.readitlater.pro

Pocket export to HTML:
http://getpocket.com/export

Mediagazer Editor Lyra McKee: What’s In My Pocket
http://getpocket.com/blog/2012/09/mediagazer-editor-lyra-mckee-whats-in-my-pocket/

Founder/CEO of Pocket Nate Weiner: What’s In My Pocket
http://getpocket.com/blog/2012/08/nate-weiner-whats-in-my-pocket/

Pocket Trends (Data analysis/analytics section of Pocket Blog)
http://getpocket.com/blog/category/trends/

webarticle2text (Python script by Chris Spencer)
https://github.com/chrisspen/webarticle2text