I’m Dreaming of a White Christmas

I’m heading home for the holidays soon.

It’s been unseasonably warm this winter, at least here in Ontario, so much so that squirrels in Ottawa are getting fat. I wanted to put together a really cool post predicting the chance of a white Christmas using lots of historical climate data, but it turns out Environment Canada has already put together something like that by crunching some numbers. We can just slam this into Google Fusion tables and get some nice visualizations of simple data.


It seems everything above a certain latitude has a much higher chance of having a white Christmas in recent times than those closer to the America border and on the coast, which I’m going to guess is likely due to how cold it gets in those areas on average during the winter. Sadly Toronto has less than a coin-flip’s chance of a white Christmas in recent times, with only a 40% chance of snow on the ground come the holiday.


But just because there’s snow on the ground doesn’t necessary mean that your yuletide weather is that worthy of a Christmas storybook or holiday movie. Environment Canada also has a definition for what they call a “Perfect Christmas”: 2 cm or more of snow on the ground and snowfall at some point during the day. Which Canadian cities had the most of these beautiful Christmases in the past?

Interestingly Ontario, Quebec and Atlantic Canada are better represented here, which I imagine has something to do with how much precipitation they get due to proximity to bodies of water, but hey, I’m not a meteorologist.
A white Christmas would be great this year, but I’m not holding my breath. Either way it will be good to sit by the fire with an eggnog and not think about data for a while. Happy Holidays!

Visual Analytics of Every PS4 and XBox One Game by Install Size


I have a startling confession to make: sometimes I just want things to be simple. Sometimes I just want things to be easier. All this talk about “Big Data”, predictive modeling, machine learning, and all those associated bits and pieces that go along with data science can be a bit mentally exhausting. I want to take a step back and work with a smaller dataset, something simple, something easy, something everyone can relate to – after all, that’s what this blog started out being about.

A while back, someone posted on Slashdot that the folks over at Finder.com had put together data sets of the install size of every PS4 and Xbox One game released to date. Being a a console owner myself – I’m a PS4 guy, but no fanboy or hardcore gamer by any means – I thought this would be a fun and rather unique data set to play around with, one that would fit well within the category of ‘everyday analytics’. So let’s take a look shall we?


Very little background required here – the dataset comprises the title, release date, release type (major or indie), console (PS4 or Xbox One), and size in GiB of all games released as of September 10th, 2015. For this post we will ignore the time-related dimension and look only at the quantity of interest: install size.


Okay, if I gave this data to your average Excel jockey what’s the first thing they’d do? A high level summary of the data broken apart by categorical variables and summarized by quantitative? You got it!
We can see that far more PS4 games have been released than Xbox (462 vs. 336) and the relative proportions are reversed for the former platform versus the latter as release type goes.

A small aside here on data visualization: it’s worth noting that the above is a good way to go for making a bar chart from a functional perspective. Since there are data labels and the y-axis metric is in the title, we can ditch the axis and maximize the data-ink ratio (well, data-pixel anyhow). I’ve also avoided using a stacked bar chart as interpretation of absolute values tends to suffer when not read from the same baseline. I’m okay with doing it for relative proportions though – as in the below, which further illustrates the difference in release type proportion between the two consoles:

Finally, how does the install size differ between the consoles and game types? If I’m an average analyst and just trying to get a grip on the data, I’d take an average to start:
We can see (unsurprisingly, if you know anything about console games) that major releases tend to be much larger in size than indie. Also in both cases, Xbox install sizes are larger on average: about 1.7x for indie titles and 1.25x for major.

Okay, that’s interesting. But if you’re like me, you’ll be thinking about how 99% of the phenomena in the universe are distributed by a power law or have some kind of non-Gaussian based distribution, and so averages are actually not always such a great way to summarize data. Is this the case for our install size data set?

Yes, it is. We can see here in this combination histogram / cumulative PDF (in the form of a Pareto chart) that the games follow a power law, with approximately 55 and 65 percent being < 5 GiB, for PS4 and Xbox games respectively

But is this entirely due to the indie games having small sizes? Might the major releases be centered around some average or median size?

No, we can see that even when broken apart by type of release the power-law like distribution for install sizes persists. I compared the averages to medians found them to be still be decent representations of central tendency and not too affected by outliers.

Finally we can look at the distribution of the install sizes by using another type of visualization suited for this task, the boxplot. While it is at least possible to jury-rig up a boxplot in Excel (see this excellent how-to over at Peltier Tech) Google Sheets doesn’t give us as much to work with, but I did my best (the data label is at the maximum point, and the value is the difference between the max and Q3):

The plots show that install sizes are generally greater for Xbox One vs. PS4, and that the difference (and skew) appears to be a bit more pronounced for indie games versus major releases, as we saw in the previous figures.

Okay, that’s all very interesting, but what about games that are available for both consoles? Are the install sizes generally the same or do they differ?

Difference in Install Size by Console Type
Because we’ve seen that the Xbox install sizes are generally larger than Playstation, here I take the PS4 size to be the baseline for games which appear on both (that is, differences are of the form XBOX Size – PS4 Size).

Of the 618 unique titles in the whole data set (798 titles if you double count across platform), 179 (~29%) were available on both – so roughly only a third of games are released for both major consoles.

Let’s take a look at the difference in install sizes – do those games which appear for both reflect what we saw earlier?

Yes, for both categories of game the majority are larger on Xbox than PS4 (none were the same size). Overall about 85% of the games were larger on Microsoft’s console (152/179).

Okay, but how much larger? Are we talking twice as large? Five times larger? Because the size of the games varies widely (especially between the release types) I opted to go for percentages here:

Unsurprisingly, on average indie games tend to have larger differences proportionally, because they’re generally much smaller in size than major releases. We can see they are nearly twice as large on Xbox vs. PS4 while major releases about 1 and a quarter. When games are larger on PS4, there’s not as big a disparity, and the pattern across release types is the same (though keep in mind the number of games here is a lot smaller than for the former).

Finally, just to ground this a bit more I thought I’d look at the top 10 games in each release type where the absolute differences are the largest. As I said before, here the difference is Xbox size minus PS4:

For major releases, the worst offender for being larger on PS4 is Batman: Arkham Night (~6.1 GiB difference) while on the opposite end, The Elder Scrolls Online has a ~19 GiB difference. Wow.

For indies, we can see the absolute difference is a lot smaller for those games bigger on PS4, with Octodad having the largest difference of ~1.4 GiB (56% of its PS4 size). Warframe is 19.6 GiB bigger on Xbox than PS4, or 503% larger (!!)

Finally, I’ve visualized all the data together for you so you can explore it yourself. Below is a bubble chart of the Xbox install size plotted against PS4, coloured by release type, where the size of each point represents the absolute value of the percentage difference between the platforms (with the PS4 size taken to be the baseline). So points above the diagonal are larger for Xbox than PS4, and points below the diagonal are larger for PS4 than Xbox. Also note that the scale is log-log. You can see that most of the major releases are pretty close to each other in size, as they nearly lie on the y=x line.


It’s been nice to get back into the swing of things and do a little simple data visualization, as well as play with a data set that falls into the ‘everyday analytics’ category.
And, as a result, we’ve learned:

  • XBox games generally tend to have larger install sizes than PS4 ones, even for the same title
  • Game install sizes follow a power law, just like most everything else in the universe (or maybe just 80% of it)
  • What the heck a GiB is
Until next time then, don’t fail to keep looking for the simple beauty in data all around you.

References & Resources

Complete List of Xbox One Install Sizes:
Complete List of PlayStation 4 Install Sizes:
Compiled data set (Google Sheets):
Excel Box and Whisker Diagrams (Box Plots) @ Peltier Tech:

Toronto Cats and Dogs II – Top 25 Names of 2014

I was quite surprised by the relative popularity of my previous analysis of the data for Licensed Cats & Dogs in Toronto for 2011, given how simple it was to put together.

I was browsing the Open Data Portal recently and noticed that there was a new data set for pets: the top 25 names for both dogs and cats. I thought this could lend itself to some quick, easy visualization and be a neat little addition to the previous post.

First we simply visualize the raw counts of the Top 25 names against each other. Interestingly, the top 2 names for both dogs and cats are apparently the same: Charlie and Max.

Next let’s take a look at the distribution of these top 25 names for each type of pet by how long they are, which just involves calculating the name length and then pooling the counts:

You can see that, proportionally the top dog names are a bit shorter (distribution is positively / right-skewed) compared to the cat names (slightly negatively / left skewed). Also note both are centered around names of length 5, and the one cat name of length 8 (Princess).

Looking at the dog names, do you notice something interesting about them? A particular feature present in nearly all? I did. Nearly every one of the top 25 dog names ends in a vowel. We can see this by visualizing the proportion of the counts for each type of pet by whether the name ends in a vowel or consonant:

Which to me, seems to indicate that more dogs tend to have “cutesy” names, usually ending in ‘y’, than cats.

Fun stuff, but one thing really bothers me… no “Fido” or “Boots”? I guess some once popular names have gone to the dogs.

References & Resources

Licensed Dog and Cat Names (Toronto Open Data)

Perception in Data Visualization – A Quick 7 Question Test

When most people think of data, they probably think of a dry, technical analysis, without a lot of creativity or freedom. Quite to the contrary, data visualization encompasses choices of design, creative freedom, and also (perhaps most interestingly) elements of cognitive psychology, particularly related to the science of visual perception and information processing.

If you read any good text on dataviz, like TufteFew, or Cairo, you will, at some point, come across a discussion of the cognitive aspects of data visualization (the latter two devoting entire chapters to this topic). This will likely include a discussion of the most elemental ways to encode information visually, and their respective accuracies when quantity is interpreted from them, usually referencing the work of Cleveland & McGill [PDF].

Mulling over the veracity of my brief mention of the visual ways of encoding quantity in my recent talk, and also recently re-reading Nathan Yau’s discussion of the aforementioned paper, I got to thinking about just how different the accuracy of interpretation between the different encodings might be.

I am not a psychologist or qualitative researcher, but given the above quickly put together a simple test of 7 questions in Google Docs, to examine the accuracy of interpreting proportional quantities when encoded visually; and I humbly request the favour of your participation. If there are enough responses I will put together what analysis is possible in a future post (using the appropriate visualization techniques, of course).

Apologies in advance for the grade-school wording of the questions, but I wanted to be as clear as possible to ensure consistency in the results. Thanks so much in advance for contributing! Click below for the quiz:

EDIT: The quiz will now be up indefinitely on this page.

Analysis of the TTC Open Data – Ridership & Revenue 2009-2012


I would say that the relationship between the citizens of Toronto and public transit is a complicated one. Some people love it. Other people hate it and can’t stop complaining about how bad it is. The TTC want to raise fare prices. Or they don’t. It’s complicated.
I personally can’t say anything negative about the TTC. Running a business is difficult, and managing a complicated beast like Toronto’s public system (and trying to keep it profitable while keeping customers happy) cannot be easy. So I feel for them. 
I rely extensively on public transit – in fact, I used to ride it every day to get to work. All things considered, for what you’re paying, this way of getting around the city is a hell of a good deal (if you ask me) compared to the insanity that is driving in Toronto.
The TTC’s ridership and revenue figures are available as part of the (awesome) Toronto Open Data initiative for accountability and transparency. As I noted previously, I think the business of keeping track of things like how many people ride public transit every day must be a difficult one, so you have to appreciate having this data, even if it is likely more of an approximation and is in a highly summarized format.
There are larger sources of open data related to the TTC which would probably be a lot cooler to work with (as my acquaintance Mr. Branigan has done) but things have been busy at work lately, so we’ll stick to this little analysis exercise.


The data set comprises numbers for: average weekly ridership (in 000’s), annual ridership (peak and off-peak), monthly & budgeted monthly ridership (in 000’s), and monthly revenue, actual and budgeted (in millions $). More info here [XLS doc].


First we consider the simplest data and that is the peak and off-peak ridership. Looking at this simple line-graph you can see that the off-peak ridership has increased more than peak ridership since 2009 – peak and off-peak ridership increasing by 4.59% and 12.78% respectively. Total ridership over the period has increased by 9.08%.

Below we plot the average weekday ridership by month. As you can see, this reflects the increasing demand on the TTC system we saw summarized yearly above. Unfortunately Google Docs doesn’t have trendlines built-in like Excel (hint hint, Google), but unsurprisingly if you add a regression line the trend is highly significant ( > 99.9%) and the slope gives an increase of approximately 415 weekday passengers per month on average.

Next we come to the ridership by month. If you look at the plot over the period of time, you can see that there is a distinct periodic behavior:

Taking the monthly averages we can better see the periodicity – there are peaks in March, June & September, and a mini-peak in the last month of the year:

This is also present in both the revenue (as one would expect) and the monthly budget (which means that the TTC is aware of it). As to why this is the case, I can’t immediately discern, though I am curious to know the answer. This is where it would be great to have some finer grained data (daily or hourly) or data related to geographic area or per station to look for interesting outliers and patterns.

Alternatively if we look at the monthly averages over the years of average weekday ridership (an average of averages, I am aware – but the best we can do given the data we have), you can see that there is a different periodic behavior, with a distinct downturn over the summer, reaching a low in August which then recovers in September to the maximum. This is interesting and I’m not exactly sure what to make of it, so I will do what I normally do which is attribute it to students.

Lastly, we come to the matter of the financials. As I said the monthly revenue and budget for the TTC follow the same periodic pattern as the ridership, and on the plus side, with increased ridership, there is increased revenue. Taking the arithmetic difference of the budgeted (targeted) revenue from actual, you can see that over time there is a decrease in this quantity:
Again if you do a linear regression this is highly significant ( > 99.9%). Does this mean that the TTC is becoming less profitable over time? Maybe. Or perhaps they are just getting better at setting their targets? I acknowledge that I’m not an economist, and what’s been done here is likely a gross oversimplification of the financials of something as massive as the TTC.

That being said, the city itself acknowledges [warning – large PDF] that while the total cost per hour for an in-service transit vehicle has decreased, the operating cost has increased, which they attribute to increases in wages and fuel prices. Operating public transit is also more expensive here in TO than other cities in the province, apparently, because we have things like streetcars and the subway, whereas most other cities only have buses. Either way, as I said before, it’s complicated.


I always enjoy working with open data and I definite appreciate the city’s initiative to be more transparent and accountable by providing the data for public use.
This was an interesting little analysis and visualization exercise and some of the key points to take away are that, over the period in question:
  • Off-peak usage of the TTC is increasing at a greater rate than peak usage
  • Usage as a whole is increasing, with about 415 more weekday riders per month on average, and a growth of ~9% from 2009 – 2012
  • Periodic behavior in the actual ridership per month over the course of the year
  • Different periodicity in average weekday ridership per month, with a peak in September
It would be really interesting to investigate the patterns in the data in finer detail, which hopefully should be possible in the future if more granular time-series, geographic, and categorical data become available. I may also consider digging into some of the larger data sets, which have been used by others to produce beautiful visualizations such as this one.

I, for one, continue to appreciate the convenience of public transit here in Toronto and wish the folks running it the best of luck with their future initiatives.

References & Resources

TTC Ridership – Ridership Numbers and Revenues Summary (at Toronto Open Data Portal)

Toronto Progress Portal – 2011 Performance Measurement and Benchmarking Report

Let’s Go To The Ex!

I went to The Ex (that’s the Canadian National Exhibition for those of you not ‘in the know’) on Saturday. I enjoy stepping out of the ordinary from time to time and carnivals / fairs / midways / exhibitions etc. are always a great way to do that.

As far as exhibitions go, I believe the CNE is one of the more venerable – it’s been around since 1879 and attracts over 1.3 million visitors every year.

Looking at the website before I went, I saw that they had a nice summary of all the ride height requirements and number of tickets required. I thought perhaps the data could stand to be presented in a more visual form.

First, how about the number of tickets required for the different midways? All of the rides on the ‘Kiddie’ Midway require four tickets, except for one (The Wacky Worm Coaster). The Adult Midway rides are split about 50/50 for five or six tickets, except for one (Sky Ride) which only requires four.

With tickets being $1.50 each, or $1 if you buy them in sets of 22 or 55, that makes the ride price range $6-9 or $4-6. Assuming you buy the $1 tickets, the average price of an adult ride is $5.42 and the average price of a child ride $4.04.

The rides also have height requirements. Note that I’ve simplified things by taking the max height for cases where shorter/younger kids can ride supervised with an adult. Here’s a breakdown of the percentage of the rides in each midway type children can ride, given their height:

Google Docs does not allow non-stacked stepped area charts, so line graph it is.

And here’s the same breakdown with percentage of the total rides (both midways combined), coloured by type. This is a better way to represent the information, as it shows the discrete nature of the height requirement:

Basically if your child is over 4′ they are good for about 80% of all the rides at the CNE.

Something else to consider – how to get your maximum value for your tickets with none left over, given that they are sold in packs of 22 and 55? I would say go with the $36 all-you-can-ride option. Also, how miniscule are your actual odds of winning those carnival games? Because I want a giant purple plush gorilla.

See you next year!