Are the dice in Mario Party fair?

Over the holidays I was playing a lot of games with friends and family as one does, and one of those games was Super Mario Party for Nintendo Switch.

Now what's interesting about this game is that, in addition to requiring dice rolls like any other board game, depending upon your character (or various 'allies' you can acquire when you team up with other playable characters and get the option to use their dice in addition to a bonus) you can choose to use different character-specific dice which are unique and have different values than a standard one.

Super Mario Party, with Mario holding his custom dice

So, being the guy that I am this got me to wondering - are all the different dice for the different characters 'fair'? If your goal is to traverse the maximum number of spaces (as it often is) are any of the dice better to use on average than the others?

Continue reading "Are the dice in Mario Party fair?"

When to Use Sequential and Diverging Palettes


I wanted to take some time to talk an about important rule for the use of colour in data visualization. 
The more I've worked in visualization, the more I have come to feel that one of the most overlooked and under-discussed facets (especially for novices) is the use of colour. A major pet peeve of mine, and a mistake I see all too often, is the use of a diverging palette instead of a sequential one or vice-versa. 
So what is the difference between a sequential and diverging palette, and when is it to correct to use each? The answer is one that arises very often in visualization: it all depends on the data, and what you're trying to show.

Sequential vs. Diverging Palettes

First of all, let's define what we are discussing here. 
Sequential Palettes
A sequential palette ranges between two colours (typically having one "main" colour) ranging from white or a lighter shade to a darker one, by varying one or more of the parameters in the HSV/HSL colour space (usually only saturation or value/luminosity, or both). 
For me, at least, varying hue is going between two very distinct colours and is usually not good practice if your data vary linearly, as it is much closer to a diverging palette which will discuss next. There are others reasons why this is bad visualization practice, and, of course, exceptions to this rule, which we will discuss later in the post.
A sequential palette (generated in R)
Diverging Palettes
In contrast to a sequential palette, a diverging palette ranges between three or more colours with the different colours being quite distinct (usually having different hues). 
While technically a diverging palette could have as many colours as you'd like in a (such as in the rainbow palette which is the default in some visualizations like in MATLAB), diverging palettes usually range only between two contrasting colours at either end with a neutral colour or white in the middle separating the two.
A diverging palette (generated in R)

When to Use Which

So now that we've defined the two different palette types of interest, when is it appropriate and inappropriate to use them?

The rule for the use of diverging palettes is very simple: they should only be used when there is a value of importance around which the data are to be compared.

This central value is typically zero, with negative values corresponding to one hue and positive the other, though this could also be done for any other value, for example, comparing numbers around a measure of central tendency or reference value.

A Simple Example
For example, looking at the Superstore dataset in Tableau, a visualizer might be tempted to make a map such as the one below, with colour encoding the number of sales in each city:

Here points on the map correspond to the cities and are sized by total number of sales and coloured by total sales in dollars. Looks good, right? The cities with the highest sales clearly stick out in the green against the dark red?

Well, yes, but do you see a problem? Look at the generated palette:

The scale ranges from the minimum sales in dollars ($4.21) to max (~$155K), so we cover the whole range of the data. But what about the midpoint? It's just the dead center point between the two, which doesn't correspond to anything meaningful in the data - so why would the hue change from red to green at that point?

This is a case which is better suited using a sequential palette, since all the values are positive and were not highlighting a meaningful value which the range of data falls around. A better choice would be a sequential palette, as below:

Here, the range is full covered and there is no midpoint, and the palette ranges from light green to dark. The extreme values still stand out in dark green, however there is no well-defined center where the hue arbitraily changes, so this is a better choice.

There are other ways we could improve this visualization's encoding of quantity as colour, for one, by using endpoints that would be more meaningful to business users instead of just the range of the data (say, $0 to $150K+), and another which we will discuss later.

Taking a look at the two palettes together, it's clearer which is a better choice for encoding the always positive value of the metric sales dollars across its range:

Going Further
Okay, so when would we want to use a diverging palette? As per the rule, if there was a meaningful midpoint or other important value you wanted to contrast the data around.

For example, in our Superstore data, sales dollars are always positive, but profit can be positive or negative, so it is appropriate to use a diverging palette in this case, with one hue corresponding to negative values and another to positive, and the neutral colour in the middle occurring at zero:

Here it is very clear which values fall at the extremes of the range, but also which are closer to the meaningful midpoint (zero): that one city in Montana is in the negative, and the others don't seem to be very profitable either; we can tell they are close to zero by how washed out their colours are.

Tableau is smart enough to know to set the midpoint at zero for our diverging palette. Again, you could tinker with the range to make the end-points more meaningful (e.g. round values), as well as varying the range: sometimes a symmetrical range for a diverging palette is easier to interpret from a numerical standpoint, though of course you have to keep in mind how perceptually this going to impact the salience of the colour values for the corresponding data.

So could we use a diverging palette for the always positive sales data? Sure. There just needs to be a point around which we are comparing the values. For example, I happen to know that the median sales per city over the time period in question is $495.82 - this would be a meaningful value to use for the midpoint of a diverging palette, and we can redo our original sales map as such:

No we have a better version of our original sales map, where here the cities coloured in red are below the median value per city, and those coloured in green are above. Much better!

But now something strange seems to be going on with the palette - what's that all about?

No Simple Answers
So what is going on with the palette in the last map from our example above? And what of my promise to discuss other ways the palette scaling can be improved, and of exceptions to the rule of not using differing hues in a continuous scale?

Well, the reason that the map looks good above but the scale looks wrong has to do with how the data are distributed: the distribution of sales by city is not normal, but follows a power law, with most of the data falling in the low end, so our palette looks the same when the colours are scaled linearly with the data:

One way to fix this is to transform the data by taking the log, and seeing that the resulting palette looks more like we'd expect:

Though of course now the range is between transformed values. It's interesting to not that in this case the midpoint comes out being nearly correct automatically (2.907 vs. log(495.82) ~= 2.695).

Further complicating all this is the fact that human perception of colour is not linear, but follows something like the Weber-Fenchner Law depending on the various properties. Robert Simmon writes on this in his excellent series of posts while he was at NASA which is definitely worth a read (and multiple re-reads).

There he also notes an exception to my statement that you shouldn't use continuous palettes with different hues, as sometimes even that can be appropriate, as he notes in the section on figure-ground when talking about earth surface temperature.


So there you have it. Once again: use diverging palettes only when there is a meaningful point around which you want to contrast the other values in your data.

Remember, it all depends on the data. What is the ideal palette for a given data set, and how should you choose it? That's not an easy question to answer, one always left up to the visualization practitioner, which only comes with the knowledge of proper visualization technique and the theoretical foundations that form it.

There are no right or wrong answers, only better or worse choices. It's all about the details.

References and Resources

Subtleties of Colour (by Robert Simmon)
Understanding Sequential and Diverging Palettes in Tableau
How to Choose Colours for Maps and Heatmaps

How Often Does Friday the 13th Happen?


So yesterday was Friday the 13th.

I hadn't even thought anything of it until someone mentioned it to me. They also pointed out that there are two Friday the 13ths this year: the one that occurred yesterday, and there will be another one in October.

This got me to thinking: how often does Friday the 13th usually occur? What's the most number of times it can occur in a year?

Sounds like questions for a nice little piece of everyday analytics.


A simple Google search revealed over a list of all the Friday the 13ths from August, 2010 up until the end of 2050 over at It was a simple matter to plunk that into Excel and throw together some simple graphs.
So to answer the first question, how often does Friday the 13th usually occur?
It looks like the maximum number of times it can occur per year is 3 (those are the years Jason must have a heyday and things are really bad at Camp Crystal Lake) and the minimum is 1. So my hypothesis is:
a. it's not possible to have a year where a Friday the 13th doesn't occur, and 
b. Friday the 13th can't occur more than 3 times in a year, due to the way the Gregorian calendar works.
Of course, this is not proof, just evidence, as we are only looking at a small slice of data.
So what is the distribution of the number of unlucky days per year?
The majority of the years in the period have only one (18, or ~44%) but not by much, as nearly the same amount have 2 (17, or ~42%). Far less have 3 F13th's, only 6 (~15%). Again, this could just be an artifact of the interval of time chosen, but gives a good idea of what to expect overall.
Are certain months favoured at all, though? Does Jason's favourite day occur more frequently in certain months?
Actually it doesn't really appear so - they look to be spread pretty evenly across the months and we will see why this is the case below.
So, what if we want even more detail. When we say how frequently does Friday the 13th occur, and we mean how long is it between each occurrence of Friday the 13th? Well, that's something we can plot over the 41-year period just by doing a simple subtraction and plotting the result.
Clearly, there is periodicity and some kind of cycle to the occurrence of Friday the 13th, as we see repeated peaks at what looks like 420 days and also at around 30 days on the low end. This is not surprising, if you think about how the calendar works, leap years, etc. 
If we pivot on the number of days and plot the result, we don't even get a distribution that is spread out evenly or anything like that; there are only 7 distinct intervals between Friday the 13ths during the period examined:
So basically, depending on the year, the shortest time between successive Friday the 13ths will be 28 days, and the greatest will be 427 (about a year and two months), but usually it is somewhere in-between at around either three, six, or eight months. It's also worth noting that every interval is divisible by seven; this should not be surprising at all either, for obvious reasons.


Overall and neat little bit of simple analysis. Of course, this is just how I typically think about things, by looking at data first. I know that in this case, the occurrence of things like Friday the 13th (or say, holidays that fall on a certain day of week or the like) are related to the properties of the Gregorian calendar and follow a pattern that you could write specific rules around if you took the time to sit down and work it all out (which is exactly what some Wikipedians have done in the article on Friday the 13th).
I'm not a superstitious, but now I know when those unlucky days are coming up and so do you... and when it's time to have a movie marathon with everyone's favourite horror villain who wears a hockey mask.

Top 100 CEOs in Canada by Salary 2008-2015, Visualized

I thought it'd been a while since I'd some good visualization work with Tableau, and noticed that this report from the Canadian Centre on Policy Alternatives was garnering a lot of attention in the news.

However, most of the articles about the report did not have any graphs and simply restated data from it in narrative to put it in context, and I found the visualizations within the report itself to be a little lacking in detail. It wasn't a huge amount of work to extract the data from the report and quickly throw it into Tableau, and put together a cohesive picture using the Stories feature (best viewed on Desktop at 1024x768 and above).

See below for the details, it's pretty staggering, even for some of the bottom earners. To put things in context, the top earner had $183M a year all-in, which, if you work 45 hours a week and only take two weeks of vacation per year, translates to about $81,000 and hour.

Geez, Looks like I need to get into a new line of work.

Plotting Choropleths from Shapefiles in R with ggmap – Toronto Neighbourhoods by Population


So, I'm not really a geographer. But any good analyst worth their salt will eventually have to do some kind of mapping or spatial visualization. Mapping is not really a forte of mine, though I have played around with it some in the past.
I was working with some shapefile data a while ago and thought about how its funny that so much of spatial data is dominated by a format that is basically proprietary. I looked around for some good tutorials on using shapefile data in R, and even so it took me a while to figure it out, longer than I would have thought.
So I thought I'd put together a simple example of making nice choropleths using R and ggmap. Let's do it using some nice shapefile data of my favourite city in the world courtesy of the good folks at Toronto's Open Data initiative.


We're going to plot the shapefile data of Toronto's neighbourhoods boundaries in R and mash it up with demographic data per neighbourhood from Wellbeing Toronto.
We'll need a few spatial plotting packages in R (ggmap, rgeos, maptools).
Also the shapefile originally threw some kind of weird error when I originally tried to load it into R, but it was nothing loading it into QGIS once and resaving it wouldn't fix. The working version is available on the github page for this post.


First let's just load in the shapefile and plot the raw boundary data using maptools. What do we get?
# Read the neighborhood shapefile data and plot
shpfile <- "NEIGHBORHOODS_WGS84_2.shp"
sh <- readShapePoly(shpfile)
This just yields the raw polygons themselves. Any good Torontonian would recognize these shapes. There's some maps like these with words squished into the polygons hanging in lots of print shops on Queen Street. Also as someone pointed out to me, most T-dotters think of the grid of downtown streets as running directly North-South and East-West but it actually sits on an angle.

Okay, that's a good start. Now we're going to include the neighbourhood population from the demographic data file by attaching it to the dataframe within the shapefile object. We do this using the merge function. Basically this is like an SQL join. Also I need to convert the neighbourhood number to a integer first so things work, because R is treating it as an string.

# Add demographic data
# The neighbourhood ID is a string - change it to a integer
sh@data$AREA_S_CD <- as.numeric(sh@data$AREA_S_CD)

# Read in the demographic data and merge on Neighbourhood Id
demo <- read.csv(file="WB-Demographics.csv", header=T)
sh2 <- merge(sh, demo, by.x='AREA_S_CD', by.y='Neighbourhood.Id')
Next we'll create a nice white to red colour palette using the colorRampPalette function, and then we have to scale the population data so it ranges from 1 to the max palette value and store that in a variable. Here I've arbitrarily chosen 128. Finally we call plot and pass that vector of colours into the col parameter:
# Set the palette
p <- colorRampPalette(c("white", "red"))(128)

# Scale the total population to the palette
pop <- sh2@data$Total.Population
cols <- (pop - min(pop))/diff(range(pop))*127+1
plot(sh, col=cols)
And here's the glorious result!

Cool. You can see that the population is greater for some of the larger neighbourhoods, notably on the east end and The Waterfront Communities (i.e. condoland)

I'm not crazy about this white-red palette so let's use RColorBrewer's spectral which is one of my faves:

#RColorBrewer, spectral
p <- colorRampPalette(brewer.pal(11, 'Spectral'))(128)
plot(sh2, col=cols)

There, that's better. The dark red neighborhood is Woburn. But we still don't have a legend so this choropleth isn't really telling us anything particularly helpful. And it'd be nice to have the polygons overplotted onto map tiles. So let's use ggmap!


In order to use ggmap we have to decompose the shapefile of polygons into something ggmap can understand (a dataframe). We do this using the fortify command. Then we use ggmap's very handy qmap function which we can just pass a search term to like we would Google Maps, and it fetches the tiles for us automatically and then we overplot the data using standard calls to geom_polygon just like you would in other visualizations using ggplot.

The first polygon call is for the filled shapes and the second is to plot the black borders.

points <- fortify(sh, region = 'AREA_S_CD')

# Plot the neighborhoods
toronto <- qmap("Toronto, Ontario", zoom=10)
toronto +geom_polygon(aes(x=long,y=lat, group=group, alpha=0.25), data=points, fill='white') +
geom_polygon(aes(x=long,y=lat, group=group), data=points, color='black', fill=NA)

Now we merge the demographic data just like we did before, and ggplot takes care of the scaling and legends for us. It's also super easy to use different palettes by using scale_fill_gradient and scale_fill_distiller for ramp palettes and RColorBrewer palettes respectively.

# merge the shapefile data with the social housing data, using the neighborhood ID
points2 <- merge(points, demo, by.x='id', by.y='Neighbourhood.Id', all.x=TRUE)

# Plot
toronto + geom_polygon(aes(x=long,y=lat, group=group, fill=Total.Population), data=points2, color='black') +
scale_fill_gradient(low='white', high='red')

# Spectral plot
toronto + geom_polygon(aes(x=long,y=lat, group=group, fill=Total.Population), data=points2, color='black') +
scale_fill_distiller(palette='Spectral') + scale_alpha(range=c(0.5,0.5))

So there you have it! Hopefully this will be useful for other R users wishing to make nice maps in R using shapefiles, or those who would like to explore using ggmap.

References & Resources

Neighbourhood boundaries at Toronto Open Data:
Demographic data from Well-being Toronto:

I’m Dreaming of a White Christmas

I'm heading home for the holidays soon.

It's been unseasonably warm this winter, at least here in Ontario, so much so that squirrels in Ottawa are getting fat. I wanted to put together a really cool post predicting the chance of a white Christmas using lots of historical climate data, but it turns out Environment Canada has already put together something like that by crunching some numbers. We can just slam this into Google Fusion tables and get some nice visualizations of simple data.


It seems everything above a certain latitude has a much higher chance of having a white Christmas in recent times than those closer to the America border and on the coast, which I'm going to guess is likely due to how cold it gets in those areas on average during the winter. Sadly Toronto has less than a coin-flip's chance of a white Christmas in recent times, with only a 40% chance of snow on the ground come the holiday.


But just because there's snow on the ground doesn't necessary mean that your yuletide weather is that worthy of a Christmas storybook or holiday movie. Environment Canada also has a definition for what they call a "Perfect Christmas": 2 cm or more of snow on the ground and snowfall at some point during the day. Which Canadian cities had the most of these beautiful Christmases in the past?

Interestingly Ontario, Quebec and Atlantic Canada are better represented here, which I imagine has something to do with how much precipitation they get due to proximity to bodies of water, but hey, I'm not a meteorologist.
A white Christmas would be great this year, but I'm not holding my breath. Either way it will be good to sit by the fire with an eggnog and not think about data for a while. Happy Holidays!

I Heart Sushi

I Heart Sushi


I like sushi.

I've been trying to eat a bit better lately though (aren't we all?) and so got to wondering: just how bad for you is sushi exactly? What are some of the better nutritional choices I can make when I go out to eat at my favorite Japanese(ish) place? What on the menu should I definitely avoid?

And then I got thinking like I normally get think about the world, that hey, it's all just data, and I remembered how I could just take some nutritional information as raw data as I've previously done ages ago for Mickey D's and see if anything interesting pops out. Plus this seemed like as good as an excuse as any to do some work with the good old data analysis and visualization stack for python, and ipython notebooks, instead of my usual go-to tool of R.

So let's have a look, shall we?


As always, the first step is getting the data; sometimes the most difficult step. Here the menu in question I chose to use was that from Sushi Stop (I am in no way affiliated nor associated with said brand, nor I am endorsing it), where the nutritional information unfortunately was only available as a PDF, as is often the case.

This is a hurdle data analysts, but more often I think, research analysts and data journalists, can often run into. Fortunately there are tools at our disposal to deal with this kind of thing, so not to worry. Using the awesome Tabula and a little bit of ad hoc cleaning from the command line, it was a simple matter of extracting the data from the PDF and into a convenient CSV. Boom, and we're ready to go.


The data comprises 335 unique items in 17 different categories with 15 different nuritional variables. Let's dig in.


First we include the usual suspects in the python data analysis stack (numpy, matplotlib and pandas), then read the data into a dataframe using pandas.

In [1]:
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
In [2]:
data = pd.read_csv("tabula-nutritional-information.csv", delimiter=",")

Okay, are we wokring with here? Let's take a look:

In [3]:
print data.columns
print len(data.columns)
Index([u'category', u'item', u'serving_size', u'calories', u'fat', u'saturated_fat', u'trans_fat', u'cholesterol', u'sodium', u'carbohydrates', u'fibre', u'sugar', u'protein', u'vitamin_a', u'vitamin_c', u'calcium', u'iron'], dtype='object')
category item serving_size calories fat saturated_fat trans_fat cholesterol sodium carbohydrates fibre sugar protein vitamin_a vitamin_c calcium iron
0 APPETIZERS & SALADS Shrimp Tempura 60 180 8.0 0.0 0 40 125 18 0 0 8 0 0 0 0
1 APPETIZERS & SALADS Three salads 120 130 3.5 0.0 0 60 790 13 4 8 8 2 6 40 8
2 APPETIZERS & SALADS Wakame 125 110 2.0 0.0 0 0 1650 13 4 9 0 0 0 110 0
3 APPETIZERS & SALADS Miso soup 255 70 3.0 0.5 0 0 810 8 1 1 6 0 0 20 25
4 APPETIZERS & SALADS Grilled salmon salad 276 260 19.0 2.5 0 30 340 12 3 6 12 80 80 8 8

5 rows × 17 columns

Let's look at the distribution of the different variables. You can see that most a heavily skewed or follow power law / log-normal type distributions as most things in nature do. Interestingly there is a little blip there in the serving sizes around 600 which we'll see later is the ramen soups.

In [4]:
# Have a look
plt.figure(0, figsize=(25,12), dpi=80)
for i in range(2,len(data.columns)):
fig = plt.subplot(2,8,i)
plt.title(data.columns[i], fontsize=25)
# fig.tick_params(axis='both', which='major', labelsize=15)

Let's do something really simple, and without looking at any of the other nutrients just look at the caloric density of the foods. We can find this by dividing the number of calories in each item by the serving size. We'll just look at the top 10 worst offenders or so:

In [5]:
data['density']= data['calories']/data['serving_size']
data[['item','category','density']].sort('density', ascending=False).head(12)
item category density
314 Yin Yang Sauce EXTRAS 5.000000
311 Ma! Sauce EXTRAS 4.375000
75 Akanasu (brown rice) HOSOMAKI 3.119266
0 Shrimp Tempura APPETIZERS & SALADS 3.000000
312 Spicy Light Mayo EXTRAS 2.916667
74 Akanasu HOSOMAKI 2.844037
67 Akanasu avocado (brown rice) HOSOMAKI 2.684564
260 Teriyaki Bomb ‐ brown rice (1 pc) TEMARI 2.539683
262 Teriyaki Bomb ‐ brown rice (4 pcs) TEMARI 2.539683
66 Akanasu avocado HOSOMAKI 2.483221
201 Inferno Roll (brown rice) SUMOMAKI 2.395210
259 Teriyaki Bomb (1 pc) TEMARI 2.380952

12 rows × 3 columns

The most calorically dense thing is Ying-Yang Sauce, which as far as I could ascertain was just mayonnaise and something else put in a ying-yang shape on a plate.Excluding the other sauces (I assume Ma! also includes mayo), the other most calorically dense foods are the variations of the Akanasu roll (sun-dried tomato pesto, light cream cheese, sesame), shrimp tempura (deep fried, so not surprising) and teriyaki bombs, which are basically seafood, cheese and mayo smushed into a ball, deep fried and covered with sauce (guh!). I guess sun-dried tomato pesto has a lot of calories. Wait a second, does brown rice have more calories than white? Oh right, sushi is made with sticky rice, and yes, yes it does. Huh, today I learned.

We can get a more visual overview of the entire menu by plotting the two quantities together. Calories divided by serving size = calories on the y-axis, serving size on the x. Here we colour by category and get a neat little scatterplot.

In [7]:
# Get the unique categories
categories = np.unique(data['category'])

# Get the colors for the unique categories
cm = plt.get_cmap('spectral')
cols = cm(np.linspace(0, 1, len(categories)))

# Iterate over the categories and plot
for category, col in zip(categories, cols):
d = data[data['category']==category]
plt.scatter(d['serving_size'], d['calories'], s=75, c=col, label=category.decode('ascii', 'ignore'))
plt.xlabel('Serving Size (g)', size=15)
plt.ylabel('Calories', size=15)
plt.title('Serving Size vs. Calories', size=18)

legend = plt.legend(title='Category', loc='center left', bbox_to_anchor=(1.01, 0.5),
ncol=1, fancybox=True, shadow=True, scatterpoints=1)

You can see that the nigiri & sashimi generally have smaller serving sizes and so less calories. The ramen soup is in a category all its own with much larger serving sizes than the other items, as I mentioned before and we saw in the histograms. The other rolls are kind of in the middle. The combos, small ramen soups and some of the appetizers and salads also sit away from the 'main body' of the rest of the menu.

Points which lie further from the line y=x have higher caloric density, and you can see that even though the top ones we picked out above had the highest raw values and we can probably guess where they are in the graph (the sauces are the vertical blue line near the bottom left, and the Akanasu are probably those pairs of dark green dots to the right), there are other categories which are probably worse overall, like the cluster of red which is sushi pizza. Which category of the menu has highest caloric density (and so is likely best avoided) overall?

In [8]:
# Find most caloric dense categories on average
density = data[['category','density']]
grouped = density.groupby('category')
grouped.agg(np.average).sort('density', ascending=False).head()
EXTRAS 2.421875
SUSHI PIZZA 2.099515
TEMARI 1.807691
HAKO 1.583009

5 rows × 1 columns

As expected, we see that other than the extras (sauces) which have very small serving sizes, on average the sushi pizzas are the most calorically dense group of items on the menu, followed by crispy rolls. The data confirm: deep fried = more calories.

What if we were only concerned with fat (as many weight-conscious people dining out are)? Let's take a look at the different categories with a little more depth than just a simple average:

In [9]:
# Boxplot of fat content
fat = data[['category','fat']]
grouped = fat.groupby('category')

# Sort
df2 = pd.DataFrame({col:vals['fat'] for col,vals in grouped})
meds = df2.median()
df2 = df2[meds.index]

# Plot
fatplot = df2.boxplot(vert=False)

While the combos and appetizers and salads have vary wide ranges in their fat content, we see again that the sushi pizza and crispy rolls have the most fat collectively and so are best avoided.

Now another thing people are often worried about when they are trying to eat well is the amount of sodium they take in. So let's repeat our previous approach in visually examining caloric density, only this time plot it as one metric on the x-axis and look at where different items on the menu sit with regards to their salt content.

In [10]:
fig = plt.figure(figsize=(12,8))
plt.ylim(-50, 2000)
for category, col in zip(categories, cols):
d = data[data['category']==category]
plt.scatter(d['density'], d['sodium'], s=75, c=col, label=category.decode('ascii', 'ignore'))
plt.xlabel('Caloric density (calories/g)', size=15)
plt.ylabel('Sodium (mg)', size=15)
plt.title('Sodium vs. Caloric Density', size=18)

legend = plt.legend(title='Category', loc='center left', bbox_to_anchor=(1.01, 0.5),
ncol=1, fancybox=True, shadow=True, scatterpoints=1)

Here we can see that while the extras (sauces) are very calorically dense, you're probably not going to take in a crazy amount of salt unless you go really heavy on them (bottom right). If we're really worried about salt the ramen soups should be avoided, as most of them have very high sodium content (straight line of light green near the left), some north of 1500mg, which is the daily recommended intake by the Health Canada for Adults 14-50. There's also some of the other items we've seen before not looking so good (sushi pizza). Some of the temari (like the teriyaki bombs) and sumomaki ('regular' white-on-the-outside maki rolls) should be avoided too? But which ones?

A plot like this is pretty crowded, I'll admit, so is really better explored, and we can do that using the very cool (and very under-development) MPLD3 package, which combines the convenience of matplotlib with the power of D3.

Below is the same scatterplot, only interactive, so you can mouse over and see what each individual point is. The items to be most avoided (top right in grey and orange), are indeed the teriyaki bombs, as well as the inferno roll (tempura, light cream cheese, sun-dried tomato pesto, red and orange masago, green onion, spicy light mayo, spicy sauce, sesame) as we saw before. Apparently that sun-dried tomato pesto is best taken in moderation.

The Akanasu rolls are the horizontal line of 4 green points close by. Your best bet is probably just to stick to the nigri and sashimi, and maybe some of the regular maki rolls closer to the bottom left corner.

In [11]:
import mpld3
fig, ax = plt.subplots(figsize=(12,8))
N = 100

for category, col in zip(categories, cols):
d = data[data['category']==category]
scatter = ax.scatter(d['density'], d['sodium'], s=40, c=col, label=category.decode('ascii', 'ignore'))
labels = list(d['item'])
tooltip = mpld3.plugins.PointLabelTooltip(scatter, labels=labels)
mpld3.plugins.connect(fig, tooltip)



Well, there we have it folks. A simple look at the data tells us some common-sense things we probably already new:

  • Deep fried foods will make you fat
  • Mayo will make you fat
  • Soup at Japanese restaurants is very salty
  • Sashimi is healthy if you go easy on the soy

And surprisingly, one thing I would not have thought: that sundried tomato pesto is apparently really bad for you if you're eating conscientiously.

That's all for now. See you next time and enjoy the raw fish.

References and Resources


Sushi Stop - Nutritional Information (PDF)

Food & Nutrition - Sodium in Canada (Health Canada)

code & data on github

Good Data Visualization Should Be Boring

So I'm going to make a statement that I'm sure some people are going to disagree with: good data visualization should be boring.

Well, at least kind of boring.

I've had a lot of conversations with a lot of people over the last few years or so about data visualization: why it's important, what constitutes good and bad, and examples of its application in both problematic and very effective ways.

A salient point someone made to me once is that part of the problem with the practice of data visualization is that it isn't viewed as a standalone discipline; it's simply done, in high school math classes, university courses, or even in the workplace by professionals, and usually assumed that people will just pick it up without discussion around it and its proper application.

I think this is gradually starting to change, as with all the talk (or hype, depending on your point of view) around “Big Data”, analytics is becoming more mainstream, and data visualization is as well as a part of it. I also think dataviz is beginning to – gradually, very gradually - become viewed as a standalone discipline, with courses now being offered in it, and the “data visualization evangelism” of academics such as Edward Tufte and Alberto Cairo and work of practitioners like Stephen Few and Mike Bostock helping to raise awareness of what's doing it wrong and what's doing it right. This, along with others creating visualizations which go viral or delivering inspirational TED talks, are doing a lot for visualization as a practice.

The thing I found when I first started to get into dataviz is that even if you're good with data that doesn't necessarily mean you're good at visualizing it. This is because, in addition to working with data, doing proper visualization involves questions of design and also the psychology of perception.

Less is More

I'm a minimalist, and therefore take what I call a functionalist perspective of data visualization. That is to say, the purpose of visualization is to most effectively represent that data so that it can be understood by the audience both most quickly and easily.

As such, I feel that good data visualization should be somewhat dull, or at least somewhat dry; in terms of depicting information and people perceiving it, it is usually the case that simpler is better. This is illustrated in principles like Tufte's data-ink ratio.

So, look at the charts below. Which is more visually appealing to you? Which is simpler? Which one depicts the quantities such that you are able to interpret them the most quickly, accurately and with the most clarity?

If you're like me, you'll say the one on the right, which is a better visualization, even though it may not be as visually appealing to some. Most often you're better served by a simpler, cleaner visualization (or perhaps several of them) than a lot of complexity and visual noise that doesn't add to the reader's understanding.

Never say always

That being said, as I mentioned, choices around data visualization are ultimately ones of design. I do believe that there are some hard and fast rules that should never be broken (e.g. always start the y-axis at 0 for bar charts of strictly positive values, don't represent data with the same units on a secondary y-axis, never use a line chart for categorical data), however I also believe there are some that are more flexible, depending on what you want to accomplish, and your audience. Should you never, ever, use a pie chart? No. Some people are more comfortable with pie charts just from their familiarity with them. Is a bar chart a better choice in terms of representing the data? Yes. But that doesn't mean there aren't exceptions (just don't make a 3D one).

The same individual that made the observation about dataviz not being taught also pointed out to me another factors that can influence design choices: what she called chart fatigue. Is the bar chart the best way to plot a single metric across a categorical variable? Almost always, yes. But show a room full of businesspeople bar chart after bar chart after bar chart and anyone can tell you that they're all going to start to look the same, and interpretation of them is going to suffer as a result. Plus you're probably going to lose the interest of your audience.

Practice makes perfect

In conclusion, I think that awareness of data visualization is only going to get better as companies (and the average consumer) become more “data savvy”. It is my sincere hope that people will give more and more emphasis, not only to the importance of visualization as a tool, but also to the design choices around it, and what constitutes good and bad depictions of data.

For now, just remember that data visualization is ultimately all about communicating and having your reader understand, not necessarily wowing them (though both together are not impossible). And sometimes, that means boring is better.

Visual Analytics of Every PS4 and XBox One Game by Install Size


I have a startling confession to make: sometimes I just want things to be simple. Sometimes I just want things to be easier. All this talk about "Big Data", predictive modeling, machine learning, and all those associated bits and pieces that go along with data science can be a bit mentally exhausting. I want to take a step back and work with a smaller dataset, something simple, something easy, something everyone can relate to - after all, that's what this blog started out being about.

A while back, someone posted on Slashdot that the folks over at had put together data sets of the install size of every PS4 and Xbox One game released to date. Being a a console owner myself - I'm a PS4 guy, but no fanboy or hardcore gamer by any means - I thought this would be a fun and rather unique data set to play around with, one that would fit well within the category of 'everyday analytics'. So let's take a look shall we?


Very little background required here - the dataset comprises the title, release date, release type (major or indie), console (PS4 or Xbox One), and size in GiB of all games released as of September 10th, 2015. For this post we will ignore the time-related dimension and look only at the quantity of interest: install size.


Okay, if I gave this data to your average Excel jockey what's the first thing they'd do? A high level summary of the data broken apart by categorical variables and summarized by quantitative? You got it!
We can see that far more PS4 games have been released than Xbox (462 vs. 336) and the relative proportions are reversed for the former platform versus the latter as release type goes.

A small aside here on data visualization: it's worth noting that the above is a good way to go for making a bar chart from a functional perspective. Since there are data labels and the y-axis metric is in the title, we can ditch the axis and maximize the data-ink ratio (well, data-pixel anyhow). I've also avoided using a stacked bar chart as interpretation of absolute values tends to suffer when not read from the same baseline. I'm okay with doing it for relative proportions though - as in the below, which further illustrates the difference in release type proportion between the two consoles:

Finally, how does the install size differ between the consoles and game types? If I'm an average analyst and just trying to get a grip on the data, I'd take an average to start:
We can see (unsurprisingly, if you know anything about console games) that major releases tend to be much larger in size than indie. Also in both cases, Xbox install sizes are larger on average: about 1.7x for indie titles and 1.25x for major.

Okay, that's interesting. But if you're like me, you'll be thinking about how 99% of the phenomena in the universe are distributed by a power law or have some kind of non-Gaussian based distribution, and so averages are actually not always such a great way to summarize data. Is this the case for our install size data set?

Yes, it is. We can see here in this combination histogram / cumulative PDF (in the form of a Pareto chart) that the games follow a power law, with approximately 55 and 65 percent being < 5 GiB, for PS4 and Xbox games respectively

But is this entirely due to the indie games having small sizes? Might the major releases be centered around some average or median size?

No, we can see that even when broken apart by type of release the power-law like distribution for install sizes persists. I compared the averages to medians found them to be still be decent representations of central tendency and not too affected by outliers.

Finally we can look at the distribution of the install sizes by using another type of visualization suited for this task, the boxplot. While it is at least possible to jury-rig up a boxplot in Excel (see this excellent how-to over at Peltier Tech) Google Sheets doesn't give us as much to work with, but I did my best (the data label is at the maximum point, and the value is the difference between the max and Q3):

The plots show that install sizes are generally greater for Xbox One vs. PS4, and that the difference (and skew) appears to be a bit more pronounced for indie games versus major releases, as we saw in the previous figures.

Okay, that's all very interesting, but what about games that are available for both consoles? Are the install sizes generally the same or do they differ?

Difference in Install Size by Console Type
Because we've seen that the Xbox install sizes are generally larger than Playstation, here I take the PS4 size to be the baseline for games which appear on both (that is, differences are of the form XBOX Size - PS4 Size).

Of the 618 unique titles in the whole data set (798 titles if you double count across platform), 179 (~29%) were available on both - so roughly only a third of games are released for both major consoles.

Let's take a look at the difference in install sizes - do those games which appear for both reflect what we saw earlier?

Yes, for both categories of game the majority are larger on Xbox than PS4 (none were the same size). Overall about 85% of the games were larger on Microsoft's console (152/179).

Okay, but how much larger? Are we talking twice as large? Five times larger? Because the size of the games varies widely (especially between the release types) I opted to go for percentages here:

Unsurprisingly, on average indie games tend to have larger differences proportionally, because they're generally much smaller in size than major releases. We can see they are nearly twice as large on Xbox vs. PS4 while major releases about 1 and a quarter. When games are larger on PS4, there's not as big a disparity, and the pattern across release types is the same (though keep in mind the number of games here is a lot smaller than for the former).

Finally, just to ground this a bit more I thought I'd look at the top 10 games in each release type where the absolute differences are the largest. As I said before, here the difference is Xbox size minus PS4:

For major releases, the worst offender for being larger on PS4 is Batman: Arkham Night (~6.1 GiB difference) while on the opposite end, The Elder Scrolls Online has a ~19 GiB difference. Wow.

For indies, we can see the absolute difference is a lot smaller for those games bigger on PS4, with Octodad having the largest difference of ~1.4 GiB (56% of its PS4 size). Warframe is 19.6 GiB bigger on Xbox than PS4, or 503% larger (!!)

Finally, I've visualized all the data together for you so you can explore it yourself. Below is a bubble chart of the Xbox install size plotted against PS4, coloured by release type, where the size of each point represents the absolute value of the percentage difference between the platforms (with the PS4 size taken to be the baseline). So points above the diagonal are larger for Xbox than PS4, and points below the diagonal are larger for PS4 than Xbox. Also note that the scale is log-log. You can see that most of the major releases are pretty close to each other in size, as they nearly lie on the y=x line.


It's been nice to get back into the swing of things and do a little simple data visualization, as well as play with a data set that falls into the 'everyday analytics' category.
And, as a result, we've learned:

  • XBox games generally tend to have larger install sizes than PS4 ones, even for the same title
  • Game install sizes follow a power law, just like most everything else in the universe (or maybe just 80% of it)
  • What the heck a GiB is
Until next time then, don't fail to keep looking for the simple beauty in data all around you.

References & Resources

Complete List of Xbox One Install Sizes:
Complete List of PlayStation 4 Install Sizes:
Compiled data set (Google Sheets):
Excel Box and Whisker Diagrams (Box Plots) @ Peltier Tech:

Visualization and Analysis of Reddit’s “The Button” Data


People are weird. And if there's anything that's greater collective proof of this fact than Reddit, you'd be hard pressed to find it.

I tend to put reddit in the same bucket as companies like Google, Amazon and Netflix, where they have enough money, or freedom, or both, to say something like "wouldn't it be cool if....?" and then they do it simply because they can.

Enter "the button" (/r/thebutton), reddit's great social experiment that appeared on April Fool's Day of this year. An enticing blue rectangle with a timer that counts down from 60 to zero that's reset when the button is pushed, with no explanation as to what happens when the time is allowed to run out. Sound familiar? The catch here being that it was an experience shared by anyone who visited the site, and each user also only got one press (though many made attempts to game the system, at least initially).

Finally, the timer reached zero, the last button press being at 2015-06-05 21:49:53.069000UTC, and the game (rather anti-climactically I might offer) ended.

What does this have to do with people being weird? Well, an entire mythology was built up around the button, amongst other things. Okay, maybe interesting is a better word. And maybe we're just talking about your average redditor.

Either way, what interests me is that when the experiment ended, all the data were made available. So let's have a look shall we?


The dataset consists of simply four fields: 
press time, the date and time the button was pressed
flair, the flair the user was assigned given at what the timer was at when they pushed the button
css, the flair class given to the user
and lastly outage press, a Boolean indicator as to if the press occurred during a site outage.
The data span a time period from 2015-04-01 16:10:04.468000 to 2015-06-05 21:49:53.069000, with a total of 1,008,316 rows (unique presses).
I found there was css missing for some rows, and a lot of of "non presser" flair (users who were not eligible to press the button as their account was created after the event started). For these I used a "missing" value of -1 for the number of seconds remaining when the button was pushed; otherwise it could be stripped from the css field.


With this data set, we're looking at a pretty straightforward categorical time series.
Overall Activity in Time
First we can just look at the total number of button presses, regardless of what the clock said (when they occurred in the countdown) by plotting the raw number of presses per day:

Hmmm... you can see there is a massive spike at the beginning of the graph and there's much, much fewer for the rest of the duration of the experiment. In fact, nearly 32% of all clicks occurred in the first day, and over half (51.3%) in the first two days. 
I think has something to do with both the initial interest in the experiment when it first was announced, and also with the fact that the higher the counter is kept at, the more people can press the button in the same time period (more on this later).
Perhaps a logarithmic graph for the y-axis would be more suitable?
That's better. We can see the big drop-off in the first two days or so, and also that little blip around the 18th of May is more apparent. This is likely tied to one of several technical glitches which are noted in the button wiki,

For a more granular look, let's do the hourly presses as well (with a log scale):

Cool. The spike on the 18th seems to be mainly around one hour with about a thousand presses, and we can see too that perhaps there's some kind of periodic behavior in the data on an hourly basis? If we exclude some of the earlier data we can also go back to not using a log scale for the y-axis:

Let's look more into the hours of the day when the button presses occur. We can create a simple bar plot of the count of button presses by hour overall:

You can see that the vast majority occurred around 5 PM and then there is a drop-off after that, with the lows being in the morning hours between about 7 and noon. Note that all the timestamps for the button pushes are in Universal Time. Unfortunately there is no geo data, but assuming most redditors who pushed the button are within the continental United States (a rather fair assumption) the high between 5-7 PM would be 11 AM to 1 PM (so, around your lunch hour at work).

But wait, that was just the overall sum of hours over the whole time period. Is there a daily pattern? What about by hour and day of week? Are most redditors pushing the button on the weekend or are they doing it at work (or during school)? We should look into this in more detail.

Hmm, nope! The majority of the clicks occurred Wednesday-Thursday night. But as we know from the previous graphs, the vast majority also occurred within the first two days, which happened to be a Wednesday and Thursday. So the figures above aren't really that insightful, and perhaps it would make more sense to look at the trending in time across both day and hour? That would give us the figure as below:

As we saw before, there is a huge amount of clicks in the first few days (the first few hours even) so even with log scaling it's hard to pick out a clear pattern. But most of the presses appear to be present in the bands after 15:00 and before 07:00. You can see the clicks around the outage on the 18th of May were in the same high period, around 18:00 and into the next day.

Maybe alternate colouring would help?

That's better. Also if we exclude the flurry of activity in the first few days or so, we can drop the logarithmic scaling and see the other data in more detail:

To get a more normalized view, we can also look at the percentage of daily clicks per hour for each day, which yields a much more interesting view, and really shows the gap in the middle and the outage on the 18th:

Activity by Seconds Remaining
So far we've only looked at the button press activity by the counts in time. What about the time remaining for the presses? That's what determined each individual reddit user's flair, and was the basis for all the discussion around the button.

The reddit code granted flairs which were specific to the time remaining when the button was pushed.  For example, if there were 34 seconds remaining, then the css would be "34s", so it was easy to strip these and convert into numeric data. There were also those that did not press the button who were given the "non presser" flair (6957 rows, ~0.69%), as well as a small number of entries missing flair (67, <0.01%), which I gave the placeholder value of -1.

The remaining flair classes served as a bucketing which functioned very much like a histogram:

Color Have they pressed? Can they press? Timer number when pressed
Grey/Gray N Y NA
Purple Y N 60.00 ~ 51.01
Blue Y N 51.00 ~ 41.01
Green Y N 41.00 ~ 31.01
Yellow Y N 31.00 ~ 21.01
Orange Y N 21.00 ~ 11.01
Red Y N 11.00 ~ 00.00
Silver/White N N NA

We can see this if we plot a histogram of the button presses by using the CSS class which gives the more granular seconds remaining, and use breaks the same as above:

We can see there is much greater proportion of those who pressed within 51-60s left, and there is falloff from there (power law). This is in line with what we saw in the time series graphs: the more the button was pressed, the more presses could occur in a given interval of time, and so we expect that most of those presses occurred during the peak activity at the beginning of the experiment (which we'll soon examine).

What's different from the documentation above from the button wiki is the "cheater" class, which was given to those who tried to game the system by doing things like disconnecting their internet and pressing the button multiple times (as far as I can tell). You can see that plotting a bar graph is similar to the above histogram with the difference being contained in the "cheater" class:

Furthermore, looking over the time period, how are the presses distributed in each class? What about in the cheater class? We can plot a more granular histogram:

Here we can more clearly see the exponential nature of the distribution, as well as little 'bumps' around the 10, 20, 30 and 45 second marks. Unfortunately this doesn't tell us anything about the cheater class as it still has valid second values. So let's do a boxplot by css class as well, showing both the classes (buckets) as well as their distributions:

Obviously each class has to fit into a certain range given their definition, but we can see some are more skewed than others (e.g. class for 51-60s is highly negatively skewed, whereas the class for 41-50 has median around 45). Also we can see that the majority of the cheater class is right near the 60 mark.

If we want to be fancier we can also plot the boxplot using just the points themselves and adding jitter:

This shows the skew of the distributions per class/bucket (focus around "round" times like 10, 30, 45s, etc.) as before, as well as how the vast majority of the cheater class appears to be at 59s mark.

Presses by seconds remaining and in time
Lastly we can combine the analyses above and look at how the quantity and proportion of button presses varies in time by the class and number of seconds remaining.

First we can look at the raw count of presses per css type per day as a line graph. Note again the scale on the y-axis is logarithmic:

This is a bit noisy, but we can see that the press-6 class (presses with 51-60s remaining) dominate at the beginning, then taper off toward the end. Presses in the 0-10 class did not appear until after April 15, then eventually overtook the quicker presses, as would have to be the case in order for the timer to run out. The cheater class starts very high with the press-6 class, then drops off significantly and continues to decrease. I would have like to break this out into small multiples for more clarity, but it's not the easiest to do using ggplot.

Another way to look at it would be to look at the percent of presses by class per day. I've written previously about how stacked area graphs are not your friend, but in this case it's actually not too bad (plus I wanted to learn how to do it in ggplot). If anything it shows the increase presses in the 51-60 range right after the outage on May 18, and the increase in the 0-10 range toward the end (green):

This is all very well and good, but let's get more granular. We can easily visualize the data more granularly using heatmaps with the second values taken from the user flair to get a much more detailed picture. First we'll look at a heatmap of this by hour over the time period:

Again, the scaling is logarithmic for the counts (here the fill colour). We can see some interesting patterns emerging, but it's a little too sparse as there are a lot of hours without presses for a particular second value. Let's really get granular and use all the data on the per second level!

On the left is the data for the whole period with a logartihmic scale, whereas the figure on the right excludes some of the earlier data and uses a linear scale. We can see the beginning peak activity in the upper lefthand corner, and then these interesting bands around the 5, 10, 20, 30, and 45 marks forming and gaining strength over time (particular toward the end). Interestingly in addition the resurgence in near-instantaneous presses after the outage around May 18, there was also a hotspot of presses around the 45s mark close to the end of April. Alternate colouring below:

Finally, we can divide by the number of presses per day and calculate the percent each number of seconds remaining made up over the time period. That gives the figures below:

Here the flurry of activity at the beginning continues to be prominent, but the bands also stand out a little more on a daily basis. We can also see how the proportion of clicks for the smaller number of seconds remaining continues to increase until finally the timer is allowed to run out.


The button experiment is over. In the end there was no momentous meaning to it all, no grand scheme or plan, no hatch exploding into the jungle, just an announcement that the thread would be archived. Again, somewhat anti-climactic.
But, it was an interesting experiment. This was an interesting data set, given the relationship between the amount of data that could exist in the same interval of time because of the nature of it. 
And I think it really says something about what the internet allows us to do (both in terms of creating something simply for the sake of it, and collecting and analyzing data), and also about people's desire to find patterns and create meaning in things, no matter what they are. If you'd asked me, I never would have guessed religions would have sprung up around something as simple as pushing a button. But then again, religions have sprung up around stranger things.
You can read and discuss in the button aftermath thread, and if you want to have a go at it yourself, the code and data are below. Until next time I'll just keep pressing on.

References & Resources

the button press data (from reddit's github)

R code for plots