Daily Case Counts Are Not Meaningful

In Part 2 of my posts on COVID-19 in Ontario, I said that for Part 3 we’d be taking a look at the updated case statuses, as well as hospitalizations. However, I’d like to put that on hold for a moment to instead address something which I think needs be far more, and worryingly continues to occur.

I will also preface this post with the same caveats and disclaimer for my other analyses on this topic related to health and disease:

  • I am not an epidemiologist, nor am I a subject matter expert on disease nor public health policy
  • All handling of the data / code / statistics, interpretations thereof, and thoughts expressed are my own and only my own
  • This post may contain errors or omissions given the above which are only my own

I intentionally choose a rather inflammatory title for this post because I wanted to make this point strongly and because I feel it needs to be made, and strongly. As opposed to my usual writing style, I will not have an introduction and background, but instead state the bottom line up front:

Looking at daily case counts for COVID-19 alone is, at best, uninformed and naïve, and at worst, highly misleading. 

In fact, I will illustrate that:

  • Apparent exponential growth in positive cases could be explained by the growth in testing a population with a set amount of disease present
  • What might appear to be large daily changes in the absolute number of cases can be duplicated as nothing more than statistical noise due to sampling

Continue reading “Daily Case Counts Are Not Meaningful”

Analysis and Visualization of the Ontario COVID-19 Data (Part 2)


In Part 1, I looked at the trends and did some visual analysis of the COVID-19 case status data for Ontario. This also required some rather involved data acquisition, as the information on COVID for Ontario was only available on a daily basis on the Ministry of Health’s official COVID page.

Since that time, finally the data has been made open and is now available on the Ontario Open Data page:

Status of cases in Ontario: https://data.ontario.ca/dataset/status-of-covid-19-cases-in-ontario
Confirmed Positive cases: https://data.ontario.ca/dataset/confirmed-positive-cases-of-covid-19-in-ontario

The second of these being data with important information like geographic and demographics that I lamented I could not locate previously.

So here for Part 2, we will dive into the confirmed cases data to see if there are any simple insights we can uncover of those positive cases of COVID in Ontario.

The same disclaimers from part 1 and then some also hold for this post (I am not an epidemiologist, interpretations of the data and opinions my own, my contain errors, I am not associated with Ministry of Health, etc.)

Continue reading “Analysis and Visualization of the Ontario COVID-19 Data (Part 2)”

Tracking and Visualization of Ontario COVID-19 Case Statuses

  • I am not an epidemiologist, nor am I a subject matter expert on disease or public health policy
  • All handling of the data, interpretations thereof, and thoughts expressed are my own and only my own
  • Interpretation and handling of the data may contain errors or omissions given the above


Everyone’s talking about coronavirus. It’s hard not to think about, hard not to read about, on a daily basis. It’s almost impossible not to – it’s top of mind in everyone’s mind, the only thought permeating our collective consciousness at this moment in time.

One result of everything that’s happening right now is that there’s a lot of data flying around out there, and a lot of articles being written, and a lot of analysis and visualization work being done, as everyone is trying to make sense of this whole business as it unfolds in real-time.

So I thought it was time to finally break down and weigh in myself. And then I thought twice, and thought it would be better to not just to do that, but instead try to make a contribution in some way.

I’m not a whole department, I don’t think I can build anything like John Hopkins CSSE has. Nor am the CDC. They have produced some good visualization as well around COVID. So I thought I’d tackle something simpler and closer to home. Something I keep reading about, which is how Toronto and Ontario are handling the situation.

Continue reading “Tracking and Visualization of Ontario COVID-19 Case Statuses”

How Does SPF Work?

So I went on vacation recently, which was nice.

One of the conversations that came up, which I’m sure does for many folks on vacation, was around the application of sunscreen. How often should you re-apply? How long will SPF 50 last vs. SPF 15? And then as we were talking, an even more fundamental question arose – what the hell is SPF anyway, and how does it work?

I’d always assumed in the past, like I assume many other people do, that it was a linear scale – so SPF 60 was 4x ‘as good’ as SPF 15. Someone in our group also said that the number was supposed to be a measure of duration for sun exposure. So, for SPF 60 you could go in direct sunlight for an hour longer than you would normally without burning whereas whereas for SPF 15 it’d only be a quarter of that.

Apparently as it turns out, neither of these things are true.

Continue reading “How Does SPF Work?”

Are the dice in Mario Party fair?

Over the holidays I was playing a lot of games with friends and family as one does, and one of those games was Super Mario Party for Nintendo Switch.

Now what’s interesting about this game is that, in addition to requiring dice rolls like any other board game, depending upon your character (or various ‘allies’ you can acquire when you team up with other playable characters and get the option to use their dice in addition to a bonus) you can choose to use different character-specific dice which are unique and have different values than a standard one.

Super Mario Party, with Mario holding his custom dice

So, being the guy that I am this got me to wondering – are all the different dice for the different characters ‘fair’? If your goal is to traverse the maximum number of spaces (as it often is) are any of the dice better to use on average than the others?

Continue reading “Are the dice in Mario Party fair?”

Are We Solving The Wrong Problems With Machine Learning?

Let’s talk about corn.

Corn and how it gets from growing in fields onto your table.

Below is a video of a corn harvesting machine:

And here is a video of people gathering corn:

So, I hear you say, what all does this have to do with machine learning?

A lot, as it so happens.

Continue reading “Are We Solving The Wrong Problems With Machine Learning?”

Training an RNN on the Archer Scripts


So all the hype these days is around “AI”, as opposed to “machine learning” (though I’ve yet to hear an exact distinction between the two), and one of the tools that seems to get talked about most is Google’s Tensorflow.
I wanted to get playing around with Tensorflow and RNN’s a little bit, since they’re not the type of machine learning I’m most familiar with, with a low investment in time to see what kind of outputs I could come up with.


A little digging and I came across this tutorial, which is a pretty good brief overview intro to RNNs, and uses Keras and computes things character-wise.
This is turn lead me to word-rnn-tensorflow, which expanding on the works of others, uses a word-based model (instead of character based).
I wasn’t about to spend my whole weekend rebuilding RNNs from scratch – no sense reinventing the wheel – so just thought it’d be interesting to play around a little with this one, and perhaps give it a more interesting dataset. Shakespeare is ok, but why not something a little more culturally relevant… like I dunno, say the scripts from a certain cartoon featuring a dysfunctional foul-mouthed spy agency?

Continue reading “Training an RNN on the Archer Scripts”

When to Use Sequential and Diverging Palettes


I wanted to take some time to talk an about important rule for the use of colour in data visualization. 
The more I’ve worked in visualization, the more I have come to feel that one of the most overlooked and under-discussed facets (especially for novices) is the use of colour. A major pet peeve of mine, and a mistake I see all too often, is the use of a diverging palette instead of a sequential one or vice-versa. 
So what is the difference between a sequential and diverging palette, and when is it to correct to use each? The answer is one that arises very often in visualization: it all depends on the data, and what you’re trying to show.

Sequential vs. Diverging Palettes

First of all, let’s define what we are discussing here. 
Sequential Palettes
A sequential palette ranges between two colours (typically having one “main” colour) ranging from white or a lighter shade to a darker one, by varying one or more of the parameters in the HSV/HSL colour space (usually only saturation or value/luminosity, or both). 
For me, at least, varying hue is going between two very distinct colours and is usually not good practice if your data vary linearly, as it is much closer to a diverging palette which will discuss next. There are others reasons why this is bad visualization practice, and, of course, exceptions to this rule, which we will discuss later in the post.
A sequential palette (generated in R)
Diverging Palettes
In contrast to a sequential palette, a diverging palette ranges between three or more colours with the different colours being quite distinct (usually having different hues). 
While technically a diverging palette could have as many colours as you’d like in a (such as in the rainbow palette which is the default in some visualizations like in MATLAB), diverging palettes usually range only between two contrasting colours at either end with a neutral colour or white in the middle separating the two.
A diverging palette (generated in R)

When to Use Which

So now that we’ve defined the two different palette types of interest, when is it appropriate and inappropriate to use them?

The rule for the use of diverging palettes is very simple: they should only be used when there is a value of importance around which the data are to be compared.

This central value is typically zero, with negative values corresponding to one hue and positive the other, though this could also be done for any other value, for example, comparing numbers around a measure of central tendency or reference value.

A Simple Example
For example, looking at the Superstore dataset in Tableau, a visualizer might be tempted to make a map such as the one below, with colour encoding the number of sales in each city:

Here points on the map correspond to the cities and are sized by total number of sales and coloured by total sales in dollars. Looks good, right? The cities with the highest sales clearly stick out in the green against the dark red?

Well, yes, but do you see a problem? Look at the generated palette:

The scale ranges from the minimum sales in dollars ($4.21) to max (~$155K), so we cover the whole range of the data. But what about the midpoint? It’s just the dead center point between the two, which doesn’t correspond to anything meaningful in the data – so why would the hue change from red to green at that point?

This is a case which is better suited using a sequential palette, since all the values are positive and were not highlighting a meaningful value which the range of data falls around. A better choice would be a sequential palette, as below:

Here, the range is full covered and there is no midpoint, and the palette ranges from light green to dark. The extreme values still stand out in dark green, however there is no well-defined center where the hue arbitraily changes, so this is a better choice.

There are other ways we could improve this visualization’s encoding of quantity as colour, for one, by using endpoints that would be more meaningful to business users instead of just the range of the data (say, $0 to $150K+), and another which we will discuss later.

Taking a look at the two palettes together, it’s clearer which is a better choice for encoding the always positive value of the metric sales dollars across its range:

Going Further
Okay, so when would we want to use a diverging palette? As per the rule, if there was a meaningful midpoint or other important value you wanted to contrast the data around.

For example, in our Superstore data, sales dollars are always positive, but profit can be positive or negative, so it is appropriate to use a diverging palette in this case, with one hue corresponding to negative values and another to positive, and the neutral colour in the middle occurring at zero:

Here it is very clear which values fall at the extremes of the range, but also which are closer to the meaningful midpoint (zero): that one city in Montana is in the negative, and the others don’t seem to be very profitable either; we can tell they are close to zero by how washed out their colours are.

Tableau is smart enough to know to set the midpoint at zero for our diverging palette. Again, you could tinker with the range to make the end-points more meaningful (e.g. round values), as well as varying the range: sometimes a symmetrical range for a diverging palette is easier to interpret from a numerical standpoint, though of course you have to keep in mind how perceptually this going to impact the salience of the colour values for the corresponding data.

So could we use a diverging palette for the always positive sales data? Sure. There just needs to be a point around which we are comparing the values. For example, I happen to know that the median sales per city over the time period in question is $495.82 – this would be a meaningful value to use for the midpoint of a diverging palette, and we can redo our original sales map as such:

No we have a better version of our original sales map, where here the cities coloured in red are below the median value per city, and those coloured in green are above. Much better!

But now something strange seems to be going on with the palette – what’s that all about?

No Simple Answers
So what is going on with the palette in the last map from our example above? And what of my promise to discuss other ways the palette scaling can be improved, and of exceptions to the rule of not using differing hues in a continuous scale?

Well, the reason that the map looks good above but the scale looks wrong has to do with how the data are distributed: the distribution of sales by city is not normal, but follows a power law, with most of the data falling in the low end, so our palette looks the same when the colours are scaled linearly with the data:

One way to fix this is to transform the data by taking the log, and seeing that the resulting palette looks more like we’d expect:

Though of course now the range is between transformed values. It’s interesting to not that in this case the midpoint comes out being nearly correct automatically (2.907 vs. log(495.82) ~= 2.695).

Further complicating all this is the fact that human perception of colour is not linear, but follows something like the Weber-Fenchner Law depending on the various properties. Robert Simmon writes on this in his excellent series of posts while he was at NASA which is definitely worth a read (and multiple re-reads).

There he also notes an exception to my statement that you shouldn’t use continuous palettes with different hues, as sometimes even that can be appropriate, as he notes in the section on figure-ground when talking about earth surface temperature.


So there you have it. Once again: use diverging palettes only when there is a meaningful point around which you want to contrast the other values in your data.

Remember, it all depends on the data. What is the ideal palette for a given data set, and how should you choose it? That’s not an easy question to answer, one always left up to the visualization practitioner, which only comes with the knowledge of proper visualization technique and the theoretical foundations that form it.

There are no right or wrong answers, only better or worse choices. It’s all about the details.

References and Resources

Subtleties of Colour (by Robert Simmon)
Understanding Sequential and Diverging Palettes in Tableau
How to Choose Colours for Maps and Heatmaps

How Often Does Friday the 13th Happen?


So yesterday was Friday the 13th.

I hadn’t even thought anything of it until someone mentioned it to me. They also pointed out that there are two Friday the 13ths this year: the one that occurred yesterday, and there will be another one in October.

This got me to thinking: how often does Friday the 13th usually occur? What’s the most number of times it can occur in a year?

Sounds like questions for a nice little piece of everyday analytics.


A simple Google search revealed over a list of all the Friday the 13ths from August, 2010 up until the end of 2050 over at timeanddate.com. It was a simple matter to plunk that into Excel and throw together some simple graphs.
So to answer the first question, how often does Friday the 13th usually occur?
It looks like the maximum number of times it can occur per year is 3 (those are the years Jason must have a heyday and things are really bad at Camp Crystal Lake) and the minimum is 1. So my hypothesis is:
a. it’s not possible to have a year where a Friday the 13th doesn’t occur, and 
b. Friday the 13th can’t occur more than 3 times in a year, due to the way the Gregorian calendar works.
Of course, this is not proof, just evidence, as we are only looking at a small slice of data.
So what is the distribution of the number of unlucky days per year?
The majority of the years in the period have only one (18, or ~44%) but not by much, as nearly the same amount have 2 (17, or ~42%). Far less have 3 F13th’s, only 6 (~15%). Again, this could just be an artifact of the interval of time chosen, but gives a good idea of what to expect overall.
Are certain months favoured at all, though? Does Jason’s favourite day occur more frequently in certain months?
Actually it doesn’t really appear so – they look to be spread pretty evenly across the months and we will see why this is the case below.
So, what if we want even more detail. When we say how frequently does Friday the 13th occur, and we mean how long is it between each occurrence of Friday the 13th? Well, that’s something we can plot over the 41-year period just by doing a simple subtraction and plotting the result.
Clearly, there is periodicity and some kind of cycle to the occurrence of Friday the 13th, as we see repeated peaks at what looks like 420 days and also at around 30 days on the low end. This is not surprising, if you think about how the calendar works, leap years, etc. 
If we pivot on the number of days and plot the result, we don’t even get a distribution that is spread out evenly or anything like that; there are only 7 distinct intervals between Friday the 13ths during the period examined:
So basically, depending on the year, the shortest time between successive Friday the 13ths will be 28 days, and the greatest will be 427 (about a year and two months), but usually it is somewhere in-between at around either three, six, or eight months. It’s also worth noting that every interval is divisible by seven; this should not be surprising at all either, for obvious reasons.


Overall and neat little bit of simple analysis. Of course, this is just how I typically think about things, by looking at data first. I know that in this case, the occurrence of things like Friday the 13th (or say, holidays that fall on a certain day of week or the like) are related to the properties of the Gregorian calendar and follow a pattern that you could write specific rules around if you took the time to sit down and work it all out (which is exactly what some Wikipedians have done in the article on Friday the 13th).
I’m not a superstitious, but now I know when those unlucky days are coming up and so do you… and when it’s time to have a movie marathon with everyone’s favourite horror villain who wears a hockey mask.

Top 100 CEOs in Canada by Salary 2008-2015, Visualized

I thought it’d been a while since I’d some good visualization work with Tableau, and noticed that this report from the Canadian Centre on Policy Alternatives was garnering a lot of attention in the news.

However, most of the articles about the report did not have any graphs and simply restated data from it in narrative to put it in context, and I found the visualizations within the report itself to be a little lacking in detail. It wasn’t a huge amount of work to extract the data from the report and quickly throw it into Tableau, and put together a cohesive picture using the Stories feature (best viewed on Desktop at 1024×768 and above).

See below for the details, it’s pretty staggering, even for some of the bottom earners. To put things in context, the top earner had $183M a year all-in, which, if you work 45 hours a week and only take two weeks of vacation per year, translates to about $81,000 and hour.

Geez, Looks like I need to get into a new line of work.