How Often Does Friday the 13th Happen?

Background

So yesterday was Friday the 13th.

I hadn’t even thought anything of it until someone mentioned it to me. They also pointed out that there are two Friday the 13ths this year: the one that occurred yesterday, and there will be another one in October.

This got me to thinking: how often does Friday the 13th usually occur? What’s the most number of times it can occur in a year?

Sounds like questions for a nice little piece of everyday analytics.

Analysis

A simple Google search revealed over a list of all the Friday the 13ths from August, 2010 up until the end of 2050 over at timeanddate.com. It was a simple matter to plunk that into Excel and throw together some simple graphs.
So to answer the first question, how often does Friday the 13th usually occur?
It looks like the maximum number of times it can occur per year is 3 (those are the years Jason must have a heyday and things are really bad at Camp Crystal Lake) and the minimum is 1. So my hypothesis is:
a. it’s not possible to have a year where a Friday the 13th doesn’t occur, and 
b. Friday the 13th can’t occur more than 3 times in a year, due to the way the Gregorian calendar works.
Of course, this is not proof, just evidence, as we are only looking at a small slice of data.
So what is the distribution of the number of unlucky days per year?
The majority of the years in the period have only one (18, or ~44%) but not by much, as nearly the same amount have 2 (17, or ~42%). Far less have 3 F13th’s, only 6 (~15%). Again, this could just be an artifact of the interval of time chosen, but gives a good idea of what to expect overall.
Are certain months favoured at all, though? Does Jason’s favourite day occur more frequently in certain months?
Actually it doesn’t really appear so – they look to be spread pretty evenly across the months and we will see why this is the case below.
So, what if we want even more detail. When we say how frequently does Friday the 13th occur, and we mean how long is it between each occurrence of Friday the 13th? Well, that’s something we can plot over the 41-year period just by doing a simple subtraction and plotting the result.
Clearly, there is periodicity and some kind of cycle to the occurrence of Friday the 13th, as we see repeated peaks at what looks like 420 days and also at around 30 days on the low end. This is not surprising, if you think about how the calendar works, leap years, etc. 
If we pivot on the number of days and plot the result, we don’t even get a distribution that is spread out evenly or anything like that; there are only 7 distinct intervals between Friday the 13ths during the period examined:
So basically, depending on the year, the shortest time between successive Friday the 13ths will be 28 days, and the greatest will be 427 (about a year and two months), but usually it is somewhere in-between at around either three, six, or eight months. It’s also worth noting that every interval is divisible by seven; this should not be surprising at all either, for obvious reasons.

Conclusion

Overall and neat little bit of simple analysis. Of course, this is just how I typically think about things, by looking at data first. I know that in this case, the occurrence of things like Friday the 13th (or say, holidays that fall on a certain day of week or the like) are related to the properties of the Gregorian calendar and follow a pattern that you could write specific rules around if you took the time to sit down and work it all out (which is exactly what some Wikipedians have done in the article on Friday the 13th).
I’m not a superstitious, but now I know when those unlucky days are coming up and so do you… and when it’s time to have a movie marathon with everyone’s favourite horror villain who wears a hockey mask.

I’m Dreaming of a White Christmas

I’m heading home for the holidays soon.

It’s been unseasonably warm this winter, at least here in Ontario, so much so that squirrels in Ottawa are getting fat. I wanted to put together a really cool post predicting the chance of a white Christmas using lots of historical climate data, but it turns out Environment Canada has already put together something like that by crunching some numbers. We can just slam this into Google Fusion tables and get some nice visualizations of simple data.

Map


It seems everything above a certain latitude has a much higher chance of having a white Christmas in recent times than those closer to the America border and on the coast, which I’m going to guess is likely due to how cold it gets in those areas on average during the winter. Sadly Toronto has less than a coin-flip’s chance of a white Christmas in recent times, with only a 40% chance of snow on the ground come the holiday.

Chart

But just because there’s snow on the ground doesn’t necessary mean that your yuletide weather is that worthy of a Christmas storybook or holiday movie. Environment Canada also has a definition for what they call a “Perfect Christmas”: 2 cm or more of snow on the ground and snowfall at some point during the day. Which Canadian cities had the most of these beautiful Christmases in the past?

Interestingly Ontario, Quebec and Atlantic Canada are better represented here, which I imagine has something to do with how much precipitation they get due to proximity to bodies of water, but hey, I’m not a meteorologist.
A white Christmas would be great this year, but I’m not holding my breath. Either way it will be good to sit by the fire with an eggnog and not think about data for a while. Happy Holidays!

Toronto Cats and Dogs II – Top 25 Names of 2014

I was quite surprised by the relative popularity of my previous analysis of the data for Licensed Cats & Dogs in Toronto for 2011, given how simple it was to put together.

I was browsing the Open Data Portal recently and noticed that there was a new data set for pets: the top 25 names for both dogs and cats. I thought this could lend itself to some quick, easy visualization and be a neat little addition to the previous post.

First we simply visualize the raw counts of the Top 25 names against each other. Interestingly, the top 2 names for both dogs and cats are apparently the same: Charlie and Max.

Next let’s take a look at the distribution of these top 25 names for each type of pet by how long they are, which just involves calculating the name length and then pooling the counts:

You can see that, proportionally the top dog names are a bit shorter (distribution is positively / right-skewed) compared to the cat names (slightly negatively / left skewed). Also note both are centered around names of length 5, and the one cat name of length 8 (Princess).

Looking at the dog names, do you notice something interesting about them? A particular feature present in nearly all? I did. Nearly every one of the top 25 dog names ends in a vowel. We can see this by visualizing the proportion of the counts for each type of pet by whether the name ends in a vowel or consonant:

Which to me, seems to indicate that more dogs tend to have “cutesy” names, usually ending in ‘y’, than cats.

Fun stuff, but one thing really bothers me… no “Fido” or “Boots”? I guess some once popular names have gone to the dogs.

References & Resources

Licensed Dog and Cat Names (Toronto Open Data)

Creepypasta – Votes vs. Rating (& learning ggplot2)

Excel:

R, base package:

R, ggplot:

Am I overfitting? Probably.

Code:
More fun stuff to come….

References

Source data at Creepypasta.com:

Code on gist:
http://gist.github.com/mylesmharrison/8886272

Creepypasta –  in list of internet phenomena (Wikipedia):
http://en.wikipedia.org/wiki/Creepypasta#Other_phenomena

Snapchat Database Leak – Visualized

Introduction

In case you don’t read anything online, or live under a rock, the internet is all atwitter (get it?) with the recent news that Snapchat has had 4.6 million users’ details leaked due to a security flaw which was compromised.

The irony here is that Snapchat was warned of the vulnerability by Gibson Security, but was rather dismissive and flippant and has now had this blow up in their faces (as it rightly should, given their response). It appears there may be very real consequences of this to the (overblown) perceived value of the company, yet another wildly popular startup with no revenue model. I bet that offer from Facebook is looking pretty good right about now.

Anyhow, a group of concerned hackers gave Snapchat what-for by exploiting the hole, and released a list of 4.6 million (4,609,621 to be exact) users details with the intent to “raise public awareness on how reckless many internet companies are with user information.

Which is awesome – kudos to those guys, once for being whitehat (they obscured two digits of each phone number to preserve some anonymity) and twice for keeping companies with large amounts of user data accountable. Gibsonsec has provided a tool so you can check if your account is in the DB here.

However, if you’re a datahead like me, when you hear that there is a file out there with 4.6M user accounts in it, your first thought is not OMG am I safe?! but let’s do some analysis!

Analysis

Area Code
As I have noted in a previous musing, it’s difficult to do any sort of in-depth analysis if you have limited dimensionality of your data – here only 3 fields – the phone number with last two digits obscured, the username, and the area.
Fortunately because some of the data here is geographic, we can do some cool visualization with mapping. First we look at the high level view, with state and those states by area. California had the most accounts compromised overall, with just shy of 1.4 M details leaked. New York State was next at just over a million. 
Because the accounts weren’t spread evenly across the states, below is a more detailed view by area code. You can see that it’s mainly Southern California and the Bay Area where the accounts are concentrated.
Usernames
Well, that covers the geographic component. Which leaves the only the username and phones numbers. I’m not going to look into the phone numbers (I mean what really can you do, other than look at the distribution of numbers – which I have a strong hypothesis about already).
Looking at the number of accounts which include numerals versus those that do not, the split is fairly even – 2,586,281 (~56.1%) do not contain numbers and the remaining 2,023,340 (~43.9%) do. There are no purely numeric usernames.
Looking at the distribution of the length of Snapchat usernames below, we see what appears to be a skew-normal distribution centered around 9.5 characters or so:
The remainder of the tail is not present, which I assume would fill in if there were more data. I had the axis stretch to 30 for perspective as there was one username in the file of length 29.

Conclusion

If anything this analysis has shown anything it has reassured me that:
  1. You are very likely not in the leak unless you live in California or New York City
  2. How amazingly natural phenomena follow or nearly follow theoretical distributions so closely
I’m not in the leak, so I’m not concerned. But once again, this stresses the importance of being mindful of where our personal data are going when using smartphone apps, and ensuring there is some measure of care and accountability on the creators’ end.

Update:
Snapchat has released a new statement promising an update to the app which makes the compromised feature optional, increased security around the API, and working with security experts in a more open fashion.

Bananagrams!!!

It was nice to be home with the family for Thanksgiving, and to finally take some time off.

A fun little activity which took up a lot of our time over the past weekend was Bananagrams, which, if you don’t already know, is sort of like a more action-packed version of Scrabble without the board.

Being the type of guy that I am, I started to think about the distribution of letters in the game. A little Googling led to some prior art to this post.

The author did something neat (which I wouldn’t have thought of) by making a sort of bar chart using the game pieces. Strangely though, they chose not to graph the different distributions of letters in Bananagrams and Scrabble but instead listed them in a table.

So, assuming the data from the post are accurate, here is a quick breakdown of said distributions below. As an added bonus, I’ve also included that trendy digital game that everyone plays on Facebook and their iDevices:

Bar graph of letter frequencies of Scrabble, Bananagrams and Words with Friends

Looking at the graph, it’s clear the Bananagrams has more tiles than the other games (the total counts are 144, 104 and 100 for Banagrams, Words with Friends and Scrabble respectively) and notably also does not have blank tiles of which the other games have 2 each. Besides the obvious prevalence of vowels in all 3 games, T, S, R, N, L and D also have high occurrence.

We can also compare the relative frequencies of the different letters in each game with respect to Scrabble. Here I took the letter frequency for each game (as a percent) then divided it by the frequency of the same letter in Scrabble. The results are below:

Bar chart of Bananagrams and Words with Friends letter frequencies relative to Scrabble

Here it is interesting to note that the relative frequency of H in Words with Friends is nearly double that in Scrabble (~192%). Also D, S and T have greater relative frequencies. The remaining letters are fairly consistent with the exception of I and N which are notably less frequent.

Bananagrams relative letter frequency is fairly consistent overall, with the exception of J, K, Q, X, and Z which are around the 140 mark. I guess the creator of the game felt there weren’t enough of the “difficult” letters in Scrabble.

There’s more analysis that could be done here (looking at the number of points per letter in WWF & Scrabble versus their relative frequency immediately comes to mind) but that should do for now. Hope you found this post “a-peeling”.

Toronto Licensed Cats & Dogs 2012 Data Visualization

It’s raining cats and dogs! No, I lied, it’s not.

But I wanted to do so more data viz and work with some more open data.

So for this quick plot I present, Cat and Dog Licenses in the City of Toronto for 2012, visualized!


Above in the top pane is the number of licensed cats and dogs per postal code (or Forward Sortation Area, FSA). I really would like to have produced a filled map (chloropleth) with the different postal code areas, however Tableau unfortunately does not have Canadian postal code boundaries, just lat/lon and getting geographic data in is a bit of an arduous process.

I needed something to plot given that I just had counts of cat and dog licenses per FSA, so threw up a scatter and there is amazing correlation! Surprise, surprise – this is just the third variable, and I bet that if you found a map of (human) population density by postal code you’d see why the two quantities are so closely related. Or perhaps not – this is just my assumption – maybe some areas of the GTA are better about getting their pets licensed or have more cats and dogs. Interesting food for thought.


Above is the number of licenses per breed type. Note that the scale is logarithmic for both as the “hairs” (domestic shorthair, domestic mediumhair and domestic longhair) dominate for cats and I wanted to keep the two graphs consistent.

The graphs are searchable by keyword, try it out!

Also I find it shocking that the second most popular breed of dog was Shih Tzu and the fourth most type of cat was Siamese – really?

Resources

Toronto Licensed Cat & Dog Reports (at Toronto Open Data Portal)

Toronto Animal Services
http://www.toronto.ca/animal_services/

Seriously, What’s a Data Scientist? (and The Newgrounds Scrape)

So here’s the thing. I wouldn’t feel comfortable calling myself a data scientist (yet).

Whenever someone mentions the term data science (or, god forbid BIG DATA, without a hint of skepticism or irony) people inevitably start talking about the elephant in the room (see what I did there)?

And I don’t know how to ride elephants (yet).

Some people (like yours truly, as just explained) are cautious – “I’m not a data scientist. Data science is a nascent field. No one can go around really calling themselves a data scientist because no one even really knows what data science is yet, there isn’t a strict definition.” (though Wikipedia’s attempt is noble).

Other people are not cautious at all – “I’m a data scientist! Hire me! I know what data are and know how to throw around the term BIG DATA! I’m great with pivot tables in Excel!!”

Aha ha. But I digress.

The point is that I’ve done the first real work which I think falls under the category of data science.

I’m no Python guru, but threw together a scraper to grab all the metadata from Newgrounds portal content.

The data are here if you’re interested in having a go at it already.

The analysis and visualization will take time, that’s for a later article. For now, here’s one of my exploratory plots, of the content rating by date. Already we can gather from this that, at least at Newgrounds, 4-and-half stars equals perfection.

Sure feels like science.

OECD Data Visualization Challenge: My Entry

The people behind Visualising are doing some great things. As well as providing access to open data sets, an embeddable data visualization player, and a massive gallery of data visualizations, they are building an online community of data visualization enthusiasts (and professionals) with their challenges and events.

In addition, those behind it (Seed and GE) are also connecting people in the real world with their data visualization marathons for students, which are looking to be the dataviz equivalent of the ever popular hackathons held around the world. As far as I know no one else is really doing this sort of thing, with a couple notable exceptions – for example Datakind and their Data Dive events (these not strictly visualization focused, however).

Okay, enough copious hyperlinking.

The latest challenge was to visualize the return on education around the world using some educational indicators from the OECD, and I thought I’d give it a go in Tableau Public.

For my visualization I chose to highlight the differences in the return on education not only between nations, but also the gender-based differences for each country.

I incorporated some other data from the OECD portal on GDP and public spending on education, and so the countries included are those with data present in all three sets.

The World Map shows the countries, coloured by GDP. The bar chart to the right depicts the public spending on education, both tertiary (blue) and non-tertiary (orange), as a percentage of GDP.

The scatterplots contrast both the gender-specific benefit-cost ratios per country, as well as between public (circles) and private (squares) benefit, and between the levels of education. A point higher up on the plots and to the left has a greater benefit-cost ratio (BCR) than a point lower and to the right, which represents a worse investment. The points are sized by internal rate-of-return (ROR).

All in all it was fun and not only did I learn a lot more about using Tableau, it gave me a lot of food for thought about how to best depict data visually as well.