Don’t Do Journey: Karaoke and a Data Analysis Musing

“DON’T DO JOURNEY!!” The look of terror and disbelief in her eyes was both sudden and palpable.

What can I say? People feel very strongly about karaoke. Every since this joy/terror was gifted/unleashed upon the world, it seems that there is no shortage of people who have very strong feelings about it.

It’s kind of a love/hate relationship. People love it. Or they hate it. Or they love to hate it. Or they hate the fact that they love it. Either way, it’s kind of surprising how polarizing it can be.

There’s a place here in Toronto that’s quite popular for it. Well, actually I don’t know how popular it is, but they do have it five nights a week. As I was looking at their website one day, I had one of these oh, neat moments – the contents of their entire karaoke songbook, a list of all 32,636 songs, is available in PDF format.

Slam that into a PDF to CSV converter…. tidy up a little, and we’ve got data!

So what’s the most available to sing at the Fox if you happen to be feeling courageous enough? The Top 10:

Hail to The King, baby.

Traditional? Standard? What the heck? I’ve never even heard of those artists! Are those some 70’s rock bands like The Eagles or…. oh, right. That makes sense. Really, traditional and standard should be the same category.

After traditional songs, no one can touch The King, followed by Ol’ Blue Eyes with about half as many songs. Just in case you were wondering, the next 10 spots after Celine Dion are a lot of country followed by The Stones.

And that, unfortunately, is it. Which brings us to my musing on data analysis.

On a very simplistic high level, you could say that there are 3 steps to data analysis:

1. Get the data
2. Make with the analysis
3. Write up report/article/paper/post for management/news outlet/academic journal/blog

And like I said, that is a massive oversimplification. Because really, you can break each step into many sub-steps, which don’t necessarily flow in order and could be iterative. For example, Step 1:

1a. Get the data
1b. Decide if there are any other data you need
1c. Get that data 
1d. Clean and process data in usable format
1e. ….

Et cetera. My roommate and I were having a discussion on these matters, and he quite astutely pointed out that many people take Step 1 for granted. Worse yet, some don’t appreciate that there is more to Step 1 than 1a.

And that is why this is another short post with only one graph. Because there’s only so much analysis you can do with Artist, Title and Song ID. There’s options, to pull a whole bunch of data: Gracenote (but they appear to be a bit stingy with their API), freedb, MusicBrainz, and Discogs. But I’m not going to set up a local SQL server or write a bunch of code right now; though it would be interesting to see an in-depth analysis taking into consideration many things like song length, year, genre, and lyric content to name a few.

As my roommate and I were talking, he pointed out that if you had a karaoke machine (actually I think it’s computers with iTunes now) which kept track of all the songs picked, there’d be something more interesting to analyze: What is the distribution of the popularity of songs? How frequently are different songs of different genres and years picked?

We agreed that it’s most likely exponential (as many things are) – Don’t Stop Believin’ probably gets picked almost once a night, but there are likely many, many other songs that have never have been (and probably never will be) picked. And lastly, I’m always left wondering, how many singers are actually in tune for more than half the song?

FBI iPhone Leak Breakdown

Don’t know if you heard, but something that is making the news today is that hacker group AntiSec purportedly gained control of an FBI agent’s laptop and got a hold of 12 million UDIDs which were apparently being tracked.

A UDID is Apple’s unique identifier for each of its ‘iDevices’, and if known could be used to get a lot of personally identifiable information about the owner of each product.

The hackers released the data on pastebin here. In the interests of protecting the privacy of the users, they removed all said personally identifiable information from the data. This is kind of a shame in a way, as it would have been interesting to do an analysis of the geographic distribution of the devices which were (allegedly) being tracked, amongst other things. I suppose they released the data for more (allegedly) altruistic purposes – i.e. to let people find out if the FBI was tracking them, not to have the data analyzed.

The one useful column that was left was the device type. Surprisingly, the majority of devices were iPads. Of course, this could just be unique to the million and one records of the 12 million which the group chose to release.

Breakdown:
iPhone: 345,384 (34.5%)
iPad: 589,720 (59%)
iPod touch: 63,724 (6.4%)
Undetermined: 1,173 (0.1%)
Total: 1,000,001

Forgive me Edward Tufte, for using a pie chart.

Let’s Go To The Ex!

I went to The Ex (that’s the Canadian National Exhibition for those of you not ‘in the know’) on Saturday. I enjoy stepping out of the ordinary from time to time and carnivals / fairs / midways / exhibitions etc. are always a great way to do that.

As far as exhibitions go, I believe the CNE is one of the more venerable – it’s been around since 1879 and attracts over 1.3 million visitors every year.

Looking at the website before I went, I saw that they had a nice summary of all the ride height requirements and number of tickets required. I thought perhaps the data could stand to be presented in a more visual form.

First, how about the number of tickets required for the different midways? All of the rides on the ‘Kiddie’ Midway require four tickets, except for one (The Wacky Worm Coaster). The Adult Midway rides are split about 50/50 for five or six tickets, except for one (Sky Ride) which only requires four.

With tickets being $1.50 each, or $1 if you buy them in sets of 22 or 55, that makes the ride price range $6-9 or $4-6. Assuming you buy the $1 tickets, the average price of an adult ride is $5.42 and the average price of a child ride $4.04.

The rides also have height requirements. Note that I’ve simplified things by taking the max height for cases where shorter/younger kids can ride supervised with an adult. Here’s a breakdown of the percentage of the rides in each midway type children can ride, given their height:

Google Docs does not allow non-stacked stepped area charts, so line graph it is.

And here’s the same breakdown with percentage of the total rides (both midways combined), coloured by type. This is a better way to represent the information, as it shows the discrete nature of the height requirement:

Basically if your child is over 4′ they are good for about 80% of all the rides at the CNE.

Something else to consider – how to get your maximum value for your tickets with none left over, given that they are sold in packs of 22 and 55? I would say go with the $36 all-you-can-ride option. Also, how miniscule are your actual odds of winning those carnival games? Because I want a giant purple plush gorilla.

See you next year!

11 Million Yellow Slips – City of Toronto Parking Tickets, 2008-2011

Introduction

I don’t know about you, but I really hate getting parking tickets. Sometimes I feel like it’s all just a giant cash grab. Really? I can’t park there between the hours of 11 and 3, but every other time is okay? Well, why the hell not?

But ah, such is life. Rules must be in place to keep civil order, keep the engines of city life running and prevent total chaos in the downtown core. However knowing this does not make coming out to the street to find that bright yellow slip of paper under your windshield wiper any easier.

Like everything else in the universe, parking tickets are a source of data. The great people at Open Data Toronto (@Open_TO) have provided all the data from every parking ticket issued in Toronto from 2008 to the end of last year.

So, let us dive in and have a look. We might just discover why we keeping getting all these tickets, or at least ease the collective pain a little in realizing how many others are sharing in it.

Background

The data set is an anonymized record of every parking ticket issued in the city of Toronto from the period 01/01/2008 – 12/31/2011. The fields provided are: the anonymized ticket #, date of infraction, infraction code, description, fine amount, time of infraction, and location (address).

The data set and more information can be found in Open Data Toronto’s data catalogue here.

Originally I had this brilliant idea to geocode every data point, and then create an awesome heat map of the geographical distribution of parking tickets issued. However, given the fact that there are ~11 million records and the Google Maps API has a daily limit of 2,500 geocoding requests per day, even if I was completely diligent and performed the task daily it would still take approximately 4400 days or about 12 years to complete. And no, I am not paying to use the API for Business (which at a limit of 100,000 requests per day would still take ~3.5 months).

If anyone knows a way around this, please drop me an email and fill me in.

Otherwise, you can check out prior art. Patrick Cain at Global News created an awesome interactive map of aggregated parking ticket data from 2010 for locations in the city where over 500 tickets were issued. This turns out to be mainly hospitals, and unsurprisingly, tickets are clustered in the downtown core. Mr. Cain did a similar analysis while at the Toronto Star back in 2009, using data from the previous year.

I just don’t like throwing out data points.

Analysis

Parking Infractions by Type 
Next we consider the parking tickets for the period by infraction type. A simple bar chart outlines the most common parking ticket types:

We will consider those codes which stick out most on the bar chart (the top 10):

> sort(codeTable, decreasing=TRUE)[1:11]
    005     029     210     003     207     009     002     008     006     015
2336433 1822690 1366945 1354671  933478  718692  496283  443706  369079 173078

Putting that into more human-readable format, the most commonly issued types of parking infractions were:

1. 005 – Park on Highway at Prohibited Time of Day
2. 029 – Park Prohibited Place/Time – No Permit
3. 210 – Park Fail to Display Receipt
4. 003 – Park on Private Property w/o Consent
5. 207 – Park w/o ticket from machine
6. 009 – Stop on Highway at Prohibited Time/Day
7. 002 – Park Longer than 3 Hours
8. 008 – Vehicle Standing Prohibited Time/Day
9. 006 – Park on Highway – Excess of Permitted Time
10. 015 – Park within 3M of Fire Hydrant

In case you were wondering, the most expensive tickets (in the range of 100’s of dollars, the max being $450 [!!] ) are all related to handicapped parking spaces.

Time Distribution of Parking Infractions
Let us now consider the parking ticket information with regards to time. First and foremost, we consider the ticket data as a simple tim
e series and plot the data for the exploratory purposes:

Cool.

Most strikingly, there are clearly defined dips in the total number of tickets over the holiday season each year. There also appears to be some kind of periodic variation in the number of tickets issued over time (the downward spikes). A good first guess would be that this is likely related to the day of the week, due to the cycle of the work week related to the volume of cars parked, vehicles in the city, et cetera.

Quickly whipping up a box plot up for the data, we can see that a significantly less proportion of the tickets are issued on Sunday. Also for some reason plotting there are many outliers on the low end. I suspect these are in the aforementioned dips around the holiday season though I did not investigate this.

Conclusions

Performing a quick analysis of many different aspects of the data was not as easy as I had hoped, given the size of the set. Still, it is interesting to see the most common types of violations and the distribution of the majority of the parking tickets with respect to time.

Interesting general points of note:

  • The most common parking infractions are wrong place / wrong time, followed by various types of failing to display a permit / buy a ticket
  • Significantly reduced number of parking violations during the Christmas holiday season
  • More tickets issued during the work week

For Part II, I plan to create some heat maps / 2D histograms of the ticket data with respect to time, and I may yet create a geospatial representation of the data, albeit in aggregated form.

I’m Lovin’ It? – A Nutritional Analysis of McDonald’s

Introduction

The other day I ate at McDonald’s.

I am not particularly proud of this fact. But some days, you are just too tired, too lazy, or too hung-over to bother throwing something together in the kitchen and you just think, “Whatever, I’m getting a Big Mac.”

As I was standing in line, ready to be served a free smile, I saw that someone had put up on the wall the nutritional information poster. From far away I saw the little columns of data, all in neatly organized tabular form, and a light went on over my head. I got excited like the nerd I am. Look! Out in the real world! Neatly organized data just ready to be analyzed! Amazing!

So, of course, after finishing my combo #1 and doing my part to contribute to the destruction of the rain forest, the world’s increasingly worrying garbage problem, and the continuing erosion of my state of health, I rushed right home to download the nutritional information from Ronald McDonald’s website and dive into the data.

Analysis

First of all, I would just like to apologize in advance for
a) using a spreadsheet application, and
b) using bar charts

Forgive me, but we’re not doing any particularly heavy lifting here. And hey, at least it wasn’t in that one piece of software that everybody hates.

Also, by way of a disclaimer, I am not a nutritionist and the content of this article is in no way associated with McDonald’s restaurants or Health Canada.

Sandwiches
First things first. Surprisingly, the largest and fattiest of the items on the board is (what I consider to be) one the “fringe” menu items: the Angus Deluxe Burger. Seriously, does anybody really ever order this thing? Wasn’t it just something the guys in the marketing department came up to recover market share from Harvey’s? But I digress.

Weighing in at just a gram shy of 300, 47 of which come from fat (of which 17 are saturated) this is probably not something you should eat every day, given that it has 780 calories. Using a ballpark figure of 2000 calories a day for a healthy adult, eating just the burger alone would make up almost 40% of your daily caloric intake.

 
Unsurprisingly, the value menu burgers are not as bad in terms of calories and fat, due to their smaller size. This is also the case for the chicken snack wraps and fajita. The McBistro sandwiches, though they are chicken, are on par with the other larger burgers (Big Mac and Big Xtra) in terms of serving size and fat content, so as far as McD’s is concerned choosing a chicken sandwich is not really a healthier option over beef (this is also the case for the caloric content).

As the document on the McDonald’s website is a little dated, some newer, more popular menu items are missing from the data set. However these are available in the web site’s nutritional calculator (which unfortunately is in Flash). FYI the Double Big Mac has 700 calories and weighs 268 grams, 40 of which come from fat (17 saturated). Close, but still not as bad as the Angus Deluxe.

In terms of sodium and cholesterol, again our friend the Angus burger is the worst offender, this time the Angus with Bacon & Cheese, having both the most sodium and cholesterol of any burger on the menu. With a whopping 1990 mg of sodium, or approximately 80% of Health Canada’s recommended daily intake, that’s a salty burger. Here a couple of the smaller burgers are quite bad, the Double Cheeseburger and Quarter Pounder with Cheese both having marginally more sodium than the Big Mac as well as more cholesterol. Best stick with the snack wraps or the other value menu burgers.

Fries
Compared to the burgers, the fries don’t even really seem all that bad. Still, if you order a large, you’re getting over 40% of your recommended daily fat intake. I realize I’m using different units than before here, so for your reference the large fries have 560 calories, 27 grams of fat and 430 mg of sodium.

Soft Drinks
If you are trying to be health-conscious, the worst drinks you could possibly order at McDonald’s are the milkshakes. Our big winner in the drinks department is the large Chocolate Banana Triple Thick Milkshake┬«. With a serving size of 698g (~1.5 lbs), this delicious shake has over 1000 calories and nearly 30 grams of fat. In fact the milkshakes are, without question, the most caloric of all the drinks available, and are only exceeded in sugar content by some of the large soft drinks.

In terms of watching the calories and sugar, diet drinks are your friend as they have zero calories and no sugar. Below is the caloric and sugar content of the drinks available, sorted in ascending order of caloric content.

 

Salads
And now the big question – McDonald’s salads: a more conscientious choice, or another nutritional offender masquerading as a healthy alternative?

There are quite healthy alternatives in the salad department. Assuming you’re not going to order the Side Garden Salad (which I assume is just lettuce, looking at its caloric and fat content) the Spicy Thai Salad and Spicy Thai with Grilled Chicken are actually quite reasonable, though the latter has a large amount of sodium (520 mg), and all the Thai and Tuscan salads have a lot of sugar (19 and 16 grams of sugar respectively).

However, all these values are referring to the salads sans dressing. If you’re like me (and most other human beings) you probably put dressing on your salad.

The Spicy Thai Salad with the Asian Sesame Dressing added might still be considered within the realm of the healthy – totaling 250 calories and 11 grams of fat. However, keep in mind that would also have 530 mg of sodium (about a quarter of the recommended daily intake) and 29 grams of sugar. Not exactly health food, but not the worst thing you could order.

And for the love of god, just don’t order any old salad at McD’s and think you are making a healthy alternative choice. The Mighty Caesar with Crispy Chicken and Caesar dressing has more fat than a Big Mac combo with medium fries and a Coke (54 g vs. 46 g) and nearly as much sodium (1240 mg vs. 1300 mg), over half the daily recommended intake.

Conclusions

Doing this brief simple examination of the McDonald’s menu will definitely help me be more mindful about the food the next time I choose to eat there. However in terms of of take-aways, there is nothing here really too surprising – we can see that McDonald’s food is, in general, very high in calories, fat, sugar and sodium. This is probably not a surprise for most, as many continue to eat it while being aware of these facts, myself included.

Still, it is somewhat shocking to see it all quantified and laid out in this fashion. A Big Mac meal with a medium fries and medium coke, for instance, has 1120 calories, 46 grams of fat, 1300 mg of sodium and 65 grams of sugar. Yikes. Assuming our 2000 calorie diet, that’s over half the day’s calories in one meal, as well as 71% and 54% of the recommended daily values for fat and sodium respectively. I will probably think twice in the future before I order that again.

If you are trying to be health-conscious and still choose to eat underneath the golden arches, based on what we have seen here, some pointers are:

  • Avoid the Angus Burgers
  • Order a smaller burger (except the double cheese), snack wrap or fajita
  • Avoid the milkshakes
  • Drink diet soft drinks
  • Some salads are acceptable, Caesar dressing is to be avoided

References / Resources

McDonald’s Nutritional Information
http://www1.mcdonalds.ca/NutritionCalculator/NutritionFactsEN.pdf

McDonald’s Canada Nutritional Calculator
http://www.mcdonalds.ca/ca/en/food/nutrition_calculator.html

The Daily % Value (Health Canada)
http://www.hc-sc.gc.ca/fn-an/label-etiquet/nutrition/cons/dv-vq/info-eng.php

Dietary Reference Intake Tables, 2005 (Health Canada)
http://www.hc-sc.gc.ca/fn-an/nutrition/reference/table/index-eng.php

LibreOffice Calc
http://www.libreoffice.org/