Google Domestic Trends

Google’s mission is to organize all the world’s information and make it universally accessible and useful. In following their mission, the company has produced some amazing tools which allow any internet user to do some data visualization without so much as having to open a spreadsheet.

One of these tools which I stumbled across the other day (which apparently has has existed for some time) is Google Domestic Trends.

I was previously aware of Google Trends, which allows a user to compare the popularity of different search terms, whether if be for serious reasons (e.g. Android vs. iPhone) or say, for something less serious. In Domestic Trends, Google has aggregated relevant search terms across different sectors of the economy, with the results presumably providing insight into market trends by sector (or at least the popularity of those market sectors with respect to time).

I am not an economist, but data are data, so here goes with the pithy commentary and observations.

Air Travel
It’s seasonal, unsurprisingly. Looks like there might be some deals over the holidays I was unaware of. Or that might be a really bad time to buy tickets.


Auto Buyers
As Google notes on the Domestic Trends frontpage, July 2009 was when the U.S. Government instituted its “Cash for Clunkers” program. However, it was also when Toyota recalled almost half a million vehicles due to defective airbags. Oh yeah, and that spike in 2005 is related to the outrageous change in the gas prices of the time.


New record. I’m glad I rent.


Computers and Electronics
Seriously, who buys desktops anymore?


Credit Cards
A poignant portrait of the changing state of the American economy and personal debt.


Durable Goods
Merry Christmas honey, I got you a Rhoomba.


School’s out for summer.


I want to say that the little spike later in 2011 has nothing to due with employment and is due to Mr. Jobs retiring, however then I would expect a much larger one to be in October.


Mobile and Wireless
The iPhone was revealed to the public on January 9th, 2007 and went on sale in June of the same year. The iPhone 3G and 3GS came out in June and July of 2008 and 2009 respectively. The 4S was released in October 2011. Not sure about mid-2010. The Blackberry Torch came out in June but that would hardly warrant what we see here.


Rental & Real Estate
Apparently it is quite seasonal. Peaks drop off around late July and early August. Students, I would guess.


We’ve seen this before. No surprises here.


I know the word you’re thinking of. It’s on the tip of your tongue and it starts with ‘R’.


See also: Google NGram Experiments.

rhok (n’ roll)

This past weekend was rhok Toronto which was a fun, exhausting, educational, and all around amazing weekend which I was honoured to be involved in.

The team I was fortunate enough to be a part of produced a prototype web-service to promote fair housing, and improve the ease of the submission process for investigations into housing by-law violations. An added bonus was that this resulted in this nice visualization of more City of Toronto data.

You can learn more about rhok here.

11 Million Yellow Slips – City of Toronto Parking Tickets, 2008-2011


I don’t know about you, but I really hate getting parking tickets. Sometimes I feel like it’s all just a giant cash grab. Really? I can’t park there between the hours of 11 and 3, but every other time is okay? Well, why the hell not?

But ah, such is life. Rules must be in place to keep civil order, keep the engines of city life running and prevent total chaos in the downtown core. However knowing this does not make coming out to the street to find that bright yellow slip of paper under your windshield wiper any easier.

Like everything else in the universe, parking tickets are a source of data. The great people at Open Data Toronto (@Open_TO) have provided all the data from every parking ticket issued in Toronto from 2008 to the end of last year.

So, let us dive in and have a look. We might just discover why we keeping getting all these tickets, or at least ease the collective pain a little in realizing how many others are sharing in it.


The data set is an anonymized record of every parking ticket issued in the city of Toronto from the period 01/01/2008 – 12/31/2011. The fields provided are: the anonymized ticket #, date of infraction, infraction code, description, fine amount, time of infraction, and location (address).

The data set and more information can be found in Open Data Toronto’s data catalogue here.

Originally I had this brilliant idea to geocode every data point, and then create an awesome heat map of the geographical distribution of parking tickets issued. However, given the fact that there are ~11 million records and the Google Maps API has a daily limit of 2,500 geocoding requests per day, even if I was completely diligent and performed the task daily it would still take approximately 4400 days or about 12 years to complete. And no, I am not paying to use the API for Business (which at a limit of 100,000 requests per day would still take ~3.5 months).

If anyone knows a way around this, please drop me an email and fill me in.

Otherwise, you can check out prior art. Patrick Cain at Global News created an awesome interactive map of aggregated parking ticket data from 2010 for locations in the city where over 500 tickets were issued. This turns out to be mainly hospitals, and unsurprisingly, tickets are clustered in the downtown core. Mr. Cain did a similar analysis while at the Toronto Star back in 2009, using data from the previous year.

I just don’t like throwing out data points.


Parking Infractions by Type 
Next we consider the parking tickets for the period by infraction type. A simple bar chart outlines the most common parking ticket types:

We will consider those codes which stick out most on the bar chart (the top 10):

> sort(codeTable, decreasing=TRUE)[1:11]
    005     029     210     003     207     009     002     008     006     015
2336433 1822690 1366945 1354671  933478  718692  496283  443706  369079 173078

Putting that into more human-readable format, the most commonly issued types of parking infractions were:

1. 005 – Park on Highway at Prohibited Time of Day
2. 029 – Park Prohibited Place/Time – No Permit
3. 210 – Park Fail to Display Receipt
4. 003 – Park on Private Property w/o Consent
5. 207 – Park w/o ticket from machine
6. 009 – Stop on Highway at Prohibited Time/Day
7. 002 – Park Longer than 3 Hours
8. 008 – Vehicle Standing Prohibited Time/Day
9. 006 – Park on Highway – Excess of Permitted Time
10. 015 – Park within 3M of Fire Hydrant

In case you were wondering, the most expensive tickets (in the range of 100’s of dollars, the max being $450 [!!] ) are all related to handicapped parking spaces.

Time Distribution of Parking Infractions
Let us now consider the parking ticket information with regards to time. First and foremost, we consider the ticket data as a simple tim
e series and plot the data for the exploratory purposes:


Most strikingly, there are clearly defined dips in the total number of tickets over the holiday season each year. There also appears to be some kind of periodic variation in the number of tickets issued over time (the downward spikes). A good first guess would be that this is likely related to the day of the week, due to the cycle of the work week related to the volume of cars parked, vehicles in the city, et cetera.

Quickly whipping up a box plot up for the data, we can see that a significantly less proportion of the tickets are issued on Sunday. Also for some reason plotting there are many outliers on the low end. I suspect these are in the aforementioned dips around the holiday season though I did not investigate this.


Performing a quick analysis of many different aspects of the data was not as easy as I had hoped, given the size of the set. Still, it is interesting to see the most common types of violations and the distribution of the majority of the parking tickets with respect to time.

Interesting general points of note:

  • The most common parking infractions are wrong place / wrong time, followed by various types of failing to display a permit / buy a ticket
  • Significantly reduced number of parking violations during the Christmas holiday season
  • More tickets issued during the work week

For Part II, I plan to create some heat maps / 2D histograms of the ticket data with respect to time, and I may yet create a geospatial representation of the data, albeit in aggregated form.

I’m Lovin’ It? – A Nutritional Analysis of McDonald’s


The other day I ate at McDonald’s.

I am not particularly proud of this fact. But some days, you are just too tired, too lazy, or too hung-over to bother throwing something together in the kitchen and you just think, “Whatever, I’m getting a Big Mac.”

As I was standing in line, ready to be served a free smile, I saw that someone had put up on the wall the nutritional information poster. From far away I saw the little columns of data, all in neatly organized tabular form, and a light went on over my head. I got excited like the nerd I am. Look! Out in the real world! Neatly organized data just ready to be analyzed! Amazing!

So, of course, after finishing my combo #1 and doing my part to contribute to the destruction of the rain forest, the world’s increasingly worrying garbage problem, and the continuing erosion of my state of health, I rushed right home to download the nutritional information from Ronald McDonald’s website and dive into the data.


First of all, I would just like to apologize in advance for
a) using a spreadsheet application, and
b) using bar charts

Forgive me, but we’re not doing any particularly heavy lifting here. And hey, at least it wasn’t in that one piece of software that everybody hates.

Also, by way of a disclaimer, I am not a nutritionist and the content of this article is in no way associated with McDonald’s restaurants or Health Canada.

First things first. Surprisingly, the largest and fattiest of the items on the board is (what I consider to be) one the “fringe” menu items: the Angus Deluxe Burger. Seriously, does anybody really ever order this thing? Wasn’t it just something the guys in the marketing department came up to recover market share from Harvey’s? But I digress.

Weighing in at just a gram shy of 300, 47 of which come from fat (of which 17 are saturated) this is probably not something you should eat every day, given that it has 780 calories. Using a ballpark figure of 2000 calories a day for a healthy adult, eating just the burger alone would make up almost 40% of your daily caloric intake.

Unsurprisingly, the value menu burgers are not as bad in terms of calories and fat, due to their smaller size. This is also the case for the chicken snack wraps and fajita. The McBistro sandwiches, though they are chicken, are on par with the other larger burgers (Big Mac and Big Xtra) in terms of serving size and fat content, so as far as McD’s is concerned choosing a chicken sandwich is not really a healthier option over beef (this is also the case for the caloric content).

As the document on the McDonald’s website is a little dated, some newer, more popular menu items are missing from the data set. However these are available in the web site’s nutritional calculator (which unfortunately is in Flash). FYI the Double Big Mac has 700 calories and weighs 268 grams, 40 of which come from fat (17 saturated). Close, but still not as bad as the Angus Deluxe.

In terms of sodium and cholesterol, again our friend the Angus burger is the worst offender, this time the Angus with Bacon & Cheese, having both the most sodium and cholesterol of any burger on the menu. With a whopping 1990 mg of sodium, or approximately 80% of Health Canada’s recommended daily intake, that’s a salty burger. Here a couple of the smaller burgers are quite bad, the Double Cheeseburger and Quarter Pounder with Cheese both having marginally more sodium than the Big Mac as well as more cholesterol. Best stick with the snack wraps or the other value menu burgers.

Compared to the burgers, the fries don’t even really seem all that bad. Still, if you order a large, you’re getting over 40% of your recommended daily fat intake. I realize I’m using different units than before here, so for your reference the large fries have 560 calories, 27 grams of fat and 430 mg of sodium.

Soft Drinks
If you are trying to be health-conscious, the worst drinks you could possibly order at McDonald’s are the milkshakes. Our big winner in the drinks department is the large Chocolate Banana Triple Thick Milkshake®. With a serving size of 698g (~1.5 lbs), this delicious shake has over 1000 calories and nearly 30 grams of fat. In fact the milkshakes are, without question, the most caloric of all the drinks available, and are only exceeded in sugar content by some of the large soft drinks.

In terms of watching the calories and sugar, diet drinks are your friend as they have zero calories and no sugar. Below is the caloric and sugar content of the drinks available, sorted in ascending order of caloric content.


And now the big question – McDonald’s salads: a more conscientious choice, or another nutritional offender masquerading as a healthy alternative?

There are quite healthy alternatives in the salad department. Assuming you’re not going to order the Side Garden Salad (which I assume is just lettuce, looking at its caloric and fat content) the Spicy Thai Salad and Spicy Thai with Grilled Chicken are actually quite reasonable, though the latter has a large amount of sodium (520 mg), and all the Thai and Tuscan salads have a lot of sugar (19 and 16 grams of sugar respectively).

However, all these values are referring to the salads sans dressing. If you’re like me (and most other human beings) you probably put dressing on your salad.

The Spicy Thai Salad with the Asian Sesame Dressing added might still be considered within the realm of the healthy – totaling 250 calories and 11 grams of fat. However, keep in mind that would also have 530 mg of sodium (about a quarter of the recommended daily intake) and 29 grams of sugar. Not exactly health food, but not the worst thing you could order.

And for the love of god, just don’t order any old salad at McD’s and think you are making a healthy alternative choice. The Mighty Caesar with Crispy Chicken and Caesar dressing has more fat than a Big Mac combo with medium fries and a Coke (54 g vs. 46 g) and nearly as much sodium (1240 mg vs. 1300 mg), over half the daily recommended intake.


Doing this brief simple examination of the McDonald’s menu will definitely help me be more mindful about the food the next time I choose to eat there. However in terms of of take-aways, there is nothing here really too surprising – we can see that McDonald’s food is, in general, very high in calories, fat, sugar and sodium. This is probably not a surprise for most, as many continue to eat it while being aware of these facts, myself included.

Still, it is somewhat shocking to see it all quantified and laid out in this fashion. A Big Mac meal with a medium fries and medium coke, for instance, has 1120 calories, 46 grams of fat, 1300 mg of sodium and 65 grams of sugar. Yikes. Assuming our 2000 calorie diet, that’s over half the day’s calories in one meal, as well as 71% and 54% of the recommended daily values for fat and sodium respectively. I will probably think twice in the future before I order that again.

If you are trying to be health-conscious and still choose to eat underneath the golden arches, based on what we have seen here, some pointers are:

  • Avoid the Angus Burgers
  • Order a smaller burger (except the double cheese), snack wrap or fajita
  • Avoid the milkshakes
  • Drink diet soft drinks
  • Some salads are acceptable, Caesar dressing is to be avoided

References / Resources

McDonald’s Nutritional Information

McDonald’s Canada Nutritional Calculator

The Daily % Value (Health Canada)

Dietary Reference Intake Tables, 2005 (Health Canada)

LibreOffice Calc

My bookshelf

I’d like to start with something small, and simple. The thing about analyzing the data of your own life is that you are the only one doing the research, so you also have to collect all of the data yourself. This takes effort; and, if you’d like to build a large enough data set to do some really interesting (and valid) analysis, time.

So I thought I’d start small. And simple. So I thought, what is an easily available source of data in my life to do some preliminary work? The answer was right next to me as I sat at my desk.

I am not a bibliophile by any stretch of the imagination, as I try to make good use of the public library when I can. I’d prefer to avoid spending copiously on books which will be read once and then collect dust. I have, over time however, amassed a small collection which is currently surpassing the capacity of my tiny IKEA bookcase.

I catalogued all the books in my collection and kept track of a few simple characteristics: number of pages, list price, publication year, binding, type (fiction, non-fiction or reference), subject, and whether or not I had read the book from cover-to-cover (“Completed”).

At the time of cataloguing I had a total of 60 books on my bookshelf. Summary of data:

> source(“books.R”)
[1] “Reading books.csv”
> summary(books)


Min.   :  63.0 
1st Qu.: 209.5 
Median : 260.0 
Mean   : 386.1 
3rd Qu.: 434.0 Max.   :1694.0 
      Binding        Year               Type              Subject 
 Hardcover:21   Min.   :1921   Fiction    :15   Math          :12 
 Softcover:39   1st Qu.:1995   Non-fiction:34   Communications: 7 
                Median :2002   Reference  :11   Humour        : 6 
                Mean   :1994                    Coffee Table  : 5 
                3rd Qu.:2006                    Classics      : 4 
                Max.   :2011                    Sci-Fi        : 4 
                                                (Other)       :22 
     Price        Completed
 Min.   :  1.00   -:16    
 1st Qu.: 16.45   N:13    
 Median : 20.49   Y:31    
 Mean   : 35.41           
 3rd Qu.: 30.37           
 Max.   :155.90           

Some of this information is a bit easier to interpret if provided in visual form (click to enlarge):

Looking at the charts we can see that I’m not really into novels, and that almost 1/5th of my library is reference books – due mainly to textbooks from university I still have kicking around. For about 1/3rd of the books which are intended to be read cover-to-cover I have not done so (“Not Applicable” refers to books like coffee-table and reference books which are not intended to be read in their entirety).

Breaking it down further we look at the division by subject/topic:

Interestingly enough, the topics in my book collection are varied (apparently I am well-read?), with the largest chunks being made up by math (both pop-science and textbooks) and communications (professional development reading in the last year).

Let’s take a look at the relationship between the list price of books and other factors.

As expected, there does not appear to be any particular relationship between the publication year of the book and the list price. The outliers near the top of the price range are the textbooks and those on the very far left of publication date are Kafka.

A more likely relationship would be that between a book’s length and its price, as larger books are typically more costly. Having a look at the data for all the books it appears this could be the case:

We can coarsely fit a trendline to the data:
> price <- books$Price
> pages <- books$Pages
> page_price_line <- lm(price ~ pages)
> summary(page_price_line)

lm(formula = price ~ pages)

    Min      1Q  Median      3Q     Max
-56.620 -13.948  -6.641  -1.508 109.802

            Estimate Std. Error t value Pr(>|t|)   
(Intercept)  9.95801    6.49793   1.532    0.131   
pages        0.06592    0.01294   5.096 3.97e-06 ***

Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 32.19 on 58 degrees of freedom
Multiple R-squared: 0.3092,    Adjusted R-squared: 0.2973
F-statistic: 25.96 on 1 and 58 DF,  p-value: 3.971e-06

Our p-value is super small however our goodness of fit (R-squared) is not great. There appears to be some sort of clustering going on here as the larger values (both in price and pages) are more dispersed. We re-examine the plot and divide by binding type:

The softcovers make up the majority of the tightly clustered values and the values for the hardcovers seem to be more spread out. The dashed line is the linear fit for the hardcovers and the solid line for the soft. However the small number (n=21) and dispersion of the points for the former make even doing this questionable. That point aside, we can see on the whole that hardcovers appear to be more expensive, as one would expect. This is illustrated in the box plot below:

However there a lot of outlying points on the plot. Looking at the scatterplot again we divide by book type and the picture becomes clearer:

It is clear the reference books make up the majority of the extreme values away from those clustered in the lower regions of the plot and thus could be treated separately.

Closing notes:

  • I did not realize how many non-fiction / general interest / popular reading books have subtitles (e.g. Zero – The Biography of A Dangerous Idea) until cataloguing the ones I own. I suppose this is to make them seem more interesting, with the hopes that people browsing at bookstores to read the blurb on the back and be enticed to purchase the book.
  • Page numbering appears to be completely arbitrary. When I could I used the last page of each book which had a page number listed. Some books have the last page in the book numbered, others have the last full page of text numbered, and still others the last written page before supplementary material at the back (index, appendix, etc.) numbered. The first numbered page also varies, accounting for things like the table of contents, introduction, prologue, copyright notices and the like.
  • Textbooks are expensive. Unreasonably so.
  • Amazon has metadata for each book which you can see under “Details” when you view it (I had to look up some things like price when it was not listed on the book. In these cases, I used Amazon’s “list price”, the crossed out value at the top of the page for a book). I imagine there is an enormous trove of data which would lend itself to much more interesting and detailed analysis than I could perform here.