Plotting Choropleths from Shapefiles in R with ggmap – Toronto Neighbourhoods by Population


So, I’m not really a geographer. But any good analyst worth their salt will eventually have to do some kind of mapping or spatial visualization. Mapping is not really a forte of mine, though I have played around with it some in the past.
I was working with some shapefile data a while ago and thought about how its funny that so much of spatial data is dominated by a format that is basically proprietary. I looked around for some good tutorials on using shapefile data in R, and even so it took me a while to figure it out, longer than I would have thought.
So I thought I’d put together a simple example of making nice choropleths using R and ggmap. Let’s do it using some nice shapefile data of my favourite city in the world courtesy of the good folks at Toronto’s Open Data initiative.


We’re going to plot the shapefile data of Toronto’s neighbourhoods boundaries in R and mash it up with demographic data per neighbourhood from Wellbeing Toronto.
We’ll need a few spatial plotting packages in R (ggmap, rgeos, maptools).
Also the shapefile originally threw some kind of weird error when I originally tried to load it into R, but it was nothing loading it into QGIS once and resaving it wouldn’t fix. The working version is available on the github page for this post.


First let’s just load in the shapefile and plot the raw boundary data using maptools. What do we get?
# Read the neighborhood shapefile data and plot
shpfile <- "NEIGHBORHOODS_WGS84_2.shp"
sh <- readShapePoly(shpfile)
This just yields the raw polygons themselves. Any good Torontonian would recognize these shapes. There’s some maps like these with words squished into the polygons hanging in lots of print shops on Queen Street. Also as someone pointed out to me, most T-dotters think of the grid of downtown streets as running directly North-South and East-West but it actually sits on an angle.

Okay, that’s a good start. Now we’re going to include the neighbourhood population from the demographic data file by attaching it to the dataframe within the shapefile object. We do this using the merge function. Basically this is like an SQL join. Also I need to convert the neighbourhood number to a integer first so things work, because R is treating it as an string.

# Add demographic data
# The neighbourhood ID is a string - change it to a integer
sh@data$AREA_S_CD <- as.numeric(sh@data$AREA_S_CD)

# Read in the demographic data and merge on Neighbourhood Id
demo <- read.csv(file="WB-Demographics.csv", header=T)
sh2 <- merge(sh, demo, by.x='AREA_S_CD', by.y='Neighbourhood.Id')
Next we’ll create a nice white to red colour palette using the colorRampPalette function, and then we have to scale the population data so it ranges from 1 to the max palette value and store that in a variable. Here I’ve arbitrarily chosen 128. Finally we call plot and pass that vector of colours into the col parameter:
# Set the palette
p <- colorRampPalette(c("white", "red"))(128)

# Scale the total population to the palette
pop <- sh2@data$Total.Population
cols <- (pop - min(pop))/diff(range(pop))*127+1
plot(sh, col=cols)
And here’s the glorious result!

Cool. You can see that the population is greater for some of the larger neighbourhoods, notably on the east end and The Waterfront Communities (i.e. condoland)

I’m not crazy about this white-red palette so let’s use RColorBrewer’s spectral which is one of my faves:

#RColorBrewer, spectral
p <- colorRampPalette(brewer.pal(11, 'Spectral'))(128)
plot(sh2, col=cols)

There, that’s better. The dark red neighborhood is Woburn. But we still don’t have a legend so this choropleth isn’t really telling us anything particularly helpful. And it’d be nice to have the polygons overplotted onto map tiles. So let’s use ggmap!


In order to use ggmap we have to decompose the shapefile of polygons into something ggmap can understand (a dataframe). We do this using the fortify command. Then we use ggmap’s very handy qmap function which we can just pass a search term to like we would Google Maps, and it fetches the tiles for us automatically and then we overplot the data using standard calls to geom_polygon just like you would in other visualizations using ggplot.

The first polygon call is for the filled shapes and the second is to plot the black borders.

points <- fortify(sh, region = 'AREA_S_CD')

# Plot the neighborhoods
toronto <- qmap("Toronto, Ontario", zoom=10)
toronto +geom_polygon(aes(x=long,y=lat, group=group, alpha=0.25), data=points, fill='white') +
geom_polygon(aes(x=long,y=lat, group=group), data=points, color='black', fill=NA)

Now we merge the demographic data just like we did before, and ggplot takes care of the scaling and legends for us. It’s also super easy to use different palettes by using scale_fill_gradient and scale_fill_distiller for ramp palettes and RColorBrewer palettes respectively.

# merge the shapefile data with the social housing data, using the neighborhood ID
points2 <- merge(points, demo, by.x='id', by.y='Neighbourhood.Id', all.x=TRUE)

# Plot
toronto + geom_polygon(aes(x=long,y=lat, group=group, fill=Total.Population), data=points2, color='black') +
scale_fill_gradient(low='white', high='red')

# Spectral plot
toronto + geom_polygon(aes(x=long,y=lat, group=group, fill=Total.Population), data=points2, color='black') +
scale_fill_distiller(palette='Spectral') + scale_alpha(range=c(0.5,0.5))

So there you have it! Hopefully this will be useful for other R users wishing to make nice maps in R using shapefiles, or those who would like to explore using ggmap.

References & Resources

Neighbourhood boundaries at Toronto Open Data:
Demographic data from Well-being Toronto:

I’m Dreaming of a White Christmas

I’m heading home for the holidays soon.

It’s been unseasonably warm this winter, at least here in Ontario, so much so that squirrels in Ottawa are getting fat. I wanted to put together a really cool post predicting the chance of a white Christmas using lots of historical climate data, but it turns out Environment Canada has already put together something like that by crunching some numbers. We can just slam this into Google Fusion tables and get some nice visualizations of simple data.


It seems everything above a certain latitude has a much higher chance of having a white Christmas in recent times than those closer to the America border and on the coast, which I’m going to guess is likely due to how cold it gets in those areas on average during the winter. Sadly Toronto has less than a coin-flip’s chance of a white Christmas in recent times, with only a 40% chance of snow on the ground come the holiday.


But just because there’s snow on the ground doesn’t necessary mean that your yuletide weather is that worthy of a Christmas storybook or holiday movie. Environment Canada also has a definition for what they call a “Perfect Christmas”: 2 cm or more of snow on the ground and snowfall at some point during the day. Which Canadian cities had the most of these beautiful Christmases in the past?

Interestingly Ontario, Quebec and Atlantic Canada are better represented here, which I imagine has something to do with how much precipitation they get due to proximity to bodies of water, but hey, I’m not a meteorologist.
A white Christmas would be great this year, but I’m not holding my breath. Either way it will be good to sit by the fire with an eggnog and not think about data for a while. Happy Holidays!

Snapchat Database Leak – Visualized


In case you don’t read anything online, or live under a rock, the internet is all atwitter (get it?) with the recent news that Snapchat has had 4.6 million users’ details leaked due to a security flaw which was compromised.

The irony here is that Snapchat was warned of the vulnerability by Gibson Security, but was rather dismissive and flippant and has now had this blow up in their faces (as it rightly should, given their response). It appears there may be very real consequences of this to the (overblown) perceived value of the company, yet another wildly popular startup with no revenue model. I bet that offer from Facebook is looking pretty good right about now.

Anyhow, a group of concerned hackers gave Snapchat what-for by exploiting the hole, and released a list of 4.6 million (4,609,621 to be exact) users details with the intent to “raise public awareness on how reckless many internet companies are with user information.

Which is awesome – kudos to those guys, once for being whitehat (they obscured two digits of each phone number to preserve some anonymity) and twice for keeping companies with large amounts of user data accountable. Gibsonsec has provided a tool so you can check if your account is in the DB here.

However, if you’re a datahead like me, when you hear that there is a file out there with 4.6M user accounts in it, your first thought is not OMG am I safe?! but let’s do some analysis!


Area Code
As I have noted in a previous musing, it’s difficult to do any sort of in-depth analysis if you have limited dimensionality of your data – here only 3 fields – the phone number with last two digits obscured, the username, and the area.
Fortunately because some of the data here is geographic, we can do some cool visualization with mapping. First we look at the high level view, with state and those states by area. California had the most accounts compromised overall, with just shy of 1.4 M details leaked. New York State was next at just over a million. 
Because the accounts weren’t spread evenly across the states, below is a more detailed view by area code. You can see that it’s mainly Southern California and the Bay Area where the accounts are concentrated.
Well, that covers the geographic component. Which leaves the only the username and phones numbers. I’m not going to look into the phone numbers (I mean what really can you do, other than look at the distribution of numbers – which I have a strong hypothesis about already).
Looking at the number of accounts which include numerals versus those that do not, the split is fairly even – 2,586,281 (~56.1%) do not contain numbers and the remaining 2,023,340 (~43.9%) do. There are no purely numeric usernames.
Looking at the distribution of the length of Snapchat usernames below, we see what appears to be a skew-normal distribution centered around 9.5 characters or so:
The remainder of the tail is not present, which I assume would fill in if there were more data. I had the axis stretch to 30 for perspective as there was one username in the file of length 29.


If anything this analysis has shown anything it has reassured me that:
  1. You are very likely not in the leak unless you live in California or New York City
  2. How amazingly natural phenomena follow or nearly follow theoretical distributions so closely
I’m not in the leak, so I’m not concerned. But once again, this stresses the importance of being mindful of where our personal data are going when using smartphone apps, and ensuring there is some measure of care and accountability on the creators’ end.

Snapchat has released a new statement promising an update to the app which makes the compromised feature optional, increased security around the API, and working with security experts in a more open fashion.

Toronto Licensed Cats & Dogs 2012 Data Visualization

It’s raining cats and dogs! No, I lied, it’s not.

But I wanted to do so more data viz and work with some more open data.

So for this quick plot I present, Cat and Dog Licenses in the City of Toronto for 2012, visualized!

Above in the top pane is the number of licensed cats and dogs per postal code (or Forward Sortation Area, FSA). I really would like to have produced a filled map (chloropleth) with the different postal code areas, however Tableau unfortunately does not have Canadian postal code boundaries, just lat/lon and getting geographic data in is a bit of an arduous process.

I needed something to plot given that I just had counts of cat and dog licenses per FSA, so threw up a scatter and there is amazing correlation! Surprise, surprise – this is just the third variable, and I bet that if you found a map of (human) population density by postal code you’d see why the two quantities are so closely related. Or perhaps not – this is just my assumption – maybe some areas of the GTA are better about getting their pets licensed or have more cats and dogs. Interesting food for thought.

Above is the number of licenses per breed type. Note that the scale is logarithmic for both as the “hairs” (domestic shorthair, domestic mediumhair and domestic longhair) dominate for cats and I wanted to keep the two graphs consistent.

The graphs are searchable by keyword, try it out!

Also I find it shocking that the second most popular breed of dog was Shih Tzu and the fourth most type of cat was Siamese – really?


Toronto Licensed Cat & Dog Reports (at Toronto Open Data Portal)

Toronto Animal Services

Tableau A-Go-Go: Signalized Intersection Traffic and Pedestrian Volume (Toronto Open Data)

First go at creating something usable with Tableau Public. There’s no search suggestions in the text box filter, but you can type the name of a street and just see those points (e.g. Yonge). Kind of cool.

You can find the original data set here.

Prior art here and here.

P.S. Tableau Maps are not the same as Google Maps. Hold shift and click and drag to pan.

rhok (n’ roll)

This past weekend was rhok Toronto which was a fun, exhausting, educational, and all around amazing weekend which I was honoured to be involved in.

The team I was fortunate enough to be a part of produced a prototype web-service to promote fair housing, and improve the ease of the submission process for investigations into housing by-law violations. An added bonus was that this resulted in this nice visualization of more City of Toronto data.

You can learn more about rhok here.