Plotting Choropleths from Shapefiles in R with ggmap – Toronto Neighbourhoods by Population

Introduction

So, I’m not really a geographer. But any good analyst worth their salt will eventually have to do some kind of mapping or spatial visualization. Mapping is not really a forte of mine, though I have played around with it some in the past.
I was working with some shapefile data a while ago and thought about how its funny that so much of spatial data is dominated by a format that is basically proprietary. I looked around for some good tutorials on using shapefile data in R, and even so it took me a while to figure it out, longer than I would have thought.
So I thought I’d put together a simple example of making nice choropleths using R and ggmap. Let’s do it using some nice shapefile data of my favourite city in the world courtesy of the good folks at Toronto’s Open Data initiative.

Background

We’re going to plot the shapefile data of Toronto’s neighbourhoods boundaries in R and mash it up with demographic data per neighbourhood from Wellbeing Toronto.
We’ll need a few spatial plotting packages in R (ggmap, rgeos, maptools).
Also the shapefile originally threw some kind of weird error when I originally tried to load it into R, but it was nothing loading it into QGIS once and resaving it wouldn’t fix. The working version is available on the github page for this post.

Analysis

First let’s just load in the shapefile and plot the raw boundary data using maptools. What do we get?
# Read the neighborhood shapefile data and plot
shpfile <- "NEIGHBORHOODS_WGS84_2.shp"
sh <- readShapePoly(shpfile)
plot(sh)
This just yields the raw polygons themselves. Any good Torontonian would recognize these shapes. There’s some maps like these with words squished into the polygons hanging in lots of print shops on Queen Street. Also as someone pointed out to me, most T-dotters think of the grid of downtown streets as running directly North-South and East-West but it actually sits on an angle.

Okay, that’s a good start. Now we’re going to include the neighbourhood population from the demographic data file by attaching it to the dataframe within the shapefile object. We do this using the merge function. Basically this is like an SQL join. Also I need to convert the neighbourhood number to a integer first so things work, because R is treating it as an string.

# Add demographic data
# The neighbourhood ID is a string - change it to a integer
sh@data$AREA_S_CD <- as.numeric(sh@data$AREA_S_CD)

# Read in the demographic data and merge on Neighbourhood Id
demo <- read.csv(file="WB-Demographics.csv", header=T)
sh2 <- merge(sh, demo, by.x='AREA_S_CD', by.y='Neighbourhood.Id')
Next we’ll create a nice white to red colour palette using the colorRampPalette function, and then we have to scale the population data so it ranges from 1 to the max palette value and store that in a variable. Here I’ve arbitrarily chosen 128. Finally we call plot and pass that vector of colours into the col parameter:
# Set the palette
p <- colorRampPalette(c("white", "red"))(128)
palette(p)

# Scale the total population to the palette
pop <- sh2@data$Total.Population
cols <- (pop - min(pop))/diff(range(pop))*127+1
plot(sh, col=cols)
And here’s the glorious result!

Cool. You can see that the population is greater for some of the larger neighbourhoods, notably on the east end and The Waterfront Communities (i.e. condoland)

I’m not crazy about this white-red palette so let’s use RColorBrewer’s spectral which is one of my faves:

#RColorBrewer, spectral
p <- colorRampPalette(brewer.pal(11, 'Spectral'))(128)
palette(rev(p))
plot(sh2, col=cols)

There, that’s better. The dark red neighborhood is Woburn. But we still don’t have a legend so this choropleth isn’t really telling us anything particularly helpful. And it’d be nice to have the polygons overplotted onto map tiles. So let’s use ggmap!


ggmap

In order to use ggmap we have to decompose the shapefile of polygons into something ggmap can understand (a dataframe). We do this using the fortify command. Then we use ggmap’s very handy qmap function which we can just pass a search term to like we would Google Maps, and it fetches the tiles for us automatically and then we overplot the data using standard calls to geom_polygon just like you would in other visualizations using ggplot.

The first polygon call is for the filled shapes and the second is to plot the black borders.

#GGPLOT 
points <- fortify(sh, region = 'AREA_S_CD')

# Plot the neighborhoods
toronto <- qmap("Toronto, Ontario", zoom=10)
toronto +geom_polygon(aes(x=long,y=lat, group=group, alpha=0.25), data=points, fill='white') +
geom_polygon(aes(x=long,y=lat, group=group), data=points, color='black', fill=NA)
Voila!

Now we merge the demographic data just like we did before, and ggplot takes care of the scaling and legends for us. It’s also super easy to use different palettes by using scale_fill_gradient and scale_fill_distiller for ramp palettes and RColorBrewer palettes respectively.

# merge the shapefile data with the social housing data, using the neighborhood ID
points2 <- merge(points, demo, by.x='id', by.y='Neighbourhood.Id', all.x=TRUE)

# Plot
toronto + geom_polygon(aes(x=long,y=lat, group=group, fill=Total.Population), data=points2, color='black') +
scale_fill_gradient(low='white', high='red')

# Spectral plot
toronto + geom_polygon(aes(x=long,y=lat, group=group, fill=Total.Population), data=points2, color='black') +
scale_fill_distiller(palette='Spectral') + scale_alpha(range=c(0.5,0.5))

So there you have it! Hopefully this will be useful for other R users wishing to make nice maps in R using shapefiles, or those who would like to explore using ggmap.

References & Resources

Neighbourhood boundaries at Toronto Open Data:
Demographic data from Well-being Toronto:

Visual Analytics of Every PS4 and XBox One Game by Install Size

Introduction

I have a startling confession to make: sometimes I just want things to be simple. Sometimes I just want things to be easier. All this talk about “Big Data”, predictive modeling, machine learning, and all those associated bits and pieces that go along with data science can be a bit mentally exhausting. I want to take a step back and work with a smaller dataset, something simple, something easy, something everyone can relate to – after all, that’s what this blog started out being about.

A while back, someone posted on Slashdot that the folks over at Finder.com had put together data sets of the install size of every PS4 and Xbox One game released to date. Being a a console owner myself – I’m a PS4 guy, but no fanboy or hardcore gamer by any means – I thought this would be a fun and rather unique data set to play around with, one that would fit well within the category of ‘everyday analytics’. So let’s take a look shall we?

Background

Very little background required here – the dataset comprises the title, release date, release type (major or indie), console (PS4 or Xbox One), and size in GiB of all games released as of September 10th, 2015. For this post we will ignore the time-related dimension and look only at the quantity of interest: install size.

Analysis

Okay, if I gave this data to your average Excel jockey what’s the first thing they’d do? A high level summary of the data broken apart by categorical variables and summarized by quantitative? You got it!
We can see that far more PS4 games have been released than Xbox (462 vs. 336) and the relative proportions are reversed for the former platform versus the latter as release type goes.

A small aside here on data visualization: it’s worth noting that the above is a good way to go for making a bar chart from a functional perspective. Since there are data labels and the y-axis metric is in the title, we can ditch the axis and maximize the data-ink ratio (well, data-pixel anyhow). I’ve also avoided using a stacked bar chart as interpretation of absolute values tends to suffer when not read from the same baseline. I’m okay with doing it for relative proportions though – as in the below, which further illustrates the difference in release type proportion between the two consoles:

Finally, how does the install size differ between the consoles and game types? If I’m an average analyst and just trying to get a grip on the data, I’d take an average to start:
We can see (unsurprisingly, if you know anything about console games) that major releases tend to be much larger in size than indie. Also in both cases, Xbox install sizes are larger on average: about 1.7x for indie titles and 1.25x for major.

Okay, that’s interesting. But if you’re like me, you’ll be thinking about how 99% of the phenomena in the universe are distributed by a power law or have some kind of non-Gaussian based distribution, and so averages are actually not always such a great way to summarize data. Is this the case for our install size data set?

Yes, it is. We can see here in this combination histogram / cumulative PDF (in the form of a Pareto chart) that the games follow a power law, with approximately 55 and 65 percent being < 5 GiB, for PS4 and Xbox games respectively

But is this entirely due to the indie games having small sizes? Might the major releases be centered around some average or median size?

No, we can see that even when broken apart by type of release the power-law like distribution for install sizes persists. I compared the averages to medians found them to be still be decent representations of central tendency and not too affected by outliers.

Finally we can look at the distribution of the install sizes by using another type of visualization suited for this task, the boxplot. While it is at least possible to jury-rig up a boxplot in Excel (see this excellent how-to over at Peltier Tech) Google Sheets doesn’t give us as much to work with, but I did my best (the data label is at the maximum point, and the value is the difference between the max and Q3):

The plots show that install sizes are generally greater for Xbox One vs. PS4, and that the difference (and skew) appears to be a bit more pronounced for indie games versus major releases, as we saw in the previous figures.

Okay, that’s all very interesting, but what about games that are available for both consoles? Are the install sizes generally the same or do they differ?

Difference in Install Size by Console Type
Because we’ve seen that the Xbox install sizes are generally larger than Playstation, here I take the PS4 size to be the baseline for games which appear on both (that is, differences are of the form XBOX Size – PS4 Size).

Of the 618 unique titles in the whole data set (798 titles if you double count across platform), 179 (~29%) were available on both – so roughly only a third of games are released for both major consoles.

Let’s take a look at the difference in install sizes – do those games which appear for both reflect what we saw earlier?

Yes, for both categories of game the majority are larger on Xbox than PS4 (none were the same size). Overall about 85% of the games were larger on Microsoft’s console (152/179).

Okay, but how much larger? Are we talking twice as large? Five times larger? Because the size of the games varies widely (especially between the release types) I opted to go for percentages here:

Unsurprisingly, on average indie games tend to have larger differences proportionally, because they’re generally much smaller in size than major releases. We can see they are nearly twice as large on Xbox vs. PS4 while major releases about 1 and a quarter. When games are larger on PS4, there’s not as big a disparity, and the pattern across release types is the same (though keep in mind the number of games here is a lot smaller than for the former).

Finally, just to ground this a bit more I thought I’d look at the top 10 games in each release type where the absolute differences are the largest. As I said before, here the difference is Xbox size minus PS4:

For major releases, the worst offender for being larger on PS4 is Batman: Arkham Night (~6.1 GiB difference) while on the opposite end, The Elder Scrolls Online has a ~19 GiB difference. Wow.

For indies, we can see the absolute difference is a lot smaller for those games bigger on PS4, with Octodad having the largest difference of ~1.4 GiB (56% of its PS4 size). Warframe is 19.6 GiB bigger on Xbox than PS4, or 503% larger (!!)

Finally, I’ve visualized all the data together for you so you can explore it yourself. Below is a bubble chart of the Xbox install size plotted against PS4, coloured by release type, where the size of each point represents the absolute value of the percentage difference between the platforms (with the PS4 size taken to be the baseline). So points above the diagonal are larger for Xbox than PS4, and points below the diagonal are larger for PS4 than Xbox. Also note that the scale is log-log. You can see that most of the major releases are pretty close to each other in size, as they nearly lie on the y=x line.

Conclusion

It’s been nice to get back into the swing of things and do a little simple data visualization, as well as play with a data set that falls into the ‘everyday analytics’ category.
And, as a result, we’ve learned:

  • XBox games generally tend to have larger install sizes than PS4 ones, even for the same title
  • Game install sizes follow a power law, just like most everything else in the universe (or maybe just 80% of it)
  • What the heck a GiB is
Until next time then, don’t fail to keep looking for the simple beauty in data all around you.

References & Resources

Complete List of Xbox One Install Sizes:
Complete List of PlayStation 4 Install Sizes:
Compiled data set (Google Sheets):
Excel Box and Whisker Diagrams (Box Plots) @ Peltier Tech:

What a Gas! The Falling Price of Oil and Ontario Gasoline Prices

Introduction

In case you’ve been living under a rock, there’s been a lot of chatter in the financial world late about the price of oil going down. Way, way, down. So much so that the Bank of Canada cut interest rates. What crazy times are these we live in? I thought gas was only going to get more and more expensive until the end of time until everyone resorted to driving solar-powered cars.
I’m no economist (or commodities guy) and frankly, a lot of it seems like black magic and voodoo to me, but I thought it’d take a look at the data to see just how things have changed. Plus it’ll be an opportunity to muck about with some time series analysis in R.

Your average economists.

Background

Not much background is really needed other than what I mentioned in the introduction, but, well, we do need some data to work with.
Ontario Gas Prices for a number of cities (as well as the province averages) can be found at the Ontario Ministry of Energy. Unfortunately they’re year-by-year CSVs with a lot of different information row-wise (Weekly, Month, YTD).
No biggie, a simple bash script will take care of downloading all the data with wget and parsing and concatenating the data with other *nix tools:
# Download data
for i in $(seq 1990 2014)
do wget http://www.energy.gov.on.ca/fuelupload/ONTREG$i.csv
done

# Retain the header
head -n 2 ONTREG1990.csv | sed 1d > ONTREG_merged.csv

# Loop over the files and use sed to extract the relevant lines
for i in $(seq 1990 2014)
do
tail -n 15 ONTREG$i.csv | sed 13,15d | sed 's/./-01-'$i',/4' >> ONTREG_merged.csv
done
Great! Now we have all monthly Ontario Gas Price data from 1990-2014 inclusive in one big file.

The WTI data I got from The Federal Reserve Bank of St. Louis, and the forecasts from the US Energy Information Administration.

Analysis

First, a pretty plot of the data:

Wow, that really is a cliff, isn’t it? The average Ontario Gas Price hasn’t been as low as it was in Dec 2014 since the fall of 2010 (Sep 2010, $1.01) and the WTI not since about May of that year.

Now to the fun stuff. Let’s read the data into R and do some hacky time series analysis.

library(ts)
library(forecast)

# Read in the Ontario Gas Price Data
data <- read.csv(file="ONTREG_merged.csv", header=T, sep=",")

# Read in the WTI oil price data
WTI_data <- read.csv(file='DCOILWTICO.csv',header=F, col.names=c("Date", "Value"))

# Create a time series object for the WTI and Ontario Avg
WTI <- ts(data=WTI_data$Value, frequency=12, start=c(1990,1), end=c(2014,12))
ON <- ts(data=data$ON.Avg, frequency=12, start=c(1990,1), end=c(2014,12))

# Plot and compare
combined <- cbind(WTI, ON)
plot(combined)

We get a comparison plot of the data:

And we can look at the Ontario Gas Price as a function of the WTI. Linear is on the left, log-log on the right.

Next we build lm model objects and look at the diagnostics. I’ll spare the details, but I feel better about the log-log, so we’ll go with that.

# Create linear models (normal and log-log)
l1 <- lm(ON ~ WTI, data=combined)
l2 <- lm(log(ON) ~ log(WTI), data=combined)

# Compare relative performance
summary(l1)
summary(l2)
plot(l1)
plot(l2)

# Plot
plot(ON ~ WTI, data=combined, pch=16, cex=0.3)
abline(l1)
plot(log(ON) ~ log(WTI), data=combined, pch=16, cex=0.3)
abline(l2)

Lastly, we read in the forecast WTI data and use it to forecast the Ontario Gas price using our second model:
# Read in WTI forecast data
WTI_forecast <- read.csv(file="WTI_forecast.csv", header=F, sep=",", col.names=c("Date", "Value"))

# Forecast Ontario Gas Price
fit <- forecast(l2, newdata=data.frame(WTI=WTI_forecast$Value))

# Unlog
fit$mean <- exp(fit$mean)
fit$lower <- exp(fit$lower)
fit$upper <- exp(fit$upper)
fit$x <- exp(fit$x)

# Plot
plot(fit, ylab='Ontario Average Gas Price (cents/L)')

And there you have it! Putting the forecast data (blue line) and the WTI forecast back into our original graph, we can compare the two:

It’s that easy, right? That’s all there is to it?

Conclusion

Sadly, no. I’ve done some very quick work here and demonstrated some of the types of tools that are available in R, but “real” time-series analysis is a practice which requires much more care and nuance. 
For example, linear modeling assumes that variables are stationary (i.e. have constant mean and variance) and not auto-correlated, properties which are almost never true in the real world. Using methods such as above for non-stationary times series can result in what is known as “spurious regression” – finding relationships between variables which don’t really exist, even though the results have high R-squared and p-values.

In these cases, testing the stationary assumption and then massaging of the data (differencing & deseasonalization) is usually required beforehand to handle the problem, or other models which do not have as strict assumptions as linear regression are more appropriate. For more on this, see the excellent book “Forecasting: Principles and Practice” by Rob J. Hyndman and George Athana¬≠sopou¬≠los.

The take-away, as always, is that real life is complicated, and so analysis requires care. Predicting the future has never been, and never will be easy; either way, I just hope gas stays cheap.

References & Resources

code & data on github:

Fuel Price Data: Regular Unleaded Gasoline 2014 (Ontario Ministry of Energy):

http://www.energy.gov.on.ca/en/fuel-prices/fuel-price-data/?fuel=REG&yr=2015

Crude Oil Prices: West Texas Intermediate (WTI) (Federal Reserve Bank of St. Louis)

Short-Term Energy Outlook (U.S. Energy Information Administration)

Forecasting: Principles and Practice (OTexts)

Toronto Cats and Dogs II – Top 25 Names of 2014

I was quite surprised by the relative popularity of my previous analysis of the data for Licensed Cats & Dogs in Toronto for 2011, given how simple it was to put together.

I was browsing the Open Data Portal recently and noticed that there was a new data set for pets: the top 25 names for both dogs and cats. I thought this could lend itself to some quick, easy visualization and be a neat little addition to the previous post.

First we simply visualize the raw counts of the Top 25 names against each other. Interestingly, the top 2 names for both dogs and cats are apparently the same: Charlie and Max.

Next let’s take a look at the distribution of these top 25 names for each type of pet by how long they are, which just involves calculating the name length and then pooling the counts:

You can see that, proportionally the top dog names are a bit shorter (distribution is positively / right-skewed) compared to the cat names (slightly negatively / left skewed). Also note both are centered around names of length 5, and the one cat name of length 8 (Princess).

Looking at the dog names, do you notice something interesting about them? A particular feature present in nearly all? I did. Nearly every one of the top 25 dog names ends in a vowel. We can see this by visualizing the proportion of the counts for each type of pet by whether the name ends in a vowel or consonant:

Which to me, seems to indicate that more dogs tend to have “cutesy” names, usually ending in ‘y’, than cats.

Fun stuff, but one thing really bothers me… no “Fido” or “Boots”? I guess some once popular names have gone to the dogs.

References & Resources

Licensed Dog and Cat Names (Toronto Open Data)

Heatmap of Toronto Traffic Signals using RGoogleMaps

A little while back there was an article in blogTO about how a reddit user had used data from Toronto’s Open Data initiative to produce a rather cool-looking map of all the locations of all the traffic signals here in the city.

It’s neat because as the author on blogTO notes, it is recognizable as Toronto without any other geographic data being plotted – the structure of the city comes out in the data alone.

Still, I thought it’d be interesting to see as a geographic heat map, and also a good excuse to fool around with mapping using Rgooglemaps.

The finished product below:

Despite my best efforts with transparency (using my helper function), it’s difficult for anything but the city core to really come out in the intensity map.

The image without the Google maps tile, and the coordinates rotated, shows the density a little better in the green-yellow areas:

And it’s also straightforward to produce a duplication of the original black and white figure:

The R code is below. Interpolation is using the trusty kde2d function from the MASS library and a rotation is applied for the latter two figures, so that the grid of Toronto’s streets faces ‘up’ as in the original map.

# Toronto Traffic Signals Heat Map
# Myles Harrison
# http://www.everydayanalytics.ca
# Data from Toronto Open Data Portal:
# http://www.toronto.ca/open

library(MASS)
library(RgoogleMaps)
library(RColorBrewer)
source('colorRampPaletteAlpha.R')

# Read in the data
data <- read.csv(file="traffic_signals.csv", skip=1, header=T, stringsAsFactors=F)
# Keep the lon and lat data
rawdata <- data.frame(as.numeric(data$Longitude), as.numeric(data$Latitude))
names(rawdata) <- c("lon", "lat")
data <- as.matrix(rawdata)

# Rotate the lat-lon coordinates using a rotation matrix
# Trial and error lead to pi/15.0 = 12 degrees
theta = pi/15.0
m = matrix(c(cos(theta), sin(theta), -sin(theta), cos(theta)), nrow=2)
data <- as.matrix(data) %*% m

# Reproduce William's original map
par(bg='black')
plot(data, cex=0.1, col="white", pch=16)

# Create heatmap with kde2d and overplot
k <- kde2d(data[,1], data[,2], n=500)
# Intensity from green to red
cols <- rev(colorRampPalette(brewer.pal(8, 'RdYlGn'))(100))
par(bg='white')
image(k, col=cols, xaxt='n', yaxt='n')
points(data, cex=0.1, pch=16)

# Mapping via RgoogleMaps
# Find map center and get map
center <- rev(sapply(rawdata, mean))
map <- GetMap(center=center, zoom=11)
# Translate original data
coords <- LatLon2XY.centered(map, rawdata$lat, rawdata$lon, 11)
coords <- data.frame(coords)

# Rerun heatmap
k2 <- kde2d(coords$newX, coords$newY, n=500)

# Create exponential transparency vector and add
alpha <- seq.int(0.5, 0.95, length.out=100)
alpha <- exp(alpha^6-1)
cols2 <- addalpha(cols, alpha)

# Plot
PlotOnStaticMap(map)
image(k2, col=cols2, add=T)
points(coords$newX, coords$newY, pch=16, cex=0.3)

This a neat little start and you can see how this type of thing could easily be extended to create a generalized mapping tool, stood up as a web service for example (they’re out there). Case in point: Google Fusion Tables. I’m unsure as to what algorithm they use but I find it less satisfying, looks like some kind of simple point blending:

As always, all the code is on github.

Interactive Visualization: Explore the 2014 “Sunshine List”

The “Sunshine List”, a list of Ontario’s public service who make more than $100,000 a year, was brought in under the Harris government in 1996, in an effort to increase transparency about just how much the top paid public servants were earning.

Now the list is released each year by the Ontario Ministry of Finance.

However, there has been some frustration that the data are not in the most easily accessible format (HTML & PDF? Really guys?).

Stuart A. Thompson was kind enough to provide the data in an easier to digest format (Nick Ragaz also provided it in CSV), as well as producing a tool for exploring it on The Globe and Mail.

I thought it’d be great to get a more visual exploration of the data at fine granularity, so have produced this interactive visualization below.

You can filter the data by searching by employer, name, or position. You can also filter the list by selecting points or groups of points on the scatterplot on the left, and highlight groups of points (or individual employees) by selecting components in the bar graph at the right. The bar graph on the right can also be expanded and collapsed to view the aggregate salary and benefits by employer, or to view the quantities for individual employees.

Hovering over data points on either graph will display the related data in a tooltip – I’ve found this is handy for looking at individual points of interest on the scatterplot. Zoom and explore to find interesting patterns and individuals. Give it a try!


I’ve plotted benefit against salary with the latter having a logarithmic axis so that the data are easier visualized and explored (note that I am in no way suggesting that benefits are a function of salary).

Using combinations of all these possible interactions mentioned above you can do some interesting visual analysis: for instance, how do the top salaries and benefits earned by Constables across police departments in Ontario differ (seriously, take a look)? What are the relative pay and benefit levels of professors at Ontario Universities on the list? How much does Rob Ford make?

Something interesting I’ve already noticed is that for many employers there are long horizontal bands where employees’ salaries vary but they are fixed into the same benefit buckets. Others have a different relationship, for example, the benefit / salary ratios of those at my alma mater vs those of employees of the City of Toronto:

Hope this tool will be interesting and useful for those interested in the list. As always feedback is welcome in the comments.

Analysis of the TTC Open Data – Ridership & Revenue 2009-2012

Introduction

I would say that the relationship between the citizens of Toronto and public transit is a complicated one. Some people love it. Other people hate it and can’t stop complaining about how bad it is. The TTC want to raise fare prices. Or they don’t. It’s complicated.
I personally can’t say anything negative about the TTC. Running a business is difficult, and managing a complicated beast like Toronto’s public system (and trying to keep it profitable while keeping customers happy) cannot be easy. So I feel for them. 
I rely extensively on public transit – in fact, I used to ride it every day to get to work. All things considered, for what you’re paying, this way of getting around the city is a hell of a good deal (if you ask me) compared to the insanity that is driving in Toronto.
The TTC’s ridership and revenue figures are available as part of the (awesome) Toronto Open Data initiative for accountability and transparency. As I noted previously, I think the business of keeping track of things like how many people ride public transit every day must be a difficult one, so you have to appreciate having this data, even if it is likely more of an approximation and is in a highly summarized format.
There are larger sources of open data related to the TTC which would probably be a lot cooler to work with (as my acquaintance Mr. Branigan has done) but things have been busy at work lately, so we’ll stick to this little analysis exercise.

Background

The data set comprises numbers for: average weekly ridership (in 000’s), annual ridership (peak and off-peak), monthly & budgeted monthly ridership (in 000’s), and monthly revenue, actual and budgeted (in millions $). More info here [XLS doc].

Analysis

First we consider the simplest data and that is the peak and off-peak ridership. Looking at this simple line-graph you can see that the off-peak ridership has increased more than peak ridership since 2009 – peak and off-peak ridership increasing by 4.59% and 12.78% respectively. Total ridership over the period has increased by 9.08%.

Below we plot the average weekday ridership by month. As you can see, this reflects the increasing demand on the TTC system we saw summarized yearly above. Unfortunately Google Docs doesn’t have trendlines built-in like Excel (hint hint, Google), but unsurprisingly if you add a regression line the trend is highly significant ( > 99.9%) and the slope gives an increase of approximately 415 weekday passengers per month on average.

Next we come to the ridership by month. If you look at the plot over the period of time, you can see that there is a distinct periodic behavior:

Taking the monthly averages we can better see the periodicity – there are peaks in March, June & September, and a mini-peak in the last month of the year:

This is also present in both the revenue (as one would expect) and the monthly budget (which means that the TTC is aware of it). As to why this is the case, I can’t immediately discern, though I am curious to know the answer. This is where it would be great to have some finer grained data (daily or hourly) or data related to geographic area or per station to look for interesting outliers and patterns.

Alternatively if we look at the monthly averages over the years of average weekday ridership (an average of averages, I am aware – but the best we can do given the data we have), you can see that there is a different periodic behavior, with a distinct downturn over the summer, reaching a low in August which then recovers in September to the maximum. This is interesting and I’m not exactly sure what to make of it, so I will do what I normally do which is attribute it to students.

Lastly, we come to the matter of the financials. As I said the monthly revenue and budget for the TTC follow the same periodic pattern as the ridership, and on the plus side, with increased ridership, there is increased revenue. Taking the arithmetic difference of the budgeted (targeted) revenue from actual, you can see that over time there is a decrease in this quantity:
Again if you do a linear regression this is highly significant ( > 99.9%). Does this mean that the TTC is becoming less profitable over time? Maybe. Or perhaps they are just getting better at setting their targets? I acknowledge that I’m not an economist, and what’s been done here is likely a gross oversimplification of the financials of something as massive as the TTC.

That being said, the city itself acknowledges [warning – large PDF] that while the total cost per hour for an in-service transit vehicle has decreased, the operating cost has increased, which they attribute to increases in wages and fuel prices. Operating public transit is also more expensive here in TO than other cities in the province, apparently, because we have things like streetcars and the subway, whereas most other cities only have buses. Either way, as I said before, it’s complicated.

Conclusion

I always enjoy working with open data and I definite appreciate the city’s initiative to be more transparent and accountable by providing the data for public use.
This was an interesting little analysis and visualization exercise and some of the key points to take away are that, over the period in question:
  • Off-peak usage of the TTC is increasing at a greater rate than peak usage
  • Usage as a whole is increasing, with about 415 more weekday riders per month on average, and a growth of ~9% from 2009 – 2012
  • Periodic behavior in the actual ridership per month over the course of the year
  • Different periodicity in average weekday ridership per month, with a peak in September
It would be really interesting to investigate the patterns in the data in finer detail, which hopefully should be possible in the future if more granular time-series, geographic, and categorical data become available. I may also consider digging into some of the larger data sets, which have been used by others to produce beautiful visualizations such as this one.

I, for one, continue to appreciate the convenience of public transit here in Toronto and wish the folks running it the best of luck with their future initiatives.

References & Resources

TTC Ridership – Ridership Numbers and Revenues Summary (at Toronto Open Data Portal)

Toronto Progress Portal – 2011 Performance Measurement and Benchmarking Report

Toronto Licensed Cats & Dogs 2012 Data Visualization

It’s raining cats and dogs! No, I lied, it’s not.

But I wanted to do so more data viz and work with some more open data.

So for this quick plot I present, Cat and Dog Licenses in the City of Toronto for 2012, visualized!


Above in the top pane is the number of licensed cats and dogs per postal code (or Forward Sortation Area, FSA). I really would like to have produced a filled map (chloropleth) with the different postal code areas, however Tableau unfortunately does not have Canadian postal code boundaries, just lat/lon and getting geographic data in is a bit of an arduous process.

I needed something to plot given that I just had counts of cat and dog licenses per FSA, so threw up a scatter and there is amazing correlation! Surprise, surprise – this is just the third variable, and I bet that if you found a map of (human) population density by postal code you’d see why the two quantities are so closely related. Or perhaps not – this is just my assumption – maybe some areas of the GTA are better about getting their pets licensed or have more cats and dogs. Interesting food for thought.


Above is the number of licenses per breed type. Note that the scale is logarithmic for both as the “hairs” (domestic shorthair, domestic mediumhair and domestic longhair) dominate for cats and I wanted to keep the two graphs consistent.

The graphs are searchable by keyword, try it out!

Also I find it shocking that the second most popular breed of dog was Shih Tzu and the fourth most type of cat was Siamese – really?

Resources

Toronto Licensed Cat & Dog Reports (at Toronto Open Data Portal)

Toronto Animal Services
http://www.toronto.ca/animal_services/

OECD Data Visualization Challenge: My Entry

The people behind Visualising are doing some great things. As well as providing access to open data sets, an embeddable data visualization player, and a massive gallery of data visualizations, they are building an online community of data visualization enthusiasts (and professionals) with their challenges and events.

In addition, those behind it (Seed and GE) are also connecting people in the real world with their data visualization marathons for students, which are looking to be the dataviz equivalent of the ever popular hackathons held around the world. As far as I know no one else is really doing this sort of thing, with a couple notable exceptions – for example Datakind and their Data Dive events (these not strictly visualization focused, however).

Okay, enough copious hyperlinking.

The latest challenge was to visualize the return on education around the world using some educational indicators from the OECD, and I thought I’d give it a go in Tableau Public.

For my visualization I chose to highlight the differences in the return on education not only between nations, but also the gender-based differences for each country.

I incorporated some other data from the OECD portal on GDP and public spending on education, and so the countries included are those with data present in all three sets.

The World Map shows the countries, coloured by GDP. The bar chart to the right depicts the public spending on education, both tertiary (blue) and non-tertiary (orange), as a percentage of GDP.

The scatterplots contrast both the gender-specific benefit-cost ratios per country, as well as between public (circles) and private (squares) benefit, and between the levels of education. A point higher up on the plots and to the left has a greater benefit-cost ratio (BCR) than a point lower and to the right, which represents a worse investment. The points are sized by internal rate-of-return (ROR).

All in all it was fun and not only did I learn a lot more about using Tableau, it gave me a lot of food for thought about how to best depict data visually as well.

Tableau A-Go-Go: Signalized Intersection Traffic and Pedestrian Volume (Toronto Open Data)

First go at creating something usable with Tableau Public. There’s no search suggestions in the text box filter, but you can type the name of a street and just see those points (e.g. Yonge). Kind of cool.

You can find the original data set here.

Prior art here and here.

P.S. Tableau Maps are not the same as Google Maps. Hold shift and click and drag to pan.