Mapping the TTC Lines with R and Leaflet

It’s been quite a while since I’ve written a post, but as of late I’ve become really interested in mapping and so have been checking out different tools for doing this, one of which is Leaflet. This is an example of a case where, because of a well-written package for R, it’s easy for the user to create interactive web maps directly from R, without even knowing any Javascript!

I had three requirements for myself:

  1. Write code that created an interactive web map using Leaflet
  2. Use Shapefile data about the City of Toronto
  3. Allow anyone to run it on their machine, without having to download or extract data

I decided to use shapefile data on the TTC, available from Toronto’s Open Data portal. Point #3 required a little research, as the shapefile itself was buried within a zip, but it’s fairly straightforward to write R code to download and unpack zip files into a temporary directory.

The code is below, followed by the result. Not a bad result for only 10 or 15 lines!


# MAPPING THE TORONTO SUBWAY LINES USING R & Leaflet
# --------------------------------------------------
#
# Myles M. Harrison
# http://www.everydayanalytics.ca

#install.packages('leaflet')
#install.packages('maptools')
library(leaflet)
library(htmlwidgets)
library(maptools)

# Data from Toronto's Open Data portal: http://www.toronto.ca/open

# Download the file and read in the
data_url <- "http://opendata.toronto.ca/gcc/TTC_subway%20lines_wgs84.zip"
cur_dir <- getwd()
temp_dir <- tempdir()
setwd(temp_dir)
download.file(data_url, "subway_wgs84.zip")
unzip("subway_wgs84.zip")
sh <- readShapeLines("subway_wgs84.shp")
unlink(dir(temp_dir))
setwd(cur_dir)

# Create a categorical coloring function
linecolor <- colorFactor(rainbow(16), sh@data$SBWAY_NAME)

# Plot using leaflet
m <- leaflet(sh) %>%
addTiles() %>%
addPolylines(popup = paste0(as.character(sh@data$SBWAY_NAME)), color=linecolor(sh@data$SBWAY_NAME)) %>%
addLegend(colors=linecolor(sh@data$SBWAY_NAME), labels=sh@data$SBWAY_NAME)

m

# Save the output
saveWidget(m, file="TTC_leaflet_map.html")

Plotting Choropleths from Shapefiles in R with ggmap – Toronto Neighbourhoods by Population

Introduction

So, I’m not really a geographer. But any good analyst worth their salt will eventually have to do some kind of mapping or spatial visualization. Mapping is not really a forte of mine, though I have played around with it some in the past.
I was working with some shapefile data a while ago and thought about how its funny that so much of spatial data is dominated by a format that is basically proprietary. I looked around for some good tutorials on using shapefile data in R, and even so it took me a while to figure it out, longer than I would have thought.
So I thought I’d put together a simple example of making nice choropleths using R and ggmap. Let’s do it using some nice shapefile data of my favourite city in the world courtesy of the good folks at Toronto’s Open Data initiative.

Background

We’re going to plot the shapefile data of Toronto’s neighbourhoods boundaries in R and mash it up with demographic data per neighbourhood from Wellbeing Toronto.
We’ll need a few spatial plotting packages in R (ggmap, rgeos, maptools).
Also the shapefile originally threw some kind of weird error when I originally tried to load it into R, but it was nothing loading it into QGIS once and resaving it wouldn’t fix. The working version is available on the github page for this post.

Analysis

First let’s just load in the shapefile and plot the raw boundary data using maptools. What do we get?
# Read the neighborhood shapefile data and plot
shpfile <- "NEIGHBORHOODS_WGS84_2.shp"
sh <- readShapePoly(shpfile)
plot(sh)
This just yields the raw polygons themselves. Any good Torontonian would recognize these shapes. There’s some maps like these with words squished into the polygons hanging in lots of print shops on Queen Street. Also as someone pointed out to me, most T-dotters think of the grid of downtown streets as running directly North-South and East-West but it actually sits on an angle.

Okay, that’s a good start. Now we’re going to include the neighbourhood population from the demographic data file by attaching it to the dataframe within the shapefile object. We do this using the merge function. Basically this is like an SQL join. Also I need to convert the neighbourhood number to a integer first so things work, because R is treating it as an string.

# Add demographic data
# The neighbourhood ID is a string - change it to a integer
sh@data$AREA_S_CD <- as.numeric(sh@data$AREA_S_CD)

# Read in the demographic data and merge on Neighbourhood Id
demo <- read.csv(file="WB-Demographics.csv", header=T)
sh2 <- merge(sh, demo, by.x='AREA_S_CD', by.y='Neighbourhood.Id')
Next we’ll create a nice white to red colour palette using the colorRampPalette function, and then we have to scale the population data so it ranges from 1 to the max palette value and store that in a variable. Here I’ve arbitrarily chosen 128. Finally we call plot and pass that vector of colours into the col parameter:
# Set the palette
p <- colorRampPalette(c("white", "red"))(128)
palette(p)

# Scale the total population to the palette
pop <- sh2@data$Total.Population
cols <- (pop - min(pop))/diff(range(pop))*127+1
plot(sh, col=cols)
And here’s the glorious result!

Cool. You can see that the population is greater for some of the larger neighbourhoods, notably on the east end and The Waterfront Communities (i.e. condoland)

I’m not crazy about this white-red palette so let’s use RColorBrewer’s spectral which is one of my faves:

#RColorBrewer, spectral
p <- colorRampPalette(brewer.pal(11, 'Spectral'))(128)
palette(rev(p))
plot(sh2, col=cols)

There, that’s better. The dark red neighborhood is Woburn. But we still don’t have a legend so this choropleth isn’t really telling us anything particularly helpful. And it’d be nice to have the polygons overplotted onto map tiles. So let’s use ggmap!


ggmap

In order to use ggmap we have to decompose the shapefile of polygons into something ggmap can understand (a dataframe). We do this using the fortify command. Then we use ggmap’s very handy qmap function which we can just pass a search term to like we would Google Maps, and it fetches the tiles for us automatically and then we overplot the data using standard calls to geom_polygon just like you would in other visualizations using ggplot.

The first polygon call is for the filled shapes and the second is to plot the black borders.

#GGPLOT 
points <- fortify(sh, region = 'AREA_S_CD')

# Plot the neighborhoods
toronto <- qmap("Toronto, Ontario", zoom=10)
toronto +geom_polygon(aes(x=long,y=lat, group=group, alpha=0.25), data=points, fill='white') +
geom_polygon(aes(x=long,y=lat, group=group), data=points, color='black', fill=NA)
Voila!

Now we merge the demographic data just like we did before, and ggplot takes care of the scaling and legends for us. It’s also super easy to use different palettes by using scale_fill_gradient and scale_fill_distiller for ramp palettes and RColorBrewer palettes respectively.

# merge the shapefile data with the social housing data, using the neighborhood ID
points2 <- merge(points, demo, by.x='id', by.y='Neighbourhood.Id', all.x=TRUE)

# Plot
toronto + geom_polygon(aes(x=long,y=lat, group=group, fill=Total.Population), data=points2, color='black') +
scale_fill_gradient(low='white', high='red')

# Spectral plot
toronto + geom_polygon(aes(x=long,y=lat, group=group, fill=Total.Population), data=points2, color='black') +
scale_fill_distiller(palette='Spectral') + scale_alpha(range=c(0.5,0.5))

So there you have it! Hopefully this will be useful for other R users wishing to make nice maps in R using shapefiles, or those who would like to explore using ggmap.

References & Resources

Neighbourhood boundaries at Toronto Open Data:
Demographic data from Well-being Toronto:

Toronto Data Science Meetup – Machine Learning for Humans

A little while ago I spoke again at the Toronto Data Science Group, and gave a presentation I called “Machine Learning for Humans”:

I had originally intended to cover a wide variety of general “gotchas” around the practical applications of machine learning, however with half an hour there’s really only so much you can cover.

The talk ended up being more of an overview of binary classification, as well as some anecdotes around mistakes in using machine learning I’ve actually seen in the field, including:

  • Not doing any model evaluation at all
  • Doing model evaluation but without cross-validation
  • Not knowing what the cold start problem is and how to avoid it with a recommender system
All in all it was received very well despite being review for a lot of people in the room. As usual, I took away some learnings around presenting:
  • Always lowball for time (the presentation was rushed despite my blistering pace)
  • Never try to use fancy fonts in Powerpoint and expect them to carry over – it never works (copy paste as an image instead when you’ve got the final presentation)
Dan Thierl of Rubikloud gave a really informative and candid talk about what product management at a data science startup can look like. In particular, I was struck by his honesty around the challenges faced (both from technical standpoint and with clients), how quickly you have to move / pivot, and how some clients are just looking for simple solutions (Can you help us dashboard?) and are perhaps not at a level of maturity to want or fully utilize a data science solution.
All in all, another great meetup that prompted some really interesting discussion afterward. I look forward to the next one. I’ve added the presentation to the speaking section.

Toronto Cats and Dogs II – Top 25 Names of 2014

I was quite surprised by the relative popularity of my previous analysis of the data for Licensed Cats & Dogs in Toronto for 2011, given how simple it was to put together.

I was browsing the Open Data Portal recently and noticed that there was a new data set for pets: the top 25 names for both dogs and cats. I thought this could lend itself to some quick, easy visualization and be a neat little addition to the previous post.

First we simply visualize the raw counts of the Top 25 names against each other. Interestingly, the top 2 names for both dogs and cats are apparently the same: Charlie and Max.

Next let’s take a look at the distribution of these top 25 names for each type of pet by how long they are, which just involves calculating the name length and then pooling the counts:

You can see that, proportionally the top dog names are a bit shorter (distribution is positively / right-skewed) compared to the cat names (slightly negatively / left skewed). Also note both are centered around names of length 5, and the one cat name of length 8 (Princess).

Looking at the dog names, do you notice something interesting about them? A particular feature present in nearly all? I did. Nearly every one of the top 25 dog names ends in a vowel. We can see this by visualizing the proportion of the counts for each type of pet by whether the name ends in a vowel or consonant:

Which to me, seems to indicate that more dogs tend to have “cutesy” names, usually ending in ‘y’, than cats.

Fun stuff, but one thing really bothers me… no “Fido” or “Boots”? I guess some once popular names have gone to the dogs.

References & Resources

Licensed Dog and Cat Names (Toronto Open Data)

Analysis of the TTC Open Data – Ridership & Revenue 2009-2012

Introduction

I would say that the relationship between the citizens of Toronto and public transit is a complicated one. Some people love it. Other people hate it and can’t stop complaining about how bad it is. The TTC want to raise fare prices. Or they don’t. It’s complicated.
I personally can’t say anything negative about the TTC. Running a business is difficult, and managing a complicated beast like Toronto’s public system (and trying to keep it profitable while keeping customers happy) cannot be easy. So I feel for them. 
I rely extensively on public transit – in fact, I used to ride it every day to get to work. All things considered, for what you’re paying, this way of getting around the city is a hell of a good deal (if you ask me) compared to the insanity that is driving in Toronto.
The TTC’s ridership and revenue figures are available as part of the (awesome) Toronto Open Data initiative for accountability and transparency. As I noted previously, I think the business of keeping track of things like how many people ride public transit every day must be a difficult one, so you have to appreciate having this data, even if it is likely more of an approximation and is in a highly summarized format.
There are larger sources of open data related to the TTC which would probably be a lot cooler to work with (as my acquaintance Mr. Branigan has done) but things have been busy at work lately, so we’ll stick to this little analysis exercise.

Background

The data set comprises numbers for: average weekly ridership (in 000’s), annual ridership (peak and off-peak), monthly & budgeted monthly ridership (in 000’s), and monthly revenue, actual and budgeted (in millions $). More info here [XLS doc].

Analysis

First we consider the simplest data and that is the peak and off-peak ridership. Looking at this simple line-graph you can see that the off-peak ridership has increased more than peak ridership since 2009 – peak and off-peak ridership increasing by 4.59% and 12.78% respectively. Total ridership over the period has increased by 9.08%.

Below we plot the average weekday ridership by month. As you can see, this reflects the increasing demand on the TTC system we saw summarized yearly above. Unfortunately Google Docs doesn’t have trendlines built-in like Excel (hint hint, Google), but unsurprisingly if you add a regression line the trend is highly significant ( > 99.9%) and the slope gives an increase of approximately 415 weekday passengers per month on average.

Next we come to the ridership by month. If you look at the plot over the period of time, you can see that there is a distinct periodic behavior:

Taking the monthly averages we can better see the periodicity – there are peaks in March, June & September, and a mini-peak in the last month of the year:

This is also present in both the revenue (as one would expect) and the monthly budget (which means that the TTC is aware of it). As to why this is the case, I can’t immediately discern, though I am curious to know the answer. This is where it would be great to have some finer grained data (daily or hourly) or data related to geographic area or per station to look for interesting outliers and patterns.

Alternatively if we look at the monthly averages over the years of average weekday ridership (an average of averages, I am aware – but the best we can do given the data we have), you can see that there is a different periodic behavior, with a distinct downturn over the summer, reaching a low in August which then recovers in September to the maximum. This is interesting and I’m not exactly sure what to make of it, so I will do what I normally do which is attribute it to students.

Lastly, we come to the matter of the financials. As I said the monthly revenue and budget for the TTC follow the same periodic pattern as the ridership, and on the plus side, with increased ridership, there is increased revenue. Taking the arithmetic difference of the budgeted (targeted) revenue from actual, you can see that over time there is a decrease in this quantity:
Again if you do a linear regression this is highly significant ( > 99.9%). Does this mean that the TTC is becoming less profitable over time? Maybe. Or perhaps they are just getting better at setting their targets? I acknowledge that I’m not an economist, and what’s been done here is likely a gross oversimplification of the financials of something as massive as the TTC.

That being said, the city itself acknowledges [warning – large PDF] that while the total cost per hour for an in-service transit vehicle has decreased, the operating cost has increased, which they attribute to increases in wages and fuel prices. Operating public transit is also more expensive here in TO than other cities in the province, apparently, because we have things like streetcars and the subway, whereas most other cities only have buses. Either way, as I said before, it’s complicated.

Conclusion

I always enjoy working with open data and I definite appreciate the city’s initiative to be more transparent and accountable by providing the data for public use.
This was an interesting little analysis and visualization exercise and some of the key points to take away are that, over the period in question:
  • Off-peak usage of the TTC is increasing at a greater rate than peak usage
  • Usage as a whole is increasing, with about 415 more weekday riders per month on average, and a growth of ~9% from 2009 – 2012
  • Periodic behavior in the actual ridership per month over the course of the year
  • Different periodicity in average weekday ridership per month, with a peak in September
It would be really interesting to investigate the patterns in the data in finer detail, which hopefully should be possible in the future if more granular time-series, geographic, and categorical data become available. I may also consider digging into some of the larger data sets, which have been used by others to produce beautiful visualizations such as this one.

I, for one, continue to appreciate the convenience of public transit here in Toronto and wish the folks running it the best of luck with their future initiatives.

References & Resources

TTC Ridership – Ridership Numbers and Revenues Summary (at Toronto Open Data Portal)

Toronto Progress Portal – 2011 Performance Measurement and Benchmarking Report

The heat is on…. or is it? Trend Analysis of Toronto Climate Data

The following is a guest post from Joel Harrrison, PhD, consulting Aquatic Scientist.

For a luddite like me, this is a big step – posting something on the inter-web.  I’m not on Facebook.  I don’t know what Twitter is.  Hell, I don’t even own a smartphone.  But, I’ve been a devoted follower of Myles’ blog for some time, and he was kind enough to let his fellow-geek-of-a-brother contribute something to everyday analytics, so who was I to pass up such an opportunity?

The impetus for my choice of analysis was this:  in celebration of Earth Day, my colleagues and I watched a film about global climate change, which was a nice excuse to eat pizza and slouch in an office chair while sipping Dr. Pepper instead of doing other, presumably useful, things.  Anyway, a good chunk of the film centred on the evidence for anthropogenic greenhouse gas emissions altering the global climate system.

While I’ve seen lots of evidence for recent increases in air temperature in the mid-latitude areas of the planet, there’s nothing quite so convincing as doing your own analysis.  So, I downloaded climate data from Environment Canada and did my own climate change analysis.  I’m an aquatic scientist, not a climate scientist, so if I’ve made any egregious mistakes here, perhaps someone well-versed in climatology will show me the error of my ways, and I’ll learn something.  Anyway, here we go.

Let’s start with mean monthly temperatures from daily means (the average, for each month of the year, of the daily mean temperatures) for the city of Toronto, for which a fairly good record exists (1940 to 2012).  Here’s what the data look like:

So, you can see the clear trend in the data, can’t you?  Trend analysis is a tricky undertaking for a number of reasons, one of which is that variation can exist on a number of temporal scales.  We’re looking at temperatures here, so obviously we would expect significant seasonality in the data, and we are not disappointed:

One method of controlling for the variation related to seasonality is to ‘deseasonalize’ the data by subtracting the monthly medians from each datum.  Let’s look at a boxplot of the deseasonalized data (in part to ensure I’ve done the deseasonalizing correctly!):

Whew, looks good, my R skills are not completely lacking, apparently.  Here are what the deseasonalized data look like, as a time series plot:

Things were clear as mud when we originally viewed the time series plot of all of the data, but after removing the variation related to seasonality, a pattern has emerged:  an increase in temperature from 1940 to 1950, relatively stable temperatures from 1950 to 1960, then a decrease in temperature from 1960 to 1970, and a fairly consistent increase from 1970 to 2012.  Viewing the annual mean temperatures makes this pattern even more conspicuous:

Hold on, you say, why bother going to the trouble of deseasonalizing the data when you could just calculate annual means and perform linear regression to test for a trend?  This is an intuitively attractive way to proceed, but the problem is, that if, say, temperatures were getting colder over time during the summer months, but proportionately warmer during the winter, the annual mean temperature would not change over time; the two opposing trends would in effect cancel each other out.  Apparently that is not the case here, as the deseasonalized data and the annual means show a similar pattern, but caution must be exercised for this reason (especially when you have little theoretical understanding of the phenomenon which you are investigating!).

So, this is all nice and good from a data visualization standpoint, but we need to perform some statistics in order to quantify the rate of change, and to decide if the change is significant in the statistical sense.  Below are the results from linear regression analyses of temperature vs. year using the original monthly means, the deseasonalized data, and the annual means.

Dependent (Response) Variable
n
slope
R2
p-value
Monthly Mean Temperature
876
0.022
0.001
0.17
Deasonalized Monthly Temperatures
876
0.022
0.05
5.82 x 10-12
Annual Mean Temperature
73
0.022
0.20
4.65 x 10-5

All 3 analyses yielded a slope of 0.022 °C/yr, which is to say, the average rate of change during the 70 years analysed was 1.54°C.  The regression based on monthly mean temperatures had a very low goodness of fit (R2 = 0.001) and was not significant at the conventional cut-off level of p < 0.05.  This is not surprising given the scatter we observed in the data due to seasonality.  What is therefore also not a surprise, is that the deseasonalized data had much better goodness of fit (R2 = 0.05), as did the annual mean temperatures (R2 = 0.20).  The much higher level of statistical significance of the regression on deseasonalized data than on the annual means is likely a function of the higher power of the analysis (i.e., 876 data vs. only 73).

Before we get too carried away here interpreting these results, is there anything we’re forgetting?  Right, those annoying underlying assumptions of the statistical test we just used.  According to Zar (1999), for simple linear regression these are:

  1. For any value of X there exists in the population a normal distribution of Y values.  This also means that, for each value of X there exists in the population a normal distribution of Ɛ’s.
  2. Must assume homogeneity of variances; that is, the variances of these population distributions of Y values (and of Ɛ’s) must all be equal to one another.
  3. The actual relationship is linear.
  4. The values of Y are to have come at random from the sampled population and are to be independent of one another.
  5. The measurements of X are obtained without error.  This…requirement…is typically impossible; so what we are doing in practice is assuming that the errors in the X data are negligible, or at least are small compared with the measurement errors in Y.

Hmm, this suddenly became a lot more complicated.   Let’s check the validity of these assumptions for the regression of the deseasonalized monthly temperatures vs. year.  Well, we can safely say that number 5 is not a concern, i.e., that the dates were measured without error, but what about the others?  Arguably, the data are not actually linear, because of the fall in temperature between 1960 and 1970, so this is something of a concern.  The Shapiro-Wilk test tells us that the residuals are not significantly non-normal (assumption 1) but just barely (p = 0.056).  We can visualize this via a Q-Q (Quantile-Quantile) plot of the residuals:

For the most part the data fall right on the line, but a few points fall below and above the line at the extremes, suggestive of a somewhat ‘heavy tailed’ distribution.  Additionally, let’s inspect the histogram:

Again, there is some slight deviation from normality, as evidenced by the distance of the first and last bars from the rest, but it’s pretty minor.  So, there is some evidence of non-normality, but it appears negligible based on visual inspection of the Q-Q plot and histogram, and it is not statistically significant according to the Shapiro-Wilk test.  So, we’re good as far as normality goes.  Check.

What about assumption 2, homogeneity of variances?  This is typically assessed by plotting the residuals against the fitted values, like so:

There does not appear to be a systematic change in the magnitude of the residuals as a function of the predicted values, or at least nothing overly worrisome, so we’re good here, too.

Last, but certainly not least, do our data represent independent measurements?  This last assumption is frequently a problem in trend analysis.  While each temperature was presumably measured on a different day, in the statistical sense this does not necessarily imply that the measurements are not autocorrelated.   Several years of data could be influenced by an external factor which influences temperature over a multi-year timescale (El Niño?) which would cause the data from sequential years to be strongly correlated.  Such temporal autocorrelation (serial dependence) can be visualized using an autocorrelation function (ACF):

The plot tells us that at a variety of lag periods (differences between years) the level of autocorrelation is significant (i.e., the ACF is above the blue line).  The Durbin-Watson test confirms that the overall level of autocorrelation in the residuals is highly significant (p = 4.04 x 10-13).

So, strictly speaking, linear regression is not appropriate for our data due to the presence of nonlinearity and serial correlation, which violate two of the five assumptions of linear regression analysis.  Now, don’t get me wrong, people violate these assumptions all the time.  Hell, you may have already violated them earlier today if you’re anything like I was in my early days of grad school.  But, as I said, this is my first blog post ever, and I don’t want to come across as some sloppy, apathetic, slap-dash, get-away-with-whatever-the-peer-reviewers-don’t-call-me-out-on type scientist – so let’s shoot for real statistical rigour here!

Fortunately, this is not too onerous a task, as there is a test that was tailor-made for trend analysis, and doesn’t have the somewhat strict requirements of linear regression.  Enter the Hirsch-Slack Test, a variation of the Seasonal Kendall Trend Test, which corrects for both seasonality and temporal autocorrelation.  I could get into more explanation as to how the test works, but this post is getting to be a little long, and hopefully you trust me by now.  So, drum roll please….

The Hirsch-Slack test gives very similar results to those obtained using linear regression; it indicates a highly significant (p = 1.48 x 10-4) increasing trend in temperature (0.020°C/yr), which is very close to the slope of 0.022°C/yr obtained by linear regression.

So, no matter which way you slice it, there was a significant increase in Toronto’s temperature over the past 70 years.  I’m curious about what caused the dip in temperature between ~1960 and ~1970, and have a feeling it may reflect changes in aerosols and other aspects of air quality related to urbanization, but don’t feel comfortable speculating too much.  Perhaps it reflects some regional or global variation related to volcanic activity or something, I really have no idea.  Obviously, if we’d performed the analysis on the years 1970 to 2010 the slope (i.e., rate of temperature increase) would have been much higher than for the entire period of record.

I was also curious if Toronto was a good model for the rest of Canada given that it is a large, rapidly growing city, and changes in temperature there could have been related to urban factors, such as the changes in air quality I already speculated about.  For this reason, I performed the same analysis on data from rural Coldwater (near where Myles and I grew up) and obtained very similar results, which suggests the trend is not unique to the city of Toronto.

In case you’re wondering, the vast majority (98%) of Canadians believe the global climate is changing, according to a recent poll by Insightrix Research (but note that far fewer believe that human activity is solely to blame.)  So, perhaps the results of this analysis won’t be a surprise to very many people, but I did find it satisfying to perform the analysis myself, and with local data.

Well, that`s all for now – time to brace ourselves for the coming heat of summer.  I think I need a nice, cold beer.

References & Resources

Zar, J.H. (1999) Biostatistical Analysis, 4th ed. Upper Saddle River, New Jersey: Prentice Hall.
http://books.google.com/books/about/Biostatistical_analysis.html?id=LCRFAQAAIAAJ

Hirsch, R.M. & Slack, J.R. (1984). A Nonparametric Trend Test for Seasonal Data With Serial Dependence. Water Resources Research 20(6), 727-732. doi: 10.1029/WR020i006p00727
http://onlinelibrary.wiley.com/doi/10.1029/WR020i006p00727/abstract

National Post: Climate Change is real, Canadians say, but they can’t agree on the cause
http://news.nationalpost.com/2012/08/16/climate-change-is-real-canadians-say-while-disagreeing-on-the-causes

Climate Data at Canadian National Climate Data and Information Archive
http://climate.weatheroffice.gc.ca/climateData/canada_e.html

Joel Harrison, PhD, Aquatic Scientist
http://www.environmentalsciences.ca/newsite/staff/#harrison

Toronto Licensed Cats & Dogs 2012 Data Visualization

It’s raining cats and dogs! No, I lied, it’s not.

But I wanted to do so more data viz and work with some more open data.

So for this quick plot I present, Cat and Dog Licenses in the City of Toronto for 2012, visualized!


Above in the top pane is the number of licensed cats and dogs per postal code (or Forward Sortation Area, FSA). I really would like to have produced a filled map (chloropleth) with the different postal code areas, however Tableau unfortunately does not have Canadian postal code boundaries, just lat/lon and getting geographic data in is a bit of an arduous process.

I needed something to plot given that I just had counts of cat and dog licenses per FSA, so threw up a scatter and there is amazing correlation! Surprise, surprise – this is just the third variable, and I bet that if you found a map of (human) population density by postal code you’d see why the two quantities are so closely related. Or perhaps not – this is just my assumption – maybe some areas of the GTA are better about getting their pets licensed or have more cats and dogs. Interesting food for thought.


Above is the number of licenses per breed type. Note that the scale is logarithmic for both as the “hairs” (domestic shorthair, domestic mediumhair and domestic longhair) dominate for cats and I wanted to keep the two graphs consistent.

The graphs are searchable by keyword, try it out!

Also I find it shocking that the second most popular breed of dog was Shih Tzu and the fourth most type of cat was Siamese – really?

Resources

Toronto Licensed Cat & Dog Reports (at Toronto Open Data Portal)

Toronto Animal Services
http://www.toronto.ca/animal_services/

The Hour of Hell of Every Morning – Commute Analysis, April to October 2012

Introduction

So a little while ago I quit my job.

Well, actually, that sounds really negative. I’m told that when you are discussing large changes in your life, like finding a new career, relationship, or brand of diet soda, it’s important to frame things positively.

So let me rephrase that – I’ve left job I previously held to pursue other directions. Why? Because I have to do what I love. I have to move forward. And I have to work with data. It’s what I want, what I’m good at, and what I was meant to do.

So onward and upward to bigger, brighter and better things.

But I digress. The point is that my morning commute has changed.

Background

I really enjoyed this old post at Omninerd, about commute tracking activities and an attempt to use some data analysis to beat traffic mathematically. So I thought, hey, I’m commuting every day, and there’s a lot of data being generated there – why not collect some of it and analyze it too?

The difference here being that I was commuting with public transit instead of driving. So yes, the title is a bit dramatic (it’s an hour of hell in traffic for some people, I actually quite enjoy taking the TTC).

When I initially started collecting the data, I had intended to time both my commute to and from work. Unfortunately, I discovered that due to having a busy personal and professional life outside of the 9 to 5, that there was little point in tracking my commute at the end of the work day, as I was very rarely going straight home (I was ending up with a very sparse data set). I suppose this was one point of insight into my life before even doing any analysis in this experiment.

So I just collected data on the way to work in the morning.

Without going into the personal details of my life in depth, my commute went something like this:

  • walk from home to station
  • take streetcar from station west to next station
  • take subway north to station near place of work
  • walk from subway platform to place of work

Punching the route into Google Maps, it tells me the entire distance is 11.5 km. As we’ll see from the data, my travel time was pretty consistent and on average took about 40 minutes every morning (I knew this even before beginning the data collection). So my speed with all three modes of transportation averages out to ~17.25 km/hr. That probably doesn’t seem that fast, but if you’ve ever driven in Toronto traffic, trust me, it is.

In terms of the methodology for data collection, I simply used the stopwatch on my phone, starting it when I left my doorstep and stopping it when reaching the revolving doors by the elevators at work.

So all told, I kept track of the date, starting time and commute length (and therefore end time). As with many things in life, hindsight is 20/20, and looking back I realized I could have collected the data in a more detailed fashion by breaking it up for each leg of the journey.

This occurred to me towards the end of the experiment, and so I did this for a day. Though you can’t do much data analysis with just this one day, it gives a general idea of the typical structure of my commute:

Okay, that’s fun and all, but that’s really an oversimplification as the journey is broken up into distinct legs. So I made this graphic which shows the breakdown for the trip and makes it look more like a journey. The activity / transport type is colour-coded the same as the pie chart above. The circles are sized proportionally to the time spent, as are the lines between each section.

There should be another line coming from the last circle, but it looks better this way.

Alternatively the visualization can be made more informative by leaving the circles sized by time and changing the curve lengths to represent the distance of each leg travelled. Then the distance for the waiting periods is zero and the graphic looks quite different:

I really didn’t think the walk from house was that long in comparison to the streetcar. Surprising.

Cool, no? And there’s an infinite number of other ways you could go about representing that data, but we’re getting into the realm of information design here. So let’s have a look at the data set.

Analysis

So first and foremost, we ask the question, is there a relationship between the starting time of my morning commute and the length of that commute? That is to say, does how early I leave to go to work in the morning impact how long it takes me to get to work, regardless of which day it is?
Before even looking at the data this is an interesting question to consider, as you could assume (I would venture to say know for a fact) that departure time is an important factor for a driving commute as the speed of one’s morning commute is directly impacted by congestion, which is relative to the number of people commuting at any given time.
However, I was taking public transit and I’m fairly certain congestion doesn’t affect it as much. Plus I headed in the opposite direction of most (away from the downtown core). So is there a relationship here?
Looking at this graph we can see a couple things. First of all, there doesn’t appear to be a salient relationship between the commute start time and duration. Some economists are perfectly happy to run a regression and slam a trend line through a big cloud of data points, but I’m not going to do that here. Maybe if there were a lot of points I’d consider it.

The other reason I’m not going to do that is that you can see from looking at this graph that the data are unevenly distributed. There are more larger values and outliers in the middle, but that’s only because the majority of my commutes started between ~8:15 and ~9:20 so that’s where most of the data lie. 

You can see this if we look at the distribution of starting hour:

I’ve included a density plot as well so I don’t have to worry about bin-sizing issues, though it should be noted that in this case it gives the impression of continuity when there isn’t any. It does help illustrate the earlier point however, about the distribution of starting times. If I were a statistician (which I’m not) I would comment on the distribution being symmetrical (i.e. is not skewed) and on its kurtosis.

The distribution of commute duration, on the other hand, is skewed:

I didn’t have any morning where the combination of my walking and the TTC could get me to North York in less than a half hour.

Next we look at commute duration and starting hour over time. The black line is a 5-day moving average.

Other than several days near the beginning of the experiment in which I left for work extra early, the average start time for the morning trip did not change greatly over the course of the months. There looks like there might be some kind of pattern in the commute duration though, with the peaking?

We can investigate if this is the case by comparing the commute duration per day of week:

There seems to be slightly more variation in the commute duration on Monday, and it takes a bit longer on Thursdays? But look at the y-axis. These aren’t big differences, were talking about a matter of several minutes here. The breakdown for when I leave each day isn’t particularly earth-shattering either:

Normally, I’d leave it at that, but are these differences significant? We can do a one-way ANOVA and check:

> aov1 = aov(commute$starthour ~ commute$weekday, data=commute)
> aov2 = aov(commute$time ~ commute$weekday, data=commute)
> summary(aov1)
              Df Sum Sq Mean Sq F value Pr(>F)
data$weekday   4  0.456  0.1140     0.7  0.593
Residuals    118 19.212  0.1628              
> summary(aov2)
              Df Sum Sq Mean Sq F value Pr(>F)
data$weekday   4   86.4   21.59   1.296  0.275
Residuals    118 1965.4   16.66              

This requires making a lot of assumptions about the data, but assuming they’re true, these results tell us there aren’t statistically significant differences in the either the average commute start time or average commute duration per weekday.

That is to say, on average, it took about the same amount of time per day to get to work and I left around the same time.

This is in stark contrast to what people talk around the water cooler about when they’re discussing their commute. I’ve never done any data analysis on a morning drive myself (or seen any, other than the post at Omninerd), but there are likely more clearly defined weekly patterns to your average driving commute than what we saw here with public transit.

Conclusions

There’s a couple ways you can look at this.
You could say there were no earth-shattering conclusions as a result of the experiment.
Or you could say that, other than the occasional outlier (of the “Attention All Passengers on the Yonge-University-Spadina line” variety) the TTC is remarkably consistent over the course of the week, as is my average departure time (which is astounding given my sleeping patterns).
It’s all about perspective. So onward and upward, until next time.

Resources

How to Beat Traffic Mathematically

TTC Trip Planner
myTTC (independently built by an acquaintance of mine – check out more of his cool work at branigan.ca):
FlowingData: Commute times in your area, mapped [US only]

Quantified Self Toronto #15 – Text Message Analysis (rehash)

Tonight was Quantified Self Toronto #15.

Eric, Sacha and Carlos shared about what they saw at the Quantified Self Conference in California.

I presented my data analysis of a year of my text messaging behaviour, albeit in slidedeck form.

Sharing my analysis was both awesome and humbling.

It was awesome because I received so many interesting questions about the analysis, and so much interesting discussion about communications was had, both during the meeting and after.

It was humbling because I received so many insightful suggestions about further analysis which could have been done, and which, in most cases, I had overlooked. These suggestions to dig deeper included analysis of:

  • Time interval between messages in conversations (Not trivial, I noted)
  • Total amount of information exchanged over time (length, as opposed to the number of messages)
  • Average or distribution of message length per contact,  and per gender
  • Number of messages per day per contact, as a measure/proxy of relationship strength over time
  • Sentiment analysis of messages, aggregate and per contact (Brilliant! How did I miss that?)

Again, it was quite humbling and also fantastic to hear all these suggestions.

The thing about data analysis is that there are always so many ways to analyze the data (and make data visualizations), and it’s what you want to know and what you want to say that help determine how to best look at it.

It’s late, and on that note, I leave you with a quick graph of the weekly number of messages for several contacts, as a proxy of relationship strength over time (pardon my lack of labeling). So looking forward to the next meeting.

Carlos Rizo, Sacha Chua, Eric Boyd and Alan Majer are the organizers of Quantified Self Toronto. More can be found out about them on their awesome blogs, or by visting quantifiedself.ca

Let’s Go To The Ex!

I went to The Ex (that’s the Canadian National Exhibition for those of you not ‘in the know’) on Saturday. I enjoy stepping out of the ordinary from time to time and carnivals / fairs / midways / exhibitions etc. are always a great way to do that.

As far as exhibitions go, I believe the CNE is one of the more venerable – it’s been around since 1879 and attracts over 1.3 million visitors every year.

Looking at the website before I went, I saw that they had a nice summary of all the ride height requirements and number of tickets required. I thought perhaps the data could stand to be presented in a more visual form.

First, how about the number of tickets required for the different midways? All of the rides on the ‘Kiddie’ Midway require four tickets, except for one (The Wacky Worm Coaster). The Adult Midway rides are split about 50/50 for five or six tickets, except for one (Sky Ride) which only requires four.

With tickets being $1.50 each, or $1 if you buy them in sets of 22 or 55, that makes the ride price range $6-9 or $4-6. Assuming you buy the $1 tickets, the average price of an adult ride is $5.42 and the average price of a child ride $4.04.

The rides also have height requirements. Note that I’ve simplified things by taking the max height for cases where shorter/younger kids can ride supervised with an adult. Here’s a breakdown of the percentage of the rides in each midway type children can ride, given their height:

Google Docs does not allow non-stacked stepped area charts, so line graph it is.

And here’s the same breakdown with percentage of the total rides (both midways combined), coloured by type. This is a better way to represent the information, as it shows the discrete nature of the height requirement:

Basically if your child is over 4′ they are good for about 80% of all the rides at the CNE.

Something else to consider – how to get your maximum value for your tickets with none left over, given that they are sold in packs of 22 and 55? I would say go with the $36 all-you-can-ride option. Also, how miniscule are your actual odds of winning those carnival games? Because I want a giant purple plush gorilla.

See you next year!