Mapping the TTC Lines with R and Leaflet

It’s been quite a while since I’ve written a post, but as of late I’ve become really interested in mapping and so have been checking out different tools for doing this, one of which is Leaflet. This is an example of a case where, because of a well-written package for R, it’s easy for the user to create interactive web maps directly from R, without even knowing any Javascript!

I had three requirements for myself:

  1. Write code that created an interactive web map using Leaflet
  2. Use Shapefile data about the City of Toronto
  3. Allow anyone to run it on their machine, without having to download or extract data

I decided to use shapefile data on the TTC, available from Toronto’s Open Data portal. Point #3 required a little research, as the shapefile itself was buried within a zip, but it’s fairly straightforward to write R code to download and unpack zip files into a temporary directory.

The code is below, followed by the result. Not a bad result for only 10 or 15 lines!


# MAPPING THE TORONTO SUBWAY LINES USING R & Leaflet
# --------------------------------------------------
#
# Myles M. Harrison
# http://www.everydayanalytics.ca

#install.packages('leaflet')
#install.packages('maptools')
library(leaflet)
library(htmlwidgets)
library(maptools)

# Data from Toronto's Open Data portal: http://www.toronto.ca/open

# Download the file and read in the
data_url <- "http://opendata.toronto.ca/gcc/TTC_subway%20lines_wgs84.zip"
cur_dir <- getwd()
temp_dir <- tempdir()
setwd(temp_dir)
download.file(data_url, "subway_wgs84.zip")
unzip("subway_wgs84.zip")
sh <- readShapeLines("subway_wgs84.shp")
unlink(dir(temp_dir))
setwd(cur_dir)

# Create a categorical coloring function
linecolor <- colorFactor(rainbow(16), sh@data$SBWAY_NAME)

# Plot using leaflet
m <- leaflet(sh) %>%
addTiles() %>%
addPolylines(popup = paste0(as.character(sh@data$SBWAY_NAME)), color=linecolor(sh@data$SBWAY_NAME)) %>%
addLegend(colors=linecolor(sh@data$SBWAY_NAME), labels=sh@data$SBWAY_NAME)

m

# Save the output
saveWidget(m, file="TTC_leaflet_map.html")

Plotting Choropleths from Shapefiles in R with ggmap – Toronto Neighbourhoods by Population

Introduction

So, I’m not really a geographer. But any good analyst worth their salt will eventually have to do some kind of mapping or spatial visualization. Mapping is not really a forte of mine, though I have played around with it some in the past.
I was working with some shapefile data a while ago and thought about how its funny that so much of spatial data is dominated by a format that is basically proprietary. I looked around for some good tutorials on using shapefile data in R, and even so it took me a while to figure it out, longer than I would have thought.
So I thought I’d put together a simple example of making nice choropleths using R and ggmap. Let’s do it using some nice shapefile data of my favourite city in the world courtesy of the good folks at Toronto’s Open Data initiative.

Background

We’re going to plot the shapefile data of Toronto’s neighbourhoods boundaries in R and mash it up with demographic data per neighbourhood from Wellbeing Toronto.
We’ll need a few spatial plotting packages in R (ggmap, rgeos, maptools).
Also the shapefile originally threw some kind of weird error when I originally tried to load it into R, but it was nothing loading it into QGIS once and resaving it wouldn’t fix. The working version is available on the github page for this post.

Analysis

First let’s just load in the shapefile and plot the raw boundary data using maptools. What do we get?
# Read the neighborhood shapefile data and plot
shpfile <- "NEIGHBORHOODS_WGS84_2.shp"
sh <- readShapePoly(shpfile)
plot(sh)
This just yields the raw polygons themselves. Any good Torontonian would recognize these shapes. There’s some maps like these with words squished into the polygons hanging in lots of print shops on Queen Street. Also as someone pointed out to me, most T-dotters think of the grid of downtown streets as running directly North-South and East-West but it actually sits on an angle.

Okay, that’s a good start. Now we’re going to include the neighbourhood population from the demographic data file by attaching it to the dataframe within the shapefile object. We do this using the merge function. Basically this is like an SQL join. Also I need to convert the neighbourhood number to a integer first so things work, because R is treating it as an string.

# Add demographic data
# The neighbourhood ID is a string - change it to a integer
sh@data$AREA_S_CD <- as.numeric(sh@data$AREA_S_CD)

# Read in the demographic data and merge on Neighbourhood Id
demo <- read.csv(file="WB-Demographics.csv", header=T)
sh2 <- merge(sh, demo, by.x='AREA_S_CD', by.y='Neighbourhood.Id')
Next we’ll create a nice white to red colour palette using the colorRampPalette function, and then we have to scale the population data so it ranges from 1 to the max palette value and store that in a variable. Here I’ve arbitrarily chosen 128. Finally we call plot and pass that vector of colours into the col parameter:
# Set the palette
p <- colorRampPalette(c("white", "red"))(128)
palette(p)

# Scale the total population to the palette
pop <- sh2@data$Total.Population
cols <- (pop - min(pop))/diff(range(pop))*127+1
plot(sh, col=cols)
And here’s the glorious result!

Cool. You can see that the population is greater for some of the larger neighbourhoods, notably on the east end and The Waterfront Communities (i.e. condoland)

I’m not crazy about this white-red palette so let’s use RColorBrewer’s spectral which is one of my faves:

#RColorBrewer, spectral
p <- colorRampPalette(brewer.pal(11, 'Spectral'))(128)
palette(rev(p))
plot(sh2, col=cols)

There, that’s better. The dark red neighborhood is Woburn. But we still don’t have a legend so this choropleth isn’t really telling us anything particularly helpful. And it’d be nice to have the polygons overplotted onto map tiles. So let’s use ggmap!


ggmap

In order to use ggmap we have to decompose the shapefile of polygons into something ggmap can understand (a dataframe). We do this using the fortify command. Then we use ggmap’s very handy qmap function which we can just pass a search term to like we would Google Maps, and it fetches the tiles for us automatically and then we overplot the data using standard calls to geom_polygon just like you would in other visualizations using ggplot.

The first polygon call is for the filled shapes and the second is to plot the black borders.

#GGPLOT 
points <- fortify(sh, region = 'AREA_S_CD')

# Plot the neighborhoods
toronto <- qmap("Toronto, Ontario", zoom=10)
toronto +geom_polygon(aes(x=long,y=lat, group=group, alpha=0.25), data=points, fill='white') +
geom_polygon(aes(x=long,y=lat, group=group), data=points, color='black', fill=NA)
Voila!

Now we merge the demographic data just like we did before, and ggplot takes care of the scaling and legends for us. It’s also super easy to use different palettes by using scale_fill_gradient and scale_fill_distiller for ramp palettes and RColorBrewer palettes respectively.

# merge the shapefile data with the social housing data, using the neighborhood ID
points2 <- merge(points, demo, by.x='id', by.y='Neighbourhood.Id', all.x=TRUE)

# Plot
toronto + geom_polygon(aes(x=long,y=lat, group=group, fill=Total.Population), data=points2, color='black') +
scale_fill_gradient(low='white', high='red')

# Spectral plot
toronto + geom_polygon(aes(x=long,y=lat, group=group, fill=Total.Population), data=points2, color='black') +
scale_fill_distiller(palette='Spectral') + scale_alpha(range=c(0.5,0.5))

So there you have it! Hopefully this will be useful for other R users wishing to make nice maps in R using shapefiles, or those who would like to explore using ggmap.

References & Resources

Neighbourhood boundaries at Toronto Open Data:
Demographic data from Well-being Toronto:

Visualization and Analysis of Reddit’s “The Button” Data

Introduction

People are weird. And if there’s anything that’s greater collective proof of this fact than Reddit, you’d be hard pressed to find it.

I tend to put reddit in the same bucket as companies like Google, Amazon and Netflix, where they have enough money, or freedom, or both, to say something like “wouldn’t it be cool if….?” and then they do it simply because they can.

Enter “the button” (/r/thebutton), reddit’s great social experiment that appeared on April Fool’s Day of this year. An enticing blue rectangle with a timer that counts down from 60 to zero that’s reset when the button is pushed, with no explanation as to what happens when the time is allowed to run out. Sound familiar? The catch here being that it was an experience shared by anyone who visited the site, and each user also only got one press (though many made attempts to game the system, at least initially).

Finally, the timer reached zero, the last button press being at 2015-06-05 21:49:53.069000UTC, and the game (rather anti-climactically I might offer) ended.

What does this have to do with people being weird? Well, an entire mythology was built up around the button, amongst other things. Okay, maybe interesting is a better word. And maybe we’re just talking about your average redditor.

Either way, what interests me is that when the experiment ended, all the data were made available. So let’s have a look shall we?

Background

The dataset consists of simply four fields: 
press time, the date and time the button was pressed
flair, the flair the user was assigned given at what the timer was at when they pushed the button
css, the flair class given to the user
and lastly outage press, a Boolean indicator as to if the press occurred during a site outage.
The data span a time period from 2015-04-01 16:10:04.468000 to 2015-06-05 21:49:53.069000, with a total of 1,008,316 rows (unique presses).
I found there was css missing for some rows, and a lot of of “non presser” flair (users who were not eligible to press the button as their account was created after the event started). For these I used a “missing” value of -1 for the number of seconds remaining when the button was pushed; otherwise it could be stripped from the css field.

Analysis

With this data set, we’re looking at a pretty straightforward categorical time series.
Overall Activity in Time
First we can just look at the total number of button presses, regardless of what the clock said (when they occurred in the countdown) by plotting the raw number of presses per day:

Hmmm… you can see there is a massive spike at the beginning of the graph and there’s much, much fewer for the rest of the duration of the experiment. In fact, nearly 32% of all clicks occurred in the first day, and over half (51.3%) in the first two days. 
I think has something to do with both the initial interest in the experiment when it first was announced, and also with the fact that the higher the counter is kept at, the more people can press the button in the same time period (more on this later).
Perhaps a logarithmic graph for the y-axis would be more suitable?
That’s better. We can see the big drop-off in the first two days or so, and also that little blip around the 18th of May is more apparent. This is likely tied to one of several technical glitches which are noted in the button wiki,

For a more granular look, let’s do the hourly presses as well (with a log scale):

Cool. The spike on the 18th seems to be mainly around one hour with about a thousand presses, and we can see too that perhaps there’s some kind of periodic behavior in the data on an hourly basis? If we exclude some of the earlier data we can also go back to not using a log scale for the y-axis:

Let’s look more into the hours of the day when the button presses occur. We can create a simple bar plot of the count of button presses by hour overall:

You can see that the vast majority occurred around 5 PM and then there is a drop-off after that, with the lows being in the morning hours between about 7 and noon. Note that all the timestamps for the button pushes are in Universal Time. Unfortunately there is no geo data, but assuming most redditors who pushed the button are within the continental United States (a rather fair assumption) the high between 5-7 PM would be 11 AM to 1 PM (so, around your lunch hour at work).

But wait, that was just the overall sum of hours over the whole time period. Is there a daily pattern? What about by hour and day of week? Are most redditors pushing the button on the weekend or are they doing it at work (or during school)? We should look into this in more detail.

Hmm, nope! The majority of the clicks occurred Wednesday-Thursday night. But as we know from the previous graphs, the vast majority also occurred within the first two days, which happened to be a Wednesday and Thursday. So the figures above aren’t really that insightful, and perhaps it would make more sense to look at the trending in time across both day and hour? That would give us the figure as below:

As we saw before, there is a huge amount of clicks in the first few days (the first few hours even) so even with log scaling it’s hard to pick out a clear pattern. But most of the presses appear to be present in the bands after 15:00 and before 07:00. You can see the clicks around the outage on the 18th of May were in the same high period, around 18:00 and into the next day.

Maybe alternate colouring would help?

That’s better. Also if we exclude the flurry of activity in the first few days or so, we can drop the logarithmic scaling and see the other data in more detail:

To get a more normalized view, we can also look at the percentage of daily clicks per hour for each day, which yields a much more interesting view, and really shows the gap in the middle and the outage on the 18th:

Activity by Seconds Remaining
So far we’ve only looked at the button press activity by the counts in time. What about the time remaining for the presses? That’s what determined each individual reddit user’s flair, and was the basis for all the discussion around the button.

The reddit code granted flairs which were specific to the time remaining when the button was pushed.  For example, if there were 34 seconds remaining, then the css would be “34s”, so it was easy to strip these and convert into numeric data. There were also those that did not press the button who were given the “non presser” flair (6957 rows, ~0.69%), as well as a small number of entries missing flair (67, <0.01%), which I gave the placeholder value of -1.

The remaining flair classes served as a bucketing which functioned very much like a histogram:

Color Have they pressed? Can they press? Timer number when pressed
Grey/Gray N Y NA
Purple Y N 60.00 ~ 51.01
Blue Y N 51.00 ~ 41.01
Green Y N 41.00 ~ 31.01
Yellow Y N 31.00 ~ 21.01
Orange Y N 21.00 ~ 11.01
Red Y N 11.00 ~ 00.00
Silver/White N N NA

We can see this if we plot a histogram of the button presses by using the CSS class which gives the more granular seconds remaining, and use breaks the same as above:

We can see there is much greater proportion of those who pressed within 51-60s left, and there is falloff from there (power law). This is in line with what we saw in the time series graphs: the more the button was pressed, the more presses could occur in a given interval of time, and so we expect that most of those presses occurred during the peak activity at the beginning of the experiment (which we’ll soon examine).

What’s different from the documentation above from the button wiki is the “cheater” class, which was given to those who tried to game the system by doing things like disconnecting their internet and pressing the button multiple times (as far as I can tell). You can see that plotting a bar graph is similar to the above histogram with the difference being contained in the “cheater” class:

Furthermore, looking over the time period, how are the presses distributed in each class? What about in the cheater class? We can plot a more granular histogram:

Here we can more clearly see the exponential nature of the distribution, as well as little ‘bumps’ around the 10, 20, 30 and 45 second marks. Unfortunately this doesn’t tell us anything about the cheater class as it still has valid second values. So let’s do a boxplot by css class as well, showing both the classes (buckets) as well as their distributions:

Obviously each class has to fit into a certain range given their definition, but we can see some are more skewed than others (e.g. class for 51-60s is highly negatively skewed, whereas the class for 41-50 has median around 45). Also we can see that the majority of the cheater class is right near the 60 mark.

If we want to be fancier we can also plot the boxplot using just the points themselves and adding jitter:

This shows the skew of the distributions per class/bucket (focus around “round” times like 10, 30, 45s, etc.) as before, as well as how the vast majority of the cheater class appears to be at 59s mark.

Presses by seconds remaining and in time
Lastly we can combine the analyses above and look at how the quantity and proportion of button presses varies in time by the class and number of seconds remaining.

First we can look at the raw count of presses per css type per day as a line graph. Note again the scale on the y-axis is logarithmic:

This is a bit noisy, but we can see that the press-6 class (presses with 51-60s remaining) dominate at the beginning, then taper off toward the end. Presses in the 0-10 class did not appear until after April 15, then eventually overtook the quicker presses, as would have to be the case in order for the timer to run out. The cheater class starts very high with the press-6 class, then drops off significantly and continues to decrease. I would have like to break this out into small multiples for more clarity, but it’s not the easiest to do using ggplot.

Another way to look at it would be to look at the percent of presses by class per day. I’ve written previously about how stacked area graphs are not your friend, but in this case it’s actually not too bad (plus I wanted to learn how to do it in ggplot). If anything it shows the increase presses in the 51-60 range right after the outage on May 18, and the increase in the 0-10 range toward the end (green):

This is all very well and good, but let’s get more granular. We can easily visualize the data more granularly using heatmaps with the second values taken from the user flair to get a much more detailed picture. First we’ll look at a heatmap of this by hour over the time period:

Again, the scaling is logarithmic for the counts (here the fill colour). We can see some interesting patterns emerging, but it’s a little too sparse as there are a lot of hours without presses for a particular second value. Let’s really get granular and use all the data on the per second level!

On the left is the data for the whole period with a logartihmic scale, whereas the figure on the right excludes some of the earlier data and uses a linear scale. We can see the beginning peak activity in the upper lefthand corner, and then these interesting bands around the 5, 10, 20, 30, and 45 marks forming and gaining strength over time (particular toward the end). Interestingly in addition the resurgence in near-instantaneous presses after the outage around May 18, there was also a hotspot of presses around the 45s mark close to the end of April. Alternate colouring below:

Finally, we can divide by the number of presses per day and calculate the percent each number of seconds remaining made up over the time period. That gives the figures below:

Here the flurry of activity at the beginning continues to be prominent, but the bands also stand out a little more on a daily basis. We can also see how the proportion of clicks for the smaller number of seconds remaining continues to increase until finally the timer is allowed to run out.

Conclusion

The button experiment is over. In the end there was no momentous meaning to it all, no grand scheme or plan, no hatch exploding into the jungle, just an announcement that the thread would be archived. Again, somewhat anti-climactic.
But, it was an interesting experiment. This was an interesting data set, given the relationship between the amount of data that could exist in the same interval of time because of the nature of it. 
And I think it really says something about what the internet allows us to do (both in terms of creating something simply for the sake of it, and collecting and analyzing data), and also about people’s desire to find patterns and create meaning in things, no matter what they are. If you’d asked me, I never would have guessed religions would have sprung up around something as simple as pushing a button. But then again, religions have sprung up around stranger things.
You can read and discuss in the button aftermath thread, and if you want to have a go at it yourself, the code and data are below. Until next time I’ll just keep pressing on.

References & Resources

the button press data (from reddit’s github)

R code for plots

/r/thebutton

What a Gas! The Falling Price of Oil and Ontario Gasoline Prices

Introduction

In case you’ve been living under a rock, there’s been a lot of chatter in the financial world late about the price of oil going down. Way, way, down. So much so that the Bank of Canada cut interest rates. What crazy times are these we live in? I thought gas was only going to get more and more expensive until the end of time until everyone resorted to driving solar-powered cars.
I’m no economist (or commodities guy) and frankly, a lot of it seems like black magic and voodoo to me, but I thought it’d take a look at the data to see just how things have changed. Plus it’ll be an opportunity to muck about with some time series analysis in R.

Your average economists.

Background

Not much background is really needed other than what I mentioned in the introduction, but, well, we do need some data to work with.
Ontario Gas Prices for a number of cities (as well as the province averages) can be found at the Ontario Ministry of Energy. Unfortunately they’re year-by-year CSVs with a lot of different information row-wise (Weekly, Month, YTD).
No biggie, a simple bash script will take care of downloading all the data with wget and parsing and concatenating the data with other *nix tools:
# Download data
for i in $(seq 1990 2014)
do wget http://www.energy.gov.on.ca/fuelupload/ONTREG$i.csv
done

# Retain the header
head -n 2 ONTREG1990.csv | sed 1d > ONTREG_merged.csv

# Loop over the files and use sed to extract the relevant lines
for i in $(seq 1990 2014)
do
tail -n 15 ONTREG$i.csv | sed 13,15d | sed 's/./-01-'$i',/4' >> ONTREG_merged.csv
done
Great! Now we have all monthly Ontario Gas Price data from 1990-2014 inclusive in one big file.

The WTI data I got from The Federal Reserve Bank of St. Louis, and the forecasts from the US Energy Information Administration.

Analysis

First, a pretty plot of the data:

Wow, that really is a cliff, isn’t it? The average Ontario Gas Price hasn’t been as low as it was in Dec 2014 since the fall of 2010 (Sep 2010, $1.01) and the WTI not since about May of that year.

Now to the fun stuff. Let’s read the data into R and do some hacky time series analysis.

library(ts)
library(forecast)

# Read in the Ontario Gas Price Data
data <- read.csv(file="ONTREG_merged.csv", header=T, sep=",")

# Read in the WTI oil price data
WTI_data <- read.csv(file='DCOILWTICO.csv',header=F, col.names=c("Date", "Value"))

# Create a time series object for the WTI and Ontario Avg
WTI <- ts(data=WTI_data$Value, frequency=12, start=c(1990,1), end=c(2014,12))
ON <- ts(data=data$ON.Avg, frequency=12, start=c(1990,1), end=c(2014,12))

# Plot and compare
combined <- cbind(WTI, ON)
plot(combined)

We get a comparison plot of the data:

And we can look at the Ontario Gas Price as a function of the WTI. Linear is on the left, log-log on the right.

Next we build lm model objects and look at the diagnostics. I’ll spare the details, but I feel better about the log-log, so we’ll go with that.

# Create linear models (normal and log-log)
l1 <- lm(ON ~ WTI, data=combined)
l2 <- lm(log(ON) ~ log(WTI), data=combined)

# Compare relative performance
summary(l1)
summary(l2)
plot(l1)
plot(l2)

# Plot
plot(ON ~ WTI, data=combined, pch=16, cex=0.3)
abline(l1)
plot(log(ON) ~ log(WTI), data=combined, pch=16, cex=0.3)
abline(l2)

Lastly, we read in the forecast WTI data and use it to forecast the Ontario Gas price using our second model:
# Read in WTI forecast data
WTI_forecast <- read.csv(file="WTI_forecast.csv", header=F, sep=",", col.names=c("Date", "Value"))

# Forecast Ontario Gas Price
fit <- forecast(l2, newdata=data.frame(WTI=WTI_forecast$Value))

# Unlog
fit$mean <- exp(fit$mean)
fit$lower <- exp(fit$lower)
fit$upper <- exp(fit$upper)
fit$x <- exp(fit$x)

# Plot
plot(fit, ylab='Ontario Average Gas Price (cents/L)')

And there you have it! Putting the forecast data (blue line) and the WTI forecast back into our original graph, we can compare the two:

It’s that easy, right? That’s all there is to it?

Conclusion

Sadly, no. I’ve done some very quick work here and demonstrated some of the types of tools that are available in R, but “real” time-series analysis is a practice which requires much more care and nuance. 
For example, linear modeling assumes that variables are stationary (i.e. have constant mean and variance) and not auto-correlated, properties which are almost never true in the real world. Using methods such as above for non-stationary times series can result in what is known as “spurious regression” – finding relationships between variables which don’t really exist, even though the results have high R-squared and p-values.

In these cases, testing the stationary assumption and then massaging of the data (differencing & deseasonalization) is usually required beforehand to handle the problem, or other models which do not have as strict assumptions as linear regression are more appropriate. For more on this, see the excellent book “Forecasting: Principles and Practice” by Rob J. Hyndman and George Athana­sopou­los.

The take-away, as always, is that real life is complicated, and so analysis requires care. Predicting the future has never been, and never will be easy; either way, I just hope gas stays cheap.

References & Resources

code & data on github:

Fuel Price Data: Regular Unleaded Gasoline 2014 (Ontario Ministry of Energy):

http://www.energy.gov.on.ca/en/fuel-prices/fuel-price-data/?fuel=REG&yr=2015

Crude Oil Prices: West Texas Intermediate (WTI) (Federal Reserve Bank of St. Louis)

Short-Term Energy Outlook (U.S. Energy Information Administration)

Forecasting: Principles and Practice (OTexts)

The Mandelbrot Set in R

Introduction

I was having a conversation with someone I know about weather forecasts the other day and it went something like this:
“Yes, their forecasts are really not very good. They really need to work on improving them.”
“Well, yes, I think they’re okay, considering what they’re trying to do is impossible.”
The thought that a relatively simple set of equations can give rise to infinitely complex behavior is something which has intrigue and appeal beyond those mathematically minded and academic. Fractals became a part of pop culture, and the idea of chaos is not unknown to those outside mathematical research, as it was popularized by the term “The Butterfly Effect” (though unfortunately I can’t say the movie starring Ashton Kutcher was anything to write home about).
When I was finishing high school and starting my undergraduate degree I got very interested in fractals and chaos. What really spurred this interest was James Gleick’s Chaos, which outlined the history of the “mathematics of chaos” in a highly readable, narrative way. So much so that I later went on to take a pure mathematics course in chaos during my degree, and wrote a program in MATLAB for exploring the Mandelbrot set.
So I thought I’d give it a shot in R.

Background

Unquestionably the most well known fractal of all is the Mandelbrot fractal. Without going into laborious mathematical detail, the Mandelbrot set stems from using complex numbers, which are numbers of the form z = x + yi. The number i, the “imaginary unit”, is the special quantity such that i2 = –1. As such complex numbers are unique in that multiplying two of them together can result in much different behavior than numbers on the real number line: the magnitude of their product is not guaranteed to be greater than the terms multiplied, even for two quantities with real and imaginary parts with magnitude greater than one.
The Mandelbrot set the set of numbers produced by the following:
1. Initialize a complex set of numbers in the complex plane that are all z = 0.
2. Iterate the formula

such where c is a set of complex numbers filling the complex plane.
3. The Mandelbrot set is the set where z remains bounded for all n.
And mathematically it can be shown that if under iteration z is greater than 2 than c is not in the set.
In practice one keeps track of the number of iterations it takes z to diverge for a given c, and then colours the points accordingly to their “speed of divergence”.

Execution

In R this is straightforward code. Pick some parameters about the space (range and resolution for x and y, number of iterations, etc.) and then iterate. Naively, the code could be executed as below:
However, if you know anything about writing good code in R, you know that for loops are bad, and it’s better to rely upon the highly optimized underpinnings of the language by using array-based functions like which and lapply.  So a more efficient version of the same code is below:

We can then plunk these into functions where the different criteria for the rendered fractal are parameters:
Let’s compare the runtime between the two shall we? For the base settings, how does the naive version compare to the vectorized one? 
> compare_runtimes()
   user  system elapsed 
 37.713   0.064  37.773 
   user  system elapsed 
  0.482   0.094   0.576 
The results speak for themselves: a ~65x decrease in runtime by using R indexing functions instead of for loops! This definitely speaks to the importance of writing optimized code taking advantage of R’s vectorized functions.
You can tweak the different parameters for the function to return different parts of the complex space. Below are some pretty example plots of what the script can do with different parameters:

Conclusion

What does all this have to do with my conversation about the weather? Why, everything, of course! It’s where chaos theory came from. Enjoy the pretty pictures. It was fun getting R to churn out some nice fractals and good to take a trip down memory lane.

References and Resources

The Mandelbot Set: 
code on github:

Twitter Pop-up Analytics

Introduction

So I’ve been thinking a lot lately. Well, that’s always true. I should say, I’ve been thinking a lot lately about the blog. When I started this blog I was very much into the whole quantified self thing, because it was new to me, I liked the data collection and analysis aspect, and I had a lot of time to play around with these little side projects.
When I started the blog I called it “everyday analytics” because that’s what I saw it always being; analysis of data on topics that were part of everyday life, the ordinary viewed under the analytical lens, things that everyone can relate to. You can see this in my original about page for the blog which has remained the same since inception.
I was thinking a lot lately about how as my interest in data analysis, visualization and analytics has matured, and so that’s not really the case so much anymore. The content of everyday analytics has become a lot less everyday. Analyzing the relative nutritional value of different items on the McDonald’s menu (yeesh, looking back now those graphs are pretty bad) is very much something to which most everyone could relate. 2-D Histograms in R? PCA and K-means clustering? Not so much.
So along this line of thinking, for this reason, I thought it’s high time to get back into the original spirit of the site when it was started. So I thought I’d do some quick quantified-self type analysis, about something everyone can relate to, nothing fancy. 
Let’s look at my Twitter feed.

Background

It wasn’t always easy to get data out of Twitter. If you look back at how Twitter’s API has changed over the years, there has been considerable uproar about the restrictions they’ve made in updates, however they’re entitled to do so as they do hold the keys to the kingdom after all (it is their product). In fact, I thought it’d be a easiest to do this analysis just using the twitteR package, but it appears to be broken since Twitter has made said updates to their API.

Luckily I am not a developer. My data needs are simple for some ad hoc analysis. All I need is the data pulled and I am ready to go. Twitter now makes this easy now for anyone to do, just go to your settings page:

And then select the ‘Download archive’ button under ‘Your Twitter Archive’ (here it is a prompt to resend mine, as I took the screenshot after):

And boom! A CSV of all your tweets is in your inbox ready for analysis. After all this talk about working with “Big Data” and trawling through large datasets, it’s nice to take a breather a work with something small and simple.

Analysis

So, as I said, nothing fancy here, just wrote some intentionally hacky R code to do some “pop-up” analytics given Twitter’s output CSV. Why did I do it this way, which results in 1990ish looking graphs, instead of in Excel and making it all pretty? Why, for you, of course. Reproducibility. You can take my same R code and run it on your twitter archive (which is probably a lot larger and more interesting than mine) and get the same graphs.
The data set comprises 328 tweets sent by myself between 2012-06-03 and 2014-10-02. The fields I examined were the datetime field (time parting analysis), the tweet source and the text / content.
Time Parting
First let’s look at the time trending of my tweeting behaviour:
We can see there is some kind of periodicity, with peaks and valleys in how many tweets I send. The sharp decline near the end is because there are only 2 days of data for October. Also, compared to your average Twitter user, I’d say I don’t tweet alot, generally only once every two days or so on average:
> summary(as.vector(monthly))
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   1.00    8.00   12.00   11.31   15.00   21.00 

Let’s take a look and see if there is any rhyme or reason to these peaks and valleys:

Looking at the total counts per month, it looks like I’ve tweeted less often in March, July and December for whatever reason (for all of this, pardon my eyeballing..)

What about by day of week?

Look like I’ve tweeted quite a bit more on Tuesday, and markedly less on the weekend. Now, how does that look over the course of the day;

My peak tweeting time seems to be around 4 PM. Apparently I have sent tweets even in the wee hours of the morning – this was a surprise to me. I took a stab at making a heatmap, but it was quite sparse; however the 4-6 PM peak does persist across the days of the week.

Tweets by Source
Okay, that was interesting. Where am I tweeting from?

Look like the majority of my tweets are actually sent from the desktop site, followed by my phone, and then sharing on sites. I attribute this to the fact that I mainly use twitter to share articles, which isn’t easy to do on my smartphone.

Content Analysis
Ah, now on to the interesting stuff! What’s actually in those tweets?
First let’s look at the length of my tweets in a simple histogram:
Looks like generally my tweets are above 70 characters or so, with a large peak close to the absolute limit of 160 characters. 
Okay, but what I am actually tweeting about? Using the very awesome tm package it’s easy to do some simple text mining and pull out both top frequent terms, as well as hashtags.
So apparently I tweet a lot about data, analysis, Toronto and visualization. To anyone who’s read the blog this shouldn’t be overly surprisingly. Also you can see I pass along articles and interact with others as “via” and “thanks” are in there too. Too bad about that garbage ampersand.
Overwhelmingly the top hashtag I use is #dataviz, followed of course by #rstats. Again, for anyone who knows me (or has seen one of my talks) this should not come as a surprise. You can also see my use of Toronto Open Data in the #opendata and #dataeh hashtags.

Conclusion

That’s all for now. As I said, this was just a fun exercise to write some quick, easy R code to do some simple personal analytics on a small dataset. On the plus side the code is generalized, so I invite you to take it and look at your own twitter archive.
Or, you could pull all of someone else’s tweets, but that would, of course, require a little more work.

References

code at github
Twitter Help Center: Downloading Your Archive
The R Text Mining ™ package at CRAN
twitteR package at CRN

5 Ways to Do 2D Histograms in R

Introduction

Lately I was trying to put together some 2D histograms in R and found that there are many ways to do it, with directions on how to do so scattered across the internet in blogs, forums and of course, Stackoverflow.

As such I thought I’d give each a go and also put all of them together here for easy reference while also highlighting their difference.

For those not “in the know” a 2D histogram is an extensions of the regular old histogram, showing the distribution of values in a data set across the range of two quantitative variables. It can be considered a special case of the heat map, where the intensity values are just the count of observations in the data set within a particular area of the 2D space (bucket or bin).

So, quickly, here are 5 ways to make 2D histograms in R, plus one additional figure which is pretty neat.

First and foremost I get the palette looking all pretty using RColorBrewer, and then chuck some normally distributed data into a data frame (because I’m lazy). Also one scatterplot to justify the use of histograms.

# Color housekeeping
library(RColorBrewer)
rf <- colorRampPalette(rev(brewer.pal(11,'Spectral')))
r <- rf(32)

# Create normally distributed data for plotting
x <- rnorm(mean=1.5, 5000)
y <- rnorm(mean=1.6, 5000)
df <- data.frame(x,y)

# Plot
plot(df, pch=16, col='black', cex=0.5)

Option 1: hexbin

The hexbin package slices the space into 2D hexagons and then counts the number of points in each hexagon. The nice thing about hexbin is that it provides a legend for you, which adding manually in R is always a pain. The default invocation provides a pretty sparse looking monochrome figure. Adding the colramp parameter with a suitable vector produced from colorRampPalette makes things nicer. The legend placement is a bit strange – I adjusted it after the fact though you just as well do so in the R code.
##### OPTION 1: hexbin from package 'hexbin' #######
library(hexbin)
# Create hexbin object and plot
h <- hexbin(df)
plot(h)
plot(h, colramp=rf)

Using the hexbinplot function provides greater flexibility, allowing specification of endpoints for the bin counting, and also allowing the provision of a transformation function. Here I did log scaling. Also it appears to handle the legend placement better; no adjustment was required for these figures.

# hexbinplot function allows greater flexibility
hexbinplot(y~x, data=df, colramp=rf)
# Setting max and mins
hexbinplot(y~x, data=df, colramp=rf, mincnt=2, maxcnt=60)

# Scaling of legend - must provide both trans and inv functions
hexbinplot(y~x, data=df, colramp=rf, trans=log, inv=exp)

Option 2: hist2d

Another simple way to get a quick 2D histogram is to use the hist2d function from the gplots package. Again, the default invocation leaves a lot to be desired:
##### OPTION 2: hist2d from package 'gplots' #######
library(gplots)

# Default call
h2 <- hist2d(df)
Setting the colors and adjusting the bin sizing coarser yields a more desirable result. We can also scale so that the intensity is logarithmic as before.
# Coarser binsizing and add colouring
h2 <- hist2d(df, nbins=25, col=r)

# Scaling with log as before
h2 <- hist2d(df, nbins=25, col=r, FUN=function(x) log(length(x)))

Option 3: stat_2dbin from ggplot

And of course, where would a good R article be without reference to the ggplot way to do things? Here we can use the stat_bin2d function, either added to a ggplot object or as a type of geometry in the call to qplot.
##### OPTION 3: stat_bin2d from package 'ggplot' #######
library(ggplot2)

# Default call (as object)
p <- ggplot(df, aes(x,y))
h3 <- p + stat_bin2d()
h3

# Default call (using qplot)
qplot(x,y,data=df, geom='bin2d')
Again, we probably want to adjust the bin sizes to a desired number, and also ensure that ggplot uses our colours that we created before. The latter is done by adding the scale_fill_gradientn function with our colour vector as the colours argument. Log scaling is also easy to add using the trans parameter.
# Add colouring and change bins
h3 <- p + stat_bin2d(bins=25) + scale_fill_gradientn(colours=r)
h3

# Log scaling
h3 <- p + stat_bin2d(bins=25) + scale_fill_gradientn(colours=r, trans="log")
h3

Option 4: kde2d

Option #4 is to do kernel density estimation using kde2d from the MASS library. Here we are actually starting to stray from discrete bucketing of histograms to true density estimation, as this function does interpolation.
The default invocation uses n = 25 which is actually what we’ve been going with in this case. You can then plot the output using image().

Setting n higher does interpolation and we are into the realm of kernel density estimation, as you can set your “bin size” lower than how your data actually appear. Hadley Wickham notes that in R there are over 20 packages [PDF] with which to do density estimation so we’ll keep that to a separate discussion.

##### OPTION 4: kde2d from package 'MASS' #######
# Not a true heatmap as interpolated (kernel density estimation)
library(MASS)

# Default call
k <- kde2d(df$x, df$y)
image(k, col=r)

# Adjust binning (interpolate - can be computationally intensive for large datasets)
k <- kde2d(df$x, df$y, n=200)
image(k, col=r)

Option 5: The Hard Way

Lastly, an intrepid R user was nice enough to show on Stackoverflow how do it “the hard way” using base packages.
##### OPTION 5: The Hard Way (DIY) #######
# http://stackoverflow.com/questions/18089752/r-generate-2d-histogram-from-raw-data
nbins <- 25
x.bin <- seq(floor(min(df[,1])), ceiling(max(df[,1])), length=nbins)
y.bin <- seq(floor(min(df[,2])), ceiling(max(df[,2])), length=nbins)

freq <- as.data.frame(table(findInterval(df[,1], x.bin),findInterval(df[,2], y.bin)))
freq[,1] <- as.numeric(freq[,1])
freq[,2] <- as.numeric(freq[,2])

freq2D <- diag(nbins)*0
freq2D[cbind(freq[,1], freq[,2])] <- freq[,3]

# Normal
image(x.bin, y.bin, freq2D, col=r)

# Log
image(x.bin, y.bin, log(freq2D), col=r)
Not the way I would do it, given all the other options available, however if you want things “just so” maybe it’s for you.

Bonus Figure

Lastly I thought I would include this one very cool figure from Computational Actuarial Science with R which is not often seen, which includes both a 2D histogram with regular 1D histograms bordering it showing the density across each dimension.
##### Addendum: 2D Histogram + 1D on sides (from Computational ActSci w R) #######
#http://books.google.ca/books?id=YWcLBAAAQBAJ&pg=PA60&lpg=PA60&dq=kde2d+log&source=bl&ots=7AB-RAoMqY&sig=gFaHSoQCoGMXrR9BTaLOdCs198U&hl=en&sa=X&ei=8mQDVPqtMsi4ggSRnILQDw&redir_esc=y#v=onepage&q=kde2d%20log&f=false

h1 <- hist(df$x, breaks=25, plot=F)
h2 <- hist(df$y, breaks=25, plot=F)
top <- max(h1$counts, h2$counts)
k <- kde2d(df$x, df$y, n=25)

# margins
oldpar <- par()
par(mar=c(3,3,1,1))
layout(matrix(c(2,0,1,3),2,2,byrow=T),c(3,1), c(1,3))
image(k, col=r) #plot the image
par(mar=c(0,2,1,0))
barplot(h1$counts, axes=F, ylim=c(0, top), space=0, col='red')
par(mar=c(2,0,0.5,1))
barplot(h2$counts, axes=F, xlim=c(0, top), space=0, col='red', horiz=T)

Conclusion

So there you have it! 5 ways to create 2D histograms in R, plus some additional code to create a really snappy looking figure which incorporates the regular variety. I leave it to you to write (or find) some good code for creating legends for those functions which do not include them. Hopefully other R users will find this a helpful reference.

References

code on github
R generate 2D histogram from raw data (Stackoverflow)
Computational Actuarial Science with R (Google Books)
Wickham, Hadley. Density Estimation in R [PDF]

PCA and K-means Clustering of Delta Aircraft

Introduction

I work in consulting. If you’re a consultant at a certain type of company, agency, organization, consultancy, whatever, this can sometimes mean travelling a lot.
Many business travelers ‘in the know’ have heard the old joke that if you want to stay at any type of hotel anywhere in the world and get a great rate, all you have to do is say that you work for IBM.

The point is that my line of business requires travel, and sometimes that is a lot of the time, like say almost all of last year. Inevitable comparisons to George Clooney’s character in Up in the Air were made (ironically I started to read that book, then left it on a plane in a seatback pocket), requests about favours involving duty free, and of course many observations and gently probing questions about frequent flier miles (FYI I’ve got more than most people, but a lot less than the entrepreneur I sat next to one time, who claimed to have close to 3 million).

But I digress.

Background

The point is that, as I said, I spent quite a bit of time travelling for work last year. Apparently the story with frequent fliers miles is that it’s best just to pick one airline and stick with it – and this also worked out well as most companies, including my employer, have preferred airlines and so you often don’t have much of a choice in the matter.

In my case this means flying Delta.

So I happened to notice in one of my many visits to Delta’s website that they have data on all of their aircraft in a certain site section. I thought this would be an interesting data set on which to do some analysis, as it has both quantitative and qualitative information and is relatively complex. What can we say about the different aircraft in Delta’s fleet, coming at it with ‘fresh eyes’? Which planes are similar? Which are dissimilar?

Aircraft data card from Delta.com

The data set comprises 33 variables on 44 aircraft taken from Delta.com, including both quantitative measures on attributes like cruising speed, accommodation and range in miles, as well as categorical data on, say, whether a particular aircraft has Wi-Fi or video. These binary categorical variables were transformed into quantitative variables by assigning them values of either 1 or 0, for yes or no respectively.

Analysis

As this a data set of many variables (33) I thought this would be an interesting opportunity to practice using a dimensionality reduction method to make the information easier to visualize and analyze.
First let’s just look at the intermediary quantitative variables related to the aircraft physical characteristics: cruising speed, total accommodation, and other quantities like length and wingspan. These variables are about in the middle of the data frame, so we can visualize all of them at once using a scatterplot matrix, which is the default for R’s output if plot() is called on a dataframe.
data <- read.csv(file="delta.csv", header=T, sep=",", row.names=1)

# scatterplot matrix of intermediary (size/non-categorical) variables
plot(data[,16:22])

We can see that there are pretty strong positive correlations between all these variables, as all of them are related to the aircraft’s overall size. Remarkably there is an almost perfectly linear relationship between wingspan and tail height, which perhaps is related to some principle of aeronautical engineering of which I am unaware.

The exception here is the variable right in the middle which is the number of engines. There is one lone outlier [Boeing 747-400 (74S)] which has four, while all the other aircraft have two. In this way the engines variable is really more like a categorical variable, but we shall as the analysis progresses that this is not really important, as there are other variables which more strongly discern the aircraft from one another than this.

How do we easier visualize a high-dimensional data set like this one? By using a dimensionality reduction technique like principal components analysis.

Principal Components Analysis

Next let’s say I know nothing about dimensionality reduction techniques and just naively apply principle components to the data in R:

# Naively apply principal components analysis to raw data and plot
pc <- princomp(data)
plot(pc)

Taking that approach we can see that the first principal component has a standard deviation of around 2200 and accounts for over 99.8% of the variance in the data. Looking at the first column of loadings, we see that the first principle component is just the range in miles.

# First component dominates greatly. What are the loadings?
summary(pc) # 1 component has > 99% variance
loadings(pc) # Can see all variance is in the range in miles

Importance of components:
                             Comp.1       Comp.2       Comp.3       Comp.4
Standard deviation     2259.2372556 6.907940e+01 2.871764e+01 2.259929e+01
Proportion of Variance    0.9987016 9.337038e-04 1.613651e-04 9.993131e-05
Cumulative Proportion     0.9987016 9.996353e-01 9.997966e-01 9.998966e-01
            

                         Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6 Comp.7 Comp.8
Seat.Width..Club.                                    -0.144 -0.110              
Seat.Pitch..Club.                                    -0.327 -0.248         0.189
Seat..Club.                                                                     
Seat.Width..First.Class.                0.250        -0.160        -0.156  0.136
Seat.Pitch..First.Class.                0.515 -0.110 -0.386  0.112 -0.130  0.183
Seats..First.Class.                     0.258 -0.124 -0.307 -0.109  0.160  0.149
Seat.Width..Business.                  -0.154  0.142 -0.108                     
Seat.Pitch..Business.                  -0.514  0.446 -0.298  0.154 -0.172  0.379
Seats..Business.                       -0.225  0.187                            
Seat.Width..Eco.Comfort.                                     0.285 -0.224       
Seat.Pitch..Eco.Comfort.                0.159                0.544 -0.442       
Seats..Eco.Comfort.                                          0.200 -0.160       
Seat.Width..Economy.                                  0.125  0.110              
Seat.Pitch..Economy.                                  0.227  0.190        -0.130
Seats..Economy.                  0.597        -0.136  0.345 -0.165         0.168
Accommodation                    0.697               -0.104                0.233
Cruising.Speed..mph.                    0.463  0.809  0.289 -0.144  0.115       
Range..miles.             0.999                                                 
Engines                                                                         
Wingspan..ft.                    0.215         0.103 -0.316 -0.357 -0.466 -0.665
Tail.Height..ft.                                     -0.100        -0.187       
Length..ft.                      0.275         0.118 -0.318  0.467  0.582 -0.418
Wifi                                                                            
Video                                                                           
Power                                                                           
Satellite                                                                       
Flat.bed                                                                        
Sleeper                                                                         
Club                                                                            
First.Class                                                                     
Business                                                                        
Eco.Comfort                                                                     

Economy                                         

This is because the scale of the different variables in the data set is quite variable; we can see this by plotting the variance of the different columns in the data frame (regular scaling on the left, logarithmic on the right):

# verify by plotting variance of columns
mar <- par()$mar
par(mar=mar+c(0,5,0,0))
barplot(sapply(data, var), horiz=T, las=1, cex.names=0.8)
barplot(sapply(data, var), horiz=T, las=1, cex.names=0.8, log='x')
par(mar=mar)

We correct for this by scaling the data using the scale() function. We can then verify that the variances across the different variables are equal so that when we apply principal components one variable does not dominate.

# Scale
data2 <- data.frame(scale(data))
# Verify variance is uniform
plot(sapply(data2, var))
After applying the scale() function the variance is now constant across variables

Now we can apply principal components to the scaled data. Note that this can also be done automatically in call to the prcomp() function by setting the parameter scale=TRUE. Now we see a result which is more along the lines of something we would expect:

# Proceed with principal components
pc <- princomp(data2)
plot(pc)
plot(pc, type='l')
summary(pc) # 4 components is both 'elbow' and explains >85% variance

Great, so now we’re in business. There are various rules of thumb for selecting the number of principal components to retain in an analysis of this type, two of which I’ve read about are:

  1. Pick the number of components which explain 85% or greater of the variation
  2. Use the ‘elbow’ method of the scree plot (on right)
Here we are fortunate in that these two are the same, so we will retain the first four principal components. We put these into new data frame and plot.
# Get principal component vectors using prcomp instead of princomp
pc <- prcomp(data2)

# First for principal components
comp <- data.frame(pc$x[,1:4])
# Plot
plot(comp, pch=16, col=rgb(0,0,0,0.5))

So what were are looking at here are twelve 2-D projections of data which are in a 4-D space. You can see there’s a clear outlier in all the dimensions, as well as some bunching together in the different projections.

Normally, I am a staunch opponent of 3D visualization, as I’ve spoken strongly about previously. The one exception to this rule is when the visualization is interactive, which allows the user to explore the space and not lose meaning due to three dimensions being collapsed into a 2D image. Plus, in this particular case, it’s a good excuse to use the very cool, very awesome rgl package.
Click on the images to view the interactive 3D versions (requires a modern browser). You can better see in the 3D projections that the data are confined mainly to the one plane one the left (components 1-3), with the exception of the outlier, and that there is also bunching in the other dimensions (components 1,3,4 on right).
library(rgl)
# Multi 3D plot
plot3d(comp$PC1, comp$PC2, comp$PC3)
plot3d(comp$PC1, comp$PC3, comp$PC4)
So, now that we’ve simplified the complex data set into a lower dimensional space we can visualize and work with, how do we find patterns in the data, in our case, the aircraft which are most similar? We can use a simple unsupervised machine learning technique like clustering.
Cluster Analysis
 
Here because I’m not a data scientist extraordinaire, I’ll stick to the simplest technique and do a simple k-means – this is pretty straightforward to do in R.
First how do we determine the number of clusters? The simplest method is to look at the within groups sum of squares and pick the ‘elbow’ in the plot, similar to as with the scree plot we did for the PCA previously. Here I used the code from R in Action:
# Determine number of clusters
wss <- (nrow(mydata)-1)*sum(apply(mydata,2,var))
for (i in 2:15) wss[i] <- sum(kmeans(mydata,
                                     centers=i)$withinss)
plot(1:15, wss, type="b", xlab="Number of Clusters",
     ylab="Within groups sum of squares")
However, it should be noted that it is very important to set the nstart parameter and iter.max parameter (I’ve found 25 and 1000, respectively to be okay values to use), which the example in Quick-R fails to do, otherwise you can get very different results each time you run the algorithm, as below.
Clustering without the nstart parameter can lead to variable results for each run
Clustering with the nstart and iter.max parameters leads to consistent results, allowing proper interpretation of the scree plot
So here we can see that the “elbow” in the scree plot is at k=4, so we apply the k-means clustering function with k = 4 and plot.
# From scree plot elbow occurs at k = 4
# Apply k-means with k=4
k <- kmeans(comp, 4, nstart=25, iter.max=1000)
library(RColorBrewer)
library(scales)
palette(alpha(brewer.pal(9,'Set1'), 0.5))
plot(comp, col=k$clust, pch=16)
We can see that the one outlier is in its own cluster, there’s 3 or 4 in the other and the remainder are split into two clusters of greater size. We visualize in 3D below, as before (click for interactive versions):
# 3D plot
plot3d(comp$PC1, comp$PC2, comp$PC3, col=k$clust)
plot3d(comp$PC1, comp$PC3, comp$PC4, col=k$clust)
We look at the exact clusters below, in order of increasing size:
# Cluster sizes
sort(table(k$clust))
clust <- names(sort(table(k$clust)))

# First cluster
row.names(data[k$clust==clust[1],])
# Second Cluster
row.names(data[k$clust==clust[2],])
# Third Cluster
row.names(data[k$clust==clust[3],])
# Fourth Cluster
row.names(data[k$clust==clust[4],])
[1] “Airbus A319 VIP”

[1] “CRJ 100/200 Pinnacle/SkyWest” “CRJ 100/200 ExpressJet”
[3] “E120”                         “ERJ-145”

[1] “Airbus A330-200”          “Airbus A330-200 (3L2)”
[3] “Airbus A330-200 (3L3)”    “Airbus A330-300”
[5] “Boeing 747-400 (74S)”     “Boeing 757-200 (75E)”
[7] “Boeing 757-200 (75X)”     “Boeing 767-300 (76G)”
[9] “Boeing 767-300 (76L)”     “Boeing 767-300 (76T)”
[11] “Boeing 767-300 (76Z V.1)” “Boeing 767-300 (76Z V.2)”
[13] “Boeing 767-400 (76D)”     “Boeing 777-200ER”
[15] “Boeing 777-200LR”

[1] “Airbus A319”            “Airbus A320”            “Airbus A320 32-R”
[4] “Boeing 717”             “Boeing 737-700 (73W)”   “Boeing 737-800 (738)”
[7] “Boeing 737-800 (73H)”   “Boeing 737-900ER (739)” “Boeing 757-200 (75A)”
[10] “Boeing 757-200 (75M)”   “Boeing 757-200 (75N)”   “Boeing 757-200 (757)”
[13] “Boeing 757-200 (75V)”   “Boeing 757-300”         “Boeing 767-300 (76P)”
[16] “Boeing 767-300 (76Q)”   “Boeing 767-300 (76U)”   “CRJ 700”
[19] “CRJ 900”                “E170”                   “E175”
[22] “MD-88”                  “MD-90”                  “MD-DC9-50”

The first cluster contains a single aircraft, the Airbus A319 VIP. This plane is on its own and rightly so – it is not part of Delta’s regular fleet but one of Airbus’ corporate jets. This is a plane for people with money, for private charter. It includes “club seats” around tables for working (or not). Below is a picture of the inside of the A319 VIP:

Ahhh, that’s the way fly (some day, some day…). This is apparently the plane professional sports teams and the American military often charter to fly – this article in the Sydney Morning Herald has more details.

The second cluster contains four aircraft – the two CRJ 100/200’s and the Embraer E120 and ERJ-145. These are the smallest passenger aircraft, with the smallest accommodations – 28 for the E120 and 50 for the remaining craft. As such, there is only economy seating in these planes which is what distinguishes them from the remainder of the fleet. The E120 also has the distinction of being the only plane in the fleet with turboprops. Photos below.

Top: CRJ100/200. Bottom left: Embraer E120. Bottom right: Embraer ERJ-145.

I’ve flown many times in the venerable CRJ 100/200 series planes, in which I can assure you there is only economy seating, and which I like to affectionately refer to as “little metal tubes of suffering.”

The other two clusters comprise the remainder of the fleet, the planes with which most commercial air travellers are familiar – your Boeing 7-whatever-7’s and other Airbus and McDonnell-Douglas planes.

These are split into two clusters, which seem to again divide the planes approximately by size (both physical and accommodation), though there is crossover in the Boeing craft.

# Compare accommodation by cluster in boxplot
boxplot(data$Accommodation ~ k$cluster,
        xlab='Cluster', ylab='Accommodation',
        main='Plane Accommodation by Cluster')
# Compare presence of seat classes in largest clusters
data[k$clust==clust[3],30:33]
data[k$clust==clust[4],30:33]
First.Class Business Eco.Comfort Economy
Airbus A330-200 0 1 1 1
Airbus A330-200 (3L2) 0 1 1 1
Airbus A330-200 (3L3) 0 1 1 1
Airbus A330-300 0 1 1 1
Boeing 747-400 (74S) 0 1 1 1
Boeing 757-200 (75E) 0 1 1 1
Boeing 757-200 (75X) 0 1 1 1
Boeing 767-300 (76G) 0 1 1 1
Boeing 767-300 (76L) 0 1 1 1
Boeing 767-300 (76T) 0 1 1 1
Boeing 767-300 (76Z V.1) 0 1 1 1
Boeing 767-300 (76Z V.2) 0 1 1 1
Boeing 767-400 (76D) 0 1 1 1
Boeing 777-200ER 0 1 1 1
Boeing 777-200LR 0 1 1 1
First.Class Business Eco.Comfort Economy
Airbus A319 1 0 1 1
Airbus A320 1 0 1 1
Airbus A320 32-R 1 0 1 1
Boeing 717 1 0 1 1
Boeing 737-700 (73W) 1 0 1 1
Boeing 737-800 (738) 1 0 1 1
Boeing 737-800 (73H) 1 0 1 1
Boeing 737-900ER (739) 1 0 1 1
Boeing 757-200 (75A) 1 0 1 1
Boeing 757-200 (75M) 1 0 1 1
Boeing 757-200 (75N) 1 0 1 1
Boeing 757-200 (757) 1 0 1 1
Boeing 757-200 (75V) 1 0 1 1
Boeing 757-300 1 0 1 1
Boeing 767-300 (76P) 1 0 1 1
Boeing 767-300 (76Q) 1 0 1 1
Boeing 767-300 (76U) 0 1 1 1
CRJ 700 1 0 1 1
CRJ 900 1 0 1 1
E170 1 0 1 1
E175 1 0 1 1
MD-88 1 0 1 1
MD-90 1 0 1 1
MD-DC9-50 1 0 1 1

Looking at the raw data, the difference I can ascertain between the largest two clusters is that all the aircraft in the one have first class seating, whereas all the planes in the other have business class instead [the one exception being the Boeing 767-300 (76U)].

Conclusions

This was a little analysis which for me not only allowed me to explore my interest in commercial aircraft, but was also educational about finer points of what to look out for when using more advanced data science techniques like principal components, clustering and advanced visualization.
All in all, the techniques did a pretty admirable job in separating out the different type of aircraft into distinct categories. However I believe the way I structured the data may have biased it towards categorizing the aircraft by seating class, as that quality was replicated in the data set compared to other variables, being represented both in quantitative variables (seat pitch & width, number of seat in class) and categorical (class presence). So really the different seating classes where represented in triplicate within the data set compared to other variables, which is why the methods separated the aircraft in this way.

If I did this again, I would structure the data differently and see what relationships such analysis could draw out using only select parts of the data (e.g. aircraft measurements only). The interesting lesson here is that it when using techniques like dimensionality reduction and clustering it is not only important to be mindful of applying them correctly, but also what variables are in your data set and how they are represented.

For now I’ll just keep on flying, collecting the miles, and counting down the days until I finally get that seat in first class.

References & Resources

Delta Fleet at Delta.com
Principal Components Analysis (Wikipedia):
http://en.wikipedia.org/wiki/Principal_components_analysis

The Little Book of R for Multivariate Analysis
Quick R: Cluster Analysis
Plane Luxury: how US sports stars fly (Syndney Morning Herald)

Heatmap of Toronto Traffic Signals using RGoogleMaps

A little while back there was an article in blogTO about how a reddit user had used data from Toronto’s Open Data initiative to produce a rather cool-looking map of all the locations of all the traffic signals here in the city.

It’s neat because as the author on blogTO notes, it is recognizable as Toronto without any other geographic data being plotted – the structure of the city comes out in the data alone.

Still, I thought it’d be interesting to see as a geographic heat map, and also a good excuse to fool around with mapping using Rgooglemaps.

The finished product below:

Despite my best efforts with transparency (using my helper function), it’s difficult for anything but the city core to really come out in the intensity map.

The image without the Google maps tile, and the coordinates rotated, shows the density a little better in the green-yellow areas:

And it’s also straightforward to produce a duplication of the original black and white figure:

The R code is below. Interpolation is using the trusty kde2d function from the MASS library and a rotation is applied for the latter two figures, so that the grid of Toronto’s streets faces ‘up’ as in the original map.

# Toronto Traffic Signals Heat Map
# Myles Harrison
# http://www.everydayanalytics.ca
# Data from Toronto Open Data Portal:
# http://www.toronto.ca/open

library(MASS)
library(RgoogleMaps)
library(RColorBrewer)
source('colorRampPaletteAlpha.R')

# Read in the data
data <- read.csv(file="traffic_signals.csv", skip=1, header=T, stringsAsFactors=F)
# Keep the lon and lat data
rawdata <- data.frame(as.numeric(data$Longitude), as.numeric(data$Latitude))
names(rawdata) <- c("lon", "lat")
data <- as.matrix(rawdata)

# Rotate the lat-lon coordinates using a rotation matrix
# Trial and error lead to pi/15.0 = 12 degrees
theta = pi/15.0
m = matrix(c(cos(theta), sin(theta), -sin(theta), cos(theta)), nrow=2)
data <- as.matrix(data) %*% m

# Reproduce William's original map
par(bg='black')
plot(data, cex=0.1, col="white", pch=16)

# Create heatmap with kde2d and overplot
k <- kde2d(data[,1], data[,2], n=500)
# Intensity from green to red
cols <- rev(colorRampPalette(brewer.pal(8, 'RdYlGn'))(100))
par(bg='white')
image(k, col=cols, xaxt='n', yaxt='n')
points(data, cex=0.1, pch=16)

# Mapping via RgoogleMaps
# Find map center and get map
center <- rev(sapply(rawdata, mean))
map <- GetMap(center=center, zoom=11)
# Translate original data
coords <- LatLon2XY.centered(map, rawdata$lat, rawdata$lon, 11)
coords <- data.frame(coords)

# Rerun heatmap
k2 <- kde2d(coords$newX, coords$newY, n=500)

# Create exponential transparency vector and add
alpha <- seq.int(0.5, 0.95, length.out=100)
alpha <- exp(alpha^6-1)
cols2 <- addalpha(cols, alpha)

# Plot
PlotOnStaticMap(map)
image(k2, col=cols2, add=T)
points(coords$newX, coords$newY, pch=16, cex=0.3)

This a neat little start and you can see how this type of thing could easily be extended to create a generalized mapping tool, stood up as a web service for example (they’re out there). Case in point: Google Fusion Tables. I’m unsure as to what algorithm they use but I find it less satisfying, looks like some kind of simple point blending:

As always, all the code is on github.

colorRampPaletteAlpha() and addalpha() – helper functions for adding transparency to colors in R

colorRampPalette is a very useful function in R for creating colors vectors to use as the palette, or to pass as an argument to a plotting function; however, a weakness lies in that it disregards the alpha channel of the colors passed to it when creating the new vector.

I have also found that working with the alpha channel in R is not always the easiest, but is something that scientists and analysts may often have to do – when overplotting, for example.

To address this I’ve quickly written the helper functions addalpha and colorRampPaletteAlpha, the former which makes passing a scalar or vector to a vector of colors as the alpha channel easier, and the latter as a wrapper for colorRampPalette which preserves the alpha channel of the colors provided.

Using the two functions in combination it is easy to produce plots with variable transparency such as in the figure below:


The code is on github.

I’ve also written examples of usage, which includes the figure above.

# addalpha() and colorRampPaletteAlpha() usage examples
# Myles Harrison
# www.everydayanalytics.ca

library(MASS)
library(RColorBrewer)
# Source the colorRampAlpha file
source ('colorRampPaletteAlpha.R')

# addalpha()
# ----------
# scalars:
col1 <- "red"
col2 <- rgb(1,0,0)
addalpha(col2, 0.8)
addalpha(col2,0.8)

# scalar alpha with vector of colors:
col3 <- c("red", "green", "blue", "yellow")
addalpha(col3, 0.8)
plot(rnorm(1000), col=addalpha(brewer.pal(11,'RdYlGn'), 0.5), pch=16)

# alpha and colors vector:
alpha <- seq.int(0, 1, length.out=4)
addalpha(col3, alpha)

# Simple example
x <- seq.int(0, 2*pi, length=1000)
y <- sin(x)
plot(x, y, col=addalpha(rep("red", 1000), abs(sin(y))))

# with RColorBrewer
x <- seq.int(0, 1, length.out=100)
z <- outer(x,x)
c1 <- colorRampPalette(brewer.pal(11, 'Spectral'))(100)
c2 <- addalpha(c1,x)
par(mfrow=c(1,2))
image(x,x,z,col=c1)
image(x,x,z,col=c2)

# colorRampPaletteAlpha()
# Create normally distributed data
x <- rnorm(1000)
y <- rnorm(1000)
k <- kde2d(x,y,n=250)

# Sample colors with alpha channel
col1 <- addalpha("red", 0.5)
col2 <-"green"
col3 <-addalpha("blue", 0.2)
cols <- c(col1,col2,col3)

# colorRampPalette ditches the alpha channel
# colorRampPaletteAlpha does not
cr1 <- colorRampPalette(cols)(32)
cr2 <- colorRampPaletteAlpha(cols, 32)

par(mfrow=c(1,2))
plot(x, y, pch=16, cex=0.3)
image(k$x,k$y,k$z,col=cr1, add=T)
plot(x, y, pch=16, cex=0.3)
image(k$x,k$y,k$z,col=cr2, add=T)

# Linear vs. spline interpolation
cr1 <- colorRampPaletteAlpha(cols, 32, interpolate='linear') # default
cr2 <- colorRampPaletteAlpha(cols, 32, interpolate='spline')
plot(x, y, pch=16, cex=0.3)
image(k$x,k$y,k$z,col=cr1, add=T)
plot(x, y, pch=16, cex=0.3)
image(k$x,k$y,k$z,col=cr2, add=T)

Hopefully other R programmers who work extensively with color and transparency will find these functions useful.