PCA and K-means Clustering of Delta Aircraft

Introduction

I work in consulting. If you’re a consultant at a certain type of company, agency, organization, consultancy, whatever, this can sometimes mean travelling a lot.
Many business travelers ‘in the know’ have heard the old joke that if you want to stay at any type of hotel anywhere in the world and get a great rate, all you have to do is say that you work for IBM.

The point is that my line of business requires travel, and sometimes that is a lot of the time, like say almost all of last year. Inevitable comparisons to George Clooney’s character in Up in the Air were made (ironically I started to read that book, then left it on a plane in a seatback pocket), requests about favours involving duty free, and of course many observations and gently probing questions about frequent flier miles (FYI I’ve got more than most people, but a lot less than the entrepreneur I sat next to one time, who claimed to have close to 3 million).

But I digress.

Background

The point is that, as I said, I spent quite a bit of time travelling for work last year. Apparently the story with frequent fliers miles is that it’s best just to pick one airline and stick with it – and this also worked out well as most companies, including my employer, have preferred airlines and so you often don’t have much of a choice in the matter.

In my case this means flying Delta.

So I happened to notice in one of my many visits to Delta’s website that they have data on all of their aircraft in a certain site section. I thought this would be an interesting data set on which to do some analysis, as it has both quantitative and qualitative information and is relatively complex. What can we say about the different aircraft in Delta’s fleet, coming at it with ‘fresh eyes’? Which planes are similar? Which are dissimilar?

Aircraft data card from Delta.com

The data set comprises 33 variables on 44 aircraft taken from Delta.com, including both quantitative measures on attributes like cruising speed, accommodation and range in miles, as well as categorical data on, say, whether a particular aircraft has Wi-Fi or video. These binary categorical variables were transformed into quantitative variables by assigning them values of either 1 or 0, for yes or no respectively.

Analysis

As this a data set of many variables (33) I thought this would be an interesting opportunity to practice using a dimensionality reduction method to make the information easier to visualize and analyze.
First let’s just look at the intermediary quantitative variables related to the aircraft physical characteristics: cruising speed, total accommodation, and other quantities like length and wingspan. These variables are about in the middle of the data frame, so we can visualize all of them at once using a scatterplot matrix, which is the default for R’s output if plot() is called on a dataframe.
data <- read.csv(file="delta.csv", header=T, sep=",", row.names=1)

# scatterplot matrix of intermediary (size/non-categorical) variables
plot(data[,16:22])

We can see that there are pretty strong positive correlations between all these variables, as all of them are related to the aircraft’s overall size. Remarkably there is an almost perfectly linear relationship between wingspan and tail height, which perhaps is related to some principle of aeronautical engineering of which I am unaware.

The exception here is the variable right in the middle which is the number of engines. There is one lone outlier [Boeing 747-400 (74S)] which has four, while all the other aircraft have two. In this way the engines variable is really more like a categorical variable, but we shall as the analysis progresses that this is not really important, as there are other variables which more strongly discern the aircraft from one another than this.

How do we easier visualize a high-dimensional data set like this one? By using a dimensionality reduction technique like principal components analysis.

Principal Components Analysis

Next let’s say I know nothing about dimensionality reduction techniques and just naively apply principle components to the data in R:

# Naively apply principal components analysis to raw data and plot
pc <- princomp(data)
plot(pc)

Taking that approach we can see that the first principal component has a standard deviation of around 2200 and accounts for over 99.8% of the variance in the data. Looking at the first column of loadings, we see that the first principle component is just the range in miles.

# First component dominates greatly. What are the loadings?
summary(pc) # 1 component has > 99% variance
loadings(pc) # Can see all variance is in the range in miles

Importance of components:
                             Comp.1       Comp.2       Comp.3       Comp.4
Standard deviation     2259.2372556 6.907940e+01 2.871764e+01 2.259929e+01
Proportion of Variance    0.9987016 9.337038e-04 1.613651e-04 9.993131e-05
Cumulative Proportion     0.9987016 9.996353e-01 9.997966e-01 9.998966e-01
            

                         Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6 Comp.7 Comp.8
Seat.Width..Club.                                    -0.144 -0.110              
Seat.Pitch..Club.                                    -0.327 -0.248         0.189
Seat..Club.                                                                     
Seat.Width..First.Class.                0.250        -0.160        -0.156  0.136
Seat.Pitch..First.Class.                0.515 -0.110 -0.386  0.112 -0.130  0.183
Seats..First.Class.                     0.258 -0.124 -0.307 -0.109  0.160  0.149
Seat.Width..Business.                  -0.154  0.142 -0.108                     
Seat.Pitch..Business.                  -0.514  0.446 -0.298  0.154 -0.172  0.379
Seats..Business.                       -0.225  0.187                            
Seat.Width..Eco.Comfort.                                     0.285 -0.224       
Seat.Pitch..Eco.Comfort.                0.159                0.544 -0.442       
Seats..Eco.Comfort.                                          0.200 -0.160       
Seat.Width..Economy.                                  0.125  0.110              
Seat.Pitch..Economy.                                  0.227  0.190        -0.130
Seats..Economy.                  0.597        -0.136  0.345 -0.165         0.168
Accommodation                    0.697               -0.104                0.233
Cruising.Speed..mph.                    0.463  0.809  0.289 -0.144  0.115       
Range..miles.             0.999                                                 
Engines                                                                         
Wingspan..ft.                    0.215         0.103 -0.316 -0.357 -0.466 -0.665
Tail.Height..ft.                                     -0.100        -0.187       
Length..ft.                      0.275         0.118 -0.318  0.467  0.582 -0.418
Wifi                                                                            
Video                                                                           
Power                                                                           
Satellite                                                                       
Flat.bed                                                                        
Sleeper                                                                         
Club                                                                            
First.Class                                                                     
Business                                                                        
Eco.Comfort                                                                     

Economy                                         

This is because the scale of the different variables in the data set is quite variable; we can see this by plotting the variance of the different columns in the data frame (regular scaling on the left, logarithmic on the right):

# verify by plotting variance of columns
mar <- par()$mar
par(mar=mar+c(0,5,0,0))
barplot(sapply(data, var), horiz=T, las=1, cex.names=0.8)
barplot(sapply(data, var), horiz=T, las=1, cex.names=0.8, log='x')
par(mar=mar)

We correct for this by scaling the data using the scale() function. We can then verify that the variances across the different variables are equal so that when we apply principal components one variable does not dominate.

# Scale
data2 <- data.frame(scale(data))
# Verify variance is uniform
plot(sapply(data2, var))
After applying the scale() function the variance is now constant across variables

Now we can apply principal components to the scaled data. Note that this can also be done automatically in call to the prcomp() function by setting the parameter scale=TRUE. Now we see a result which is more along the lines of something we would expect:

# Proceed with principal components
pc <- princomp(data2)
plot(pc)
plot(pc, type='l')
summary(pc) # 4 components is both 'elbow' and explains >85% variance

Great, so now we’re in business. There are various rules of thumb for selecting the number of principal components to retain in an analysis of this type, two of which I’ve read about are:

  1. Pick the number of components which explain 85% or greater of the variation
  2. Use the ‘elbow’ method of the scree plot (on right)
Here we are fortunate in that these two are the same, so we will retain the first four principal components. We put these into new data frame and plot.
# Get principal component vectors using prcomp instead of princomp
pc <- prcomp(data2)

# First for principal components
comp <- data.frame(pc$x[,1:4])
# Plot
plot(comp, pch=16, col=rgb(0,0,0,0.5))

So what were are looking at here are twelve 2-D projections of data which are in a 4-D space. You can see there’s a clear outlier in all the dimensions, as well as some bunching together in the different projections.

Normally, I am a staunch opponent of 3D visualization, as I’ve spoken strongly about previously. The one exception to this rule is when the visualization is interactive, which allows the user to explore the space and not lose meaning due to three dimensions being collapsed into a 2D image. Plus, in this particular case, it’s a good excuse to use the very cool, very awesome rgl package.
Click on the images to view the interactive 3D versions (requires a modern browser). You can better see in the 3D projections that the data are confined mainly to the one plane one the left (components 1-3), with the exception of the outlier, and that there is also bunching in the other dimensions (components 1,3,4 on right).
library(rgl)
# Multi 3D plot
plot3d(comp$PC1, comp$PC2, comp$PC3)
plot3d(comp$PC1, comp$PC3, comp$PC4)
So, now that we’ve simplified the complex data set into a lower dimensional space we can visualize and work with, how do we find patterns in the data, in our case, the aircraft which are most similar? We can use a simple unsupervised machine learning technique like clustering.
Cluster Analysis
 
Here because I’m not a data scientist extraordinaire, I’ll stick to the simplest technique and do a simple k-means – this is pretty straightforward to do in R.
First how do we determine the number of clusters? The simplest method is to look at the within groups sum of squares and pick the ‘elbow’ in the plot, similar to as with the scree plot we did for the PCA previously. Here I used the code from R in Action:
# Determine number of clusters
wss <- (nrow(mydata)-1)*sum(apply(mydata,2,var))
for (i in 2:15) wss[i] <- sum(kmeans(mydata,
                                     centers=i)$withinss)
plot(1:15, wss, type="b", xlab="Number of Clusters",
     ylab="Within groups sum of squares")
However, it should be noted that it is very important to set the nstart parameter and iter.max parameter (I’ve found 25 and 1000, respectively to be okay values to use), which the example in Quick-R fails to do, otherwise you can get very different results each time you run the algorithm, as below.
Clustering without the nstart parameter can lead to variable results for each run
Clustering with the nstart and iter.max parameters leads to consistent results, allowing proper interpretation of the scree plot
So here we can see that the “elbow” in the scree plot is at k=4, so we apply the k-means clustering function with k = 4 and plot.
# From scree plot elbow occurs at k = 4
# Apply k-means with k=4
k <- kmeans(comp, 4, nstart=25, iter.max=1000)
library(RColorBrewer)
library(scales)
palette(alpha(brewer.pal(9,'Set1'), 0.5))
plot(comp, col=k$clust, pch=16)
We can see that the one outlier is in its own cluster, there’s 3 or 4 in the other and the remainder are split into two clusters of greater size. We visualize in 3D below, as before (click for interactive versions):
# 3D plot
plot3d(comp$PC1, comp$PC2, comp$PC3, col=k$clust)
plot3d(comp$PC1, comp$PC3, comp$PC4, col=k$clust)
We look at the exact clusters below, in order of increasing size:
# Cluster sizes
sort(table(k$clust))
clust <- names(sort(table(k$clust)))

# First cluster
row.names(data[k$clust==clust[1],])
# Second Cluster
row.names(data[k$clust==clust[2],])
# Third Cluster
row.names(data[k$clust==clust[3],])
# Fourth Cluster
row.names(data[k$clust==clust[4],])
[1] “Airbus A319 VIP”

[1] “CRJ 100/200 Pinnacle/SkyWest” “CRJ 100/200 ExpressJet”
[3] “E120”                         “ERJ-145”

[1] “Airbus A330-200”          “Airbus A330-200 (3L2)”
[3] “Airbus A330-200 (3L3)”    “Airbus A330-300”
[5] “Boeing 747-400 (74S)”     “Boeing 757-200 (75E)”
[7] “Boeing 757-200 (75X)”     “Boeing 767-300 (76G)”
[9] “Boeing 767-300 (76L)”     “Boeing 767-300 (76T)”
[11] “Boeing 767-300 (76Z V.1)” “Boeing 767-300 (76Z V.2)”
[13] “Boeing 767-400 (76D)”     “Boeing 777-200ER”
[15] “Boeing 777-200LR”

[1] “Airbus A319”            “Airbus A320”            “Airbus A320 32-R”
[4] “Boeing 717”             “Boeing 737-700 (73W)”   “Boeing 737-800 (738)”
[7] “Boeing 737-800 (73H)”   “Boeing 737-900ER (739)” “Boeing 757-200 (75A)”
[10] “Boeing 757-200 (75M)”   “Boeing 757-200 (75N)”   “Boeing 757-200 (757)”
[13] “Boeing 757-200 (75V)”   “Boeing 757-300”         “Boeing 767-300 (76P)”
[16] “Boeing 767-300 (76Q)”   “Boeing 767-300 (76U)”   “CRJ 700”
[19] “CRJ 900”                “E170”                   “E175”
[22] “MD-88”                  “MD-90”                  “MD-DC9-50”

The first cluster contains a single aircraft, the Airbus A319 VIP. This plane is on its own and rightly so – it is not part of Delta’s regular fleet but one of Airbus’ corporate jets. This is a plane for people with money, for private charter. It includes “club seats” around tables for working (or not). Below is a picture of the inside of the A319 VIP:

Ahhh, that’s the way fly (some day, some day…). This is apparently the plane professional sports teams and the American military often charter to fly – this article in the Sydney Morning Herald has more details.

The second cluster contains four aircraft – the two CRJ 100/200’s and the Embraer E120 and ERJ-145. These are the smallest passenger aircraft, with the smallest accommodations – 28 for the E120 and 50 for the remaining craft. As such, there is only economy seating in these planes which is what distinguishes them from the remainder of the fleet. The E120 also has the distinction of being the only plane in the fleet with turboprops. Photos below.

Top: CRJ100/200. Bottom left: Embraer E120. Bottom right: Embraer ERJ-145.

I’ve flown many times in the venerable CRJ 100/200 series planes, in which I can assure you there is only economy seating, and which I like to affectionately refer to as “little metal tubes of suffering.”

The other two clusters comprise the remainder of the fleet, the planes with which most commercial air travellers are familiar – your Boeing 7-whatever-7’s and other Airbus and McDonnell-Douglas planes.

These are split into two clusters, which seem to again divide the planes approximately by size (both physical and accommodation), though there is crossover in the Boeing craft.

# Compare accommodation by cluster in boxplot
boxplot(data$Accommodation ~ k$cluster,
        xlab='Cluster', ylab='Accommodation',
        main='Plane Accommodation by Cluster')
# Compare presence of seat classes in largest clusters
data[k$clust==clust[3],30:33]
data[k$clust==clust[4],30:33]
First.Class Business Eco.Comfort Economy
Airbus A330-200 0 1 1 1
Airbus A330-200 (3L2) 0 1 1 1
Airbus A330-200 (3L3) 0 1 1 1
Airbus A330-300 0 1 1 1
Boeing 747-400 (74S) 0 1 1 1
Boeing 757-200 (75E) 0 1 1 1
Boeing 757-200 (75X) 0 1 1 1
Boeing 767-300 (76G) 0 1 1 1
Boeing 767-300 (76L) 0 1 1 1
Boeing 767-300 (76T) 0 1 1 1
Boeing 767-300 (76Z V.1) 0 1 1 1
Boeing 767-300 (76Z V.2) 0 1 1 1
Boeing 767-400 (76D) 0 1 1 1
Boeing 777-200ER 0 1 1 1
Boeing 777-200LR 0 1 1 1
First.Class Business Eco.Comfort Economy
Airbus A319 1 0 1 1
Airbus A320 1 0 1 1
Airbus A320 32-R 1 0 1 1
Boeing 717 1 0 1 1
Boeing 737-700 (73W) 1 0 1 1
Boeing 737-800 (738) 1 0 1 1
Boeing 737-800 (73H) 1 0 1 1
Boeing 737-900ER (739) 1 0 1 1
Boeing 757-200 (75A) 1 0 1 1
Boeing 757-200 (75M) 1 0 1 1
Boeing 757-200 (75N) 1 0 1 1
Boeing 757-200 (757) 1 0 1 1
Boeing 757-200 (75V) 1 0 1 1
Boeing 757-300 1 0 1 1
Boeing 767-300 (76P) 1 0 1 1
Boeing 767-300 (76Q) 1 0 1 1
Boeing 767-300 (76U) 0 1 1 1
CRJ 700 1 0 1 1
CRJ 900 1 0 1 1
E170 1 0 1 1
E175 1 0 1 1
MD-88 1 0 1 1
MD-90 1 0 1 1
MD-DC9-50 1 0 1 1

Looking at the raw data, the difference I can ascertain between the largest two clusters is that all the aircraft in the one have first class seating, whereas all the planes in the other have business class instead [the one exception being the Boeing 767-300 (76U)].

Conclusions

This was a little analysis which for me not only allowed me to explore my interest in commercial aircraft, but was also educational about finer points of what to look out for when using more advanced data science techniques like principal components, clustering and advanced visualization.
All in all, the techniques did a pretty admirable job in separating out the different type of aircraft into distinct categories. However I believe the way I structured the data may have biased it towards categorizing the aircraft by seating class, as that quality was replicated in the data set compared to other variables, being represented both in quantitative variables (seat pitch & width, number of seat in class) and categorical (class presence). So really the different seating classes where represented in triplicate within the data set compared to other variables, which is why the methods separated the aircraft in this way.

If I did this again, I would structure the data differently and see what relationships such analysis could draw out using only select parts of the data (e.g. aircraft measurements only). The interesting lesson here is that it when using techniques like dimensionality reduction and clustering it is not only important to be mindful of applying them correctly, but also what variables are in your data set and how they are represented.

For now I’ll just keep on flying, collecting the miles, and counting down the days until I finally get that seat in first class.

References & Resources

Delta Fleet at Delta.com
Principal Components Analysis (Wikipedia):
http://en.wikipedia.org/wiki/Principal_components_analysis

The Little Book of R for Multivariate Analysis
Quick R: Cluster Analysis
Plane Luxury: how US sports stars fly (Syndney Morning Herald)

Big Data Week Toronto 2014 Recap – Meetup #3: Big Data Visualization

This past week was Big Data Week for those of you that don’t know, a week of talks and events held worldwide to “unite the global data communities through series of events and meetups”.

Viafoura put on the events this year for Toronto and was kind enough to extend an invitation to myself to be one of the speakers talking on data visualization and how that relates to all this “Big Data” stuff.

Paul spoke detecting fraud online using visualization and data science techniques. Something I often think about when presenting is how to make your message clear and connect with both the least technical people in the audience (who, quite often, have attended strictly out of curiosity) and the most knowledgeable and technically-minded people present.

I was really impressed with Paul’s visual explanation of the Jaccard coefficient. Not everyone understands set theory, however almost everyone will understand a Venn diagram if you put it in front of them.

So to explain the Jaccard index as a measure of mutual information when giving a presentation, which is better? You could put the definition up on a slide:

 J(A,B) = {{|A cap B|}over{|A cup B|}}.
which is fine for the mathematically-minded in your audience but would probably lose a lot of others. Instead, you could use a visualization like this figure Paul included:
The two depict the same quantity, but the latter is far more accessible to a wide audience. Great stuff.
I spoke on “Practical Visualizations for Visualizing Big Data” which included some fundamentals (thinking about data and perception in visualization / visual encoding) and the challenges the three “V”s of Big Data present when doing visualization and analysis, and some thoughts on how to address them.
This prompted some interesting discussions afterward, I found most people were much more interested in the fundamentals part – how to do visualization effectively, what constitutes a visualization, and the perceptional elements of dataviz and less on the data science aspects of the talk.
Overall it was a great evening and I was happy to get up and talk visualization again. Thanks to the guys from Viafoura for putting this on and inviting me, and to the folks at the Ryerson DMZ for hosting.
Mini-gallery culled from Twitter below:

Interactive Visualization: Explore the 2014 “Sunshine List”

The “Sunshine List”, a list of Ontario’s public service who make more than $100,000 a year, was brought in under the Harris government in 1996, in an effort to increase transparency about just how much the top paid public servants were earning.

Now the list is released each year by the Ontario Ministry of Finance.

However, there has been some frustration that the data are not in the most easily accessible format (HTML & PDF? Really guys?).

Stuart A. Thompson was kind enough to provide the data in an easier to digest format (Nick Ragaz also provided it in CSV), as well as producing a tool for exploring it on The Globe and Mail.

I thought it’d be great to get a more visual exploration of the data at fine granularity, so have produced this interactive visualization below.

You can filter the data by searching by employer, name, or position. You can also filter the list by selecting points or groups of points on the scatterplot on the left, and highlight groups of points (or individual employees) by selecting components in the bar graph at the right. The bar graph on the right can also be expanded and collapsed to view the aggregate salary and benefits by employer, or to view the quantities for individual employees.

Hovering over data points on either graph will display the related data in a tooltip – I’ve found this is handy for looking at individual points of interest on the scatterplot. Zoom and explore to find interesting patterns and individuals. Give it a try!


I’ve plotted benefit against salary with the latter having a logarithmic axis so that the data are easier visualized and explored (note that I am in no way suggesting that benefits are a function of salary).

Using combinations of all these possible interactions mentioned above you can do some interesting visual analysis: for instance, how do the top salaries and benefits earned by Constables across police departments in Ontario differ (seriously, take a look)? What are the relative pay and benefit levels of professors at Ontario Universities on the list? How much does Rob Ford make?

Something interesting I’ve already noticed is that for many employers there are long horizontal bands where employees’ salaries vary but they are fixed into the same benefit buckets. Others have a different relationship, for example, the benefit / salary ratios of those at my alma mater vs those of employees of the City of Toronto:

Hope this tool will be interesting and useful for those interested in the list. As always feedback is welcome in the comments.

Perception in Data Visualization – A Quick 7 Question Test

When most people think of data, they probably think of a dry, technical analysis, without a lot of creativity or freedom. Quite to the contrary, data visualization encompasses choices of design, creative freedom, and also (perhaps most interestingly) elements of cognitive psychology, particularly related to the science of visual perception and information processing.

If you read any good text on dataviz, like TufteFew, or Cairo, you will, at some point, come across a discussion of the cognitive aspects of data visualization (the latter two devoting entire chapters to this topic). This will likely include a discussion of the most elemental ways to encode information visually, and their respective accuracies when quantity is interpreted from them, usually referencing the work of Cleveland & McGill [PDF].

Mulling over the veracity of my brief mention of the visual ways of encoding quantity in my recent talk, and also recently re-reading Nathan Yau’s discussion of the aforementioned paper, I got to thinking about just how different the accuracy of interpretation between the different encodings might be.

I am not a psychologist or qualitative researcher, but given the above quickly put together a simple test of 7 questions in Google Docs, to examine the accuracy of interpreting proportional quantities when encoded visually; and I humbly request the favour of your participation. If there are enough responses I will put together what analysis is possible in a future post (using the appropriate visualization techniques, of course).

Apologies in advance for the grade-school wording of the questions, but I wanted to be as clear as possible to ensure consistency in the results. Thanks so much in advance for contributing! Click below for the quiz:


EDIT: The quiz will now be up indefinitely on this page.

colorRampPaletteAlpha() and addalpha() – helper functions for adding transparency to colors in R

colorRampPalette is a very useful function in R for creating colors vectors to use as the palette, or to pass as an argument to a plotting function; however, a weakness lies in that it disregards the alpha channel of the colors passed to it when creating the new vector.

I have also found that working with the alpha channel in R is not always the easiest, but is something that scientists and analysts may often have to do – when overplotting, for example.

To address this I’ve quickly written the helper functions addalpha and colorRampPaletteAlpha, the former which makes passing a scalar or vector to a vector of colors as the alpha channel easier, and the latter as a wrapper for colorRampPalette which preserves the alpha channel of the colors provided.

Using the two functions in combination it is easy to produce plots with variable transparency such as in the figure below:


The code is on github.

I’ve also written examples of usage, which includes the figure above.

# addalpha() and colorRampPaletteAlpha() usage examples
# Myles Harrison
# www.everydayanalytics.ca

library(MASS)
library(RColorBrewer)
# Source the colorRampAlpha file
source ('colorRampPaletteAlpha.R')

# addalpha()
# ----------
# scalars:
col1 <- "red"
col2 <- rgb(1,0,0)
addalpha(col2, 0.8)
addalpha(col2,0.8)

# scalar alpha with vector of colors:
col3 <- c("red", "green", "blue", "yellow")
addalpha(col3, 0.8)
plot(rnorm(1000), col=addalpha(brewer.pal(11,'RdYlGn'), 0.5), pch=16)

# alpha and colors vector:
alpha <- seq.int(0, 1, length.out=4)
addalpha(col3, alpha)

# Simple example
x <- seq.int(0, 2*pi, length=1000)
y <- sin(x)
plot(x, y, col=addalpha(rep("red", 1000), abs(sin(y))))

# with RColorBrewer
x <- seq.int(0, 1, length.out=100)
z <- outer(x,x)
c1 <- colorRampPalette(brewer.pal(11, 'Spectral'))(100)
c2 <- addalpha(c1,x)
par(mfrow=c(1,2))
image(x,x,z,col=c1)
image(x,x,z,col=c2)

# colorRampPaletteAlpha()
# Create normally distributed data
x <- rnorm(1000)
y <- rnorm(1000)
k <- kde2d(x,y,n=250)

# Sample colors with alpha channel
col1 <- addalpha("red", 0.5)
col2 <-"green"
col3 <-addalpha("blue", 0.2)
cols <- c(col1,col2,col3)

# colorRampPalette ditches the alpha channel
# colorRampPaletteAlpha does not
cr1 <- colorRampPalette(cols)(32)
cr2 <- colorRampPaletteAlpha(cols, 32)

par(mfrow=c(1,2))
plot(x, y, pch=16, cex=0.3)
image(k$x,k$y,k$z,col=cr1, add=T)
plot(x, y, pch=16, cex=0.3)
image(k$x,k$y,k$z,col=cr2, add=T)

# Linear vs. spline interpolation
cr1 <- colorRampPaletteAlpha(cols, 32, interpolate='linear') # default
cr2 <- colorRampPaletteAlpha(cols, 32, interpolate='spline')
plot(x, y, pch=16, cex=0.3)
image(k$x,k$y,k$z,col=cr1, add=T)
plot(x, y, pch=16, cex=0.3)
image(k$x,k$y,k$z,col=cr2, add=T)

Hopefully other R programmers who work extensively with color and transparency will find these functions useful.

Toronto Data Science Group – A Survey of Data Visualization Techniques and Practice

Recently I spoke at the Toronto Data Science group. The folks at Mozilla were kind enough to record it and put it on Air, so here it is for your viewing pleasure (and critique):


Overall it was quite well received. Aside from the usual omg does my voice really sound like that?? which is to be expected, a couple of thoughts on the business of giving presentations which were quite salient here:

  • Talk slower and enunciate
  • Gesture, but not too much
  • Tailor sizing and colouring of visuals, depending on projection & audience size

I’ve reproduced the code which was used to create the figures made in R (including the bubble chart example, with code and data from FlowingData), which regrettably at the time I neglected to save. Here it is in a gist:

The visuals are also available on Slideshare.

Lessons learned: talk slower, always save your code, and Google stuff before starting – because somebody’s probably already done it before you.

In Critique of Slopegraphs

I’ve been doing more research into less common types of data visualization techniques recently, and was reading up on slopegraphs.

Andy Kirk wrote a piece praising slopegraphs last December, which goes over the construction of a slopegraph with some example data very nicely. However I’ve seen some other bad examples of data visualization across the web using them, and just thought I’d put in my two cents.

Introductory remarks

I tend to think of slopegraphs as a very boiled-down version of a normal line chart, in which you have only two values for your independent variable and strip away all the non-data ink. This works because if you label all the individual components, you can take away all the cruft because you don’t need the legend or axes anymore, do you? Here’s the example of the before and after that below, using the soccer data from the Andy’s post.
First as a line graph:
Hmm, that’s not very enlightening is it? There are so many values for the categorical variable (team) that the graph requires a plethora of colours in the legend, and a considerable amount of back-and-forth to interpret. Contrast with the slopegraph, which is much easier to interpret as the individual values can be read off, and it also ditches the non-data ink of the axes:

Here it is much easier to read off values for the individual teams, it feels less cluttered, and more data have been encoded both in colour (orange for a decrease between the two years, and blue for an increase) as well as the thickness of the lines (thicker lines for change of > 25%).

Pros and Cons

In my opinion, the slope graph should be viewed as an extension of the line graph, and so even though traditional chart elements like the y-axis have been stripped away, consistency should be kept with the regular conventions of data visualization.
In the above example, Andy has correctly honoured vertical position, so that each team appears on other side of the graph at the correct height according to the number of points it has. This is the same as one of Dr. Tufte’s original graphs (from the Visual Display of Quantitative Information), which follows the same practice and I quite like:
Brilliant. However when you no longer honour the vertical position to encode value, you lose the ability to truly compare across the categorical variable, which tend I disagree with. This is usually done for legibility’s sake (to “uncrowd” the graph when there are a lot of lines), however, I feel like it could still be avoided in most of cases. See below for the example.

Here the vertical position is not honoured, as some values which are smaller appear above those which are larger, so that the lines do not cross and the graph is uncluttered.

Also it should be noted in this case there is more than one value in the independent variable. As long as the scale in the vertical direction is still consistent, the changes in quantity can still be compared by the slope of the lines, even if the exact values cannot be compared because the vertical position no longer corresponds directly to quantity.

Either way, this type of slopegraph is closer to a group of sparklines (as Tufte originally noted), as it allows comparison of the changes in the dependent variable across values of the independent for each value of the categorical variable, but not the exact quantities.

Where things really start to fall apart though, is when slope graphs are used to connect values from two different variables. Charlie Park has some examples of this on his blog post on the subject, such as the one from Ben Fry below:

So here’s the question – what exactly, does the slope of the different lines correspond to? The variable on the left is win-loss record and on the right is total salary. The first author correctly notes that in this case, the slopegraph is an extension of a parallel coordinates graph, which requires some further discussion.
A parallel coordinates graph is all very well and good for doing exploratory data analysis, and finding patterns in data with a large number of variables. However I would avoid graphs like the one above in general – because the variable on the left and the right are not the same, the slope of the line is essentially meaningless. 
In this case of the baseball data, why not just display the information in a regular scatterplot, as below? Simple and clear. You can then include the additional information using colour and size respectively if desired and make a bubble chart.

Was the disproportionately large payroll of the Yankees as obvious in the previous visualization? Maybe, but not as saliently. The relative size of the payroll was encoded in the thickness of the line, but quantity is not interpreted as quickly and accurately when encoded using area/thickness as it is when using position. Also because the previous data were ranked (vertical position did not portray quantity), the much smaller number of wins by Kansas relative to the other teams was not as apparent at is it here.

Fry notes that he chose not to use a scatterplot as he wanted ranking for both quantities, which I suppose is the advantage of the original treatment, and something which is not depicted in the alternative I’ve presented. Also Park correctly notes in the examples on his post that different visualizations draw the eye to different features of the data, and some people have more difficulty interpreting a visualization like a bubble chart than slopegraph. Still, I remain a skeptical functionalist as far as visualization is concerned, and prefer the treatment above to the former.

Alternatives

I’ve presented some criticism of the slopegraphs here, but are there alternatives? Yes. In addition to the above, let’s explore some others, using the data from the soccer example.

Really what we are interested in is the change in the quantity over the two values of the independent variable (year). So we can instead look at that quantity (change between the two years), and visualize it as a bar graph with a baseline of zero. Here the bars are again coloured by whether the change is positive or negative.

This is fine; however we lost the information encoded in the thickness of the lines. We can encode that using the lightness (intensity) of the different bars. Dark for > 25% change, light for the others:

Hmm, not bad. However we’ve still lost the information about the absolute value of points each year. So let’s make that the value along the horizontal axis instead.

Okay fine, now the length of the bars corresponds to the magnitude of the change in points across the two years, with positive changes being coloured blue and negative orange, and the shading corresponding to whether the change was greater or less than 25%.

However, even if I put a legend and told you what the colours correspond to, it’s pretty common for people to think of things as progressing from left to right (at least in Western cultures). The graph is difficult to interpret because for bars in orange the score for the first year is on the right, whereas for those in blue it’s on the left. That is to say, we have the absolute values, but direction of the change is not depicted well. Changing the bars to arrows solves this, as below:

Now we have the absolute values of the points in each year for each team, and the direction of the change is displayed better than just with colour. Adding the gridlines allows the viewer to read off the individual values of points more easily. Lastly, we encode the other categorical variable of interest (change greater/less than 25%) as the thickness of the line.

Like so. After creating the above independently, I discovered visualization consultant Naomi Robbins had already written about this type of chart on Forbes, as an alternative to using multiple pie charts. Jon Peltier also has an excellent in-depth description how to make these types of charts in Excel, as well as showing another alternative visualization option to slope graphs, using a dot plot.

Of course, providing the usual fixings for a graph such as a legend, title and proper axis labels would complete the above, which brings me to my last point. Though I think it’s a good alternative to slopegraphs, it can in no way compete in simplicity given that Dr. Tufte’s example of a slopegraph as it had zero non-data ink. And, of course, this type of graph will not work when there are more than two values in the independent variable which to compare across.

Closing Remarks

It is easy to tell who are the true thought leaders in data visualization, because they often take it upon themselves to find special cases for visualization where people struggle or visualize data poorly, and then invent new visualizations types to fill the need (Tufte with the slopegraph, and Few came up with the bullet graph to supplant god-awful gauges on dashboards).
As I discussed, there are certain cases when slopegraphs should not be used, and I feel you would be better served by other types of graphs; in particular, cases where the slopegraph is a variation of the parallel coordinates chart not the line graph, or where quantity is not encoded in vertical position and comparing quantities for each value of the independent variable is important.

That being said, it is (as always) very important when making choices regarding data visualization to consider the pros and cons of different visualization types, the properties of the data you are trying to communicate, and, of course, the target audience.

Judiciously used, slopegraphs provide a highly efficient way in terms of data-ink ratio to visualize change in quantity across a categorical variable with a large number of values. Their appeal lies both in this and their elegant simplicity.

References & Resources

Slopegraphs discussion on Edward Tufte forum
http://www.edwardtufte.com/bboard/q-and-a-fetch-msg?msg_id=0003nk
In Praise of Slopegraphs, by Andy Kirk
Edward Tufte’s “Slopegraphs” by Charlie Park
http://charliepark.org/slopegraphs/
Peltier Tech: How to Make Arrow Charts in Excel
http://peltiertech.com/WordPress/arrow-charts-in-excel/

Salary vs. Performance of MLB Teams by Ben Fry
http://fathom.info/salaryper/

salary vs performance scatterplot (Tableau Public)

The Mathematics of Wind Chill

Introduction

Holy crap, it was cold out.

If you haven’t been reading the news, don’t live in the American Midwest or Canada, or do and didn’t go outside the last couple weeks (for which I don’t blame you) there was some mighty cold weather lately due to something called a polar vortex.

Meteorologists stated that a lot of people (those in younger generations) would never have experienced anything like this before – cold from the freezing temperatures and high winds the likes of which parts of the US and Canada haven’t seen for 40 years.

It was really cold. So cold that weird stuff happened, including the ground exploding here in Ontario due to frost quakes, or cryoseisms, as they are technically known (or as my sister suggested they should be called, “frosted quakes” – get it?)

When there is all this talk of a polar vortex, all I could think of was a particularly ridiculous TV-movie that came out lately, and that this is our Northern equivalent, which probably looked something like this artist’s depiction below:

Scientific depiction of polar vortex phenomena (not to scale)

But I digress. The real point is that all this cold weather got me thinking about windchill – what is it exactly? How is it determined? Let’s do some everyday analysis.

Background

Wind chill hasn’t always been the same, and there is some controversy exactly how scientific it is in the way it is calculated.
Wind chill depends upon only two variables – air temperature and wind speed – and the formula was derived not from physical models of atmosphere but from participants in simulated laboratory conditions.
Also, the old formula was replaced in 2001 by a new formula, with Canada greatly leading the effort, since there was some concern that the old formula gave values too low and that people would think they can safely withstand colder temperatures than they actually could.
The old formula had strange units but I found this page at University of Carleton which provides it in degrees Fahrenheit, so we can compare the old and new systems directly.

Analysis

Since the wind chill index is a function of two variables (a surface), we can calculate it using vectors in R and visually depict the results as an image (filled contour). This is in the following code below:

Which results in the following plots:

And the absolute difference between the two:

For low wind speeds (around 5 mph – wind chill is only defined when wind speed is greater 5 mph) you can see that the new system is colder, but for wind speeds greater than 10 mph the opposite is true, especially so in the bitter bitter cold (high winds and very cold temperatures). This is in line with the desire to correct the old system for giving values which were felt were too low.

If you’re really visual person, here is the last contour plot as a surface:

Which, despite some of the limitations of 3-D visualization, shows the non-linear nature of the two systems and the difference between them.

Conclusion

This was a pretty interesting exercise and shows again how mathematics permeates many of our everyday notions – even if we’re not necessarily aware of it being the case. 
For me the takeaway here is that wind chill is not an exact metric based on the physical laws of the atmosphere, but instead a more subjective one based upon people’s reaction to cold and wind (an inanimate object cannot “feel” wind chill).
Despite the difficulty of the problem of trying to exactly quantify how much colder the blustery arctic winds make it feel outside, saying “-32F with the wind chill” will still always be better than saying “dude, it’s really really cold outside.”
Either way, be sure to wear a hat.

References & Resources

Windchill (at Wikipedia)
National Weather Service – Windchill Calculator
National Weather Service – Windchill Terms & Definitions 
Environment Canada – Canada’s Windchill Index

Snapchat Database Leak – Visualized

Introduction

In case you don’t read anything online, or live under a rock, the internet is all atwitter (get it?) with the recent news that Snapchat has had 4.6 million users’ details leaked due to a security flaw which was compromised.

The irony here is that Snapchat was warned of the vulnerability by Gibson Security, but was rather dismissive and flippant and has now had this blow up in their faces (as it rightly should, given their response). It appears there may be very real consequences of this to the (overblown) perceived value of the company, yet another wildly popular startup with no revenue model. I bet that offer from Facebook is looking pretty good right about now.

Anyhow, a group of concerned hackers gave Snapchat what-for by exploiting the hole, and released a list of 4.6 million (4,609,621 to be exact) users details with the intent to “raise public awareness on how reckless many internet companies are with user information.

Which is awesome – kudos to those guys, once for being whitehat (they obscured two digits of each phone number to preserve some anonymity) and twice for keeping companies with large amounts of user data accountable. Gibsonsec has provided a tool so you can check if your account is in the DB here.

However, if you’re a datahead like me, when you hear that there is a file out there with 4.6M user accounts in it, your first thought is not OMG am I safe?! but let’s do some analysis!

Analysis

Area Code
As I have noted in a previous musing, it’s difficult to do any sort of in-depth analysis if you have limited dimensionality of your data – here only 3 fields – the phone number with last two digits obscured, the username, and the area.
Fortunately because some of the data here is geographic, we can do some cool visualization with mapping. First we look at the high level view, with state and those states by area. California had the most accounts compromised overall, with just shy of 1.4 M details leaked. New York State was next at just over a million. 
Because the accounts weren’t spread evenly across the states, below is a more detailed view by area code. You can see that it’s mainly Southern California and the Bay Area where the accounts are concentrated.
Usernames
Well, that covers the geographic component. Which leaves the only the username and phones numbers. I’m not going to look into the phone numbers (I mean what really can you do, other than look at the distribution of numbers – which I have a strong hypothesis about already).
Looking at the number of accounts which include numerals versus those that do not, the split is fairly even – 2,586,281 (~56.1%) do not contain numbers and the remaining 2,023,340 (~43.9%) do. There are no purely numeric usernames.
Looking at the distribution of the length of Snapchat usernames below, we see what appears to be a skew-normal distribution centered around 9.5 characters or so:
The remainder of the tail is not present, which I assume would fill in if there were more data. I had the axis stretch to 30 for perspective as there was one username in the file of length 29.

Conclusion

If anything this analysis has shown anything it has reassured me that:
  1. You are very likely not in the leak unless you live in California or New York City
  2. How amazingly natural phenomena follow or nearly follow theoretical distributions so closely
I’m not in the leak, so I’m not concerned. But once again, this stresses the importance of being mindful of where our personal data are going when using smartphone apps, and ensuring there is some measure of care and accountability on the creators’ end.

Update:
Snapchat has released a new statement promising an update to the app which makes the compromised feature optional, increased security around the API, and working with security experts in a more open fashion.

What’s in My Inbox? Data Analysis of Outlook

Introduction

Email is the bane of our modern existence.

Who of us hasn’t had a long, convoluted, back-and-forth email thread going on for days (if not weeks) in order to settle an issue which could have been resolved with a simple 5 minute conversation?

With some colleagues of mine, email has become so overwhelming (or their attempts to organize it so futile) that it brings to my mind Orwell’s workers at the Ministry of Truth in 1984 and their pneumatic tubes and memory holes – if the message you want is not in the top 1% (or 0.01%) of your inbox and you don’t know how to use search effectively, then for all intents and purposes it might as well be gone (see also: Snapchat).

Much has been written on the subject of why exactly we send and receive so much of it, how to best organize it, and whether or not it is, in fact, even an effective method of communication.

At one time even Gmail and the concept of labels was revolutionary – and it has done some good in organizing the ever-increasing deluge that is email for the majority of people. Other attempts have sprung up to tame the beast and make sense of such a flood of communication – most notably in my mind Inbox Zero, the simply-titled smartphone app Mailbox, and MIT’s recent data visualization project Immersion.

But email, with all its systemic flaws, misuse, and annoyances, is definitely here for good, no question. What a world we live in.

But I digress.

Background

I had originally hoped to export everything from Gmail and do a very thorough analysis of all my personal email. Though this is now a lot easier than it used to be, I got frustrated at the time trying to write a Python script and moved on to other projects.
But then I thought, hey, why not do the same thing for my work email? I recently discovered that it’s quite easy to export email from Outlook (as I detailed last time) so that brings us to this post.
I was somewhat disappointed that Outlook can only export a folder at a time (which does not include special folders such as search folders or ‘All Mail’) – I organize my mail into folders and wanted an export of all of it.
That being said, the bulk probably does remain in my inbox (4,217 items in my inbox resulted in a CSV that was ~15 MB) and we can still get a rough look using what’s available  The data cover the period from February 27th, 2013 to Nov 16th, 2013.

Email by Contact
First let’s  look at the top 15 contacts by total number of emails. Here are some pretty simple graphs summarizing that data, first by category of contact:

In the top 15, split between co-workers/colleagues and management is pretty even. I received about 5 times as much email from coworkers and managers as from stakeholders (but then again a lot of the latter ended up sorted into folders, so this count is probably higher). Still, I don’t directly interact with stakeholders as much as some others, and tend to work with teams or my immediate manager. Also, calls are usually better.

Here you can see that I interacted primarily with my immediate colleague and manager the most, then other management, and the remainder further down the line are a mix which includes email to myself and from office operations. Also of note – I don’t actually receive that much email (I’m more of a “in the weeds” type of guy) or, as I said, much has gone into the appropriate folders.

Time-Series Analysis
The above graphs show a very simplistic and high level view of what proportion of email I was receiving from who (with a suitable level of anonymity, I hope). More interesting is a quick and simple analysis of patterns in time of the volume of email I received – and I’m pretty sure you already have an idea of what some of these patterns might be.

When doing data analysis, I always feel it is important to first visualize as much of the data as practically possible – in order to get “a feel” for the data and avoid making erroneous conclusions without having this overall familiarity (as I noted in an earlier post). If a picture is worth thousand words then a good data visualization is worth a thousand keystrokes and mouse clicks.

Below is a simple scatter plot all the emails received by day, with the time of day on the y-axis:


This scatterpolot is perhaps not immediately particulary illuminating, however it already shows us a few things worth noting:

  • the majority of emails appear in a band approximately between 8 AM and 5 PM
  • there is increased density of email in the period between the end of July and early October, after which there is a sparse interval until mid-month / early November
  • there appears to be some kind of periodic nature to the volume of daily emails, giving a “strip-like” appearance (three guesses what that periodic nature is…)

We can look into this further by considering the daily volume of emails, as below. The black line is a 7 day moving average:

We can see the patterns noted above – the increase in daily volume after 7/27 and the marked decrease mid-October. Though I wracked my brain and looked thoroughly, I couldn’t find a specific reason why there was an increase over the summer – this was just a busy time for projects (and probably not for myself sorting email). The marked decrease in October corresponds to a period of bench time, which you can see was rather short-lived.

As I noted previously in analyzing communications data, the distribution of this type of information is exponential in nature and usually follows a log-normal distribution. As such, a moving average is not the greatest measure of central tendency – but a decent approximation for our purposes. Still, I find the graph a little more digestible when depicted with a logarithmic y-axis, as below:

Lastly we consider the periodic nature of the emails which is noted in the initial scatterplot. We can look for patterns by making a standard heatmap with the weekday as the column and hour of day as the row, as below:

You can clearly see that the the majority of work email occurs between the hours of 9 to 5 (shocking!). However some other interesting points of note are the bulk of email in the mornings at the begiinning of the week, fall-off after 5 PM at the end of the week (Thursday & Friday) and the messages received Saturday morning. Again, I don’t really receive that much email, or have spirited a lot of it away into folders as I noted at the beginning of the article (this analysis does not include things like automated emails and reports, etc.)

Email Size & Attachments
Looking at file attachments, I believe the data are more skewed than the rest, as the clean-up of large emails is a semi-regular task for the office worker (as not many have the luxury of an unlimited email inbox capacity – even executives) so I would expect that values on the high end to have largely been removed. Nevertheless it still provides a rough approximation of how email sizes are distributed and what proportion have attachments included.

First we look at the overall proportion of email left in my inbox which has attachments – of the 4,217 emails, 2914 did not have an attachment (69.1%) and 1303 did (30.9%).

Examining the size of emails (which includes the attachments) in a histogram, we see a familiar looking distribution, which here I have further expanded by making it into a Pareto chart. (note that the scale on the left y-axis is logarithmic):

Here we can see that of what was left in my inbox, all messages were about 8 MB in size or less, with the vast majority being 250K or less. In fact 99% of the email was less than 1750KB, and 99.9% less than 6MB.

Conclusion

This was a very quick analysis of what was in my inbox, however we saw some interesting points of note, some of which confirm what one would expect – in particular:
  • vast majority of email is received between the hours of 9-5 Monday to Friday
  • majority of email I received was between the two managers & colleagues I work closest with
  • approximately 3 out of 10 emails I received had attachments
  • the distribution of email sizes is logarithmic in nature
If I wanted to take this analysis further, we could also look at the trending by contact and also do some content analysis (the latter not being done here for obvious reasons, of course).
This was an interesting exercise because it made me mindful again of what everyday analytics is all about – analyzing rich data sets we are producing all the time, but of which we are not always aware.

References and Resources

Inbox Zero
http://inboxzero.com/

Mailbox
http://www.mailboxapp.com/

Immersion
https://immersion.media.mit.edu/

Data Mining Email to Discover Organizational Networks and Emergent Communities in Work Flows