PCA and K-means Clustering of Delta Aircraft

Introduction

I work in consulting. If you’re a consultant at a certain type of company, agency, organization, consultancy, whatever, this can sometimes mean travelling a lot.
Many business travelers ‘in the know’ have heard the old joke that if you want to stay at any type of hotel anywhere in the world and get a great rate, all you have to do is say that you work for IBM.

The point is that my line of business requires travel, and sometimes that is a lot of the time, like say almost all of last year. Inevitable comparisons to George Clooney’s character in Up in the Air were made (ironically I started to read that book, then left it on a plane in a seatback pocket), requests about favours involving duty free, and of course many observations and gently probing questions about frequent flier miles (FYI I’ve got more than most people, but a lot less than the entrepreneur I sat next to one time, who claimed to have close to 3 million).

But I digress.

Background

The point is that, as I said, I spent quite a bit of time travelling for work last year. Apparently the story with frequent fliers miles is that it’s best just to pick one airline and stick with it – and this also worked out well as most companies, including my employer, have preferred airlines and so you often don’t have much of a choice in the matter.

In my case this means flying Delta.

So I happened to notice in one of my many visits to Delta’s website that they have data on all of their aircraft in a certain site section. I thought this would be an interesting data set on which to do some analysis, as it has both quantitative and qualitative information and is relatively complex. What can we say about the different aircraft in Delta’s fleet, coming at it with ‘fresh eyes’? Which planes are similar? Which are dissimilar?

Aircraft data card from Delta.com

The data set comprises 33 variables on 44 aircraft taken from Delta.com, including both quantitative measures on attributes like cruising speed, accommodation and range in miles, as well as categorical data on, say, whether a particular aircraft has Wi-Fi or video. These binary categorical variables were transformed into quantitative variables by assigning them values of either 1 or 0, for yes or no respectively.

Analysis

As this a data set of many variables (33) I thought this would be an interesting opportunity to practice using a dimensionality reduction method to make the information easier to visualize and analyze.
First let’s just look at the intermediary quantitative variables related to the aircraft physical characteristics: cruising speed, total accommodation, and other quantities like length and wingspan. These variables are about in the middle of the data frame, so we can visualize all of them at once using a scatterplot matrix, which is the default for R’s output if plot() is called on a dataframe.
data <- read.csv(file="delta.csv", header=T, sep=",", row.names=1)

# scatterplot matrix of intermediary (size/non-categorical) variables
plot(data[,16:22])

We can see that there are pretty strong positive correlations between all these variables, as all of them are related to the aircraft’s overall size. Remarkably there is an almost perfectly linear relationship between wingspan and tail height, which perhaps is related to some principle of aeronautical engineering of which I am unaware.

The exception here is the variable right in the middle which is the number of engines. There is one lone outlier [Boeing 747-400 (74S)] which has four, while all the other aircraft have two. In this way the engines variable is really more like a categorical variable, but we shall as the analysis progresses that this is not really important, as there are other variables which more strongly discern the aircraft from one another than this.

How do we easier visualize a high-dimensional data set like this one? By using a dimensionality reduction technique like principal components analysis.

Principal Components Analysis

Next let’s say I know nothing about dimensionality reduction techniques and just naively apply principle components to the data in R:

# Naively apply principal components analysis to raw data and plot
pc <- princomp(data)
plot(pc)

Taking that approach we can see that the first principal component has a standard deviation of around 2200 and accounts for over 99.8% of the variance in the data. Looking at the first column of loadings, we see that the first principle component is just the range in miles.

# First component dominates greatly. What are the loadings?
summary(pc) # 1 component has > 99% variance
loadings(pc) # Can see all variance is in the range in miles

Importance of components:
                             Comp.1       Comp.2       Comp.3       Comp.4
Standard deviation     2259.2372556 6.907940e+01 2.871764e+01 2.259929e+01
Proportion of Variance    0.9987016 9.337038e-04 1.613651e-04 9.993131e-05
Cumulative Proportion     0.9987016 9.996353e-01 9.997966e-01 9.998966e-01
            

                         Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6 Comp.7 Comp.8
Seat.Width..Club.                                    -0.144 -0.110              
Seat.Pitch..Club.                                    -0.327 -0.248         0.189
Seat..Club.                                                                     
Seat.Width..First.Class.                0.250        -0.160        -0.156  0.136
Seat.Pitch..First.Class.                0.515 -0.110 -0.386  0.112 -0.130  0.183
Seats..First.Class.                     0.258 -0.124 -0.307 -0.109  0.160  0.149
Seat.Width..Business.                  -0.154  0.142 -0.108                     
Seat.Pitch..Business.                  -0.514  0.446 -0.298  0.154 -0.172  0.379
Seats..Business.                       -0.225  0.187                            
Seat.Width..Eco.Comfort.                                     0.285 -0.224       
Seat.Pitch..Eco.Comfort.                0.159                0.544 -0.442       
Seats..Eco.Comfort.                                          0.200 -0.160       
Seat.Width..Economy.                                  0.125  0.110              
Seat.Pitch..Economy.                                  0.227  0.190        -0.130
Seats..Economy.                  0.597        -0.136  0.345 -0.165         0.168
Accommodation                    0.697               -0.104                0.233
Cruising.Speed..mph.                    0.463  0.809  0.289 -0.144  0.115       
Range..miles.             0.999                                                 
Engines                                                                         
Wingspan..ft.                    0.215         0.103 -0.316 -0.357 -0.466 -0.665
Tail.Height..ft.                                     -0.100        -0.187       
Length..ft.                      0.275         0.118 -0.318  0.467  0.582 -0.418
Wifi                                                                            
Video                                                                           
Power                                                                           
Satellite                                                                       
Flat.bed                                                                        
Sleeper                                                                         
Club                                                                            
First.Class                                                                     
Business                                                                        
Eco.Comfort                                                                     

Economy                                         

This is because the scale of the different variables in the data set is quite variable; we can see this by plotting the variance of the different columns in the data frame (regular scaling on the left, logarithmic on the right):

# verify by plotting variance of columns
mar <- par()$mar
par(mar=mar+c(0,5,0,0))
barplot(sapply(data, var), horiz=T, las=1, cex.names=0.8)
barplot(sapply(data, var), horiz=T, las=1, cex.names=0.8, log='x')
par(mar=mar)

We correct for this by scaling the data using the scale() function. We can then verify that the variances across the different variables are equal so that when we apply principal components one variable does not dominate.

# Scale
data2 <- data.frame(scale(data))
# Verify variance is uniform
plot(sapply(data2, var))
After applying the scale() function the variance is now constant across variables

Now we can apply principal components to the scaled data. Note that this can also be done automatically in call to the prcomp() function by setting the parameter scale=TRUE. Now we see a result which is more along the lines of something we would expect:

# Proceed with principal components
pc <- princomp(data2)
plot(pc)
plot(pc, type='l')
summary(pc) # 4 components is both 'elbow' and explains >85% variance

Great, so now we’re in business. There are various rules of thumb for selecting the number of principal components to retain in an analysis of this type, two of which I’ve read about are:

  1. Pick the number of components which explain 85% or greater of the variation
  2. Use the ‘elbow’ method of the scree plot (on right)
Here we are fortunate in that these two are the same, so we will retain the first four principal components. We put these into new data frame and plot.
# Get principal component vectors using prcomp instead of princomp
pc <- prcomp(data2)

# First for principal components
comp <- data.frame(pc$x[,1:4])
# Plot
plot(comp, pch=16, col=rgb(0,0,0,0.5))

So what were are looking at here are twelve 2-D projections of data which are in a 4-D space. You can see there’s a clear outlier in all the dimensions, as well as some bunching together in the different projections.

Normally, I am a staunch opponent of 3D visualization, as I’ve spoken strongly about previously. The one exception to this rule is when the visualization is interactive, which allows the user to explore the space and not lose meaning due to three dimensions being collapsed into a 2D image. Plus, in this particular case, it’s a good excuse to use the very cool, very awesome rgl package.
Click on the images to view the interactive 3D versions (requires a modern browser). You can better see in the 3D projections that the data are confined mainly to the one plane one the left (components 1-3), with the exception of the outlier, and that there is also bunching in the other dimensions (components 1,3,4 on right).
library(rgl)
# Multi 3D plot
plot3d(comp$PC1, comp$PC2, comp$PC3)
plot3d(comp$PC1, comp$PC3, comp$PC4)
So, now that we’ve simplified the complex data set into a lower dimensional space we can visualize and work with, how do we find patterns in the data, in our case, the aircraft which are most similar? We can use a simple unsupervised machine learning technique like clustering.
Cluster Analysis
 
Here because I’m not a data scientist extraordinaire, I’ll stick to the simplest technique and do a simple k-means – this is pretty straightforward to do in R.
First how do we determine the number of clusters? The simplest method is to look at the within groups sum of squares and pick the ‘elbow’ in the plot, similar to as with the scree plot we did for the PCA previously. Here I used the code from R in Action:
# Determine number of clusters
wss <- (nrow(mydata)-1)*sum(apply(mydata,2,var))
for (i in 2:15) wss[i] <- sum(kmeans(mydata,
                                     centers=i)$withinss)
plot(1:15, wss, type="b", xlab="Number of Clusters",
     ylab="Within groups sum of squares")
However, it should be noted that it is very important to set the nstart parameter and iter.max parameter (I’ve found 25 and 1000, respectively to be okay values to use), which the example in Quick-R fails to do, otherwise you can get very different results each time you run the algorithm, as below.
Clustering without the nstart parameter can lead to variable results for each run
Clustering with the nstart and iter.max parameters leads to consistent results, allowing proper interpretation of the scree plot
So here we can see that the “elbow” in the scree plot is at k=4, so we apply the k-means clustering function with k = 4 and plot.
# From scree plot elbow occurs at k = 4
# Apply k-means with k=4
k <- kmeans(comp, 4, nstart=25, iter.max=1000)
library(RColorBrewer)
library(scales)
palette(alpha(brewer.pal(9,'Set1'), 0.5))
plot(comp, col=k$clust, pch=16)
We can see that the one outlier is in its own cluster, there’s 3 or 4 in the other and the remainder are split into two clusters of greater size. We visualize in 3D below, as before (click for interactive versions):
# 3D plot
plot3d(comp$PC1, comp$PC2, comp$PC3, col=k$clust)
plot3d(comp$PC1, comp$PC3, comp$PC4, col=k$clust)
We look at the exact clusters below, in order of increasing size:
# Cluster sizes
sort(table(k$clust))
clust <- names(sort(table(k$clust)))

# First cluster
row.names(data[k$clust==clust[1],])
# Second Cluster
row.names(data[k$clust==clust[2],])
# Third Cluster
row.names(data[k$clust==clust[3],])
# Fourth Cluster
row.names(data[k$clust==clust[4],])
[1] “Airbus A319 VIP”

[1] “CRJ 100/200 Pinnacle/SkyWest” “CRJ 100/200 ExpressJet”
[3] “E120”                         “ERJ-145”

[1] “Airbus A330-200”          “Airbus A330-200 (3L2)”
[3] “Airbus A330-200 (3L3)”    “Airbus A330-300”
[5] “Boeing 747-400 (74S)”     “Boeing 757-200 (75E)”
[7] “Boeing 757-200 (75X)”     “Boeing 767-300 (76G)”
[9] “Boeing 767-300 (76L)”     “Boeing 767-300 (76T)”
[11] “Boeing 767-300 (76Z V.1)” “Boeing 767-300 (76Z V.2)”
[13] “Boeing 767-400 (76D)”     “Boeing 777-200ER”
[15] “Boeing 777-200LR”

[1] “Airbus A319”            “Airbus A320”            “Airbus A320 32-R”
[4] “Boeing 717”             “Boeing 737-700 (73W)”   “Boeing 737-800 (738)”
[7] “Boeing 737-800 (73H)”   “Boeing 737-900ER (739)” “Boeing 757-200 (75A)”
[10] “Boeing 757-200 (75M)”   “Boeing 757-200 (75N)”   “Boeing 757-200 (757)”
[13] “Boeing 757-200 (75V)”   “Boeing 757-300”         “Boeing 767-300 (76P)”
[16] “Boeing 767-300 (76Q)”   “Boeing 767-300 (76U)”   “CRJ 700”
[19] “CRJ 900”                “E170”                   “E175”
[22] “MD-88”                  “MD-90”                  “MD-DC9-50”

The first cluster contains a single aircraft, the Airbus A319 VIP. This plane is on its own and rightly so – it is not part of Delta’s regular fleet but one of Airbus’ corporate jets. This is a plane for people with money, for private charter. It includes “club seats” around tables for working (or not). Below is a picture of the inside of the A319 VIP:

Ahhh, that’s the way fly (some day, some day…). This is apparently the plane professional sports teams and the American military often charter to fly – this article in the Sydney Morning Herald has more details.

The second cluster contains four aircraft – the two CRJ 100/200’s and the Embraer E120 and ERJ-145. These are the smallest passenger aircraft, with the smallest accommodations – 28 for the E120 and 50 for the remaining craft. As such, there is only economy seating in these planes which is what distinguishes them from the remainder of the fleet. The E120 also has the distinction of being the only plane in the fleet with turboprops. Photos below.

Top: CRJ100/200. Bottom left: Embraer E120. Bottom right: Embraer ERJ-145.

I’ve flown many times in the venerable CRJ 100/200 series planes, in which I can assure you there is only economy seating, and which I like to affectionately refer to as “little metal tubes of suffering.”

The other two clusters comprise the remainder of the fleet, the planes with which most commercial air travellers are familiar – your Boeing 7-whatever-7’s and other Airbus and McDonnell-Douglas planes.

These are split into two clusters, which seem to again divide the planes approximately by size (both physical and accommodation), though there is crossover in the Boeing craft.

# Compare accommodation by cluster in boxplot
boxplot(data$Accommodation ~ k$cluster,
        xlab='Cluster', ylab='Accommodation',
        main='Plane Accommodation by Cluster')
# Compare presence of seat classes in largest clusters
data[k$clust==clust[3],30:33]
data[k$clust==clust[4],30:33]
First.Class Business Eco.Comfort Economy
Airbus A330-200 0 1 1 1
Airbus A330-200 (3L2) 0 1 1 1
Airbus A330-200 (3L3) 0 1 1 1
Airbus A330-300 0 1 1 1
Boeing 747-400 (74S) 0 1 1 1
Boeing 757-200 (75E) 0 1 1 1
Boeing 757-200 (75X) 0 1 1 1
Boeing 767-300 (76G) 0 1 1 1
Boeing 767-300 (76L) 0 1 1 1
Boeing 767-300 (76T) 0 1 1 1
Boeing 767-300 (76Z V.1) 0 1 1 1
Boeing 767-300 (76Z V.2) 0 1 1 1
Boeing 767-400 (76D) 0 1 1 1
Boeing 777-200ER 0 1 1 1
Boeing 777-200LR 0 1 1 1
First.Class Business Eco.Comfort Economy
Airbus A319 1 0 1 1
Airbus A320 1 0 1 1
Airbus A320 32-R 1 0 1 1
Boeing 717 1 0 1 1
Boeing 737-700 (73W) 1 0 1 1
Boeing 737-800 (738) 1 0 1 1
Boeing 737-800 (73H) 1 0 1 1
Boeing 737-900ER (739) 1 0 1 1
Boeing 757-200 (75A) 1 0 1 1
Boeing 757-200 (75M) 1 0 1 1
Boeing 757-200 (75N) 1 0 1 1
Boeing 757-200 (757) 1 0 1 1
Boeing 757-200 (75V) 1 0 1 1
Boeing 757-300 1 0 1 1
Boeing 767-300 (76P) 1 0 1 1
Boeing 767-300 (76Q) 1 0 1 1
Boeing 767-300 (76U) 0 1 1 1
CRJ 700 1 0 1 1
CRJ 900 1 0 1 1
E170 1 0 1 1
E175 1 0 1 1
MD-88 1 0 1 1
MD-90 1 0 1 1
MD-DC9-50 1 0 1 1

Looking at the raw data, the difference I can ascertain between the largest two clusters is that all the aircraft in the one have first class seating, whereas all the planes in the other have business class instead [the one exception being the Boeing 767-300 (76U)].

Conclusions

This was a little analysis which for me not only allowed me to explore my interest in commercial aircraft, but was also educational about finer points of what to look out for when using more advanced data science techniques like principal components, clustering and advanced visualization.
All in all, the techniques did a pretty admirable job in separating out the different type of aircraft into distinct categories. However I believe the way I structured the data may have biased it towards categorizing the aircraft by seating class, as that quality was replicated in the data set compared to other variables, being represented both in quantitative variables (seat pitch & width, number of seat in class) and categorical (class presence). So really the different seating classes where represented in triplicate within the data set compared to other variables, which is why the methods separated the aircraft in this way.

If I did this again, I would structure the data differently and see what relationships such analysis could draw out using only select parts of the data (e.g. aircraft measurements only). The interesting lesson here is that it when using techniques like dimensionality reduction and clustering it is not only important to be mindful of applying them correctly, but also what variables are in your data set and how they are represented.

For now I’ll just keep on flying, collecting the miles, and counting down the days until I finally get that seat in first class.

References & Resources

Delta Fleet at Delta.com
Principal Components Analysis (Wikipedia):
http://en.wikipedia.org/wiki/Principal_components_analysis

The Little Book of R for Multivariate Analysis
Quick R: Cluster Analysis
Plane Luxury: how US sports stars fly (Syndney Morning Herald)

Big Data Week Toronto 2014 Recap – Meetup #3: Big Data Visualization

This past week was Big Data Week for those of you that don’t know, a week of talks and events held worldwide to “unite the global data communities through series of events and meetups”.

Viafoura put on the events this year for Toronto and was kind enough to extend an invitation to myself to be one of the speakers talking on data visualization and how that relates to all this “Big Data” stuff.

Paul spoke detecting fraud online using visualization and data science techniques. Something I often think about when presenting is how to make your message clear and connect with both the least technical people in the audience (who, quite often, have attended strictly out of curiosity) and the most knowledgeable and technically-minded people present.

I was really impressed with Paul’s visual explanation of the Jaccard coefficient. Not everyone understands set theory, however almost everyone will understand a Venn diagram if you put it in front of them.

So to explain the Jaccard index as a measure of mutual information when giving a presentation, which is better? You could put the definition up on a slide:

 J(A,B) = {{|A cap B|}over{|A cup B|}}.
which is fine for the mathematically-minded in your audience but would probably lose a lot of others. Instead, you could use a visualization like this figure Paul included:
The two depict the same quantity, but the latter is far more accessible to a wide audience. Great stuff.
I spoke on “Practical Visualizations for Visualizing Big Data” which included some fundamentals (thinking about data and perception in visualization / visual encoding) and the challenges the three “V”s of Big Data present when doing visualization and analysis, and some thoughts on how to address them.
This prompted some interesting discussions afterward, I found most people were much more interested in the fundamentals part – how to do visualization effectively, what constitutes a visualization, and the perceptional elements of dataviz and less on the data science aspects of the talk.
Overall it was a great evening and I was happy to get up and talk visualization again. Thanks to the guys from Viafoura for putting this on and inviting me, and to the folks at the Ryerson DMZ for hosting.
Mini-gallery culled from Twitter below:

The Top 0.1% – Keeping Pace with the Data Explosion?

I’ve been thinking a lot lately, as I do, about data.

When you work with something so closely, it is hard not to have the way you think about what you work with impact the way you think about other things in other parts of your life.

For instance, cooks don’t think about food the same way once they’ve worked in the kitchen; bankers don’t think about money the same way once they’ve seen a vault full of it; and analysts don’t think about data the same way, once they start constantly analyzing it.

The difference being, of course, that not everything is related to food or money, but everything is data if you know how to think like an analyst.

I remember when I was in elementary school as a young child and a friend of mine described to me the things of which he was afraid. We sat in the field behind the school and stared down at the gravel on the track.

“Think about counting every single pebble of gravel on the track,” he said. “That’s the sort of thing that really scares me.” He did seemed genuinely concerned. “Or think about counting every grain of sand on a beach, and then think about how many beaches there are in the whole world, and counting every grain of sand on every single one of those beaches. That’s the sort of thing that frightens me.”

The thing that scared my childhood friend was infinity; or perhaps not the infinite, but just very very large numbers – the quantities of the magnitude relating to that thing everyone is talking about these days called Big Data.

And that’s the sort of thing I’ve been thinking about lately.

I don’t remember the exact figure, but if you scour the internet to read up on our information age, and in particular our age of “Big Data” you will find statements similar to that below:

…. there has been more information created in the past year than there was in all of recorded history before it.


Which brings me to my point about the Top 0.1%.

Or, perhaps, to be more fair, probably something more like the Top 0.01%.

There is so much information out there. Every day around the world, every milliliter of gas pumped, every transaction at POS, every mouse click on millions of websites on the internet is being recorded, and creating more data.

Our capacity to record and store information has exploded exponentially.

But, perhaps somewhat critically, our ability to work with it and analyze it has not.

In the same way that The Ingenuity Gap talks about how the complexity of problems facing society is ever increasing but our ability to implement solutions is not matching that pace, we might be in danger of similarly finding the amount of information being recorded and stored in our modern world is exponentially increasing but our ability to manage and analyze it is not. Not only from a technological perspective, but also from a human perspective – there is only so much information one person can handle working with and keep “in mind”.

I know that many other people are thinking this way as well. This is my crude, data-centric take on what people have been talking about since the 1970’s – information overload. And I know that many other authors have touched on this point recently as it is of legitimate concern; for instance – acknowledging that the skill set needed to work with and make sense of these data sets of exponentially increasing size is so specialized that data scientists will not scale.

Will our technology and ability to manage data be able to keep up with the ever increasing explosion of it? Will software and platforms develop and match pace such that those able to make sense of these large data sets are not just a select group of specialists? How will the analyst of tomorrow handle working with and thinking about the analysis of such massive and diverse data sets?

Only time will answer these questions, but the one thing that seems certain is that the data deluge will only continue as storage becomes ever cheaper and of greater density while the volume, velocity and variety of data collected worldwide continues to explode.

In other words: there’s more that came from.

Seriously, What’s a Data Scientist? (and The Newgrounds Scrape)

So here’s the thing. I wouldn’t feel comfortable calling myself a data scientist (yet).

Whenever someone mentions the term data science (or, god forbid BIG DATA, without a hint of skepticism or irony) people inevitably start talking about the elephant in the room (see what I did there)?

And I don’t know how to ride elephants (yet).

Some people (like yours truly, as just explained) are cautious – “I’m not a data scientist. Data science is a nascent field. No one can go around really calling themselves a data scientist because no one even really knows what data science is yet, there isn’t a strict definition.” (though Wikipedia’s attempt is noble).

Other people are not cautious at all – “I’m a data scientist! Hire me! I know what data are and know how to throw around the term BIG DATA! I’m great with pivot tables in Excel!!”

Aha ha. But I digress.

The point is that I’ve done the first real work which I think falls under the category of data science.

I’m no Python guru, but threw together a scraper to grab all the metadata from Newgrounds portal content.

The data are here if you’re interested in having a go at it already.

The analysis and visualization will take time, that’s for a later article. For now, here’s one of my exploratory plots, of the content rating by date. Already we can gather from this that, at least at Newgrounds, 4-and-half stars equals perfection.

Sure feels like science.