#### Introduction

The point is that my line of business requires travel, and sometimes that is a lot of the time, like say almost all of last year. Inevitable comparisons to George Clooney’s character in Up in the Air were made (ironically I started to read that book, then left it on a plane in a seatback pocket), requests about favours involving duty free, and of course many observations and gently probing questions about frequent flier miles (FYI I’ve got more than most people, but a lot less than the entrepreneur I sat next to one time, who claimed to have close to 3 million).

But I digress.

#### Background

In my case this means flying Delta.

So I happened to notice in one of my many visits to Delta’s website that they have data on all of their aircraft in a certain site section. I thought this would be an interesting data set on which to do some analysis, as it has both quantitative and qualitative information and is relatively complex. What can we say about the different aircraft in Delta’s fleet, coming at it with ‘fresh eyes’? Which planes are similar? Which are dissimilar?

*Aircraft data card from Delta.com*

The data set comprises 33 variables on 44 aircraft taken from Delta.com, including both quantitative measures on attributes like cruising speed, accommodation and range in miles, as well as categorical data on, say, whether a particular aircraft has Wi-Fi or video. These binary categorical variables were transformed into quantitative variables by assigning them values of either 1 or 0, for yes or no respectively.

#### Analysis

data <- read.csv(file="delta.csv", header=T, sep=",", row.names=1) # scatterplot matrix of intermediary (size/non-categorical) variables plot(data[,16:22])

We can see that there are pretty strong positive correlations between all these variables, as all of them are related to the aircraft’s overall size. Remarkably there is an almost perfectly linear relationship between wingspan and tail height, which perhaps is related to some principle of aeronautical engineering of which I am unaware.

The exception here is the variable right in the middle which is the number of engines. There is one lone outlier [Boeing 747-400 (74S)] which has four, while all the other aircraft have two. In this way the engines variable is really more like a categorical variable, but we shall as the analysis progresses that this is not really important, as there are other variables which more strongly discern the aircraft from one another than this.

How do we easier visualize a high-dimensional data set like this one? By using a dimensionality reduction technique like principal components analysis.

*Principal Components Analysis*

Next let’s say I know nothing about dimensionality reduction techniques and just naively apply principle components to the data in R:

# Naively apply principal components analysis to raw data and plot pc <- princomp(data) plot(pc)

Taking that approach we can see that the first principal component has a standard deviation of around 2200 and accounts for over 99.8% of the variance in the data. Looking at the first column of loadings, we see that the first principle component is just the range in miles.

# First component dominates greatly. What are the loadings? summary(pc) # 1 component has > 99% variance loadings(pc) # Can see all variance is in the range in miles

Importance of components:

Comp.1 Comp.2 Comp.3 Comp.4

Standard deviation 2259.2372556 6.907940e+01 2.871764e+01 2.259929e+01

Proportion of Variance 0.9987016 9.337038e-04 1.613651e-04 9.993131e-05

Cumulative Proportion 0.9987016 9.996353e-01 9.997966e-01 9.998966e-01

Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6 Comp.7 Comp.8

Seat.Width..Club. -0.144 -0.110

Seat.Pitch..Club. -0.327 -0.248 0.189

Seat..Club.

Seat.Width..First.Class. 0.250 -0.160 -0.156 0.136

Seat.Pitch..First.Class. 0.515 -0.110 -0.386 0.112 -0.130 0.183

Seats..First.Class. 0.258 -0.124 -0.307 -0.109 0.160 0.149

Seat.Width..Business. -0.154 0.142 -0.108

Seat.Pitch..Business. -0.514 0.446 -0.298 0.154 -0.172 0.379

Seats..Business. -0.225 0.187

Seat.Width..Eco.Comfort. 0.285 -0.224

Seat.Pitch..Eco.Comfort. 0.159 0.544 -0.442

Seats..Eco.Comfort. 0.200 -0.160

Seat.Width..Economy. 0.125 0.110

Seat.Pitch..Economy. 0.227 0.190 -0.130

Seats..Economy. 0.597 -0.136 0.345 -0.165 0.168

Accommodation 0.697 -0.104 0.233

Cruising.Speed..mph. 0.463 0.809 0.289 -0.144 0.115

Range..miles. 0.999

Engines

Wingspan..ft. 0.215 0.103 -0.316 -0.357 -0.466 -0.665

Tail.Height..ft. -0.100 -0.187

Length..ft. 0.275 0.118 -0.318 0.467 0.582 -0.418

Wifi

Video

Power

Satellite

Flat.bed

Sleeper

Club

First.Class

Business

Eco.Comfort

Economy

This is because the scale of the different variables in the data set is quite variable; we can see this by plotting the variance of the different columns in the data frame (regular scaling on the left, logarithmic on the right):

# verify by plotting variance of columns mar <- par()$mar par(mar=mar+c(0,5,0,0)) barplot(sapply(data, var), horiz=T, las=1, cex.names=0.8) barplot(sapply(data, var), horiz=T, las=1, cex.names=0.8, log='x') par(mar=mar)

We correct for this by scaling the data using the scale() function. We can then verify that the variances across the different variables are equal so that when we apply principal components one variable does not dominate.

# Scale data2 <- data.frame(scale(data)) # Verify variance is uniform plot(sapply(data2, var))

After applying the scale() function the variance is now constant across variables |

Now we can apply principal components to the scaled data. Note that this can also be done automatically in call to the prcomp() function by setting the parameter scale=TRUE. Now we see a result which is more along the lines of something we would expect:

# Proceed with principal components pc <- princomp(data2) plot(pc) plot(pc, type='l') summary(pc) # 4 components is both 'elbow' and explains >85% variance

Great, so now we’re in business. There are various rules of thumb for selecting the number of principal components to retain in an analysis of this type, two of which I’ve read about are:

- Pick the number of components which explain 85% or greater of the variation
- Use the ‘elbow’ method of the scree plot (on right)

# Get principal component vectors using prcomp instead of princomp pc <- prcomp(data2) # First for principal components comp <- data.frame(pc$x[,1:4]) # Plot plot(comp, pch=16, col=rgb(0,0,0,0.5))

So what were are looking at here are twelve 2-D projections of data which are in a 4-D space. You can see there’s a clear outlier in all the dimensions, as well as some bunching together in the different projections.

library(rgl) # Multi 3D plot plot3d(comp$PC1, comp$PC2, comp$PC3) plot3d(comp$PC1, comp$PC3, comp$PC4)

*Cluster Analysis*

# Determine number of clusters wss <- (nrow(mydata)-1)*sum(apply(mydata,2,var)) for (i in 2:15) wss[i] <- sum(kmeans(mydata, centers=i)$withinss) plot(1:15, wss, type="b", xlab="Number of Clusters", ylab="Within groups sum of squares")

Clustering without the nstart parameter can lead to variable results for each run |

Clustering with the nstart and iter.max parameters leads to consistent results, allowing proper interpretation of the scree plot |

# From scree plot elbow occurs at k = 4 # Apply k-means with k=4 k <- kmeans(comp, 4, nstart=25, iter.max=1000) library(RColorBrewer) library(scales) palette(alpha(brewer.pal(9,'Set1'), 0.5)) plot(comp, col=k$clust, pch=16)

# 3D plot plot3d(comp$PC1, comp$PC2, comp$PC3, col=k$clust) plot3d(comp$PC1, comp$PC3, comp$PC4, col=k$clust)

# Cluster sizes sort(table(k$clust)) clust <- names(sort(table(k$clust))) # First cluster row.names(data[k$clust==clust[1],]) # Second Cluster row.names(data[k$clust==clust[2],]) # Third Cluster row.names(data[k$clust==clust[3],]) # Fourth Cluster row.names(data[k$clust==clust[4],])

[1] “CRJ 100/200 Pinnacle/SkyWest” “CRJ 100/200 ExpressJet”

[3] “E120” “ERJ-145”

[1] “Airbus A330-200” “Airbus A330-200 (3L2)”

[3] “Airbus A330-200 (3L3)” “Airbus A330-300”

[5] “Boeing 747-400 (74S)” “Boeing 757-200 (75E)”

[7] “Boeing 757-200 (75X)” “Boeing 767-300 (76G)”

[9] “Boeing 767-300 (76L)” “Boeing 767-300 (76T)”

[11] “Boeing 767-300 (76Z V.1)” “Boeing 767-300 (76Z V.2)”

[13] “Boeing 767-400 (76D)” “Boeing 777-200ER”

[15] “Boeing 777-200LR”

[1] “Airbus A319” “Airbus A320” “Airbus A320 32-R”

[4] “Boeing 717” “Boeing 737-700 (73W)” “Boeing 737-800 (738)”

[7] “Boeing 737-800 (73H)” “Boeing 737-900ER (739)” “Boeing 757-200 (75A)”

[10] “Boeing 757-200 (75M)” “Boeing 757-200 (75N)” “Boeing 757-200 (757)”

[13] “Boeing 757-200 (75V)” “Boeing 757-300” “Boeing 767-300 (76P)”

[16] “Boeing 767-300 (76Q)” “Boeing 767-300 (76U)” “CRJ 700”

[19] “CRJ 900” “E170” “E175”

[22] “MD-88” “MD-90” “MD-DC9-50”

Ahhh, that’s the way fly (some day, some day…). This is apparently the plane professional sports teams and the American military often charter to fly – this article in the Sydney Morning Herald has more details.

Top: CRJ100/200. Bottom left: Embraer E120. Bottom right: Embraer ERJ-145. |

I’ve flown many times in the venerable CRJ 100/200 series planes, in which I can assure you there is only economy seating, and which I like to affectionately refer to as “little metal tubes of suffering.”

These are split into two clusters, which seem to again divide the planes approximately by size (both physical and accommodation), though there is crossover in the Boeing craft.

# Compare accommodation by cluster in boxplot boxplot(data$Accommodation ~ k$cluster, xlab='Cluster', ylab='Accommodation', main='Plane Accommodation by Cluster')

# Compare presence of seat classes in largest clusters data[k$clust==clust[3],30:33] data[k$clust==clust[4],30:33]

First.Class | Business | Eco.Comfort | Economy | |

Airbus A330-200 | 0 | 1 | 1 | 1 |

Airbus A330-200 (3L2) | 0 | 1 | 1 | 1 |

Airbus A330-200 (3L3) | 0 | 1 | 1 | 1 |

Airbus A330-300 | 0 | 1 | 1 | 1 |

Boeing 747-400 (74S) | 0 | 1 | 1 | 1 |

Boeing 757-200 (75E) | 0 | 1 | 1 | 1 |

Boeing 757-200 (75X) | 0 | 1 | 1 | 1 |

Boeing 767-300 (76G) | 0 | 1 | 1 | 1 |

Boeing 767-300 (76L) | 0 | 1 | 1 | 1 |

Boeing 767-300 (76T) | 0 | 1 | 1 | 1 |

Boeing 767-300 (76Z V.1) | 0 | 1 | 1 | 1 |

Boeing 767-300 (76Z V.2) | 0 | 1 | 1 | 1 |

Boeing 767-400 (76D) | 0 | 1 | 1 | 1 |

Boeing 777-200ER | 0 | 1 | 1 | 1 |

Boeing 777-200LR | 0 | 1 | 1 | 1 |

First.Class | Business | Eco.Comfort | Economy | |

Airbus A319 | 1 | 0 | 1 | 1 |

Airbus A320 | 1 | 0 | 1 | 1 |

Airbus A320 32-R | 1 | 0 | 1 | 1 |

Boeing 717 | 1 | 0 | 1 | 1 |

Boeing 737-700 (73W) | 1 | 0 | 1 | 1 |

Boeing 737-800 (738) | 1 | 0 | 1 | 1 |

Boeing 737-800 (73H) | 1 | 0 | 1 | 1 |

Boeing 737-900ER (739) | 1 | 0 | 1 | 1 |

Boeing 757-200 (75A) | 1 | 0 | 1 | 1 |

Boeing 757-200 (75M) | 1 | 0 | 1 | 1 |

Boeing 757-200 (75N) | 1 | 0 | 1 | 1 |

Boeing 757-200 (757) | 1 | 0 | 1 | 1 |

Boeing 757-200 (75V) | 1 | 0 | 1 | 1 |

Boeing 757-300 | 1 | 0 | 1 | 1 |

Boeing 767-300 (76P) | 1 | 0 | 1 | 1 |

Boeing 767-300 (76Q) | 1 | 0 | 1 | 1 |

Boeing 767-300 (76U) | 0 | 1 | 1 | 1 |

CRJ 700 | 1 | 0 | 1 | 1 |

CRJ 900 | 1 | 0 | 1 | 1 |

E170 | 1 | 0 | 1 | 1 |

E175 | 1 | 0 | 1 | 1 |

MD-88 | 1 | 0 | 1 | 1 |

MD-90 | 1 | 0 | 1 | 1 |

MD-DC9-50 | 1 | 0 | 1 | 1 |

Looking at the raw data, the difference I can ascertain between the largest two clusters is that all the aircraft in the one have first class seating, whereas all the planes in the other have business class instead [the one exception being the Boeing 767-300 (76U)].

#### Conclusions

If I did this again, I would structure the data differently and see what relationships such analysis could draw out using only select parts of the data (*e.g. *aircraft measurements only). The interesting lesson here is that it when using techniques like dimensionality reduction and clustering it is not only important to be mindful of applying them correctly, but also what variables are in your data set and how they are represented.

#### References & Resources

*http://en.wikipedia.org/wiki/Principal_components_analysis*

You should update the analysis to show what will happen should Delta add the A380 to its fleet. This article got me thinking about that:

http://www.businessweek.com/articles/2014-06-23/can-the-airbus-a380-crack-a-u-dot-s-dot-fleet-yes-and-heres-why#r=hpt-fs

How interesting! The aviation industry is so vast, and there are so many different areas you could analyze. I really enjoyed reading your findings. Thanks so much for sharing this!

Thank you! And you are, of course, most welcome. Glad you enjoyed the post.

Excellent! I cloned your repo and followed along in R. Simply excellent, thank you. You've saved me oodles of time in figuring out all those packages also. And rgl is cool too 🙂

Do you know if it would've been advisable to remove some of the colinear variables before performing PCA? For example, there are several variables that grow with the physical size of the plane and one of these would've captured enough of the information. Same thing with 'luxury' attributes. When I did 'loadings(pc)' on the version of the pc variable as defined in line 33, I couldn't see anything that explained the PC ellipsoid, except perhaps that the colinear variables seemed to have similar weights. Perhaps I'll do it and see what happens

I too am a consultant and so often the customer wants to know 'what's going on' in their data, such that the minute I map data from a space they get to a some derived space sends them into conniptions. So I'd be interested in knowing how to make the principal components explainable. Often for the case of making a good prediction, no one cares. But in trying to get the customer to sign off on my work, they do care.

thanks again

Glad you enjoyed it, Tim. Happy to help.

My understanding of PCA is that it accounts for collinearity by picking a combinations of variables for each principal component such that the maximum amount of variance in the data is explained by them and they are orthogonal. In fact, if you read the definition of PCA, you'll see that dealing with correlated and collinear variables is really much of the motivation behind the method. There's also some discussion on your concerns here, but let me know if you find any more documentation or a reference that suggests otherwise.

I will freely admit that there are flaws in my analysis (which others have pointed out). As PCA picks components that explain the maximum amount of variability, it should have been no surprise that the seat class variables became the determining factor. Using dummy variables is appropriate for methods like linear regression but not for PCA, as the dummy variables will always take maximum values (0 or 1) whereas the scaled numerical variables will lie in the range, a fact which I did not consider at the time but which other readers have since pointed out.

I have understood the importance of scaling the features before PCA. However it is not clear to me if it is necessary to standardize again before running the K-Means algorithm to have the same variance over the principal components.

Do you have any suggestion?