Year: 2014
Toronto Cats and Dogs II – Top 25 Names of 2014
I was quite surprised by the relative popularity of my previous analysis of the data for Licensed Cats & Dogs in Toronto for 2011, given how simple it was to put together.
I was browsing the Open Data Portal recently and noticed that there was a new data set for pets: the top 25 names for both dogs and cats. I thought this could lend itself to some quick, easy visualization and be a neat little addition to the previous post.
First we simply visualize the raw counts of the Top 25 names against each other. Interestingly, the top 2 names for both dogs and cats are apparently the same: Charlie and Max.
Next let’s take a look at the distribution of these top 25 names for each type of pet by how long they are, which just involves calculating the name length and then pooling the counts:
You can see that, proportionally the top dog names are a bit shorter (distribution is positively / right-skewed) compared to the cat names (slightly negatively / left skewed). Also note both are centered around names of length 5, and the one cat name of length 8 (Princess).
Looking at the dog names, do you notice something interesting about them? A particular feature present in nearly all? I did. Nearly every one of the top 25 dog names ends in a vowel. We can see this by visualizing the proportion of the counts for each type of pet by whether the name ends in a vowel or consonant:
Which to me, seems to indicate that more dogs tend to have “cutesy” names, usually ending in ‘y’, than cats.
Fun stuff, but one thing really bothers me… no “Fido” or “Boots”? I guess some once popular names have gone to the dogs.
References & Resources
Twitter Pop-up Analytics
Introduction
Background
Luckily I am not a developer. My data needs are simple for some ad hoc analysis. All I need is the data pulled and I am ready to go. Twitter now makes this easy now for anyone to do, just go to your settings page:
And then select the ‘Download archive’ button under ‘Your Twitter Archive’ (here it is a prompt to resend mine, as I took the screenshot after):
And boom! A CSV of all your tweets is in your inbox ready for analysis. After all this talk about working with “Big Data” and trawling through large datasets, it’s nice to take a breather a work with something small and simple.
Analysis
Conclusion
References
5 Ways to Do 2D Histograms in R
Introduction
Lately I was trying to put together some 2D histograms in R and found that there are many ways to do it, with directions on how to do so scattered across the internet in blogs, forums and of course, Stackoverflow.
As such I thought I’d give each a go and also put all of them together here for easy reference while also highlighting their difference.
For those not “in the know” a 2D histogram is an extensions of the regular old histogram, showing the distribution of values in a data set across the range of two quantitative variables. It can be considered a special case of the heat map, where the intensity values are just the count of observations in the data set within a particular area of the 2D space (bucket or bin).
So, quickly, here are 5 ways to make 2D histograms in R, plus one additional figure which is pretty neat.
First and foremost I get the palette looking all pretty using RColorBrewer, and then chuck some normally distributed data into a data frame (because I’m lazy). Also one scatterplot to justify the use of histograms.
# Color housekeeping
library(RColorBrewer)
rf <- colorRampPalette(rev(brewer.pal(11,'Spectral')))
r <- rf(32)
# Create normally distributed data for plotting
x <- rnorm(mean=1.5, 5000)
y <- rnorm(mean=1.6, 5000)
df <- data.frame(x,y)
# Plot
plot(df, pch=16, col='black', cex=0.5)
Option 1: hexbin
##### OPTION 1: hexbin from package 'hexbin' #######
library(hexbin)
# Create hexbin object and plot
h <- hexbin(df)
plot(h)
plot(h, colramp=rf)
Using the hexbinplot function provides greater flexibility, allowing specification of endpoints for the bin counting, and also allowing the provision of a transformation function. Here I did log scaling. Also it appears to handle the legend placement better; no adjustment was required for these figures.
# hexbinplot function allows greater flexibility
hexbinplot(y~x, data=df, colramp=rf)
# Setting max and mins
hexbinplot(y~x, data=df, colramp=rf, mincnt=2, maxcnt=60)
# Scaling of legend - must provide both trans and inv functions
hexbinplot(y~x, data=df, colramp=rf, trans=log, inv=exp)
Option 2: hist2d
##### OPTION 2: hist2d from package 'gplots' #######
library(gplots)
# Default call
h2 <- hist2d(df)
# Coarser binsizing and add colouring
h2 <- hist2d(df, nbins=25, col=r)
# Scaling with log as before
h2 <- hist2d(df, nbins=25, col=r, FUN=function(x) log(length(x)))
Option 3: stat_2dbin from ggplot
##### OPTION 3: stat_bin2d from package 'ggplot' #######
library(ggplot2)
# Default call (as object)
p <- ggplot(df, aes(x,y))
h3 <- p + stat_bin2d()
h3
# Default call (using qplot)
qplot(x,y,data=df, geom='bin2d')
# Add colouring and change bins
h3 <- p + stat_bin2d(bins=25) + scale_fill_gradientn(colours=r)
h3
# Log scaling
h3 <- p + stat_bin2d(bins=25) + scale_fill_gradientn(colours=r, trans="log")
h3
Option 4: kde2d
Setting n higher does interpolation and we are into the realm of kernel density estimation, as you can set your “bin size” lower than how your data actually appear. Hadley Wickham notes that in R there are over 20 packages [PDF] with which to do density estimation so we’ll keep that to a separate discussion.
##### OPTION 4: kde2d from package 'MASS' #######
# Not a true heatmap as interpolated (kernel density estimation)
library(MASS)
# Default call
k <- kde2d(df$x, df$y)
image(k, col=r)
# Adjust binning (interpolate - can be computationally intensive for large datasets)
k <- kde2d(df$x, df$y, n=200)
image(k, col=r)
Option 5: The Hard Way
##### OPTION 5: The Hard Way (DIY) #######
# http://stackoverflow.com/questions/18089752/r-generate-2d-histogram-from-raw-data
nbins <- 25
x.bin <- seq(floor(min(df[,1])), ceiling(max(df[,1])), length=nbins)
y.bin <- seq(floor(min(df[,2])), ceiling(max(df[,2])), length=nbins)
freq <- as.data.frame(table(findInterval(df[,1], x.bin),findInterval(df[,2], y.bin)))
freq[,1] <- as.numeric(freq[,1])
freq[,2] <- as.numeric(freq[,2])
freq2D <- diag(nbins)*0
freq2D[cbind(freq[,1], freq[,2])] <- freq[,3]
# Normal
image(x.bin, y.bin, freq2D, col=r)
# Log
image(x.bin, y.bin, log(freq2D), col=r)
Bonus Figure
##### Addendum: 2D Histogram + 1D on sides (from Computational ActSci w R) #######
#http://books.google.ca/books?id=YWcLBAAAQBAJ&pg=PA60&lpg=PA60&dq=kde2d+log&source=bl&ots=7AB-RAoMqY&sig=gFaHSoQCoGMXrR9BTaLOdCs198U&hl=en&sa=X&ei=8mQDVPqtMsi4ggSRnILQDw&redir_esc=y#v=onepage&q=kde2d%20log&f=false
h1 <- hist(df$x, breaks=25, plot=F)
h2 <- hist(df$y, breaks=25, plot=F)
top <- max(h1$counts, h2$counts)
k <- kde2d(df$x, df$y, n=25)
# margins
oldpar <- par()
par(mar=c(3,3,1,1))
layout(matrix(c(2,0,1,3),2,2,byrow=T),c(3,1), c(1,3))
image(k, col=r) #plot the image
par(mar=c(0,2,1,0))
barplot(h1$counts, axes=F, ylim=c(0, top), space=0, col='red')
par(mar=c(2,0,0.5,1))
barplot(h2$counts, axes=F, xlim=c(0, top), space=0, col='red', horiz=T)
Conclusion
References
Stacked Area Graphs Are Not Your Friend
Stacked area graphs are not your friend. Seriously. I want to make this abundantly clear.
I’m going to expound on some of the work of Stephen Few here and lay out what stacked area graphs are, why they are a poor type of data visualization, and what are some good alternatives.
What is a stacked area graph?
Pretty, no? |
Shortcomings of Stacked Area Graphs
For instance, in the example graph I produced above, it can be easy to think there are very well-defined peaks in all the series around Jan 9 and Jan 22. This is because of the effect just mentioned. Look at the same graph if I selectively shuffle the order of the stacking of the areas:
While we still see those peaks at the times mentioned because those are the peaks for the total, but look at the series for Group D (in purple). Do you still feel the same about how it fluctuates between the dates of the 8th and the 22nd as you did before, in the first figure?
Because of the inclination to interpret the top of the area as quantity, interpreting the trend in the different areas of a stacked area graph is usually quite difficult.
Alternative Approaches
When the graph gets a bit noisy like this it might also be a good idea to thin the lines.
Okay, that’s better. But as the number of values of the categorical variable increases the graph is going to get increasingly noisy. What do we do in those cases?
Well, as I often have to remind myself, nowhere does it say that you have to tell your story all in one graph. There’s nothing stopping us from breaking up this one graph into smaller individual graphs, one for each and also the total. The disadvantage here is that it’s not as easy to compare between the different groups, however we can make it easier by using the same axis scaling for the graphs for each individual group.
Here there were an odd number of graphs so I chose to keep the graph for the total larger (giving it emphasis) and maintain their original aspect ratios. You could just as easily make a panel of 6 with equal sizes if you had a different number of graphs, or put them all in tall or wide graphic in a column or row.
Also, now that each individual graph depicts the value for a different group, we don’t need the colours on the figures on the right anymore; that information is in each individual plot title. So we can ditch the color. I’ll keep the total black to differentiate between the total and the value for individual group.
As the number of values for the categorical variable gets very large you go from multiple figures into true small multiple (trellis plot) territory, like in the figure below:
Another option, if you have the benefit of more dynamic visualization tools available, would be to use interactivity and gray out series in the background, such as in this amazing visualization of housing prices from the New York Times:
Click me for dataviz goodness. |
Concluding Remarks
In my opinion, and my experience working with data visualization, you are almost always better served by the simpler, more minimalistic types of visualizations (the fundamental three being the bar chart, line graph and scatterplot) than more complicated ones. This has been an example of that, as stacked area graphs are really just a combination of area graphs, which are, in turn, an extension of the line graph.
References
PCA and K-means Clustering of Delta Aircraft
Introduction
The point is that my line of business requires travel, and sometimes that is a lot of the time, like say almost all of last year. Inevitable comparisons to George Clooney’s character in Up in the Air were made (ironically I started to read that book, then left it on a plane in a seatback pocket), requests about favours involving duty free, and of course many observations and gently probing questions about frequent flier miles (FYI I’ve got more than most people, but a lot less than the entrepreneur I sat next to one time, who claimed to have close to 3 million).
But I digress.
Background
In my case this means flying Delta.
So I happened to notice in one of my many visits to Delta’s website that they have data on all of their aircraft in a certain site section. I thought this would be an interesting data set on which to do some analysis, as it has both quantitative and qualitative information and is relatively complex. What can we say about the different aircraft in Delta’s fleet, coming at it with ‘fresh eyes’? Which planes are similar? Which are dissimilar?
Aircraft data card from Delta.com
The data set comprises 33 variables on 44 aircraft taken from Delta.com, including both quantitative measures on attributes like cruising speed, accommodation and range in miles, as well as categorical data on, say, whether a particular aircraft has Wi-Fi or video. These binary categorical variables were transformed into quantitative variables by assigning them values of either 1 or 0, for yes or no respectively.
Analysis
data <- read.csv(file="delta.csv", header=T, sep=",", row.names=1) # scatterplot matrix of intermediary (size/non-categorical) variables plot(data[,16:22])
We can see that there are pretty strong positive correlations between all these variables, as all of them are related to the aircraft’s overall size. Remarkably there is an almost perfectly linear relationship between wingspan and tail height, which perhaps is related to some principle of aeronautical engineering of which I am unaware.
The exception here is the variable right in the middle which is the number of engines. There is one lone outlier [Boeing 747-400 (74S)] which has four, while all the other aircraft have two. In this way the engines variable is really more like a categorical variable, but we shall as the analysis progresses that this is not really important, as there are other variables which more strongly discern the aircraft from one another than this.
How do we easier visualize a high-dimensional data set like this one? By using a dimensionality reduction technique like principal components analysis.
Principal Components Analysis
Next let’s say I know nothing about dimensionality reduction techniques and just naively apply principle components to the data in R:
# Naively apply principal components analysis to raw data and plot pc <- princomp(data) plot(pc)
Taking that approach we can see that the first principal component has a standard deviation of around 2200 and accounts for over 99.8% of the variance in the data. Looking at the first column of loadings, we see that the first principle component is just the range in miles.
# First component dominates greatly. What are the loadings? summary(pc) # 1 component has > 99% variance loadings(pc) # Can see all variance is in the range in miles
Importance of components:
Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Comp.1 Â Â Â Comp.2 Â Â Â Comp.3 Â Â Â Comp.4
Standard deviation Â Â 2259.2372556 6.907940e+01 2.871764e+01 2.259929e+01
Proportion of Variance Â Â 0.9987016 9.337038e-04 1.613651e-04 9.993131e-05
Cumulative Proportion Â Â 0.9987016 9.996353e-01 9.997966e-01 9.998966e-01
Â Â Â Â Â Â Â
Â Â Â Â Â Â Â Â Â Â Â Â Â Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6 Comp.7 Comp.8
Seat.Width..Club. Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â -0.144 -0.110 Â Â Â Â Â Â Â
Seat.Pitch..Club. Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â -0.327 -0.248 Â Â Â Â 0.189
Seat..Club. Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â
Seat.Width..First.Class. Â Â Â Â Â Â Â Â 0.250 Â Â Â Â -0.160 Â Â Â Â -0.156 Â 0.136
Seat.Pitch..First.Class. Â Â Â Â Â Â Â Â 0.515 -0.110 -0.386 Â 0.112 -0.130 Â 0.183
Seats..First.Class. Â Â Â Â Â Â Â Â Â Â 0.258 -0.124 -0.307 -0.109 Â 0.160 Â 0.149
Seat.Width..Business. Â Â Â Â Â Â Â Â Â -0.154 Â 0.142 -0.108 Â Â Â Â Â Â Â Â Â Â Â
Seat.Pitch..Business. Â Â Â Â Â Â Â Â Â -0.514 Â 0.446 -0.298 Â 0.154 -0.172 Â 0.379
Seats..Business. Â Â Â Â Â Â Â Â Â Â Â -0.225 Â 0.187 Â Â Â Â Â Â Â Â Â Â Â Â Â Â
Seat.Width..Eco.Comfort. Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â 0.285 -0.224 Â Â Â Â
Seat.Pitch..Eco.Comfort. Â Â Â Â Â Â Â Â 0.159 Â Â Â Â Â Â Â Â 0.544 -0.442 Â Â Â Â
Seats..Eco.Comfort. Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â 0.200 -0.160 Â Â Â Â
Seat.Width..Economy. Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â 0.125 Â 0.110 Â Â Â Â Â Â Â
Seat.Pitch..Economy. Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â 0.227 Â 0.190 Â Â Â Â -0.130
Seats..Economy. Â Â Â Â Â Â Â Â Â 0.597 Â Â Â Â -0.136 Â 0.345 -0.165 Â Â Â Â 0.168
Accommodation Â Â Â Â Â Â Â Â Â Â 0.697 Â Â Â Â Â Â Â -0.104 Â Â Â Â Â Â Â Â 0.233
Cruising.Speed..mph. Â Â Â Â Â Â Â Â Â Â 0.463 Â 0.809 Â 0.289 -0.144 Â 0.115 Â Â Â Â
Range..miles. Â Â Â Â Â Â 0.999 Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â
Engines Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â
Wingspan..ft. Â Â Â Â Â Â Â Â Â Â 0.215 Â Â Â Â 0.103 -0.316 -0.357 -0.466 -0.665
Tail.Height..ft. Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â -0.100 Â Â Â Â -0.187 Â Â Â Â
Length..ft. Â Â Â Â Â Â Â Â Â Â Â 0.275 Â Â Â Â 0.118 -0.318 Â 0.467 Â 0.582 -0.418
Wifi Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â
Video Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â
Power Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â
Satellite Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â
Flat.bed Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â
Sleeper Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â
Club Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â
First.Class Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â
Business Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â
Eco.Comfort Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â
Economy Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â
This is because the scale of the different variables in the data set is quite variable; we can see this by plotting the variance of the different columns in the data frame (regular scaling on the left, logarithmic on the right):
# verify by plotting variance of columns mar <- par()$mar par(mar=mar+c(0,5,0,0)) barplot(sapply(data, var), horiz=T, las=1, cex.names=0.8) barplot(sapply(data, var), horiz=T, las=1, cex.names=0.8, log='x') par(mar=mar)
We correct for this by scaling the data using the scale() function. We can then verify that the variances across the different variables are equal so that when we apply principal components one variable does not dominate.
# Scale data2 <- data.frame(scale(data)) # Verify variance is uniform plot(sapply(data2, var))
After applying the scale() function the variance is now constant across variables |
Now we can apply principal components to the scaled data. Note that this can also be done automatically in call to the prcomp() function by setting the parameter scale=TRUE. Now we see a result which is more along the lines of something we would expect:
# Proceed with principal components pc <- princomp(data2) plot(pc) plot(pc, type='l') summary(pc) # 4 components is both 'elbow' and explains >85% variance
Great, so now we’re in business. There are various rules of thumb for selecting the number of principal components to retain in an analysis of this type, two of which I’ve read about are:
- Pick the number of components which explain 85% or greater of the variation
- Use the ‘elbow’ method of the scree plot (on right)
# Get principal component vectors using prcomp instead of princomp pc <- prcomp(data2) # First for principal components comp <- data.frame(pc$x[,1:4]) # Plot plot(comp, pch=16, col=rgb(0,0,0,0.5))
So what were are looking at here are twelve 2-D projections of data which are in a 4-D space. You can see there’s a clear outlier in all the dimensions, as well as some bunching together in the different projections.
library(rgl) # Multi 3D plot plot3d(comp$PC1, comp$PC2, comp$PC3) plot3d(comp$PC1, comp$PC3, comp$PC4)
# Determine number of clusters wss <- (nrow(mydata)-1)*sum(apply(mydata,2,var)) for (i in 2:15) wss[i] <- sum(kmeans(mydata, centers=i)$withinss) plot(1:15, wss, type="b", xlab="Number of Clusters", ylab="Within groups sum of squares")
Clustering without the nstart parameter can lead to variable results for each run |
Clustering with the nstart and iter.max parameters leads to consistent results, allowing proper interpretation of the scree plot |
# From scree plot elbow occurs at k = 4 # Apply k-means with k=4 k <- kmeans(comp, 4, nstart=25, iter.max=1000) library(RColorBrewer) library(scales) palette(alpha(brewer.pal(9,'Set1'), 0.5)) plot(comp, col=k$clust, pch=16)
# 3D plot plot3d(comp$PC1, comp$PC2, comp$PC3, col=k$clust) plot3d(comp$PC1, comp$PC3, comp$PC4, col=k$clust)
# Cluster sizes sort(table(k$clust)) clust <- names(sort(table(k$clust))) # First cluster row.names(data[k$clust==clust[1],]) # Second Cluster row.names(data[k$clust==clust[2],]) # Third Cluster row.names(data[k$clust==clust[3],]) # Fourth Cluster row.names(data[k$clust==clust[4],])
[1] “CRJ 100/200 Pinnacle/SkyWest” “CRJ 100/200 ExpressJet”
[3] “E120” Â Â Â Â Â Â Â Â Â Â Â Â “ERJ-145”
[1] “Airbus A330-200” Â Â Â Â Â “Airbus A330-200 (3L2)”
[3] “Airbus A330-200 (3L3)” Â Â “Airbus A330-300”
[5] “Boeing 747-400 (74S)” Â Â “Boeing 757-200 (75E)”
[7] “Boeing 757-200 (75X)” Â Â “Boeing 767-300 (76G)”
[9] “Boeing 767-300 (76L)” Â Â “Boeing 767-300 (76T)”
[11] “Boeing 767-300 (76Z V.1)” “Boeing 767-300 (76Z V.2)”
[13] “Boeing 767-400 (76D)” Â Â “Boeing 777-200ER”
[15] “Boeing 777-200LR”
[1] “Airbus A319” Â Â Â Â Â Â “Airbus A320” Â Â Â Â Â Â “Airbus A320 32-R”
[4] “Boeing 717” Â Â Â Â Â Â “Boeing 737-700 (73W)” Â “Boeing 737-800 (738)”
[7] “Boeing 737-800 (73H)” Â “Boeing 737-900ER (739)” “Boeing 757-200 (75A)”
[10] “Boeing 757-200 (75M)” Â “Boeing 757-200 (75N)” Â “Boeing 757-200 (757)”
[13] “Boeing 757-200 (75V)” Â “Boeing 757-300” Â Â Â Â “Boeing 767-300 (76P)”
[16] “Boeing 767-300 (76Q)” Â “Boeing 767-300 (76U)” Â “CRJ 700”
[19] “CRJ 900” Â Â Â Â Â Â Â Â “E170” Â Â Â Â Â Â Â Â Â “E175”
[22] “MD-88” Â Â Â Â Â Â Â Â Â “MD-90” Â Â Â Â Â Â Â Â Â “MD-DC9-50”
Ahhh, that’s the way fly (some day, some day…). This is apparently the plane professional sports teams and the American military often charter to fly – this articleÂ in the Sydney Morning Herald has more details.
Top: CRJ100/200. Bottom left: Embraer E120. Bottom right: Embraer ERJ-145. |
I’ve flown many times in the venerable CRJ 100/200 series planes, in which I can assure you there is only economy seating, and which I like to affectionately refer to as “little metal tubes of suffering.”
These are split into two clusters, which seem to again divide the planes approximately by size (both physical and accommodation), though there is crossover in the Boeing craft.
# Compare accommodation by cluster in boxplot boxplot(data$Accommodation ~ k$cluster, xlab='Cluster', ylab='Accommodation', main='Plane Accommodation by Cluster')
# Compare presence of seat classes in largest clusters data[k$clust==clust[3],30:33] data[k$clust==clust[4],30:33]
First.Class | Business | Eco.Comfort | Economy | |
Airbus A330-200 | 0 | 1 | 1 | 1 |
Airbus A330-200 (3L2) | 0 | 1 | 1 | 1 |
Airbus A330-200 (3L3) | 0 | 1 | 1 | 1 |
Airbus A330-300 | 0 | 1 | 1 | 1 |
Boeing 747-400 (74S) | 0 | 1 | 1 | 1 |
Boeing 757-200 (75E) | 0 | 1 | 1 | 1 |
Boeing 757-200 (75X) | 0 | 1 | 1 | 1 |
Boeing 767-300 (76G) | 0 | 1 | 1 | 1 |
Boeing 767-300 (76L) | 0 | 1 | 1 | 1 |
Boeing 767-300 (76T) | 0 | 1 | 1 | 1 |
Boeing 767-300 (76Z V.1) | 0 | 1 | 1 | 1 |
Boeing 767-300 (76Z V.2) | 0 | 1 | 1 | 1 |
Boeing 767-400 (76D) | 0 | 1 | 1 | 1 |
Boeing 777-200ER | 0 | 1 | 1 | 1 |
Boeing 777-200LR | 0 | 1 | 1 | 1 |
First.Class | Business | Eco.Comfort | Economy | |
Airbus A319 | 1 | 0 | 1 | 1 |
Airbus A320 | 1 | 0 | 1 | 1 |
Airbus A320 32-R | 1 | 0 | 1 | 1 |
Boeing 717 | 1 | 0 | 1 | 1 |
Boeing 737-700 (73W) | 1 | 0 | 1 | 1 |
Boeing 737-800 (738) | 1 | 0 | 1 | 1 |
Boeing 737-800 (73H) | 1 | 0 | 1 | 1 |
Boeing 737-900ER (739) | 1 | 0 | 1 | 1 |
Boeing 757-200 (75A) | 1 | 0 | 1 | 1 |
Boeing 757-200 (75M) | 1 | 0 | 1 | 1 |
Boeing 757-200 (75N) | 1 | 0 | 1 | 1 |
Boeing 757-200 (757) | 1 | 0 | 1 | 1 |
Boeing 757-200 (75V) | 1 | 0 | 1 | 1 |
Boeing 757-300 | 1 | 0 | 1 | 1 |
Boeing 767-300 (76P) | 1 | 0 | 1 | 1 |
Boeing 767-300 (76Q) | 1 | 0 | 1 | 1 |
Boeing 767-300 (76U) | 0 | 1 | 1 | 1 |
CRJ 700 | 1 | 0 | 1 | 1 |
CRJ 900 | 1 | 0 | 1 | 1 |
E170 | 1 | 0 | 1 | 1 |
E175 | 1 | 0 | 1 | 1 |
MD-88 | 1 | 0 | 1 | 1 |
MD-90 | 1 | 0 | 1 | 1 |
MD-DC9-50 | 1 | 0 | 1 | 1 |
Looking at the raw data, the difference I can ascertain between the largest two clusters is that all the aircraft in the one have first class seating, whereas all the planes in the other have business class instead [the one exception being the Boeing 767-300 (76U)].
Conclusions
If I did this again, I would structure the data differently and see what relationships such analysis could draw out using only select parts of the data (e.g. aircraft measurements only). The interesting lesson here is that it when using techniques like dimensionality reduction and clustering it is not only important to be mindful of applying them correctly, but also what variables are in your data set and how they are represented.
References & Resources
http://en.wikipedia.org/wiki/Principal_components_analysis
Big Data Week Toronto 2014 Recap – Meetup #3: Big Data Visualization
This past week was Big Data Week for those of you that don’t know, a week of talks and events held worldwide to “unite the global data communities through series of events and meetups”.
Viafoura put on the events this year for Toronto and was kind enough to extend an invitation to myself to be one of the speakers talking on data visualization and how that relates to all this “Big Data” stuff.
Paul spoke detecting fraud online using visualization and data science techniques. Something I often think about when presenting is how to make your message clear and connect with both the least technical people in the audience (who, quite often, have attended strictly out of curiosity) and the most knowledgeable and technically-minded people present.
I was really impressed with Paul’s visual explanation of the Jaccard coefficient. Not everyone understands set theory, however almost everyone will understand a Venn diagram if you put it in front of them.
So to explain the Jaccard index as a measure of mutual information when giving a presentation, which is better? You could put the definition up on a slide:
Heatmap of Toronto Traffic Signals using RGoogleMaps
A little while back there was an article in blogTO about how a reddit user had used data from Toronto’s Open Data initiative to produce a rather cool-looking map of all the locations of all the traffic signals here in the city.
It’s neat because as the author on blogTO notes, it is recognizable as Toronto without any other geographic data being plotted – the structure of the city comes out in the data alone.
Still, I thought it’d be interesting to see as a geographic heat map, and also a good excuse to fool around with mapping using Rgooglemaps.
The finished product below:
Despite my best efforts with transparency (using my helper function), it’s difficult for anything but the city core to really come out in the intensity map.
The image without the Google maps tile, and the coordinates rotated, shows the density a little better in the green-yellow areas:
And it’s also straightforward to produce a duplication of the original black and white figure:
The R code is below. Interpolation is using the trusty kde2d function from the MASS library and a rotation is applied for the latter two figures, so that the grid of Toronto’s streets faces ‘up’ as in the original map.
# Toronto Traffic Signals Heat Map
# Myles Harrison
# https://www.everydayanalytics.ca
# Data from Toronto Open Data Portal:
# http://www.toronto.ca/open
library(MASS)
library(RgoogleMaps)
library(RColorBrewer)
source('colorRampPaletteAlpha.R')
# Read in the data
data <- read.csv(file="traffic_signals.csv", skip=1, header=T, stringsAsFactors=F)
# Keep the lon and lat data
rawdata <- data.frame(as.numeric(data$Longitude), as.numeric(data$Latitude))
names(rawdata) <- c("lon", "lat")
data <- as.matrix(rawdata)
# Rotate the lat-lon coordinates using a rotation matrix
# Trial and error lead to pi/15.0 = 12 degrees
theta = pi/15.0
m = matrix(c(cos(theta), sin(theta), -sin(theta), cos(theta)), nrow=2)
data <- as.matrix(data) %*% m
# Reproduce William's original map
par(bg='black')
plot(data, cex=0.1, col="white", pch=16)
# Create heatmap with kde2d and overplot
k <- kde2d(data[,1], data[,2], n=500)
# Intensity from green to red
cols <- rev(colorRampPalette(brewer.pal(8, 'RdYlGn'))(100))
par(bg='white')
image(k, col=cols, xaxt='n', yaxt='n')
points(data, cex=0.1, pch=16)
# Mapping via RgoogleMaps
# Find map center and get map
center <- rev(sapply(rawdata, mean))
map <- GetMap(center=center, zoom=11)
# Translate original data
coords <- LatLon2XY.centered(map, rawdata$lat, rawdata$lon, 11)
coords <- data.frame(coords)
# Rerun heatmap
k2 <- kde2d(coords$newX, coords$newY, n=500)
# Create exponential transparency vector and add
alpha <- seq.int(0.5, 0.95, length.out=100)
alpha <- exp(alpha^6-1)
cols2 <- addalpha(cols, alpha)
# Plot
PlotOnStaticMap(map)
image(k2, col=cols2, add=T)
points(coords$newX, coords$newY, pch=16, cex=0.3)
This a neat little start and you can see how this type of thing could easily be extended to create a generalized mapping tool, stood up as a web service for example (they’re out there). Case in point: Google Fusion Tables. I’m unsure as to what algorithm they use but I find it less satisfying, looks like some kind of simple point blending:
As always, all the code is on github.
Interactive Visualization: Explore the 2014 “Sunshine List”
The “Sunshine List”, a list of Ontario’s public service who make more than $100,000 a year, was brought in under the Harris government in 1996, in an effort to increase transparency about just how much the top paid public servants were earning.
Now the list is released each year by the Ontario Ministry of Finance.
However, there has been some frustration that the data are not in the most easily accessible format (HTML & PDF? Really guys?).
Stuart A. Thompson was kind enough to provide the data in an easier to digest format (Nick Ragaz also provided it in CSV), as well as producing a tool for exploring it on The Globe and Mail.
I thought it’d be great to get a more visual exploration of the data at fine granularity, so have produced this interactive visualization below.
You can filter the data by searching by employer, name, or position. You can also filter the list by selecting points or groups of points on the scatterplot on the left, and highlight groups of points (or individual employees) by selecting components in the bar graph at the right. The bar graph on the right can also be expanded and collapsed to view the aggregate salary and benefits by employer, or to view the quantities for individual employees.
Hovering over data points on either graph will display the related data in a tooltip – I’ve found this is handy for looking at individual points of interest on the scatterplot. Zoom and explore to find interesting patterns and individuals. Give it a try!
I’ve plotted benefit against salary with the latter having a logarithmic axis so that the data are easier visualized and explored (note that I am in no way suggesting that benefits are a function of salary).
Using combinations of all these possible interactions mentioned above you can do some interesting visual analysis: for instance, how do the top salaries and benefits earned by Constables across police departments in Ontario differ (seriously, take a look)? What are the relative pay and benefit levels of professors at Ontario Universities on the list? How much does Rob Ford make?
Something interesting I’ve already noticed is that for many employers there are long horizontal bands where employees’ salaries vary but they are fixed into the same benefit buckets. Others have a different relationship, for example, the benefit / salary ratios of those at my alma mater vs those of employees of the City of Toronto:
Hope this tool will be interesting and useful for those interested in the list. As always feedback is welcome in the comments.
Perception in Data Visualization – A Quick 7 Question Test
When most people think of data, they probably think of a dry, technical analysis, without a lot of creativity or freedom. Quite to the contrary, data visualization encompasses choices of design, creative freedom, and also (perhaps most interestingly) elements of cognitive psychology, particularly related to the science of visual perception and information processing.
If you read any good text on dataviz, like Tufte, Few, or Cairo, you will, at some point, come across a discussion of the cognitive aspects of data visualization (the latter two devoting entire chapters to this topic). This will likely include a discussion of the most elemental ways to encode information visually, and their respective accuracies when quantity is interpreted from them, usually referencing the work of Cleveland & McGill [PDF].
Mulling over the veracity of my brief mention of the visual ways of encoding quantity in my recent talk, and also recently re-reading Nathan Yau’s discussion of the aforementioned paper, I got to thinking about just how different the accuracy of interpretation between the different encodings might be.
I am not a psychologist or qualitative researcher, but given the above quickly put together a simple test of 7 questions in Google Docs, to examine the accuracy of interpreting proportional quantities when encoded visually; and I humbly request the favour of your participation. If there are enough responses I will put together what analysis is possible in a future post (using the appropriate visualization techniques, of course).
Apologies in advance for the grade-school wording of the questions, but I wanted to be as clear as possible to ensure consistency in the results. Thanks so much in advance for contributing! Click below for the quiz:
EDIT: The quiz will now be up indefinitely on this page.