Introduction
Lately I was trying to put together some 2D histograms in R and found that there are many ways to do it, with directions on how to do so scattered across the internet in blogs, forums and of course, Stackoverflow.
As such I thought I’d give each a go and also put all of them together here for easy reference while also highlighting their difference.
For those not “in the know” a 2D histogram is an extensions of the regular old histogram, showing the distribution of values in a data set across the range of two quantitative variables. It can be considered a special case of the heat map, where the intensity values are just the count of observations in the data set within a particular area of the 2D space (bucket or bin).
So, quickly, here are 5 ways to make 2D histograms in R, plus one additional figure which is pretty neat.
First and foremost I get the palette looking all pretty using RColorBrewer, and then chuck some normally distributed data into a data frame (because I’m lazy). Also one scatterplot to justify the use of histograms.
# Color housekeeping
library(RColorBrewer)
rf <- colorRampPalette(rev(brewer.pal(11,'Spectral')))
r <- rf(32)
# Create normally distributed data for plotting
x <- rnorm(mean=1.5, 5000)
y <- rnorm(mean=1.6, 5000)
df <- data.frame(x,y)
# Plot
plot(df, pch=16, col='black', cex=0.5)
Option 1: hexbin
##### OPTION 1: hexbin from package 'hexbin' #######
library(hexbin)
# Create hexbin object and plot
h <- hexbin(df)
plot(h)
plot(h, colramp=rf)
Using the hexbinplot function provides greater flexibility, allowing specification of endpoints for the bin counting, and also allowing the provision of a transformation function. Here I did log scaling. Also it appears to handle the legend placement better; no adjustment was required for these figures.
# hexbinplot function allows greater flexibility
hexbinplot(y~x, data=df, colramp=rf)
# Setting max and mins
hexbinplot(y~x, data=df, colramp=rf, mincnt=2, maxcnt=60)
# Scaling of legend - must provide both trans and inv functions
hexbinplot(y~x, data=df, colramp=rf, trans=log, inv=exp)
Option 2: hist2d
##### OPTION 2: hist2d from package 'gplots' #######
library(gplots)
# Default call
h2 <- hist2d(df)
# Coarser binsizing and add colouring
h2 <- hist2d(df, nbins=25, col=r)
# Scaling with log as before
h2 <- hist2d(df, nbins=25, col=r, FUN=function(x) log(length(x)))
Option 3: stat_2dbin from ggplot
##### OPTION 3: stat_bin2d from package 'ggplot' #######
library(ggplot2)
# Default call (as object)
p <- ggplot(df, aes(x,y))
h3 <- p + stat_bin2d()
h3
# Default call (using qplot)
qplot(x,y,data=df, geom='bin2d')
# Add colouring and change bins
h3 <- p + stat_bin2d(bins=25) + scale_fill_gradientn(colours=r)
h3
# Log scaling
h3 <- p + stat_bin2d(bins=25) + scale_fill_gradientn(colours=r, trans="log")
h3
Option 4: kde2d
Setting n higher does interpolation and we are into the realm of kernel density estimation, as you can set your “bin size” lower than how your data actually appear. Hadley Wickham notes that in R there are over 20 packages [PDF] with which to do density estimation so we’ll keep that to a separate discussion.
##### OPTION 4: kde2d from package 'MASS' #######
# Not a true heatmap as interpolated (kernel density estimation)
library(MASS)
# Default call
k <- kde2d(df$x, df$y)
image(k, col=r)
# Adjust binning (interpolate - can be computationally intensive for large datasets)
k <- kde2d(df$x, df$y, n=200)
image(k, col=r)
Option 5: The Hard Way
##### OPTION 5: The Hard Way (DIY) #######
# http://stackoverflow.com/questions/18089752/r-generate-2d-histogram-from-raw-data
nbins <- 25
x.bin <- seq(floor(min(df[,1])), ceiling(max(df[,1])), length=nbins)
y.bin <- seq(floor(min(df[,2])), ceiling(max(df[,2])), length=nbins)
freq <- as.data.frame(table(findInterval(df[,1], x.bin),findInterval(df[,2], y.bin)))
freq[,1] <- as.numeric(freq[,1])
freq[,2] <- as.numeric(freq[,2])
freq2D <- diag(nbins)*0
freq2D[cbind(freq[,1], freq[,2])] <- freq[,3]
# Normal
image(x.bin, y.bin, freq2D, col=r)
# Log
image(x.bin, y.bin, log(freq2D), col=r)
Bonus Figure
##### Addendum: 2D Histogram + 1D on sides (from Computational ActSci w R) #######
#http://books.google.ca/books?id=YWcLBAAAQBAJ&pg=PA60&lpg=PA60&dq=kde2d+log&source=bl&ots=7AB-RAoMqY&sig=gFaHSoQCoGMXrR9BTaLOdCs198U&hl=en&sa=X&ei=8mQDVPqtMsi4ggSRnILQDw&redir_esc=y#v=onepage&q=kde2d%20log&f=false
h1 <- hist(df$x, breaks=25, plot=F)
h2 <- hist(df$y, breaks=25, plot=F)
top <- max(h1$counts, h2$counts)
k <- kde2d(df$x, df$y, n=25)
# margins
oldpar <- par()
par(mar=c(3,3,1,1))
layout(matrix(c(2,0,1,3),2,2,byrow=T),c(3,1), c(1,3))
image(k, col=r) #plot the image
par(mar=c(0,2,1,0))
barplot(h1$counts, axes=F, ylim=c(0, top), space=0, col='red')
par(mar=c(2,0,0.5,1))
barplot(h2$counts, axes=F, xlim=c(0, top), space=0, col='red', horiz=T)
Thank you for the overview. Of course R ist great and you are not restricted to the 5 options specified in the blog. The contour plot may not be as visually compelling as your option #4, but it is printable in b/w-publications, it is often easier to read the numbers, it is readable for the color blind and it is less arbitrary then color scales tend to be. That is for larger amounts.
Even if it is not strictly a histogram I feel, that good old sunflower plot should be mentioned here, as it gives a good overview of the counts of 2D data. Doing the binning yourself you can easily chance it into a histogram.
I am confident, that there is a wealth of other options to display 2D histogram data. Have you looked into pseudo-3D plots like persp and scatter3D / scatterplot3D and the like? Usually "flat" plot will be superior, but not necessary always.
ggplot2 also has stat_binhex:
p = ggplot(df, aes(x,y))
p + stat_binhex(bins=25) + scale_fill_gradientn(colours=r)
But they are not histogram. They are scatterplot. Am I wrong?
Of course! And you are absolutely correct regarding contour plots, given what we know about perceiving quantity encoded as colour. I was not familiar with sunflower plots, thank you for teaching me something new. There is also the option to use the style parameter in the hexbin function to produced similar output, with style="lattice" or style="nested.centroids"
I am generally opposed to 3D visualization (unless it is interactive) as flattening a 3D image into 2D usually results in poor or difficult visualization or both. In my opinion it can be good in certain situations depending on the properties of the data, or used in a supplementary fashion.
And yes, of course, there exists a wealth of options to do data visualization (using R or not); it is up to the one doing analysis to use them appropriately and effectively.
Ah! Can't believe I missed that! Thank you.
The result is quite similar to hexbin.
As I understand it, a scatterplot plots the data itself, a histogramm first bins and counts the data before plotting. Unless the binning is in a lot of shingles and the counting is weighted, then it becomes a density plot. So hexbinning forms histogramms that look very similar to scatterplots.
Cheers,
Bernhard
I have to admit, that even though I consider the sunflower plot a usefull plot of it's own right, I have yet to see one in a published paper. People seem to prefer jittered plots even for discrete data. They need even less explanation. The one thing I dislike in them is the random factor.
Thank you very much for linking the "style=lattice" hexbin plot. That is what you thought me today. Looking forward to using that soon.
Cheers,
Bernhard