It's been said that the ideal job is one you love enough to do for free but are good enough at that people will pay you for it. That if you do what you love no matter what others may say, and if you work at it hard enough, and long enough, eventually people will recognize it and you’ll be a success.
What started as simply a hobby and posting some sketches online turned into a cornerstone of internet popular culture, with a cult following amongst geekdom, the technically savvy, and more.
Though I would say that it’s gone beyond that now, and even those less nerdy and techie know what xkcd means – it’s become such a key part of internet popular culture. Indeed, Mr. Munroe’s work has swing due to the sheer number of people who know and love it, and content on the site has resulted in changes being made on some of the biggest sites on the Internet – take, for example, Google adding a comment read-a-loud in 2008, quite possibly because of a certain comic.
As another nerdy / tech / data citizen of the internet who knows, loves and follows xkcd, I thought I could pay tribute to it with its own everyday analysis.
Initially, I thought I would have to go about doing it the hard way again. I’ve done some web scraping before with Python and thought this would be the same using the (awesome) Beautiful Soup package.
That being said it was straightforward to write some Python with urrlib2 to download the data and then get going on the analysis.
Of course, after doing all that I realized that someone else was nice enough to have already written the equivalent code in R to access the data. D’oh! Oh well. Lesson learned – Google stuff first, code later.
But it was important to write that code in Python as I used the Python Imaging Library (PIL) (also awesome… thanks mysterious, shadowy developers at Pythonware/Secret Labs AB) to extract metadata from the comic images.
As well as the data provided via the JSON (comic #, url, title, date, transcript and alt text) I pulled out additional fields using the Python Imaging Library (file format, filesize, dimensions, aspect ratio and luminosity). I also wanted to calculate hue, however, regrettably this is a somewhat more complicated process which my image processing chops were not immediately up to, and so I deferred on this point.
You can see out of the 1204 comics, 1073 (~89.19%) were in PNG format, 128 (~10.64%) were in JPEG and only 2 (#961, Eternal Flame and #1116, Stoplight) (~0.17%) were in GIF. This of course, being because the latter are the only two comics which are animated.
Looking at the filetype over time below, you can see that initially xkcd was primarily composed of JPEG images (mostly because they were scanned sketches) and this quickly changed over time to being almost exclusively PNG with the exception of the two aforementioned animated GIFs. The lone outlying JPEG near 600 is Alternative Energy Revolution (#556).
- L - 8 bit black & white
- P - 8 bit colour
- RGB - colour
- LA, RGBA - black & white with alpha channel (transparency), colour with alpha channel
You can see that the majority are imagemode L (847, ~70.41%) followed by 346 in RGB (~28.76%) and a tiny remaining number are in P (8, ~0.7%) with the remaining two in L and RGB modes with alpha channel (LA & RGBA).
Any readers will know that the bulk of xkcd comics are simple black-and-white images with stick figures and you can see this reflected in the almost ¾ to ¼ ratio of monochrome to coloured images.
Looking at the image mode over time, we can see that like the filetype, almost all of the images were initially in RGB mode as they were scans. After this period, the coloured comics are fairly evenly interspersed with the more common black and white images.
Luminosity was the only quality of the images which had significant change over the years that Randall has created the comic. Doing an analysis of variance we can see there is a statistically significant year-on-year difference in the average comic brightness (> 99%):
> aov(data$lumen ~ data$year)
aov(formula = data$lumen ~ data$year)
Sum of Squares 5762.0 829314.2
Deg. of Freedom 1 1201
Residual standard error: 26.27774
Estimated effects may be unbalanced
> summary(aov(data$lumen ~ data$year))
Df Sum Sq Mean Sq F value Pr(>F)
data$year 1 5762 5762 8.344 0.00394 **
Residuals 1201 829314 691
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
True, there are currently less data points for the 2013 year, however even doing the same excluding this year is significant with 99% significance.
The average luminosity decreases by year, and this is seen in the plot below which shows a downward trend:
Next we look at the sizes of each comic. xkcd ranges from very tall comic-book style strips to ultra-simplistic single small images driving the whole point or punch line home in one frame.
Looking at the height of each comic versus the width, you can see that there appears to be several standard widths which Randall produces the comic at (not so with heights). These standard widths are 400, 500, 600, 640, and 740.
We can see these reflected in the distribution of all image widths, 740 is by far the most common comic width. There is no such pattern in the image heights, which appears to have a more logarithmic-like distribution.
Interesting, the ‘canonical’ widths are not constant over time – there were several widths which were used frequently near the beginning, after which the more common standard of 740px was used. This may be due to the large number of scanned images near the beginning, as I imagine scanning an A4 sheet of paper would often result in the same image resolutions.
The one lone outlier on the high end of image width is 780px wide and is #1193, Externalities.
As well as examining the resolution (dimensions) of the comic images we can also examine the distribution of the images by their filesize.
You can see that the majority of the images are below 100K in size – in general the xkcd comics are quite small as the majority are simple PNGs displaying very little visual information.
We can also look at the comic size (area in square pixels) versus the filesize:
There is clearly a relationship here, as illustrated on the log-log plot on the right with the trend line.Of course, I am just stating the obvious - this relationship is not unique to the comics and exists as a property for the image formats in general.
As an analyst who has come out of the school of mathematics, and continually been immersed in the world of technology, it is reassuring to read something like xkcd and feel like you’re not the only one who thinks matters of data, math, science, and technology can be funny, along with all the other quirkiness and craziness in life which Randall so aptly (and sometimes poignantly) portrays.
That being said, Randall’s one dedicated guy who has done some awesome work for the digitally connected social world of science, technology, and internet geekdom, and now we know how much he likes using 740px as the canvas width, and that xkcd is gradually using less white pixels over the years.