xkcd: Visualized

Introduction

It’s been said that the ideal job is one you love enough to do for free but are good enough at that people will pay you for it. That if you do what you love no matter what others may say, and if you work at it hard enough, and long enough, eventually people will recognize it and you’ll be a success.

Such is the case with Randall Munroe. Because any nerd worth their salt knows what xkcd is.

What started as simply a hobby and posting some sketches online turned into a cornerstone of internet popular culture, with a cult following amongst geekdom, the technically savvy, and more.

Though I would say that it’s gone beyond that now, and even those less nerdy and techie know what xkcd means – it’s become such a key part of internet popular culture. Indeed, Mr. Munroe’s work has swing due to the sheer number of people who know and love it, and content on the site has resulted in changes being made on some of the biggest sites on the Internet – take, for example, Google adding a comment read-a-loud in 2008, quite possibly because of a certain comic.

As another nerdy / tech / data citizen of the internet who knows, loves and follows xkcd, I thought I could pay tribute to it with its own everyday analysis.

Background

Initially, I thought I would have to go about doing it the hard way again. I’ve done some web scraping before with Python and thought this would be the same using the (awesome) Beautiful Soup package.

But Randall, being the tech-savvy (and Creative Commons abiding) guy that he is, was nice enough to provide an API to return all the comic metadata in JSON format (thanks Randall!).

That being said it was straightforward to write some Python with urrlib2 to download the data and then get going on the analysis.

Of course, after doing all that I realized that someone else was nice enough to have already written the equivalent code in R to access the data. D’oh! Oh well. Lesson learned – Google stuff first, code later.

But it was important to write that code in Python as I used the Python Imaging Library (PIL) (also awesome… thanks mysterious, shadowy developers at Pythonware/Secret Labs AB) to extract metadata from the comic images.

The data includes the 1204 comics from the very beginning (#1, Barrel – Part 1 posted on Jan 1, 2006) to #1204, Detail, posted on April 26, 2013.

As well as the data provided via the JSON (comic #, url, title, date, transcript and alt text) I pulled out additional fields using the Python Imaging Library (file format, filesize, dimensions, aspect ratio and luminosity). I also wanted to calculate hue, however, regrettably this is a somewhat more complicated process which my image processing chops were not immediately up to, and so I deferred on this point.

Analysis

File type
Ring chart of xkcd comics by file typeBar chart of xkcd comics by file type


You can see out of the 1204 comics, 1073 (~89.19%) were in PNG format, 128 (~10.64%) were in JPEG and only 2 (#961, Eternal Flame and #1116, Stoplight) (~0.17%) were in GIF. This of course, being because the latter are the only two comics which are animated.

Looking at the filetype over time below, you can see that initially xkcd was primarily composed of JPEG images (mostly because they were scanned sketches) and this quickly changed over time to being almost exclusively PNG with the exception of the two aforementioned animated GIFs. The lone outlying JPEG near 600 is Alternative Energy Revolution (#556).

strip chart of xkcd comics by file type
Image Mode
Next we can look at the image mode of all the xkcd images. For a little context, the image modes are roughly as following:
  • L – 8 bit black & white
  • P – 8 bit colour
  • RGB – colour
  • LA, RGBA – black & white with alpha channel (transparency), colour with alpha channel

The breakdown for all the comics is depicted below.

ring chart of xkcd comics by image modebar chart of xkcd comics by image mode

You can see that the majority are imagemode L (847, ~70.41%) followed by 346 in RGB (~28.76%) and a tiny remaining number are in P (8, ~0.7%) with the remaining two in L and RGB modes with alpha channel (LA & RGBA).

Any readers will know that the bulk of xkcd comics are simple black-and-white images with stick figures and you can see this reflected in the almost ¾ to ¼ ratio of monochrome to coloured images.

The two images with alpha channel are Helping (#383) and Click and Drag (#1110), most likely because of the soft image effect and interactivity, respectively.

Looking at the image mode over time, we can see that like the filetype, almost all of the images were initially in RGB mode as they were scans. After this period, the coloured comics are fairly evenly interspersed with the more common black and white images.

strip chart of xkcd comics by image mode
Luminosity

You can see in the figure on the left that given the black-and-white nature of xkcd the luminosity of each image is usually quite high (the maximum is 255). We can see the distribution better summarized on the right in a histogram:

scatterplot of luminosity of xkcd comicshistogram of luminosity of xkcd comics

Luminosity was the only quality of the images which had significant change over the years that Randall has created the comic. Doing an analysis of variance we can see there is a statistically significant year-on-year difference in the average comic brightness (> 99%):

> aov(data$lumen ~ data$year)
Call:
aov(formula = data$lumen ~ data$year)

Terms:
data$year Residuals
Sum of Squares 5762.0 829314.2
Deg. of Freedom 1 1201

Residual standard error: 26.27774
Estimated effects may be unbalanced
> summary(aov(data$lumen ~ data$year))
Df Sum Sq Mean Sq F value Pr(>F)
data$year 1 5762 5762 8.344 0.00394 **
Residuals 1201 829314 691

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

True, there are currently less data points for the 2013 year, however even doing the same excluding this year is significant with 99% significance.

The average luminosity decreases by year, and this is seen in the plot below which shows a downward trend:

line plot of average luminosity of xkcd per year

Image Dimensions
Next we look at the sizes of each comic. xkcd ranges from very tall comic-book style strips to ultra-simplistic single small images driving the whole point or punch line home in one frame.

scatterplot of height vs. width for xkcd comics

Looking at the height of each comic versus the width, you can see that there appears to be several standard widths which Randall produces the comic at (not so with heights). These standard widths are 400, 500, 600, 640, and 740.

distribution of image heights of xkcd comic

We can see these reflected in the distribution of all image widths, 740 is by far the most common comic width. There is no such pattern in the image heights, which appears to have a more logarithmic-like distribution.

histogram of width of xkcd comicshistogram of height of xkcd comics

Interesting, the ‘canonical’ widths are not constant over time – there were several widths which were used frequently near the beginning, after which the more common standard of 740px was used. This may be due to the large number of scanned images near the beginning, as I imagine scanning an A4 sheet of paper would often result in the same image resolutions. 

scatterplot of width of xkcd comics

The one lone outlier on the high end of image width is 780px wide and is #1193, Externalities.

Looking at the aspect ratio of the comics over time, you can see that there are appear to be two classes of comics – a larger number (about 60%) of which are more tightly clustered around an even 1:1 aspect ratio, and then a second class more evenly distributed with aspect ratio 2 and above. There are also a small peaks around 1.5 and 1.75.
scatterplot of aspect ratio of xkcd comicshistogram of aspect ratio of xkcd comics
In case you were wondering the comic with an aspect ratio of ~8 is Tags (#1144) and the tallest comic proportionally is Future Timeline (#887).
Filesize

As well as examining the resolution (dimensions) of the comic images we can also examine the distribution of the images by their filesize.

distribution of file size of xkcd comics

You can see that the majority of the images are below 100K in size – in general the xkcd comics are quite small as the majority are simple PNGs displaying very little visual information.

We can also look at the comic size (area in square pixels) versus the filesize:

scatterplot of file size versus image size of xkcd comicsscatterplot of file size versus image size of xkcd comics (with trend line)

There is clearly a relationship here, as illustrated on the log-log plot on the right with the trend line.Of course, I am just stating the obvious – this relationship is not unique to the comics and exists as a property for the image formats in general.

If we separated out the images by file type (JPEG and PNG) I believe we would see different numbers for the relationship as a result of the particularities of the image compression techniques.

Conclusions

I have this theory that how funny a joke is to someone who gets it is inversely proportional to the number of other people who would get it. That is to say, the more esoteric and niche the comedy is, the funnier and more appealing it is to those who actually get the punch line. It’s a feeling of being special – a feeling that someone else understands and that the joke was made just for you, and others like you, and that you’re not alone in thinking that comics involving Avogadro’s number, Kurt Godel or Turing Completeness can be hilarious.

As an analyst who has come out of the school of mathematics, and continually been immersed in the world of technology, it is reassuring to read something like xkcd and feel like you’re not the only one who thinks matters of data, math, science, and technology can be funny, along with all the other quirkiness and craziness in life which Randall so aptly (and sometimes poignantly) portrays.

That being said, Randall’s one dedicated guy who has done some awesome work for the digitally connected social world of science, technology, and internet geekdom, and now we know how much he likes using 740px as the canvas width, and that xkcd is gradually using less white pixels over the years.

And let’s hope there will be many more years to come.

Resources

xkcd
xkcd – JSON API
xkcd – Wikipedia
code on github

Top 10 Super Bowl XLVII Commercials in Social TV (Respin)

So the Super Bowl is kind of a big deal.

Not just because there’s a lot of football. And not just because it’s a great excuse to get together with friends and drink a whole lot of beer and eat unhealthy foods. And not because it’s a good excuse to shout at your new 72″ flatscreen with home theater surround that you bought at Best Buy just for your Super Bowl party and are going to try to return the next day even though you’re pretty sure now that they don’t let you do that any more.

The Super Bowl is a big deal for marketers. For creatives. For ‘social media gurus’. Because there’s a lot of eyeballs watching those commercials. In fact, I’m pretty sure there’s people going to Super Bowl parties who don’t even like football and are just there for the commercials, that is if they’ve not decided to catch all the best ones after the fact on YouTube.

And also, you know, because if you’re putting down $6 million for a minute of commercial airtime, you want to make sure that those dollars are well spent.

So Bluefin Labs is generating a lot of buzz lately as they were acquired by Twitter. TV is big, social media is big, so Social TV analytics must be even bigger, right? Right?

Anyhow Bluefin showed up recently in my Twitter feed for a different reason: their report on the Top 10 Super Bowl XLVII commercials in Social TV that they did for AdAge.

The report’s pretty and all, but a little too pretty for my liking, so I thought I’d respin some of it.

Breakdown by Gender:

Superbowl XLVII Commercial Social Mentions by Gender

You can see that the male / female split is fairly even overall, with the exception of the NFL Network’s ad and to a lesser extent the ad for Fast & Furious 6 which were more heavily mentioned proportionally by males. The Budweiser, Calvin Klein and Taco Bell spots had greater percentages of women commenting.

Sentiment

The Taco Bell, Dodge and Budweiser ads had the most mentions with positive sentiment. The NFL ad had a very large amount of neutral comments (74%), moreso than any other ad, proportionally. The Go Daddy ad had the most negative mentions, for good reason – it’s gross and just kind of weird. It wouldn’t be the Super Bowl if Go Daddy didn’t air a commercial of questionable taste though, right?
Superbowl XLVII Commercial Sentiment Breakdown by Gender
Superbowl XLVII Commercial Sentiment Breakdown by Gender (Proportional)
Lastly, I am going to go against the grain here and say that the next big thing in football is most definitely going to be Leon Sandcastle.

Seriously, What’s a Data Scientist? (and The Newgrounds Scrape)

So here’s the thing. I wouldn’t feel comfortable calling myself a data scientist (yet).

Whenever someone mentions the term data science (or, god forbid BIG DATA, without a hint of skepticism or irony) people inevitably start talking about the elephant in the room (see what I did there)?

And I don’t know how to ride elephants (yet).

Some people (like yours truly, as just explained) are cautious – “I’m not a data scientist. Data science is a nascent field. No one can go around really calling themselves a data scientist because no one even really knows what data science is yet, there isn’t a strict definition.” (though Wikipedia’s attempt is noble).

Other people are not cautious at all – “I’m a data scientist! Hire me! I know what data are and know how to throw around the term BIG DATA! I’m great with pivot tables in Excel!!”

Aha ha. But I digress.

The point is that I’ve done the first real work which I think falls under the category of data science.

I’m no Python guru, but threw together a scraper to grab all the metadata from Newgrounds portal content.

The data are here if you’re interested in having a go at it already.

The analysis and visualization will take time, that’s for a later article. For now, here’s one of my exploratory plots, of the content rating by date. Already we can gather from this that, at least at Newgrounds, 4-and-half stars equals perfection.

Sure feels like science.

FBI iPhone Leak Breakdown

Don’t know if you heard, but something that is making the news today is that hacker group AntiSec purportedly gained control of an FBI agent’s laptop and got a hold of 12 million UDIDs which were apparently being tracked.

A UDID is Apple’s unique identifier for each of its ‘iDevices’, and if known could be used to get a lot of personally identifiable information about the owner of each product.

The hackers released the data on pastebin here. In the interests of protecting the privacy of the users, they removed all said personally identifiable information from the data. This is kind of a shame in a way, as it would have been interesting to do an analysis of the geographic distribution of the devices which were (allegedly) being tracked, amongst other things. I suppose they released the data for more (allegedly) altruistic purposes – i.e. to let people find out if the FBI was tracking them, not to have the data analyzed.

The one useful column that was left was the device type. Surprisingly, the majority of devices were iPads. Of course, this could just be unique to the million and one records of the 12 million which the group chose to release.

Breakdown:
iPhone: 345,384 (34.5%)
iPad: 589,720 (59%)
iPod touch: 63,724 (6.4%)
Undetermined: 1,173 (0.1%)
Total: 1,000,001

Forgive me Edward Tufte, for using a pie chart.

Google Domestic Trends

Google’s mission is to organize all the world’s information and make it universally accessible and useful. In following their mission, the company has produced some amazing tools which allow any internet user to do some data visualization without so much as having to open a spreadsheet.

One of these tools which I stumbled across the other day (which apparently has has existed for some time) is Google Domestic Trends.

I was previously aware of Google Trends, which allows a user to compare the popularity of different search terms, whether if be for serious reasons (e.g. Android vs. iPhone) or say, for something less serious. In Domestic Trends, Google has aggregated relevant search terms across different sectors of the economy, with the results presumably providing insight into market trends by sector (or at least the popularity of those market sectors with respect to time).

I am not an economist, but data are data, so here goes with the pithy commentary and observations.

Air Travel
It’s seasonal, unsurprisingly. Looks like there might be some deals over the holidays I was unaware of. Or that might be a really bad time to buy tickets.

Link

Auto Buyers
As Google notes on the Domestic Trends frontpage, July 2009 was when the U.S. Government instituted its “Cash for Clunkers” program. However, it was also when Toyota recalled almost half a million vehicles due to defective airbags. Oh yeah, and that spike in 2005 is related to the outrageous change in the gas prices of the time.

Link

Bankruptcy
New record. I’m glad I rent.

Link

Computers and Electronics
Seriously, who buys desktops anymore?

Link

Credit Cards
A poignant portrait of the changing state of the American economy and personal debt.

Link

Durable Goods
Merry Christmas honey, I got you a Rhoomba.

Link

Education
School’s out for summer.

Link

Jobs
I want to say that the little spike later in 2011 has nothing to due with employment and is due to Mr. Jobs retiring, however then I would expect a much larger one to be in October.

Link

Mobile and Wireless
The iPhone was revealed to the public on January 9th, 2007 and went on sale in June of the same year. The iPhone 3G and 3GS came out in June and July of 2008 and 2009 respectively. The 4S was released in October 2011. Not sure about mid-2010. The Blackberry Torch came out in June but that would hardly warrant what we see here.

Link

Rental & Real Estate
Apparently it is quite seasonal. Peaks drop off around late July and early August. Students, I would guess.

Link

Shopping
We’ve seen this before. No surprises here.

Link

Unemployment
I know the word you’re thinking of. It’s on the tip of your tongue and it starts with ‘R’.

Link

See also: Google NGram Experiments.

rhok (n’ roll)

This past weekend was rhok Toronto which was a fun, exhausting, educational, and all around amazing weekend which I was honoured to be involved in.

The team I was fortunate enough to be a part of produced a prototype web-service to promote fair housing, and improve the ease of the submission process for investigations into housing by-law violations. An added bonus was that this resulted in this nice visualization of more City of Toronto data.

You can learn more about rhok here.