I Heart Sushi

I Heart Sushi

Introduction

I like sushi.

I’ve been trying to eat a bit better lately though (aren’t we all?) and so got to wondering: just how bad for you is sushi exactly? What are some of the better nutritional choices I can make when I go out to eat at my favorite Japanese(ish) place? What on the menu should I definitely avoid?

And then I got thinking like I normally get think about the world, that hey, it’s all just data, and I remembered how I could just take some nutritional information as raw data as I’ve previously done ages ago for Mickey D’s and see if anything interesting pops out. Plus this seemed like as good as an excuse as any to do some work with the good old data analysis and visualization stack for python, and ipython notebooks, instead of my usual go-to tool of R.

So let’s have a look, shall we?

Background

As always, the first step is getting the data; sometimes the most difficult step. Here the menu in question I chose to use was that from Sushi Stop (I am in no way affiliated nor associated with said brand, nor I am endorsing it), where the nutritional information unfortunately was only available as a PDF, as is often the case.

This is a hurdle data analysts, but more often I think, research analysts and data journalists, can often run into. Fortunately there are tools at our disposal to deal with this kind of thing, so not to worry. Using the awesome Tabula and a little bit of ad hoc cleaning from the command line, it was a simple matter of extracting the data from the PDF and into a convenient CSV. Boom, and we’re ready to go.

Tabula

The data comprises 335 unique items in 17 different categories with 15 different nuritional variables. Let’s dig in.

Analysis

First we include the usual suspects in the python data analysis stack (numpy, matplotlib and pandas), then read the data into a dataframe using pandas.

In [1]:
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
In [2]:
data = pd.read_csv("tabula-nutritional-information.csv", delimiter=",")

Okay, are we wokring with here? Let’s take a look:

In [3]:
print data.columns
print len(data.columns)
data.head()
Index([u'category', u'item', u'serving_size', u'calories', u'fat', u'saturated_fat', u'trans_fat', u'cholesterol', u'sodium', u'carbohydrates', u'fibre', u'sugar', u'protein', u'vitamin_a', u'vitamin_c', u'calcium', u'iron'], dtype='object')
17
Out[3]:
category item serving_size calories fat saturated_fat trans_fat cholesterol sodium carbohydrates fibre sugar protein vitamin_a vitamin_c calcium iron
0 APPETIZERS & SALADS Shrimp Tempura 60 180 8.0 0.0 0 40 125 18 0 0 8 0 0 0 0
1 APPETIZERS & SALADS Three salads 120 130 3.5 0.0 0 60 790 13 4 8 8 2 6 40 8
2 APPETIZERS & SALADS Wakame 125 110 2.0 0.0 0 0 1650 13 4 9 0 0 0 110 0
3 APPETIZERS & SALADS Miso soup 255 70 3.0 0.5 0 0 810 8 1 1 6 0 0 20 25
4 APPETIZERS & SALADS Grilled salmon salad 276 260 19.0 2.5 0 30 340 12 3 6 12 80 80 8 8

5 rows × 17 columns

Let’s look at the distribution of the different variables. You can see that most a heavily skewed or follow power law / log-normal type distributions as most things in nature do. Interestingly there is a little blip there in the serving sizes around 600 which we’ll see later is the ramen soups.

In [4]:
# Have a look
plt.figure(0, figsize=(25,12), dpi=80)
for i in range(2,len(data.columns)):
fig = plt.subplot(2,8,i)
plt.title(data.columns[i], fontsize=25)
plt.hist(data[data.columns[i]])
# fig.tick_params(axis='both', which='major', labelsize=15)
plt.tight_layout()

Let’s do something really simple, and without looking at any of the other nutrients just look at the caloric density of the foods. We can find this by dividing the number of calories in each item by the serving size. We’ll just look at the top 10 worst offenders or so:

In [5]:
data['density']= data['calories']/data['serving_size']
data[['item','category','density']].sort('density', ascending=False).head(12)
Out[5]:
item category density
314 Yin Yang Sauce EXTRAS 5.000000
311 Ma! Sauce EXTRAS 4.375000
75 Akanasu (brown rice) HOSOMAKI 3.119266
0 Shrimp Tempura APPETIZERS & SALADS 3.000000
312 Spicy Light Mayo EXTRAS 2.916667
74 Akanasu HOSOMAKI 2.844037
67 Akanasu avocado (brown rice) HOSOMAKI 2.684564
260 Teriyaki Bomb ‐ brown rice (1 pc) TEMARI 2.539683
262 Teriyaki Bomb ‐ brown rice (4 pcs) TEMARI 2.539683
66 Akanasu avocado HOSOMAKI 2.483221
201 Inferno Roll (brown rice) SUMOMAKI 2.395210
259 Teriyaki Bomb (1 pc) TEMARI 2.380952

12 rows × 3 columns

The most calorically dense thing is Ying-Yang Sauce, which as far as I could ascertain was just mayonnaise and something else put in a ying-yang shape on a plate.Excluding the other sauces (I assume Ma! also includes mayo), the other most calorically dense foods are the variations of the Akanasu roll (sun-dried tomato pesto, light cream cheese, sesame), shrimp tempura (deep fried, so not surprising) and teriyaki bombs, which are basically seafood, cheese and mayo smushed into a ball, deep fried and covered with sauce (guh!). I guess sun-dried tomato pesto has a lot of calories. Wait a second, does brown rice have more calories than white? Oh right, sushi is made with sticky rice, and yes, yes it does. Huh, today I learned.

We can get a more visual overview of the entire menu by plotting the two quantities together. Calories divided by serving size = calories on the y-axis, serving size on the x. Here we colour by category and get a neat little scatterplot.

In [7]:
# Get the unique categories
categories = np.unique(data['category'])

# Get the colors for the unique categories
cm = plt.get_cmap('spectral')
cols = cm(np.linspace(0, 1, len(categories)))

# Iterate over the categories and plot
plt.figure(figsize=(12,8))
for category, col in zip(categories, cols):
d = data[data['category']==category]
plt.scatter(d['serving_size'], d['calories'], s=75, c=col, label=category.decode('ascii', 'ignore'))
plt.xlabel('Serving Size (g)', size=15)
plt.ylabel('Calories', size=15)
plt.title('Serving Size vs. Calories', size=18)

legend = plt.legend(title='Category', loc='center left', bbox_to_anchor=(1.01, 0.5),
ncol=1, fancybox=True, shadow=True, scatterpoints=1)
legend.get_title().set_fontsize(15)

You can see that the nigiri & sashimi generally have smaller serving sizes and so less calories. The ramen soup is in a category all its own with much larger serving sizes than the other items, as I mentioned before and we saw in the histograms. The other rolls are kind of in the middle. The combos, small ramen soups and some of the appetizers and salads also sit away from the ‘main body’ of the rest of the menu.

Points which lie further from the line y=x have higher caloric density, and you can see that even though the top ones we picked out above had the highest raw values and we can probably guess where they are in the graph (the sauces are the vertical blue line near the bottom left, and the Akanasu are probably those pairs of dark green dots to the right), there are other categories which are probably worse overall, like the cluster of red which is sushi pizza. Which category of the menu has highest caloric density (and so is likely best avoided) overall?

In [8]:
# Find most caloric dense categories on average
density = data[['category','density']]
grouped = density.groupby('category')
grouped.agg(np.average).sort('density', ascending=False).head()
Out[8]:
density
category
EXTRAS 2.421875
SUSHI PIZZA 2.099515
CRISPY ROLLS 1.969304
TEMARI 1.807691
HAKO 1.583009

5 rows × 1 columns

As expected, we see that other than the extras (sauces) which have very small serving sizes, on average the sushi pizzas are the most calorically dense group of items on the menu, followed by crispy rolls. The data confirm: deep fried = more calories.

What if we were only concerned with fat (as many weight-conscious people dining out are)? Let’s take a look at the different categories with a little more depth than just a simple average:

In [9]:
# Boxplot of fat content
fat = data[['category','fat']]
grouped = fat.groupby('category')

# Sort
df2 = pd.DataFrame({col:vals['fat'] for col,vals in grouped})
meds = df2.median()
meds.sort(ascending=True)
df2 = df2[meds.index]

# Plot
plt.figure(figsize=(12,8))
fatplot = df2.boxplot(vert=False)

While the combos and appetizers and salads have vary wide ranges in their fat content, we see again that the sushi pizza and crispy rolls have the most fat collectively and so are best avoided.

Now another thing people are often worried about when they are trying to eat well is the amount of sodium they take in. So let’s repeat our previous approach in visually examining caloric density, only this time plot it as one metric on the x-axis and look at where different items on the menu sit with regards to their salt content.

In [10]:
fig = plt.figure(figsize=(12,8))
plt.xlim(0,6)
plt.ylim(-50, 2000)
for category, col in zip(categories, cols):
d = data[data['category']==category]
plt.scatter(d['density'], d['sodium'], s=75, c=col, label=category.decode('ascii', 'ignore'))
plt.xlabel('Caloric density (calories/g)', size=15)
plt.ylabel('Sodium (mg)', size=15)
plt.title('Sodium vs. Caloric Density', size=18)

legend = plt.legend(title='Category', loc='center left', bbox_to_anchor=(1.01, 0.5),
ncol=1, fancybox=True, shadow=True, scatterpoints=1)
legend.get_title().set_fontsize(15)

Here we can see that while the extras (sauces) are very calorically dense, you’re probably not going to take in a crazy amount of salt unless you go really heavy on them (bottom right). If we’re really worried about salt the ramen soups should be avoided, as most of them have very high sodium content (straight line of light green near the left), some north of 1500mg, which is the daily recommended intake by the Health Canada for Adults 14-50. There’s also some of the other items we’ve seen before not looking so good (sushi pizza). Some of the temari (like the teriyaki bombs) and sumomaki (‘regular’ white-on-the-outside maki rolls) should be avoided too? But which ones?

A plot like this is pretty crowded, I’ll admit, so is really better explored, and we can do that using the very cool (and very under-development) MPLD3 package, which combines the convenience of matplotlib with the power of D3.

Below is the same scatterplot, only interactive, so you can mouse over and see what each individual point is. The items to be most avoided (top right in grey and orange), are indeed the teriyaki bombs, as well as the inferno roll (tempura, light cream cheese, sun-dried tomato pesto, red and orange masago, green onion, spicy light mayo, spicy sauce, sesame) as we saw before. Apparently that sun-dried tomato pesto is best taken in moderation.

The Akanasu rolls are the horizontal line of 4 green points close by. Your best bet is probably just to stick to the nigri and sashimi, and maybe some of the regular maki rolls closer to the bottom left corner.

In [11]:
import mpld3
fig, ax = plt.subplots(figsize=(12,8))
ax.set_xlim(0,6)
ax.set_ylim(-50,2000)
N = 100

for category, col in zip(categories, cols):
d = data[data['category']==category]
scatter = ax.scatter(d['density'], d['sodium'], s=40, c=col, label=category.decode('ascii', 'ignore'))
labels = list(d['item'])
tooltip = mpld3.plugins.PointLabelTooltip(scatter, labels=labels)
mpld3.plugins.connect(fig, tooltip)


mpld3.display()
Out[11]:

Conclusion

Well, there we have it folks. A simple look at the data tells us some common-sense things we probably already new:

  • Deep fried foods will make you fat
  • Mayo will make you fat
  • Soup at Japanese restaurants is very salty
  • Sashimi is healthy if you go easy on the soy

And surprisingly, one thing I would not have thought: that sundried tomato pesto is apparently really bad for you if you’re eating conscientiously.

That’s all for now. See you next time and enjoy the raw fish.

References and Resources

Tablua
http://tabula.technology/

Sushi Stop – Nutritional Information (PDF)
http://www.sushishop.com/themes/web/assets/files/nutritional-information-en.pdf

Food & Nutrition – Sodium in Canada (Health Canada)
http://www.hc-sc.gc.ca/fn-an/nutrition/sodium/index-eng.php

code & data on github
https://github.com/mylesmharrison/i_heart_sushi/

Everything in Its Right Place: Visualization and Content Analysis of Radiohead Lyrics

Introduction

I am not a huge Radiohead fan.

To be honest, the Radiohead I know and love and remember is that which was a rock band without a lot of ‘experimental’ tracks – a band you discovered on Big Shiny Tunes 2, or because your friends told you about it, or because it was playing in the background of a bar you were at sometime in the 90’s.

But I really do like their music, I’ve become familiar with more of it and overall it does possess a certain unique character in its entirety. Their range is so diverse and has changed so much over the years that it would be really hard not to find at least one track that someone will like. In this way they are very much like the Beatles, I suppose.

I was interested in doing some more content analysis type work and text mining in R, so I thought I’d try song lyrics and Radiohead immediately came to mind.

Background

In order to first do the analysis, we need all the data (shocking, I know). Somewhat surprisingly, putting ‘radiohead data‘ into Google comes up with little except for many, many links to the video and project for House of Cards which was made using LIDAR technology and had the data set publicly released.
So once again we are in this situation where we are responsible for not only analyzing all the data and communicating the findings, but also getting it as well. Such is the life of an analyst, everyday and otherwise (see my previous musing on this point).
The lyrics data was taken from the listing of Radiohead lyrics at Green Plastic Radiohead.

Normally it would be simply a matter of throwing something together in Python using Beautiful Soup as I have done previously. Unfortunately, due to the way these particular pages were coded, that proved to be a bit more difficult than expected.

As a result the extraction process ended up being a convoluted ad-hoc data wrangling exercise involving the use of wget, sed and Beautiful Soup – a process which was neither enjoyable nor something I would care to repeat.
In retrospect, two points:
Getting the data is not always easy.
Sometimes sitting down beforehand and looking at where you are getting it from, the format it is in and how to best go about getting it into the format you need will save you a lot  of wasted time and frustration in the long run. Ask questions before you begin – what format is the data in now? What is the format I need/would like it to be in to do the analysis? What steps are required in order to get from one to the other (i.e. what is the data transformation or mapping process)?
That being said, my methods got me where I needed to be, however there were most likely easier, more straightforward approaches which would have saved a lot frustration on my part.
If you’re going to code a website, use a sane page structure and give important page elements ids.
Make it easy on your other developers (and the rest of the world in general) by labeling your <div> containers and other elements with ids (which are unique!!) or at least classes. Otherwise how are people going to scrape all your data and steal it for their own ends? I joke… kind of. 
In this case my frustrations actually stemmed mainly from some questionable code for a cache-buster. But even once I got past that, the contents of the main page containers were somewhat inconsistent. Such is life, and the internet.
The remaining data, album and track length – were taken from the Wikipedia pages for each album and later merged with the calculations done with the text data in R.
Okay, enough whinging – we have the data – let’s check it out.

Analysis

I stuck with what I consider to be the ‘canonical’ Radiohead albums – that is, the big releases  you’ve probably heard about even if you’re like me a not a hardcore Radiohead fan – 8 albums in total (Pablo Honey, The Bends, OK Computer, Kid A, Amnesiac, Hail to the Thief, In Rainbows, and The King of Limbs).
Unstructured (and non-quantitative) data always lends itself to more interesting analysis – with something like text, how do we analyze it? How do we quantify it? Let’s start with the easily quantifiable parts and go from there.
Track Length
Below is a boxplot of the track lengths per album, with the points overplotted.

Distribution of Radiohead track lengths by album
Interestingly Pablo Honey and Kid A have the largest ranges of track length (from 2:12 to 4:40 and 3:42 to 7:01 respectively) – if you ignore the single tracks around 2 minutes on Amnesiac and Hail to the Thief the variance of their track lengths is more in line with all the other albums. Ignoring the single outlier, The King of Limbs is appears to be special given its narrow range of track lengths.
Word Count
Next we look at the number of words (lyrics) per album:
Distribution of number of words per Radiohead album

There is a large range of word counts, from the two truly instrumental tracks (Treefingers on Kid A and Hunting Bears on Amnesiac) to the wordier tracks (Dollars and Cents and A Wolf at the Door). Pablo Honey almost looks like it has two categories of songs – with a split around the 80 word mark.

Okay, interesting and all, but again these are small amounts of data and only so much can be drawn out as such.

Going forward we examine two calculated quantities.

Calculated Quantities – Lexical Density and ‘Lyrical Density’

In the realm of content analysis there is a measure known as lexical density which is a measure of the number of content words as a proportion of the total number of words – a value which ranges from 0 to 100. In general, the greater the lexical density of a text, the more content heavy it is and more ‘unpacking’ it takes to understand – texts with low lexical density are easier to understand.

According to Wikipedia the formula is as follows:

where Ld is the analysed text’s lexical density, NLex is the number of lexical word tokens (nouns, adjectives, verbs, adverbs) in the analysed text, and N is the number of all tokens (total number of words) in the analysed text.

Now, I am not a linguist, however it sounds like this is just the ratio of words which are not stopwords to the total number – or could at least be approximated by it. That’s what I went with in the calculations in R using the tm package (because I’m not going to write a package to calculate lexical density by myself).

On a related note, I completely made up a quantity which I am calling ‘lyrical density’ which is much easier to calculate and understand – this is just the number of lyrics per song over the track length, and is measured in words per second. An instrumental track would have lyrical density of zero, and a song with one word per second for the whole track would have a lyrical density of 1.

Lexical Density

Distribution of lexical density of Radiohead songs by album
Looking at the calculated lexical density per album, we can see that the majority of songs have their lexical density between about 30 to 70. The two instrumental songs have a lexical density of 0 (as they have no words) and distribution appears most even on OK Computer. The most content-word heavy song is on Hail to the Thief and is I Will (No Man’s Land)
If you could imaging extending the number of songs Radiohead written to infinity, you might get a density function something like below, with the bulk of songs having density between 30 and 70 (which I imagine is a normal reasonable range for any text) and a little bump at 0 for their instrumental songs:
Histogram of lexical density of Radiohead tracks with overplotted density function
Lyrical Density
Next we come to my calculated quantity, lyrical density – or the number of words per second on each track.
Distribution of lyrical density of Radiohead tracks by album

Interestingly, there are outlying tracks near the high end where the proportion of words to the song length is greater than 1 (Fitter Happier, A Wolf at the Door, and Faust Arp). Fitter Happier shouldn’t even really count, as it is really an instrumental track with a synthesized voice dubbed overtop. If you listen to A Wolf at the Door it is clear why the lyrical density is so high – Thom is practically rapping at points. Otherwise Kid A and The King of Limbs seem to have less quickly sung lyrics than the other albums on average.

Lexical Density + Lyrical Density
Putting it all together, we can examine the quantities for all of the Radiohead songs in one data visualization. You can examine different albums by clicking the color legend at the right, and compare multiple albums by holding CTRL and clicking more than one.


The songs are colour-coded by album. The points are plotted by lexical density along y-axis against the lyrical density along the x-axis and sized by total number of words in the song. As such, the position of the point in the plot gives an idea of the rate of lyrical content in the track – a song like I Might Be Wrong is fitting a lot less content words into a song at a slower rate than a track like A Wolf at the Door which is packed much tighter with both lyrics and meaning.

Conclusion

This was an interesting project and it was fascinating to take something everyday like song lyrics and analyze them as data (though some Radiohead fans might argue that there is nothing ‘everyday’ about Radiohead lyrics).
All in all, I feel that a lot of the analysis has to be taken with a grain of salt (or a shaker or two), given the size of the data set (n = 89). 
That being said, I still feel it is still proof positive that you can take something typically thought of as very artistic and qualitative like a song, and classify it in a meaningful way in quantitative fashion. I had never listened to the song Fitter Happier, yet it is a clear outlier in several measures – and listening to the song I discovered why – it is a track with a robot-like voice over and not containing sung lyrics at all. 
A more interesting and ambitious project would be to take a much larger data set, where the measures examined here would be more reliable given the large n, and look at things such as trends in time (the evolution of American rock lyrics) or by genre / style of music. This sort of thing exists out there already to an extent, for example, in work done with The Million Song Data Set which I came across in some of my Google searches I made for this project.
But as I said, this would be a large and ambitious amount of work, which is perhaps more suited for something like a research paper or thesis – I am just one (everyday) analyst. 

References & Resources

Radiohead Lyrics at Green Plastic Radiohead
The Million Song Data Set
Measuring the Evolution of Contemporary Western Popular Music [PDF]
Radiohead “House of Cards” by Aaron Koblin
code, data & plots on github

xkcd: Visualized

Introduction

It’s been said that the ideal job is one you love enough to do for free but are good enough at that people will pay you for it. That if you do what you love no matter what others may say, and if you work at it hard enough, and long enough, eventually people will recognize it and you’ll be a success.

Such is the case with Randall Munroe. Because any nerd worth their salt knows what xkcd is.

What started as simply a hobby and posting some sketches online turned into a cornerstone of internet popular culture, with a cult following amongst geekdom, the technically savvy, and more.

Though I would say that it’s gone beyond that now, and even those less nerdy and techie know what xkcd means – it’s become such a key part of internet popular culture. Indeed, Mr. Munroe’s work has swing due to the sheer number of people who know and love it, and content on the site has resulted in changes being made on some of the biggest sites on the Internet – take, for example, Google adding a comment read-a-loud in 2008, quite possibly because of a certain comic.

As another nerdy / tech / data citizen of the internet who knows, loves and follows xkcd, I thought I could pay tribute to it with its own everyday analysis.

Background

Initially, I thought I would have to go about doing it the hard way again. I’ve done some web scraping before with Python and thought this would be the same using the (awesome) Beautiful Soup package.

But Randall, being the tech-savvy (and Creative Commons abiding) guy that he is, was nice enough to provide an API to return all the comic metadata in JSON format (thanks Randall!).

That being said it was straightforward to write some Python with urrlib2 to download the data and then get going on the analysis.

Of course, after doing all that I realized that someone else was nice enough to have already written the equivalent code in R to access the data. D’oh! Oh well. Lesson learned – Google stuff first, code later.

But it was important to write that code in Python as I used the Python Imaging Library (PIL) (also awesome… thanks mysterious, shadowy developers at Pythonware/Secret Labs AB) to extract metadata from the comic images.

The data includes the 1204 comics from the very beginning (#1, Barrel – Part 1 posted on Jan 1, 2006) to #1204, Detail, posted on April 26, 2013.

As well as the data provided via the JSON (comic #, url, title, date, transcript and alt text) I pulled out additional fields using the Python Imaging Library (file format, filesize, dimensions, aspect ratio and luminosity). I also wanted to calculate hue, however, regrettably this is a somewhat more complicated process which my image processing chops were not immediately up to, and so I deferred on this point.

Analysis

File type
Ring chart of xkcd comics by file typeBar chart of xkcd comics by file type


You can see out of the 1204 comics, 1073 (~89.19%) were in PNG format, 128 (~10.64%) were in JPEG and only 2 (#961, Eternal Flame and #1116, Stoplight) (~0.17%) were in GIF. This of course, being because the latter are the only two comics which are animated.

Looking at the filetype over time below, you can see that initially xkcd was primarily composed of JPEG images (mostly because they were scanned sketches) and this quickly changed over time to being almost exclusively PNG with the exception of the two aforementioned animated GIFs. The lone outlying JPEG near 600 is Alternative Energy Revolution (#556).

strip chart of xkcd comics by file type
Image Mode
Next we can look at the image mode of all the xkcd images. For a little context, the image modes are roughly as following:
  • L – 8 bit black & white
  • P – 8 bit colour
  • RGB – colour
  • LA, RGBA – black & white with alpha channel (transparency), colour with alpha channel

The breakdown for all the comics is depicted below.

ring chart of xkcd comics by image modebar chart of xkcd comics by image mode

You can see that the majority are imagemode L (847, ~70.41%) followed by 346 in RGB (~28.76%) and a tiny remaining number are in P (8, ~0.7%) with the remaining two in L and RGB modes with alpha channel (LA & RGBA).

Any readers will know that the bulk of xkcd comics are simple black-and-white images with stick figures and you can see this reflected in the almost ¾ to ¼ ratio of monochrome to coloured images.

The two images with alpha channel are Helping (#383) and Click and Drag (#1110), most likely because of the soft image effect and interactivity, respectively.

Looking at the image mode over time, we can see that like the filetype, almost all of the images were initially in RGB mode as they were scans. After this period, the coloured comics are fairly evenly interspersed with the more common black and white images.

strip chart of xkcd comics by image mode
Luminosity

You can see in the figure on the left that given the black-and-white nature of xkcd the luminosity of each image is usually quite high (the maximum is 255). We can see the distribution better summarized on the right in a histogram:

scatterplot of luminosity of xkcd comicshistogram of luminosity of xkcd comics

Luminosity was the only quality of the images which had significant change over the years that Randall has created the comic. Doing an analysis of variance we can see there is a statistically significant year-on-year difference in the average comic brightness (> 99%):

> aov(data$lumen ~ data$year)
Call:
aov(formula = data$lumen ~ data$year)

Terms:
data$year Residuals
Sum of Squares 5762.0 829314.2
Deg. of Freedom 1 1201

Residual standard error: 26.27774
Estimated effects may be unbalanced
> summary(aov(data$lumen ~ data$year))
Df Sum Sq Mean Sq F value Pr(>F)
data$year 1 5762 5762 8.344 0.00394 **
Residuals 1201 829314 691

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

True, there are currently less data points for the 2013 year, however even doing the same excluding this year is significant with 99% significance.

The average luminosity decreases by year, and this is seen in the plot below which shows a downward trend:

line plot of average luminosity of xkcd per year

Image Dimensions
Next we look at the sizes of each comic. xkcd ranges from very tall comic-book style strips to ultra-simplistic single small images driving the whole point or punch line home in one frame.

scatterplot of height vs. width for xkcd comics

Looking at the height of each comic versus the width, you can see that there appears to be several standard widths which Randall produces the comic at (not so with heights). These standard widths are 400, 500, 600, 640, and 740.

distribution of image heights of xkcd comic

We can see these reflected in the distribution of all image widths, 740 is by far the most common comic width. There is no such pattern in the image heights, which appears to have a more logarithmic-like distribution.

histogram of width of xkcd comicshistogram of height of xkcd comics

Interesting, the ‘canonical’ widths are not constant over time – there were several widths which were used frequently near the beginning, after which the more common standard of 740px was used. This may be due to the large number of scanned images near the beginning, as I imagine scanning an A4 sheet of paper would often result in the same image resolutions. 

scatterplot of width of xkcd comics

The one lone outlier on the high end of image width is 780px wide and is #1193, Externalities.

Looking at the aspect ratio of the comics over time, you can see that there are appear to be two classes of comics – a larger number (about 60%) of which are more tightly clustered around an even 1:1 aspect ratio, and then a second class more evenly distributed with aspect ratio 2 and above. There are also a small peaks around 1.5 and 1.75.
scatterplot of aspect ratio of xkcd comicshistogram of aspect ratio of xkcd comics
In case you were wondering the comic with an aspect ratio of ~8 is Tags (#1144) and the tallest comic proportionally is Future Timeline (#887).
Filesize

As well as examining the resolution (dimensions) of the comic images we can also examine the distribution of the images by their filesize.

distribution of file size of xkcd comics

You can see that the majority of the images are below 100K in size – in general the xkcd comics are quite small as the majority are simple PNGs displaying very little visual information.

We can also look at the comic size (area in square pixels) versus the filesize:

scatterplot of file size versus image size of xkcd comicsscatterplot of file size versus image size of xkcd comics (with trend line)

There is clearly a relationship here, as illustrated on the log-log plot on the right with the trend line.Of course, I am just stating the obvious – this relationship is not unique to the comics and exists as a property for the image formats in general.

If we separated out the images by file type (JPEG and PNG) I believe we would see different numbers for the relationship as a result of the particularities of the image compression techniques.

Conclusions

I have this theory that how funny a joke is to someone who gets it is inversely proportional to the number of other people who would get it. That is to say, the more esoteric and niche the comedy is, the funnier and more appealing it is to those who actually get the punch line. It’s a feeling of being special – a feeling that someone else understands and that the joke was made just for you, and others like you, and that you’re not alone in thinking that comics involving Avogadro’s number, Kurt Godel or Turing Completeness can be hilarious.

As an analyst who has come out of the school of mathematics, and continually been immersed in the world of technology, it is reassuring to read something like xkcd and feel like you’re not the only one who thinks matters of data, math, science, and technology can be funny, along with all the other quirkiness and craziness in life which Randall so aptly (and sometimes poignantly) portrays.

That being said, Randall’s one dedicated guy who has done some awesome work for the digitally connected social world of science, technology, and internet geekdom, and now we know how much he likes using 740px as the canvas width, and that xkcd is gradually using less white pixels over the years.

And let’s hope there will be many more years to come.

Resources

xkcd
xkcd – JSON API
xkcd – Wikipedia
code on github

Seriously, What’s a Data Scientist? (and The Newgrounds Scrape)

So here’s the thing. I wouldn’t feel comfortable calling myself a data scientist (yet).

Whenever someone mentions the term data science (or, god forbid BIG DATA, without a hint of skepticism or irony) people inevitably start talking about the elephant in the room (see what I did there)?

And I don’t know how to ride elephants (yet).

Some people (like yours truly, as just explained) are cautious – “I’m not a data scientist. Data science is a nascent field. No one can go around really calling themselves a data scientist because no one even really knows what data science is yet, there isn’t a strict definition.” (though Wikipedia’s attempt is noble).

Other people are not cautious at all – “I’m a data scientist! Hire me! I know what data are and know how to throw around the term BIG DATA! I’m great with pivot tables in Excel!!”

Aha ha. But I digress.

The point is that I’ve done the first real work which I think falls under the category of data science.

I’m no Python guru, but threw together a scraper to grab all the metadata from Newgrounds portal content.

The data are here if you’re interested in having a go at it already.

The analysis and visualization will take time, that’s for a later article. For now, here’s one of my exploratory plots, of the content rating by date. Already we can gather from this that, at least at Newgrounds, 4-and-half stars equals perfection.

Sure feels like science.

What’s in My Pocket? Read it now! (or Read It Later)

Introduction

You know what’s awesome? Pocket.

I mean, sure, it’s not the first. I think Instapaper existed a little before (perhaps). And there are alternatives, like Google Reader. But Pocket is still my favorite. It’s pretty awesome at what it does.

Pocket (or Read It Later, as it used to be known) has fundamentally changed the way I read.

Before I had an Android phone I used to primarily read books. But applications like Pocket allow you to save an article from the web so you can read it later. Being a big fan of reading (and also procrastination) this was a really great application for me to discover, and I’m quite glad I did. Now I can still catch up on the latest Lifehacker even if I am on the subway and don’t have data connectivity.

Background

The other interesting thing about this application is that they make it fairly easy to get a hold of your data. The website has an export function which allows you to dump all your data for everything you’ve ever added to your reading list into HTML.

Having the URL of every article you’ve ever read in Pocket is handy, as you can revisit all the articles you’ve saved. But there’s more to it than that. The HTML export also contains the time each article was added (in UNIX epoch). Combine this with an XML or JSON dump from the API, and now we’ve got some data to work with.

My data set comprises a list of 2975 URLs added to the application over the period 14/07/2011 – 19/09/2012. The data from the export includes the article ID, article URL, date added and updated, and tags added to each article.

In order to add to the data provided by the export functionalities, I wrote a simple Python script using webarticle2text, which is available on github. This script downloaded the all the text from each article URL and continually added it to a single text file, as well as doing a word count for each article and extracting the top-level domain (TLD).

Analysis

First of all we can take a very simple overview of all the articles I have saved by site:

And because pie-type charts make Edward R. Tufte (and some other dataviz people) cry, here is the same information in a bar chart:
Head and shoulders above all other websites at nearly half of all articles saved is Psychology Today. I would just like to be on the record as saying – don’t hate. I know this particular publication is written in such a fashion that it usually thought of as being slanted towards women, however I find the majority of articles to be quite interesting (as evidenced by the number of articles I have read). Perhaps other men are not that interested in the goings-on in their own and other people’s heads, but I am (apparently).

Also, I think this is largely due to the design of the site. I commented before that using Pocket has changed the way I read. Well, one example of this is that I find I save a lot more articles from sites which have well designed mobile versions, as I primarily add articles from my phone. For this reason I can also see why I have saved so many articles from Psych Today, as their well-designed mobile site has made it easy to do so. Plus the article titles are usually enough to grab me.

You can have a look at their visually appealing mobile site if you are on a phone (it detects if the browser is a desktop browser). The other top sites in the list also have well-designed mobile sites (e.g. The Globe and Mail, AskMen, Ars Technica).

Good mobile site design aside, I like reading psych articles, men’s magazines, news, and tech.

Next we examine the data with respect to time.

Unfortunately the Pocket export only provides two categories: time added and time ‘updated’. Looking at the data, I believe this ‘updated’ definition applies to mutiple actions on the article, like marking as read, adding tags, re-downloading, et cetera. It would be ideal to actually have the date/time when the article was marked as read, as then further interesting analysis could be done. For example, looking at the time interval between when articles were added and read, or the number the number of articles read per day.

Anyhow, we continue with what data are available. As in a previous post, we can get a high-level overview of the data with a scatterplot:

Pretty.

The most salient features which immediately stand out are the two distinct bands in the early morning and late afternoon. These correspond to when the majority of my reading is done, on my communte to and from work on public transit.

You can also see the general usage lining up with events in my personal life. The bands start in early October, shortly after I began my new job and started taking public transit. There is also a distinct gap from late December to early January when I was home visiting family over the Christmas holidays.

You can see that as well as being added while I am on public transit, articles are also added all throughout the day. This is as expected; I often add articles (either on my phone or via browser) over the course of the day while at work. Again, it would be interesting to have more data to look at this further, in particular knowing which articles were read or added from which platform.

I am uncertain about articles which are listed as being updated in the late hours in the evening. Although I sometimes do read articles (usually through the browser) in these hours, I think this may correspond to things like adding tags or also a delay in synching between my phone and the Pocket servers.

I played around with heatmaps and boxplots of the data with respect to time, but there was nothing particularly interesting which you can’t see from this scatterplot. The majority of articles are added and updated Monday to Friday during commute hours.

We can also look at the daily volume of articles added:

This graph looks similar to one seen previously in my post on texting. There are some days where very few articles are added and a few where there are a large number. Looking at the distribution of the number of articles added daily, we see an exponential type distribution:

Lastly we examine the content of the articles I read. As I said, all the article text was downloaded using Python and word counts were calculated for each. We can plot a histogram of this to see the distribution of the article length for what I’ve been reading:

Hmmmmm.

Well, that doesn’t look quite right. Did I really read an article 40,000 words long? That’s about 64 pages isn’t it? Looking at URLs for the articles with tens of thousands of words, I could see that those articles added were either malfunctions of the Pocket article parser, the webarticle2text script, or both. For example, the 40,000 word article was a post on the Dictionary.com blog where the article parser also grabbed the entire comment thread.

Leaving the data as is, but zooming in on a more reasonable portion of the histogram, we see something a little more sensical:

This is a little more what we expect. The bulk of the data are distributed between very short articles and those about 1500 words long. The spikes in the low end also correspond to failures of the article parsers.

Now what about the text content of the articles? I really do enjoy a good wordcloud, however, I know that some people tend look down upon them. This is because there are alternate ways of depicting the same data which are more informative. However as I said, I do enjoy them as they are visually appealing.

So firstly I will present the word content in a more traditional way. After removing stop words, the top 25 words found in the conglomerate file of the article text are as follows:

As you can see, there are issues with the download script as there is some garbage in there (div, the years 2011 and 2012, and garbage characters for “don’t” and “are”, or possibly “you’re”). But it appears that my recreational reading corresponds to the most common subjects of its main sources. The majority of my reading was from Psychology Today and so the number one word we see is “people”. I also read a lot articles from men’s magazines, and so we see words which I suspect primarily come from there (“women”, “social”, “sex”, “job”), as well as the psych articles.

And now the pretty visualization:

Seeing the content of what I read depicted this way has made me have some realizations about my interests. I primarily think of myself as a data person, but obviously I am genuinely interested in people as well.

I’m glad data is in there as a ‘big word’ (just above ‘person’), though maybe not as big as some of the others. I’ve just started to fill my reading list with a lot of data visualization and analysis articles as of late.

Well, that was fun, and somewhat educational. In the meantime, I’ll keep on reading. Because the moment you stop reading is the moment you stop learning. As Dr. Seuss said: “The more that you read, the more things you will know. The more that you learn, the more places you’ll go!”

Conclusions

  • Majority of reading done during commute on public transit
  • Number of articles added daily of exponential-type distribution
  • Most articles read from very short to ~1500 words
  • Articles focused on people, dating, social topics; more recently data

Resources

Pocket (formerly Read It Later) on Google Play:
https://play.google.com/store/apps/details?id=com.ideashower.readitlater.pro

Pocket export to HTML:
http://getpocket.com/export

Mediagazer Editor Lyra McKee: What’s In My Pocket
http://getpocket.com/blog/2012/09/mediagazer-editor-lyra-mckee-whats-in-my-pocket/

Founder/CEO of Pocket Nate Weiner: What’s In My Pocket
http://getpocket.com/blog/2012/08/nate-weiner-whats-in-my-pocket/

Pocket Trends (Data analysis/analytics section of Pocket Blog)
http://getpocket.com/blog/category/trends/

webarticle2text (Python script by Chris Spencer)
https://github.com/chrisspen/webarticle2text