What’s in my Pocket? (Part II) – Analysis of Pocket App Article Tagging

Introduction

You know what’s still awesome? Pocket.

As I noted in an earlier post (oh god, was that really more than a year ago?!) I started using the Pocket application, previously known as Read It Later, in July of 2011 and it has changed my reading behavior ever since.

Lately I’ve been thinking a lot about quantified self and how I’m not really tracking anything anymore. Something which was noted at one of the Meetups is that data collection is really the hurdle: like anything in life – voting, marketing, dating, whatever – you have to make it easy otherwise most people probably won’t bother to do it. I’m pretty sure there’s a psychological term for this – something involving the word ‘threshold’.

That’s where smartphones come in. Some people have privacy concerns about having all their data in the cloud (obviously I don’t, as I’m willing putting myself all on display in the blog here) but that aside, one of the cool things about smartphone apps is that you are passively creating lots of data. Over time this results in a data set about you. And if you know how to pull that data you can analyze it (and hence yourself).  I did this previously, for instance with my text messages and also with data from Pocket collected up to that time.

So let’s give it a go again, but this time with a different focus for the analysis.

Background

This time I wasn’t so interested in when I read articles and from where, but moreso in the types of articles I was reading. In the earlier analysis, I summarized what I was reading by top-level domain of the site – and what resulted was a high-level overview of my online reading behavior.
Pocket added the ability for you to tag your articles. The tags are similar to labels in Gmail and so the relationships can be many to one. This provides a way for you to categorize your reading list (and archive) by category, and for the purposes of this analysis here, to analyze them accordingly.
First and foremost, we need the data (again). Unfortunately over the course of the development of the Pocket application, the amount of data you can get easily via export (without using the API) has diminished. Originally the export was available both as XML or JSON, but unfortunately those are now no longer available.
However, you can still export your reading list as an HTML file, which contains attributes in the link elements for the time the article was added and the tags it has attached.

Basically the export is quasi-XML, so it’s a simple matter of writing some R code using the XML library to get the data into a format we can work with (CSV):

Here I extract the attributes and also create a column for each tag name with a binary value for if the article had that tag (one of my associates at work would call this a ‘classifier’, though it’s not the data science-y kind). Because I wrote this in a general enough fashion, you should be able to run the code on your own Pocket export and get the same results.
Now that we have some data we can plunk it into Excel and do some data visualization.

Analysis

First we examine the state of articles over time – what is the proportion of articles added over time which were tagged versus not?

Tagged vs. Untagged

You can see that initially I resisted tagging articles, but starting November adopted it and began tagging almost all articles added. And because stacked area graphs are not especially good data visualization, here is a line graph of the number of articles tagged per month:

Which better shows that I gradually adopted tagging from October into November. Another thing to note from this graph is that my Pocket usage peaked between November of last year to May of this year, after which the number of articles added on a monthly basis decreases significantly (hence the previous graph being proportional).

Next we examine the number of articles by subject area. I’ve collected them into more-or-less meaningful groups and will explain the different tags as we go along. Note the changing scale on the y-axes for these graphs, as the absolute number of articles varies greatly by category.

Psych & Other Soft Topics
As I noted previously in the other post, when starting to use Pocket I initially read a very large number of psych articles.

I also read a fair number of “personal development” articles (read: self-helpish – mainly from The Art of Manliness) which has decreased greatly as of late. The purple are articles on communications, the light blue “parapsych”, which is my catchall for new-agey articles relating to things like the zodiac, astrology, mentalism, mythology, etc. (I know it’s all nonsense, but hey it’s good conversation for dinner parties and the next category).

The big spike recently was a cool site I found recently with lots of articles on the zodiac (see: The Barnum Effect). Most of these later got deleted.

Dating & Sex
Now that I have your attention… what you don’t read articles on sex? The Globe and Mail’s Life section has a surprising number of them. Also if you read men’s magazines online there are a lot, most of which are actually pretty awful. You can see too that articles on dating made up a large proportion of my reading back in the fall, also from those types of sites (which thankfully I now visit far less frequently).

News, etc.
This next graph is actually a bit busy for my liking, but I found this data set somewhat challenging to visualize overall, given the number of categories and how they change in time.

News is just that. Tech mostly the internet and gadgets. Jobs is anything career related. Finance is both in the news (macro) and personal. Marketing is a newcomer.

Web & Data

The data tag relates to anything data-centric – as of late more applied to big data, data science and analytics. Interestingly my reading on web analytics preceded my new career in it (January 2013), just like my readings in marketing did – which is kind of cool. It also goes to show that if you read enough about analytics in general you’ll eventually read about web analytics.

Data visualization is a tag I created recently so has very few articles – many of which I would have previously tagged with ‘data’.

Life & Humanities

If that other graph was a little too busy this one is definitely so, but I’m not going to bother to break it out into more graphs now. Articles on style are of occasional interest, and travel has become a recent one. ‘Living’ refers mainly to articles on city life (mostly from The Globe as well as the odd one from blogto).

Work
And finally some new-comers, making up the minority, related to work:

SEO is search engine optimization and dev refers to development, web and otherwise.

Gee that was fun, and kind of enlightening. But tagging in Pocket is like in Gmail – it is not one-to-one but many-to-one. So next I thought to try to answer the question: which tags are most related? That is, which tags are most commonly applied to articles together?

To do this we again turn to R and the following code snippet, on top of that previous, does the trick:

All this does is remove the untagged articles from the tag frame and then run a correlation between each column of the tag matrix. I’m no expert on exotic correlation coefficients, so I simply used the standard (Pearson’s). In the case of simple binary variables (true / false such as here), the internet informs me that this reduces to the phi coefficient.

Given there are 30 unique tags, this creates a 30 x 30 matrix, which is visualized below as a heatmap:

Redder is negative, greener is positive. I neglected to add a legend here as when not using ggplot or a custom function it is kind of a pain, but some interesting relationships can still immediately be seen. Most notably food and health articles are the most strongly positively correlated while data and psych articles are most strongly negatively correlated.

Other interesting relationships are that psych articles are negatively correlated with jobs, tech and web analytics (surprise, surprise) and positively correlated with communications, personal development and sex; news is positively correlated with finance, science and tech.

Conclusion

All in all this was a fun exercise and I also learned some things about my reading habits which I already suspected – the amount I read (or at least save to read later) has changed over time as well as the sorts of topics I read about. Also some types of topics are far more likely to go together than others.
If I had a lot more time I could see taking this code and standing it up into some sort of generalized analytics web service (perhaps using Shiny if I was being really lazy) for Pocket users, if there was sufficient interest in that sort of thing.
Though it was still relatively easy to get the data out, I do wish that the XML/JSON export would be restored to provide easier access, for people who want their data but are not necessarily developers. Not being a developer, my attempts to use the new API for extraction purposes were somewhat frustrating (and ultimately unsuccessful).

Though apps often make our lives easier with passive data collection, all this information being “in the cloud” does raise questions of data ownership (and governance) and I do wish more companies, large and small, would make it easier for us to get a hold of our data when we want it.

Because at the end of the day, it is ultimately our data that we are producing – and it’s the things it can tell us about ourselves that makes it valuable to us.

Resources

Pocket – Export Reading List to HTML
Pocket – Developer API
Phi Coefficient
The Barnum (Forer) Effect
code on github

Bananagrams!!!

It was nice to be home with the family for Thanksgiving, and to finally take some time off.

A fun little activity which took up a lot of our time over the past weekend was Bananagrams, which, if you don’t already know, is sort of like a more action-packed version of Scrabble without the board.

Being the type of guy that I am, I started to think about the distribution of letters in the game. A little Googling led to some prior art to this post.

The author did something neat (which I wouldn’t have thought of) by making a sort of bar chart using the game pieces. Strangely though, they chose not to graph the different distributions of letters in Bananagrams and Scrabble but instead listed them in a table.

So, assuming the data from the post are accurate, here is a quick breakdown of said distributions below. As an added bonus, I’ve also included that trendy digital game that everyone plays on Facebook and their iDevices:

Bar graph of letter frequencies of Scrabble, Bananagrams and Words with Friends

Looking at the graph, it’s clear the Bananagrams has more tiles than the other games (the total counts are 144, 104 and 100 for Banagrams, Words with Friends and Scrabble respectively) and notably also does not have blank tiles of which the other games have 2 each. Besides the obvious prevalence of vowels in all 3 games, T, S, R, N, L and D also have high occurrence.

We can also compare the relative frequencies of the different letters in each game with respect to Scrabble. Here I took the letter frequency for each game (as a percent) then divided it by the frequency of the same letter in Scrabble. The results are below:

Bar chart of Bananagrams and Words with Friends letter frequencies relative to Scrabble

Here it is interesting to note that the relative frequency of H in Words with Friends is nearly double that in Scrabble (~192%). Also D, S and T have greater relative frequencies. The remaining letters are fairly consistent with the exception of I and N which are notably less frequent.

Bananagrams relative letter frequency is fairly consistent overall, with the exception of J, K, Q, X, and Z which are around the 140 mark. I guess the creator of the game felt there weren’t enough of the “difficult” letters in Scrabble.

There’s more analysis that could be done here (looking at the number of points per letter in WWF & Scrabble versus their relative frequency immediately comes to mind) but that should do for now. Hope you found this post “a-peeling”.

Analysis of the TTC Open Data – Ridership & Revenue 2009-2012

Introduction

I would say that the relationship between the citizens of Toronto and public transit is a complicated one. Some people love it. Other people hate it and can’t stop complaining about how bad it is. The TTC want to raise fare prices. Or they don’t. It’s complicated.
I personally can’t say anything negative about the TTC. Running a business is difficult, and managing a complicated beast like Toronto’s public system (and trying to keep it profitable while keeping customers happy) cannot be easy. So I feel for them. 
I rely extensively on public transit – in fact, I used to ride it every day to get to work. All things considered, for what you’re paying, this way of getting around the city is a hell of a good deal (if you ask me) compared to the insanity that is driving in Toronto.
The TTC’s ridership and revenue figures are available as part of the (awesome) Toronto Open Data initiative for accountability and transparency. As I noted previously, I think the business of keeping track of things like how many people ride public transit every day must be a difficult one, so you have to appreciate having this data, even if it is likely more of an approximation and is in a highly summarized format.
There are larger sources of open data related to the TTC which would probably be a lot cooler to work with (as my acquaintance Mr. Branigan has done) but things have been busy at work lately, so we’ll stick to this little analysis exercise.

Background

The data set comprises numbers for: average weekly ridership (in 000’s), annual ridership (peak and off-peak), monthly & budgeted monthly ridership (in 000’s), and monthly revenue, actual and budgeted (in millions $). More info here [XLS doc].

Analysis

First we consider the simplest data and that is the peak and off-peak ridership. Looking at this simple line-graph you can see that the off-peak ridership has increased more than peak ridership since 2009 – peak and off-peak ridership increasing by 4.59% and 12.78% respectively. Total ridership over the period has increased by 9.08%.

Below we plot the average weekday ridership by month. As you can see, this reflects the increasing demand on the TTC system we saw summarized yearly above. Unfortunately Google Docs doesn’t have trendlines built-in like Excel (hint hint, Google), but unsurprisingly if you add a regression line the trend is highly significant ( > 99.9%) and the slope gives an increase of approximately 415 weekday passengers per month on average.

Next we come to the ridership by month. If you look at the plot over the period of time, you can see that there is a distinct periodic behavior:

Taking the monthly averages we can better see the periodicity – there are peaks in March, June & September, and a mini-peak in the last month of the year:

This is also present in both the revenue (as one would expect) and the monthly budget (which means that the TTC is aware of it). As to why this is the case, I can’t immediately discern, though I am curious to know the answer. This is where it would be great to have some finer grained data (daily or hourly) or data related to geographic area or per station to look for interesting outliers and patterns.

Alternatively if we look at the monthly averages over the years of average weekday ridership (an average of averages, I am aware – but the best we can do given the data we have), you can see that there is a different periodic behavior, with a distinct downturn over the summer, reaching a low in August which then recovers in September to the maximum. This is interesting and I’m not exactly sure what to make of it, so I will do what I normally do which is attribute it to students.

Lastly, we come to the matter of the financials. As I said the monthly revenue and budget for the TTC follow the same periodic pattern as the ridership, and on the plus side, with increased ridership, there is increased revenue. Taking the arithmetic difference of the budgeted (targeted) revenue from actual, you can see that over time there is a decrease in this quantity:
Again if you do a linear regression this is highly significant ( > 99.9%). Does this mean that the TTC is becoming less profitable over time? Maybe. Or perhaps they are just getting better at setting their targets? I acknowledge that I’m not an economist, and what’s been done here is likely a gross oversimplification of the financials of something as massive as the TTC.

That being said, the city itself acknowledges [warning – large PDF] that while the total cost per hour for an in-service transit vehicle has decreased, the operating cost has increased, which they attribute to increases in wages and fuel prices. Operating public transit is also more expensive here in TO than other cities in the province, apparently, because we have things like streetcars and the subway, whereas most other cities only have buses. Either way, as I said before, it’s complicated.

Conclusion

I always enjoy working with open data and I definite appreciate the city’s initiative to be more transparent and accountable by providing the data for public use.
This was an interesting little analysis and visualization exercise and some of the key points to take away are that, over the period in question:
  • Off-peak usage of the TTC is increasing at a greater rate than peak usage
  • Usage as a whole is increasing, with about 415 more weekday riders per month on average, and a growth of ~9% from 2009 – 2012
  • Periodic behavior in the actual ridership per month over the course of the year
  • Different periodicity in average weekday ridership per month, with a peak in September
It would be really interesting to investigate the patterns in the data in finer detail, which hopefully should be possible in the future if more granular time-series, geographic, and categorical data become available. I may also consider digging into some of the larger data sets, which have been used by others to produce beautiful visualizations such as this one.

I, for one, continue to appreciate the convenience of public transit here in Toronto and wish the folks running it the best of luck with their future initiatives.

References & Resources

TTC Ridership – Ridership Numbers and Revenues Summary (at Toronto Open Data Portal)

Toronto Progress Portal – 2011 Performance Measurement and Benchmarking Report

Everything in Its Right Place: Visualization and Content Analysis of Radiohead Lyrics

Introduction

I am not a huge Radiohead fan.

To be honest, the Radiohead I know and love and remember is that which was a rock band without a lot of ‘experimental’ tracks – a band you discovered on Big Shiny Tunes 2, or because your friends told you about it, or because it was playing in the background of a bar you were at sometime in the 90’s.

But I really do like their music, I’ve become familiar with more of it and overall it does possess a certain unique character in its entirety. Their range is so diverse and has changed so much over the years that it would be really hard not to find at least one track that someone will like. In this way they are very much like the Beatles, I suppose.

I was interested in doing some more content analysis type work and text mining in R, so I thought I’d try song lyrics and Radiohead immediately came to mind.

Background

In order to first do the analysis, we need all the data (shocking, I know). Somewhat surprisingly, putting ‘radiohead data‘ into Google comes up with little except for many, many links to the video and project for House of Cards which was made using LIDAR technology and had the data set publicly released.
So once again we are in this situation where we are responsible for not only analyzing all the data and communicating the findings, but also getting it as well. Such is the life of an analyst, everyday and otherwise (see my previous musing on this point).
The lyrics data was taken from the listing of Radiohead lyrics at Green Plastic Radiohead.

Normally it would be simply a matter of throwing something together in Python using Beautiful Soup as I have done previously. Unfortunately, due to the way these particular pages were coded, that proved to be a bit more difficult than expected.

As a result the extraction process ended up being a convoluted ad-hoc data wrangling exercise involving the use of wget, sed and Beautiful Soup – a process which was neither enjoyable nor something I would care to repeat.
In retrospect, two points:
Getting the data is not always easy.
Sometimes sitting down beforehand and looking at where you are getting it from, the format it is in and how to best go about getting it into the format you need will save you a lot  of wasted time and frustration in the long run. Ask questions before you begin – what format is the data in now? What is the format I need/would like it to be in to do the analysis? What steps are required in order to get from one to the other (i.e. what is the data transformation or mapping process)?
That being said, my methods got me where I needed to be, however there were most likely easier, more straightforward approaches which would have saved a lot frustration on my part.
If you’re going to code a website, use a sane page structure and give important page elements ids.
Make it easy on your other developers (and the rest of the world in general) by labeling your <div> containers and other elements with ids (which are unique!!) or at least classes. Otherwise how are people going to scrape all your data and steal it for their own ends? I joke… kind of. 
In this case my frustrations actually stemmed mainly from some questionable code for a cache-buster. But even once I got past that, the contents of the main page containers were somewhat inconsistent. Such is life, and the internet.
The remaining data, album and track length – were taken from the Wikipedia pages for each album and later merged with the calculations done with the text data in R.
Okay, enough whinging – we have the data – let’s check it out.

Analysis

I stuck with what I consider to be the ‘canonical’ Radiohead albums – that is, the big releases  you’ve probably heard about even if you’re like me a not a hardcore Radiohead fan – 8 albums in total (Pablo Honey, The Bends, OK Computer, Kid A, Amnesiac, Hail to the Thief, In Rainbows, and The King of Limbs).
Unstructured (and non-quantitative) data always lends itself to more interesting analysis – with something like text, how do we analyze it? How do we quantify it? Let’s start with the easily quantifiable parts and go from there.
Track Length
Below is a boxplot of the track lengths per album, with the points overplotted.

Distribution of Radiohead track lengths by album
Interestingly Pablo Honey and Kid A have the largest ranges of track length (from 2:12 to 4:40 and 3:42 to 7:01 respectively) – if you ignore the single tracks around 2 minutes on Amnesiac and Hail to the Thief the variance of their track lengths is more in line with all the other albums. Ignoring the single outlier, The King of Limbs is appears to be special given its narrow range of track lengths.
Word Count
Next we look at the number of words (lyrics) per album:
Distribution of number of words per Radiohead album

There is a large range of word counts, from the two truly instrumental tracks (Treefingers on Kid A and Hunting Bears on Amnesiac) to the wordier tracks (Dollars and Cents and A Wolf at the Door). Pablo Honey almost looks like it has two categories of songs – with a split around the 80 word mark.

Okay, interesting and all, but again these are small amounts of data and only so much can be drawn out as such.

Going forward we examine two calculated quantities.

Calculated Quantities – Lexical Density and ‘Lyrical Density’

In the realm of content analysis there is a measure known as lexical density which is a measure of the number of content words as a proportion of the total number of words – a value which ranges from 0 to 100. In general, the greater the lexical density of a text, the more content heavy it is and more ‘unpacking’ it takes to understand – texts with low lexical density are easier to understand.

According to Wikipedia the formula is as follows:

where Ld is the analysed text’s lexical density, NLex is the number of lexical word tokens (nouns, adjectives, verbs, adverbs) in the analysed text, and N is the number of all tokens (total number of words) in the analysed text.

Now, I am not a linguist, however it sounds like this is just the ratio of words which are not stopwords to the total number – or could at least be approximated by it. That’s what I went with in the calculations in R using the tm package (because I’m not going to write a package to calculate lexical density by myself).

On a related note, I completely made up a quantity which I am calling ‘lyrical density’ which is much easier to calculate and understand – this is just the number of lyrics per song over the track length, and is measured in words per second. An instrumental track would have lyrical density of zero, and a song with one word per second for the whole track would have a lyrical density of 1.

Lexical Density

Distribution of lexical density of Radiohead songs by album
Looking at the calculated lexical density per album, we can see that the majority of songs have their lexical density between about 30 to 70. The two instrumental songs have a lexical density of 0 (as they have no words) and distribution appears most even on OK Computer. The most content-word heavy song is on Hail to the Thief and is I Will (No Man’s Land)
If you could imaging extending the number of songs Radiohead written to infinity, you might get a density function something like below, with the bulk of songs having density between 30 and 70 (which I imagine is a normal reasonable range for any text) and a little bump at 0 for their instrumental songs:
Histogram of lexical density of Radiohead tracks with overplotted density function
Lyrical Density
Next we come to my calculated quantity, lyrical density – or the number of words per second on each track.
Distribution of lyrical density of Radiohead tracks by album

Interestingly, there are outlying tracks near the high end where the proportion of words to the song length is greater than 1 (Fitter Happier, A Wolf at the Door, and Faust Arp). Fitter Happier shouldn’t even really count, as it is really an instrumental track with a synthesized voice dubbed overtop. If you listen to A Wolf at the Door it is clear why the lyrical density is so high – Thom is practically rapping at points. Otherwise Kid A and The King of Limbs seem to have less quickly sung lyrics than the other albums on average.

Lexical Density + Lyrical Density
Putting it all together, we can examine the quantities for all of the Radiohead songs in one data visualization. You can examine different albums by clicking the color legend at the right, and compare multiple albums by holding CTRL and clicking more than one.


The songs are colour-coded by album. The points are plotted by lexical density along y-axis against the lyrical density along the x-axis and sized by total number of words in the song. As such, the position of the point in the plot gives an idea of the rate of lyrical content in the track – a song like I Might Be Wrong is fitting a lot less content words into a song at a slower rate than a track like A Wolf at the Door which is packed much tighter with both lyrics and meaning.

Conclusion

This was an interesting project and it was fascinating to take something everyday like song lyrics and analyze them as data (though some Radiohead fans might argue that there is nothing ‘everyday’ about Radiohead lyrics).
All in all, I feel that a lot of the analysis has to be taken with a grain of salt (or a shaker or two), given the size of the data set (n = 89). 
That being said, I still feel it is still proof positive that you can take something typically thought of as very artistic and qualitative like a song, and classify it in a meaningful way in quantitative fashion. I had never listened to the song Fitter Happier, yet it is a clear outlier in several measures – and listening to the song I discovered why – it is a track with a robot-like voice over and not containing sung lyrics at all. 
A more interesting and ambitious project would be to take a much larger data set, where the measures examined here would be more reliable given the large n, and look at things such as trends in time (the evolution of American rock lyrics) or by genre / style of music. This sort of thing exists out there already to an extent, for example, in work done with The Million Song Data Set which I came across in some of my Google searches I made for this project.
But as I said, this would be a large and ambitious amount of work, which is perhaps more suited for something like a research paper or thesis – I am just one (everyday) analyst. 

References & Resources

Radiohead Lyrics at Green Plastic Radiohead
The Million Song Data Set
Measuring the Evolution of Contemporary Western Popular Music [PDF]
Radiohead “House of Cards” by Aaron Koblin
code, data & plots on github

The heat is on…. or is it? Trend Analysis of Toronto Climate Data

The following is a guest post from Joel Harrrison, PhD, consulting Aquatic Scientist.

For a luddite like me, this is a big step – posting something on the inter-web.  I’m not on Facebook.  I don’t know what Twitter is.  Hell, I don’t even own a smartphone.  But, I’ve been a devoted follower of Myles’ blog for some time, and he was kind enough to let his fellow-geek-of-a-brother contribute something to everyday analytics, so who was I to pass up such an opportunity?

The impetus for my choice of analysis was this:  in celebration of Earth Day, my colleagues and I watched a film about global climate change, which was a nice excuse to eat pizza and slouch in an office chair while sipping Dr. Pepper instead of doing other, presumably useful, things.  Anyway, a good chunk of the film centred on the evidence for anthropogenic greenhouse gas emissions altering the global climate system.

While I’ve seen lots of evidence for recent increases in air temperature in the mid-latitude areas of the planet, there’s nothing quite so convincing as doing your own analysis.  So, I downloaded climate data from Environment Canada and did my own climate change analysis.  I’m an aquatic scientist, not a climate scientist, so if I’ve made any egregious mistakes here, perhaps someone well-versed in climatology will show me the error of my ways, and I’ll learn something.  Anyway, here we go.

Let’s start with mean monthly temperatures from daily means (the average, for each month of the year, of the daily mean temperatures) for the city of Toronto, for which a fairly good record exists (1940 to 2012).  Here’s what the data look like:

So, you can see the clear trend in the data, can’t you?  Trend analysis is a tricky undertaking for a number of reasons, one of which is that variation can exist on a number of temporal scales.  We’re looking at temperatures here, so obviously we would expect significant seasonality in the data, and we are not disappointed:

One method of controlling for the variation related to seasonality is to ‘deseasonalize’ the data by subtracting the monthly medians from each datum.  Let’s look at a boxplot of the deseasonalized data (in part to ensure I’ve done the deseasonalizing correctly!):

Whew, looks good, my R skills are not completely lacking, apparently.  Here are what the deseasonalized data look like, as a time series plot:

Things were clear as mud when we originally viewed the time series plot of all of the data, but after removing the variation related to seasonality, a pattern has emerged:  an increase in temperature from 1940 to 1950, relatively stable temperatures from 1950 to 1960, then a decrease in temperature from 1960 to 1970, and a fairly consistent increase from 1970 to 2012.  Viewing the annual mean temperatures makes this pattern even more conspicuous:

Hold on, you say, why bother going to the trouble of deseasonalizing the data when you could just calculate annual means and perform linear regression to test for a trend?  This is an intuitively attractive way to proceed, but the problem is, that if, say, temperatures were getting colder over time during the summer months, but proportionately warmer during the winter, the annual mean temperature would not change over time; the two opposing trends would in effect cancel each other out.  Apparently that is not the case here, as the deseasonalized data and the annual means show a similar pattern, but caution must be exercised for this reason (especially when you have little theoretical understanding of the phenomenon which you are investigating!).

So, this is all nice and good from a data visualization standpoint, but we need to perform some statistics in order to quantify the rate of change, and to decide if the change is significant in the statistical sense.  Below are the results from linear regression analyses of temperature vs. year using the original monthly means, the deseasonalized data, and the annual means.

Dependent (Response) Variable
n
slope
R2
p-value
Monthly Mean Temperature
876
0.022
0.001
0.17
Deasonalized Monthly Temperatures
876
0.022
0.05
5.82 x 10-12
Annual Mean Temperature
73
0.022
0.20
4.65 x 10-5

All 3 analyses yielded a slope of 0.022 °C/yr, which is to say, the average rate of change during the 70 years analysed was 1.54°C.  The regression based on monthly mean temperatures had a very low goodness of fit (R2 = 0.001) and was not significant at the conventional cut-off level of p < 0.05.  This is not surprising given the scatter we observed in the data due to seasonality.  What is therefore also not a surprise, is that the deseasonalized data had much better goodness of fit (R2 = 0.05), as did the annual mean temperatures (R2 = 0.20).  The much higher level of statistical significance of the regression on deseasonalized data than on the annual means is likely a function of the higher power of the analysis (i.e., 876 data vs. only 73).

Before we get too carried away here interpreting these results, is there anything we’re forgetting?  Right, those annoying underlying assumptions of the statistical test we just used.  According to Zar (1999), for simple linear regression these are:

  1. For any value of X there exists in the population a normal distribution of Y values.  This also means that, for each value of X there exists in the population a normal distribution of Ɛ’s.
  2. Must assume homogeneity of variances; that is, the variances of these population distributions of Y values (and of Ɛ’s) must all be equal to one another.
  3. The actual relationship is linear.
  4. The values of Y are to have come at random from the sampled population and are to be independent of one another.
  5. The measurements of X are obtained without error.  This…requirement…is typically impossible; so what we are doing in practice is assuming that the errors in the X data are negligible, or at least are small compared with the measurement errors in Y.

Hmm, this suddenly became a lot more complicated.   Let’s check the validity of these assumptions for the regression of the deseasonalized monthly temperatures vs. year.  Well, we can safely say that number 5 is not a concern, i.e., that the dates were measured without error, but what about the others?  Arguably, the data are not actually linear, because of the fall in temperature between 1960 and 1970, so this is something of a concern.  The Shapiro-Wilk test tells us that the residuals are not significantly non-normal (assumption 1) but just barely (p = 0.056).  We can visualize this via a Q-Q (Quantile-Quantile) plot of the residuals:

For the most part the data fall right on the line, but a few points fall below and above the line at the extremes, suggestive of a somewhat ‘heavy tailed’ distribution.  Additionally, let’s inspect the histogram:

Again, there is some slight deviation from normality, as evidenced by the distance of the first and last bars from the rest, but it’s pretty minor.  So, there is some evidence of non-normality, but it appears negligible based on visual inspection of the Q-Q plot and histogram, and it is not statistically significant according to the Shapiro-Wilk test.  So, we’re good as far as normality goes.  Check.

What about assumption 2, homogeneity of variances?  This is typically assessed by plotting the residuals against the fitted values, like so:

There does not appear to be a systematic change in the magnitude of the residuals as a function of the predicted values, or at least nothing overly worrisome, so we’re good here, too.

Last, but certainly not least, do our data represent independent measurements?  This last assumption is frequently a problem in trend analysis.  While each temperature was presumably measured on a different day, in the statistical sense this does not necessarily imply that the measurements are not autocorrelated.   Several years of data could be influenced by an external factor which influences temperature over a multi-year timescale (El Niño?) which would cause the data from sequential years to be strongly correlated.  Such temporal autocorrelation (serial dependence) can be visualized using an autocorrelation function (ACF):

The plot tells us that at a variety of lag periods (differences between years) the level of autocorrelation is significant (i.e., the ACF is above the blue line).  The Durbin-Watson test confirms that the overall level of autocorrelation in the residuals is highly significant (p = 4.04 x 10-13).

So, strictly speaking, linear regression is not appropriate for our data due to the presence of nonlinearity and serial correlation, which violate two of the five assumptions of linear regression analysis.  Now, don’t get me wrong, people violate these assumptions all the time.  Hell, you may have already violated them earlier today if you’re anything like I was in my early days of grad school.  But, as I said, this is my first blog post ever, and I don’t want to come across as some sloppy, apathetic, slap-dash, get-away-with-whatever-the-peer-reviewers-don’t-call-me-out-on type scientist – so let’s shoot for real statistical rigour here!

Fortunately, this is not too onerous a task, as there is a test that was tailor-made for trend analysis, and doesn’t have the somewhat strict requirements of linear regression.  Enter the Hirsch-Slack Test, a variation of the Seasonal Kendall Trend Test, which corrects for both seasonality and temporal autocorrelation.  I could get into more explanation as to how the test works, but this post is getting to be a little long, and hopefully you trust me by now.  So, drum roll please….

The Hirsch-Slack test gives very similar results to those obtained using linear regression; it indicates a highly significant (p = 1.48 x 10-4) increasing trend in temperature (0.020°C/yr), which is very close to the slope of 0.022°C/yr obtained by linear regression.

So, no matter which way you slice it, there was a significant increase in Toronto’s temperature over the past 70 years.  I’m curious about what caused the dip in temperature between ~1960 and ~1970, and have a feeling it may reflect changes in aerosols and other aspects of air quality related to urbanization, but don’t feel comfortable speculating too much.  Perhaps it reflects some regional or global variation related to volcanic activity or something, I really have no idea.  Obviously, if we’d performed the analysis on the years 1970 to 2010 the slope (i.e., rate of temperature increase) would have been much higher than for the entire period of record.

I was also curious if Toronto was a good model for the rest of Canada given that it is a large, rapidly growing city, and changes in temperature there could have been related to urban factors, such as the changes in air quality I already speculated about.  For this reason, I performed the same analysis on data from rural Coldwater (near where Myles and I grew up) and obtained very similar results, which suggests the trend is not unique to the city of Toronto.

In case you’re wondering, the vast majority (98%) of Canadians believe the global climate is changing, according to a recent poll by Insightrix Research (but note that far fewer believe that human activity is solely to blame.)  So, perhaps the results of this analysis won’t be a surprise to very many people, but I did find it satisfying to perform the analysis myself, and with local data.

Well, that`s all for now – time to brace ourselves for the coming heat of summer.  I think I need a nice, cold beer.

References & Resources

Zar, J.H. (1999) Biostatistical Analysis, 4th ed. Upper Saddle River, New Jersey: Prentice Hall.
http://books.google.com/books/about/Biostatistical_analysis.html?id=LCRFAQAAIAAJ

Hirsch, R.M. & Slack, J.R. (1984). A Nonparametric Trend Test for Seasonal Data With Serial Dependence. Water Resources Research 20(6), 727-732. doi: 10.1029/WR020i006p00727
http://onlinelibrary.wiley.com/doi/10.1029/WR020i006p00727/abstract

National Post: Climate Change is real, Canadians say, but they can’t agree on the cause
http://news.nationalpost.com/2012/08/16/climate-change-is-real-canadians-say-while-disagreeing-on-the-causes

Climate Data at Canadian National Climate Data and Information Archive
http://climate.weatheroffice.gc.ca/climateData/canada_e.html

Joel Harrison, PhD, Aquatic Scientist
http://www.environmentalsciences.ca/newsite/staff/#harrison

xkcd: Visualized

Introduction

It’s been said that the ideal job is one you love enough to do for free but are good enough at that people will pay you for it. That if you do what you love no matter what others may say, and if you work at it hard enough, and long enough, eventually people will recognize it and you’ll be a success.

Such is the case with Randall Munroe. Because any nerd worth their salt knows what xkcd is.

What started as simply a hobby and posting some sketches online turned into a cornerstone of internet popular culture, with a cult following amongst geekdom, the technically savvy, and more.

Though I would say that it’s gone beyond that now, and even those less nerdy and techie know what xkcd means – it’s become such a key part of internet popular culture. Indeed, Mr. Munroe’s work has swing due to the sheer number of people who know and love it, and content on the site has resulted in changes being made on some of the biggest sites on the Internet – take, for example, Google adding a comment read-a-loud in 2008, quite possibly because of a certain comic.

As another nerdy / tech / data citizen of the internet who knows, loves and follows xkcd, I thought I could pay tribute to it with its own everyday analysis.

Background

Initially, I thought I would have to go about doing it the hard way again. I’ve done some web scraping before with Python and thought this would be the same using the (awesome) Beautiful Soup package.

But Randall, being the tech-savvy (and Creative Commons abiding) guy that he is, was nice enough to provide an API to return all the comic metadata in JSON format (thanks Randall!).

That being said it was straightforward to write some Python with urrlib2 to download the data and then get going on the analysis.

Of course, after doing all that I realized that someone else was nice enough to have already written the equivalent code in R to access the data. D’oh! Oh well. Lesson learned – Google stuff first, code later.

But it was important to write that code in Python as I used the Python Imaging Library (PIL) (also awesome… thanks mysterious, shadowy developers at Pythonware/Secret Labs AB) to extract metadata from the comic images.

The data includes the 1204 comics from the very beginning (#1, Barrel – Part 1 posted on Jan 1, 2006) to #1204, Detail, posted on April 26, 2013.

As well as the data provided via the JSON (comic #, url, title, date, transcript and alt text) I pulled out additional fields using the Python Imaging Library (file format, filesize, dimensions, aspect ratio and luminosity). I also wanted to calculate hue, however, regrettably this is a somewhat more complicated process which my image processing chops were not immediately up to, and so I deferred on this point.

Analysis

File type
Ring chart of xkcd comics by file typeBar chart of xkcd comics by file type


You can see out of the 1204 comics, 1073 (~89.19%) were in PNG format, 128 (~10.64%) were in JPEG and only 2 (#961, Eternal Flame and #1116, Stoplight) (~0.17%) were in GIF. This of course, being because the latter are the only two comics which are animated.

Looking at the filetype over time below, you can see that initially xkcd was primarily composed of JPEG images (mostly because they were scanned sketches) and this quickly changed over time to being almost exclusively PNG with the exception of the two aforementioned animated GIFs. The lone outlying JPEG near 600 is Alternative Energy Revolution (#556).

strip chart of xkcd comics by file type
Image Mode
Next we can look at the image mode of all the xkcd images. For a little context, the image modes are roughly as following:
  • L – 8 bit black & white
  • P – 8 bit colour
  • RGB – colour
  • LA, RGBA – black & white with alpha channel (transparency), colour with alpha channel

The breakdown for all the comics is depicted below.

ring chart of xkcd comics by image modebar chart of xkcd comics by image mode

You can see that the majority are imagemode L (847, ~70.41%) followed by 346 in RGB (~28.76%) and a tiny remaining number are in P (8, ~0.7%) with the remaining two in L and RGB modes with alpha channel (LA & RGBA).

Any readers will know that the bulk of xkcd comics are simple black-and-white images with stick figures and you can see this reflected in the almost ¾ to ¼ ratio of monochrome to coloured images.

The two images with alpha channel are Helping (#383) and Click and Drag (#1110), most likely because of the soft image effect and interactivity, respectively.

Looking at the image mode over time, we can see that like the filetype, almost all of the images were initially in RGB mode as they were scans. After this period, the coloured comics are fairly evenly interspersed with the more common black and white images.

strip chart of xkcd comics by image mode
Luminosity

You can see in the figure on the left that given the black-and-white nature of xkcd the luminosity of each image is usually quite high (the maximum is 255). We can see the distribution better summarized on the right in a histogram:

scatterplot of luminosity of xkcd comicshistogram of luminosity of xkcd comics

Luminosity was the only quality of the images which had significant change over the years that Randall has created the comic. Doing an analysis of variance we can see there is a statistically significant year-on-year difference in the average comic brightness (> 99%):

> aov(data$lumen ~ data$year)
Call:
aov(formula = data$lumen ~ data$year)

Terms:
data$year Residuals
Sum of Squares 5762.0 829314.2
Deg. of Freedom 1 1201

Residual standard error: 26.27774
Estimated effects may be unbalanced
> summary(aov(data$lumen ~ data$year))
Df Sum Sq Mean Sq F value Pr(>F)
data$year 1 5762 5762 8.344 0.00394 **
Residuals 1201 829314 691

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

True, there are currently less data points for the 2013 year, however even doing the same excluding this year is significant with 99% significance.

The average luminosity decreases by year, and this is seen in the plot below which shows a downward trend:

line plot of average luminosity of xkcd per year

Image Dimensions
Next we look at the sizes of each comic. xkcd ranges from very tall comic-book style strips to ultra-simplistic single small images driving the whole point or punch line home in one frame.

scatterplot of height vs. width for xkcd comics

Looking at the height of each comic versus the width, you can see that there appears to be several standard widths which Randall produces the comic at (not so with heights). These standard widths are 400, 500, 600, 640, and 740.

distribution of image heights of xkcd comic

We can see these reflected in the distribution of all image widths, 740 is by far the most common comic width. There is no such pattern in the image heights, which appears to have a more logarithmic-like distribution.

histogram of width of xkcd comicshistogram of height of xkcd comics

Interesting, the ‘canonical’ widths are not constant over time – there were several widths which were used frequently near the beginning, after which the more common standard of 740px was used. This may be due to the large number of scanned images near the beginning, as I imagine scanning an A4 sheet of paper would often result in the same image resolutions. 

scatterplot of width of xkcd comics

The one lone outlier on the high end of image width is 780px wide and is #1193, Externalities.

Looking at the aspect ratio of the comics over time, you can see that there are appear to be two classes of comics – a larger number (about 60%) of which are more tightly clustered around an even 1:1 aspect ratio, and then a second class more evenly distributed with aspect ratio 2 and above. There are also a small peaks around 1.5 and 1.75.
scatterplot of aspect ratio of xkcd comicshistogram of aspect ratio of xkcd comics
In case you were wondering the comic with an aspect ratio of ~8 is Tags (#1144) and the tallest comic proportionally is Future Timeline (#887).
Filesize

As well as examining the resolution (dimensions) of the comic images we can also examine the distribution of the images by their filesize.

distribution of file size of xkcd comics

You can see that the majority of the images are below 100K in size – in general the xkcd comics are quite small as the majority are simple PNGs displaying very little visual information.

We can also look at the comic size (area in square pixels) versus the filesize:

scatterplot of file size versus image size of xkcd comicsscatterplot of file size versus image size of xkcd comics (with trend line)

There is clearly a relationship here, as illustrated on the log-log plot on the right with the trend line.Of course, I am just stating the obvious – this relationship is not unique to the comics and exists as a property for the image formats in general.

If we separated out the images by file type (JPEG and PNG) I believe we would see different numbers for the relationship as a result of the particularities of the image compression techniques.

Conclusions

I have this theory that how funny a joke is to someone who gets it is inversely proportional to the number of other people who would get it. That is to say, the more esoteric and niche the comedy is, the funnier and more appealing it is to those who actually get the punch line. It’s a feeling of being special – a feeling that someone else understands and that the joke was made just for you, and others like you, and that you’re not alone in thinking that comics involving Avogadro’s number, Kurt Godel or Turing Completeness can be hilarious.

As an analyst who has come out of the school of mathematics, and continually been immersed in the world of technology, it is reassuring to read something like xkcd and feel like you’re not the only one who thinks matters of data, math, science, and technology can be funny, along with all the other quirkiness and craziness in life which Randall so aptly (and sometimes poignantly) portrays.

That being said, Randall’s one dedicated guy who has done some awesome work for the digitally connected social world of science, technology, and internet geekdom, and now we know how much he likes using 740px as the canvas width, and that xkcd is gradually using less white pixels over the years.

And let’s hope there will be many more years to come.

Resources

xkcd
xkcd – JSON API
xkcd – Wikipedia
code on github

Toronto Licensed Cats & Dogs 2012 Data Visualization

It’s raining cats and dogs! No, I lied, it’s not.

But I wanted to do so more data viz and work with some more open data.

So for this quick plot I present, Cat and Dog Licenses in the City of Toronto for 2012, visualized!


Above in the top pane is the number of licensed cats and dogs per postal code (or Forward Sortation Area, FSA). I really would like to have produced a filled map (chloropleth) with the different postal code areas, however Tableau unfortunately does not have Canadian postal code boundaries, just lat/lon and getting geographic data in is a bit of an arduous process.

I needed something to plot given that I just had counts of cat and dog licenses per FSA, so threw up a scatter and there is amazing correlation! Surprise, surprise – this is just the third variable, and I bet that if you found a map of (human) population density by postal code you’d see why the two quantities are so closely related. Or perhaps not – this is just my assumption – maybe some areas of the GTA are better about getting their pets licensed or have more cats and dogs. Interesting food for thought.


Above is the number of licenses per breed type. Note that the scale is logarithmic for both as the “hairs” (domestic shorthair, domestic mediumhair and domestic longhair) dominate for cats and I wanted to keep the two graphs consistent.

The graphs are searchable by keyword, try it out!

Also I find it shocking that the second most popular breed of dog was Shih Tzu and the fourth most type of cat was Siamese – really?

Resources

Toronto Licensed Cat & Dog Reports (at Toronto Open Data Portal)

Toronto Animal Services
http://www.toronto.ca/animal_services/

Top 10 Super Bowl XLVII Commercials in Social TV (Respin)

So the Super Bowl is kind of a big deal.

Not just because there’s a lot of football. And not just because it’s a great excuse to get together with friends and drink a whole lot of beer and eat unhealthy foods. And not because it’s a good excuse to shout at your new 72″ flatscreen with home theater surround that you bought at Best Buy just for your Super Bowl party and are going to try to return the next day even though you’re pretty sure now that they don’t let you do that any more.

The Super Bowl is a big deal for marketers. For creatives. For ‘social media gurus’. Because there’s a lot of eyeballs watching those commercials. In fact, I’m pretty sure there’s people going to Super Bowl parties who don’t even like football and are just there for the commercials, that is if they’ve not decided to catch all the best ones after the fact on YouTube.

And also, you know, because if you’re putting down $6 million for a minute of commercial airtime, you want to make sure that those dollars are well spent.

So Bluefin Labs is generating a lot of buzz lately as they were acquired by Twitter. TV is big, social media is big, so Social TV analytics must be even bigger, right? Right?

Anyhow Bluefin showed up recently in my Twitter feed for a different reason: their report on the Top 10 Super Bowl XLVII commercials in Social TV that they did for AdAge.

The report’s pretty and all, but a little too pretty for my liking, so I thought I’d respin some of it.

Breakdown by Gender:

Superbowl XLVII Commercial Social Mentions by Gender

You can see that the male / female split is fairly even overall, with the exception of the NFL Network’s ad and to a lesser extent the ad for Fast & Furious 6 which were more heavily mentioned proportionally by males. The Budweiser, Calvin Klein and Taco Bell spots had greater percentages of women commenting.

Sentiment

The Taco Bell, Dodge and Budweiser ads had the most mentions with positive sentiment. The NFL ad had a very large amount of neutral comments (74%), moreso than any other ad, proportionally. The Go Daddy ad had the most negative mentions, for good reason – it’s gross and just kind of weird. It wouldn’t be the Super Bowl if Go Daddy didn’t air a commercial of questionable taste though, right?
Superbowl XLVII Commercial Sentiment Breakdown by Gender
Superbowl XLVII Commercial Sentiment Breakdown by Gender (Proportional)
Lastly, I am going to go against the grain here and say that the next big thing in football is most definitely going to be Leon Sandcastle.

Finer Points Regarding Data Visualization Choices

The human mind is limited.

We can only process so much information at one time. Numerals are text which communicate quantity. However, unlike other text, it’s a lot harder to read a whole bunch of numbers and get a high-level understanding of what is being communicated. There are sentences of numbers and quantities (these are called equations, but not everyone is as literate in them) however simply looking at a pile of data and having an understanding of the ‘big picture’ is not something most people can do. This is especially true as the amount of information becomes larger than a table with a few categories and values.

If you’re a market research, business, data, financial, or (insert other prefix here) analyst, part of your job is taking a lot of information and making sense of that information, so that other people don’t have to. Let’s face it – your Senior Manager or The VP doesn’t have time to wade through all the data – that’s why they hired you.

Ever since Descartes’ epiphany (and even before that) people have been realizing that there are other, more effective ways to communicate information than having to look at all the details. You can communicate the shape of the data without knowing exactly how many Twitter followers were gained each day. You can see what the data look like without having to know the exact dollar value for sales each and every day. You can feel what the data are like, and get an intuitive understanding of what’s going on, without having to look at all the raw information.

Enter data visualization.

Like any practice, data visualization and the depicting quantitative relationships visually can be done poorly or can be done well. I’m sure you’ve seen examples of the former, whether it be in a presentation or other report, or perhaps floating around the Internet. And the latter, like so many good things, is not always so plentiful, nor appreciated. Here I present some finer points between data visualization choices, in the hope that you will always find yourself depicting data well.

Pie (and Doughnut) Chart

Ah, the pie chart. The go-to the world over when most people seek to communicate data, and one both loved and loathed by many.
The pie chart should be used to compare quantities of different categories where the proportion of the whole is important, not the absolute values (though these can be added with labelling as well). It’s important that the number of categories being compared remain small – depending on the values, the readability of the chart decreases greatly as the number of categories increases. You can see this below. The second example is a case where an alternate representation should be considered, as the chart’s readability and usefulness is lower given the larger number of proportions being compared:

Doughnut charts are the same as pie charts but with a hole in the center. They may be used in the place of multiple pie charts by nesting the rings:

Hmm.

Though again, as the number of quantities being compared increases the readability and visual utility generally decreases and you are better served by a bar chart in these cases. Also there is the issue that the area of each annulus will be different for the same angle, depending upon which ring it is in.

With circular charts it is best to avoid legends as this causes the eye to flit back and forth between the different segments and the legend, however when abiding by this practice for doughnut charts labeling becomes a problem, as you can see above.

Tufte contends that a bar chart will always serve better than a pie chart (though some others disagree). The issue is that there is some debate about the way the human mind processes comparisons with angular representations versus those depicted linearly or by area. I tend to agree and find the chart below much better data visualization that the one we saw previously:

Isn’t that much better?

From a practical perspective – a pie chart is useful because of its simplicity and familiarity, and is a way to communicate proportion of quantities when the number of categories being compared is small. 
Bonus question:
Q. When is it a good idea to use a 3-D pie chart?
A. Never. Only as an example of bad data visualization!

Bar Charts

Bar charts are used to depict the values of a quantity or quantities across categories. For example, to depict sales by department, or per product type.
This type of chart can be (and is) used to depict values over time, however, said chunks of time should be discrete (e.g. quarters, years) and of a small number. When a comparison is to be done over time and the number of periods / data points is larger, it is better visualized using a line chart.

As the number of categories becomes large, an alternative to the usual arrangement (‘column’ chart) is to arrange the categories vertically and bars horizontally. Note this is best done only for categorical / nominal data as data with an implied order (ordinal, interval, or ratio type data) should be displayed left-to-right in increasing order to be consistent with reading left to right.
Bar charts may also be stacked in order to depict both the values between categories as well as the total across them. If the absolute values are not important, then stacked bar charts may be used in this way in the place of several pie charts, with all bars having a maximum height of 100%:

Stephen Few contends that this still makes it difficult to compare proportions, similar to the problem with pie charts, and has other suggestions [PDF], though I think it is fine on some occassions, depending the nature of the data being depicted.

When creating bar charts it is important to always start the y-axis from zero so as not to produce a misleading graph.

A column chart may also be combined with a line graph of the total per category in a type of combo chart known as Pareto chart.

Scatterplot (and Bubble Graphs)

Scatterplots are used to depict a relationship between two quantitative variables. The value pairs for the variables are plotted against each other, as below:

When used to depict relationships occurring over time, we instead use a special type of scatterplot known as a line graph (next section).

A bubble chart is a type of scatterplot used to compare relationships between three variables, where the points are sized by area according to a third value. Care should be taken to ensure that the points are sized correctly in this type of chart, so as not to incorrectly depict the relative proportion of quantities

Relationships between four variables may also be visualized by colouring each point according to the value of a fourth variable, though this may be a lot of information to depict all at once, depending upon the nature of the data. When animated to include a fifth variable (usually time) it is known as a motion chart, which is perhaps most famously demonstrated in Hans Rosling’s landmark TED Talk which has become somewhat of a legend.

Line Graphs

Line graphs are usually used to depict quantities changing over time. They may also be used to depict relationships between two (numeric) quantities when there is continuity in both.

For example, it makes sense to compare sales over time with a line graph, as time is numerical quantity that varies continuously:

However it would not make sense to use a line graph to compare sales across departments as that is categorical / nominal. Note that there is one exception to this rule and that is the aforementioned Pareto chart.

Omitting the points on the line graph and using a smooth graph instead of line segments creates an impression of more data being plotted, and hence a greater continuity. Compare with the plot above the one below:

So practically speaking save the smooth line graphs for when you have a lot of data and the points would just be visual clutter, otherwise it’s best to overplot the points to be clear about what quantities are being communicated.

Also note that unlike a bar chart, it is acceptable to have a non-zero starting point for the y-axis of a line graph as the change in values is being depicted, not their absolute values.

Now Go Be Great!

This is just a sample of some of the finer differences between the choices for visualizing data. There are of course many more ways to depict data, and, I would argue, that possibilities for data visualization are only limited by the imagination of the visualizer. However when sticking with the tried, true and familiar, keep these points in mind to be great at what you do and get your point across quantitatively and visually.
Go, visualize the data, and be amazing!

What The Smeg? Some Text Analysis of the Red Dwarf Scripts

Introduction

Just as Pocket fundamentally changed my reading behaviour, I am finding that now having Netflix (and even before that, other downloadable or streaming digital content) is really changing my behaviour as far as television is concerned.

Where watching TV used to be an affair of browsing through 500 channels and complaining there was nothing on, now with the advent of on-demand digital services there is choice. Instead of flipping through hundreds of channels (is that a linear search or a random walk?), most of which have nothing whatsoever that interests you, now you can search for exactly the show you are looking for and watch it when you want. Without commercials.

Wait, what? That’s amazing! No wonder people are ‘cutting the cord’ and media corporations are concerned about the future of their business model.

True, you can still browse. People complain that the selection on Netflix is bad for Canada, but for 8 dollars a month, really it’s pretty good what you’re getting. And given the…. eclectic nature of the selection I sometimes find myself watching something I would never think to look for directly, or give a second chance if I just caught 5 minutes of the middle of it on cable.

Such is the case with Red Dwarf. Red Dwarf is one of those shows that gained a cult following, and, despite its many flaws, for me has a certain charm and some great moments. This despite my not being able to understand all of the jokes (or dialogue!) as it is a show from the BBC.

The point is that before Netflix, I probably wouldn’t come across something like this, and I definitely wouldn’t watch all of it, if there wasn’t that option so easily laid out.

So I watched a lot of this show and got to thinking, why not take this as an opportunity to do some more everyday analytics?

Background

If you’re not familiar with the show or a fan, I’ll briefly summarize here so you’re not totally lost.

The series centers around Dave Lister, an underachieving chicken-soup vending machine repairman aboard the intergalactic mining ship Red Dwarf. Lister inadvertently becomes the last human being alive when being put into stasis for 3 million years by the ship’s computer, Holly, when there is a radiation leak aboard the ship. The remainder of the ship’s crew are Arnold J. Rimmer, a hologram of Lister’s now-deceased bunkmate and superior officer; The Cat, a humanoid evolved from Lister’s pet cat; Kryten, a neurotic sanitation droid; and later Kristine Kochanski, a love interest who gets brought back to life from another dimension.

Conveniently, the Red Dwarf scripts are available online, transcribed by dedicated fans of the program. This just goes to show that the series truly does have cult following, when there are fans who love the show so much as to sit and transcribe episodes just for it’s own sake! But then again, I am doing data analysis and visualization on that same show….

Analysis

Of the ten seasons and 61 episodes of the series, the data set covers Seasons 1-8 and comprises and 51 episodes of those 52 (S08E03 – Back In The Red (Part III) is missing).
I did some text analysis of the data with the tm package for R. 

First we can see the prevalence of different characters within the show over the course of the series. I’ve omitted the x-axis labels as they made the chart appear cluttered, you can see them by interacting.

Lister and Rimmer, the two main characters, have the highest amount of mentions overall. Kryten appears in the eponymous S02E01 and is then later introduced as one of the core characters at the beginning of Season 3. The Cat remains fairly constant throughout the whole series as he appears or speaks mainly for comedic value. In S01E06, Rimmer makes a duplicate of himself which explains the high number of lines by his character and mentions of his name in the script. You can see he disappears after Episode 2 of Season 7 in which his character is written out, until re-appearing in Season 8 (he appears in S07E05 as there is an episode dedicated to the rest of the crew reminiscing about him).

Holly, the ship’s computer, appears consistently at the beginning of the program until disappearing with the Red Dwarf towards the beginning of Season 6. He is later reintroduced when it returns at the beginning of Season 8.

Lister wants to bring back Kochanski as a hologram in S01E03, and she also appears in S02E04, as it is a time travel episode. She is introduced as one of the core cast members in Episode 3 of Season 7 and continues to be so until the end of the series.

Ace is Rimmer’s macho alter-ego from another dimension. He appears a couple time in the series before S07E02, in which he is used as a plot device to write Rimmer out of the show for that season.

Appearance and mentions of other crew members of the Dwarf correspond to the beginning of the series and the end (Season 8) when they are reintroduced. The Captain, Hollister, appears much more frequently towards the end of the show.

Robots appear mainly as one-offs who are the focus of a single episode. The exceptions are the Scutters (Red Dwarf’s utility droids) whose appearances coincide with the parts of the show where the Dwarf exists, and simulants which are mentioned occasionally as villians / plot devices. The toaster and snarky dispensing machine also appear towards the beginning and end, with the former also having speaking parts in S04E04.

As mentioned before, the Dwarf gets destroyed towards at the end of Season 5 until being reintroduced at the beginning of Season 8. During this time, the crew live in one of the ship’s shuttlecraft, The Starbug. You can also see that the starbug is mentioned more frequently in episodes when the crew go on excursions (e.g. Season 3, Episodes 1 and 2).

One of the recurring themes of the show is how much Lister really enjoys Indian food, particularly chicken vindaloo. That and how he’d much rather just drink beer at the pub than do anything. S04E02 (spike 1) features a monster, a Chicken Vindaloo man (don’t ask), and the whole premise of S07E01 (spike 2) is Lister wanting to go back in time to get poppadoms.

Thought this would be fun. Space is a consistent theme of the show, obviously. S07E01 is a time travel episode, and the episodes with Pete (Season 8, 6-7) at the end feature a time-altering device.

Conclusions

I recall talking to associate of mine who recounted his experiences in a data analysis and programming workshop where the data set used was the Enron emails. As he quite rightly pointed out, he knew nothing about the Enron emails, so doing the analysis was difficult – he wasn’t quite sure what he was looking at, or what he should be expecting. He said he later used the Seinfeld scripts as a starting point, as this was at least something he was familiar with.

And that’s an excellent point. You don’t need necessarily need to be a subject matter expert to be an analyst, but it sure helps to have some idea what you exactly you are analyzing. Also I would think that there’s a higher probability you care about what you are trying to analyze more if you know something about it.

On that note, it was enjoyable to analyze the scripts in this manner, and see something so familiar as a television show visualized as data like any other. I think the major themes and changes in the plotlines of the show were well represented in this way.

In terms of future directions, I tried looking at the correlation between terms using the findAssocs() function but got strange results, which I believe is due to the small number of documents. At a later point I’d like to do that properly, with a larger number of documents (perhaps tweets). Also this would work better if synonym replacement for the characters was handled in the original corpus, instead of ad-hoc and after the fact (see code).

Lastly, another thing I took away from all this is that cult TV shows have very, very devoted fan-bases. Probably due to its systemic bias, there is an awful lot about Red Dwarf on Wikipedia, and elsewhere on the internet.

Resources

code and data on github
https://github.com/mylesmharrison/reddwarf

Red Dwarf Scripts (Lady of the Cake)