Training an RNN on the Archer Scripts

Introduction

So all the hype these days is around “AI”, as opposed to “machine learning” (though I’ve yet to hear an exact distinction between the two), and one of the tools that seems to get talked about most is Google’s Tensorflow.
I wanted to get playing around with Tensorflow and RNN’s a little bit, since they’re not the type of machine learning I’m most familiar with, with a low investment in time to see what kind of outputs I could come up with.

Background

A little digging and I came across this tutorial, which is a pretty good brief overview intro to RNNs, and uses Keras and computes things character-wise.
This is turn lead me to word-rnn-tensorflow, which expanding on the works of others, uses a word-based model (instead of character based).
I wasn’t about to spend my whole weekend rebuilding RNNs from scratch – no sense reinventing the wheel – so just thought it’d be interesting to play around a little with this one, and perhaps give it a more interesting dataset. Shakespeare is ok, but why not something a little more culturally relevant… like I dunno, say the scripts from a certain cartoon featuring a dysfunctional foul-mouthed spy agency?

Continue reading “Training an RNN on the Archer Scripts”

Twitter Pop-up Analytics

Introduction

So I’ve been thinking a lot lately. Well, that’s always true. I should say, I’ve been thinking a lot lately about the blog. When I started this blog I was very much into the whole quantified self thing, because it was new to me, I liked the data collection and analysis aspect, and I had a lot of time to play around with these little side projects.
When I started the blog I called it “everyday analytics” because that’s what I saw it always being; analysis of data on topics that were part of everyday life, the ordinary viewed under the analytical lens, things that everyone can relate to. You can see this in my original about page for the blog which has remained the same since inception.
I was thinking a lot lately about how as my interest in data analysis, visualization and analytics has matured, and so that’s not really the case so much anymore. The content of everyday analytics has become a lot less everyday. Analyzing the relative nutritional value of different items on the McDonald’s menu (yeesh, looking back now those graphs are pretty bad) is very much something to which most everyone could relate. 2-D Histograms in R? PCA and K-means clustering? Not so much.
So along this line of thinking, for this reason, I thought it’s high time to get back into the original spirit of the site when it was started. So I thought I’d do some quick quantified-self type analysis, about something everyone can relate to, nothing fancy. 
Let’s look at my Twitter feed.

Background

It wasn’t always easy to get data out of Twitter. If you look back at how Twitter’s API has changed over the years, there has been considerable uproar about the restrictions they’ve made in updates, however they’re entitled to do so as they do hold the keys to the kingdom after all (it is their product). In fact, I thought it’d be a easiest to do this analysis just using the twitteR package, but it appears to be broken since Twitter has made said updates to their API.

Luckily I am not a developer. My data needs are simple for some ad hoc analysis. All I need is the data pulled and I am ready to go. Twitter now makes this easy now for anyone to do, just go to your settings page:

And then select the ‘Download archive’ button under ‘Your Twitter Archive’ (here it is a prompt to resend mine, as I took the screenshot after):

And boom! A CSV of all your tweets is in your inbox ready for analysis. After all this talk about working with “Big Data” and trawling through large datasets, it’s nice to take a breather a work with something small and simple.

Analysis

So, as I said, nothing fancy here, just wrote some intentionally hacky R code to do some “pop-up” analytics given Twitter’s output CSV. Why did I do it this way, which results in 1990ish looking graphs, instead of in Excel and making it all pretty? Why, for you, of course. Reproducibility. You can take my same R code and run it on your twitter archive (which is probably a lot larger and more interesting than mine) and get the same graphs.
The data set comprises 328 tweets sent by myself between 2012-06-03 and 2014-10-02. The fields I examined were the datetime field (time parting analysis), the tweet source and the text / content.
Time Parting
First let’s look at the time trending of my tweeting behaviour:
We can see there is some kind of periodicity, with peaks and valleys in how many tweets I send. The sharp decline near the end is because there are only 2 days of data for October. Also, compared to your average Twitter user, I’d say I don’t tweet alot, generally only once every two days or so on average:
> summary(as.vector(monthly))
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   1.00    8.00   12.00   11.31   15.00   21.00 

Let’s take a look and see if there is any rhyme or reason to these peaks and valleys:

Looking at the total counts per month, it looks like I’ve tweeted less often in March, July and December for whatever reason (for all of this, pardon my eyeballing..)

What about by day of week?

Look like I’ve tweeted quite a bit more on Tuesday, and markedly less on the weekend. Now, how does that look over the course of the day;

My peak tweeting time seems to be around 4 PM. Apparently I have sent tweets even in the wee hours of the morning – this was a surprise to me. I took a stab at making a heatmap, but it was quite sparse; however the 4-6 PM peak does persist across the days of the week.

Tweets by Source
Okay, that was interesting. Where am I tweeting from?

Look like the majority of my tweets are actually sent from the desktop site, followed by my phone, and then sharing on sites. I attribute this to the fact that I mainly use twitter to share articles, which isn’t easy to do on my smartphone.

Content Analysis
Ah, now on to the interesting stuff! What’s actually in those tweets?
First let’s look at the length of my tweets in a simple histogram:
Looks like generally my tweets are above 70 characters or so, with a large peak close to the absolute limit of 160 characters. 
Okay, but what I am actually tweeting about? Using the very awesome tm package it’s easy to do some simple text mining and pull out both top frequent terms, as well as hashtags.
So apparently I tweet a lot about data, analysis, Toronto and visualization. To anyone who’s read the blog this shouldn’t be overly surprisingly. Also you can see I pass along articles and interact with others as “via” and “thanks” are in there too. Too bad about that garbage ampersand.
Overwhelmingly the top hashtag I use is #dataviz, followed of course by #rstats. Again, for anyone who knows me (or has seen one of my talks) this should not come as a surprise. You can also see my use of Toronto Open Data in the #opendata and #dataeh hashtags.

Conclusion

That’s all for now. As I said, this was just a fun exercise to write some quick, easy R code to do some simple personal analytics on a small dataset. On the plus side the code is generalized, so I invite you to take it and look at your own twitter archive.
Or, you could pull all of someone else’s tweets, but that would, of course, require a little more work.

References

code at github
Twitter Help Center: Downloading Your Archive
The R Text Mining ™ package at CRAN
twitteR package at CRN

Everything in Its Right Place: Visualization and Content Analysis of Radiohead Lyrics

Introduction

I am not a huge Radiohead fan.

To be honest, the Radiohead I know and love and remember is that which was a rock band without a lot of ‘experimental’ tracks – a band you discovered on Big Shiny Tunes 2, or because your friends told you about it, or because it was playing in the background of a bar you were at sometime in the 90’s.

But I really do like their music, I’ve become familiar with more of it and overall it does possess a certain unique character in its entirety. Their range is so diverse and has changed so much over the years that it would be really hard not to find at least one track that someone will like. In this way they are very much like the Beatles, I suppose.

I was interested in doing some more content analysis type work and text mining in R, so I thought I’d try song lyrics and Radiohead immediately came to mind.

Background

In order to first do the analysis, we need all the data (shocking, I know). Somewhat surprisingly, putting ‘radiohead data‘ into Google comes up with little except for many, many links to the video and project for House of Cards which was made using LIDAR technology and had the data set publicly released.
So once again we are in this situation where we are responsible for not only analyzing all the data and communicating the findings, but also getting it as well. Such is the life of an analyst, everyday and otherwise (see my previous musing on this point).
The lyrics data was taken from the listing of Radiohead lyrics at Green Plastic Radiohead.

Normally it would be simply a matter of throwing something together in Python using Beautiful Soup as I have done previously. Unfortunately, due to the way these particular pages were coded, that proved to be a bit more difficult than expected.

As a result the extraction process ended up being a convoluted ad-hoc data wrangling exercise involving the use of wget, sed and Beautiful Soup – a process which was neither enjoyable nor something I would care to repeat.
In retrospect, two points:
Getting the data is not always easy.
Sometimes sitting down beforehand and looking at where you are getting it from, the format it is in and how to best go about getting it into the format you need will save you a lot  of wasted time and frustration in the long run. Ask questions before you begin – what format is the data in now? What is the format I need/would like it to be in to do the analysis? What steps are required in order to get from one to the other (i.e. what is the data transformation or mapping process)?
That being said, my methods got me where I needed to be, however there were most likely easier, more straightforward approaches which would have saved a lot frustration on my part.
If you’re going to code a website, use a sane page structure and give important page elements ids.
Make it easy on your other developers (and the rest of the world in general) by labeling your <div> containers and other elements with ids (which are unique!!) or at least classes. Otherwise how are people going to scrape all your data and steal it for their own ends? I joke… kind of. 
In this case my frustrations actually stemmed mainly from some questionable code for a cache-buster. But even once I got past that, the contents of the main page containers were somewhat inconsistent. Such is life, and the internet.
The remaining data, album and track length – were taken from the Wikipedia pages for each album and later merged with the calculations done with the text data in R.
Okay, enough whinging – we have the data – let’s check it out.

Analysis

I stuck with what I consider to be the ‘canonical’ Radiohead albums – that is, the big releases  you’ve probably heard about even if you’re like me a not a hardcore Radiohead fan – 8 albums in total (Pablo Honey, The Bends, OK Computer, Kid A, Amnesiac, Hail to the Thief, In Rainbows, and The King of Limbs).
Unstructured (and non-quantitative) data always lends itself to more interesting analysis – with something like text, how do we analyze it? How do we quantify it? Let’s start with the easily quantifiable parts and go from there.
Track Length
Below is a boxplot of the track lengths per album, with the points overplotted.

Distribution of Radiohead track lengths by album
Interestingly Pablo Honey and Kid A have the largest ranges of track length (from 2:12 to 4:40 and 3:42 to 7:01 respectively) – if you ignore the single tracks around 2 minutes on Amnesiac and Hail to the Thief the variance of their track lengths is more in line with all the other albums. Ignoring the single outlier, The King of Limbs is appears to be special given its narrow range of track lengths.
Word Count
Next we look at the number of words (lyrics) per album:
Distribution of number of words per Radiohead album

There is a large range of word counts, from the two truly instrumental tracks (Treefingers on Kid A and Hunting Bears on Amnesiac) to the wordier tracks (Dollars and Cents and A Wolf at the Door). Pablo Honey almost looks like it has two categories of songs – with a split around the 80 word mark.

Okay, interesting and all, but again these are small amounts of data and only so much can be drawn out as such.

Going forward we examine two calculated quantities.

Calculated Quantities – Lexical Density and ‘Lyrical Density’

In the realm of content analysis there is a measure known as lexical density which is a measure of the number of content words as a proportion of the total number of words – a value which ranges from 0 to 100. In general, the greater the lexical density of a text, the more content heavy it is and more ‘unpacking’ it takes to understand – texts with low lexical density are easier to understand.

According to Wikipedia the formula is as follows:

where Ld is the analysed text’s lexical density, NLex is the number of lexical word tokens (nouns, adjectives, verbs, adverbs) in the analysed text, and N is the number of all tokens (total number of words) in the analysed text.

Now, I am not a linguist, however it sounds like this is just the ratio of words which are not stopwords to the total number – or could at least be approximated by it. That’s what I went with in the calculations in R using the tm package (because I’m not going to write a package to calculate lexical density by myself).

On a related note, I completely made up a quantity which I am calling ‘lyrical density’ which is much easier to calculate and understand – this is just the number of lyrics per song over the track length, and is measured in words per second. An instrumental track would have lyrical density of zero, and a song with one word per second for the whole track would have a lyrical density of 1.

Lexical Density

Distribution of lexical density of Radiohead songs by album
Looking at the calculated lexical density per album, we can see that the majority of songs have their lexical density between about 30 to 70. The two instrumental songs have a lexical density of 0 (as they have no words) and distribution appears most even on OK Computer. The most content-word heavy song is on Hail to the Thief and is I Will (No Man’s Land)
If you could imaging extending the number of songs Radiohead written to infinity, you might get a density function something like below, with the bulk of songs having density between 30 and 70 (which I imagine is a normal reasonable range for any text) and a little bump at 0 for their instrumental songs:
Histogram of lexical density of Radiohead tracks with overplotted density function
Lyrical Density
Next we come to my calculated quantity, lyrical density – or the number of words per second on each track.
Distribution of lyrical density of Radiohead tracks by album

Interestingly, there are outlying tracks near the high end where the proportion of words to the song length is greater than 1 (Fitter Happier, A Wolf at the Door, and Faust Arp). Fitter Happier shouldn’t even really count, as it is really an instrumental track with a synthesized voice dubbed overtop. If you listen to A Wolf at the Door it is clear why the lyrical density is so high – Thom is practically rapping at points. Otherwise Kid A and The King of Limbs seem to have less quickly sung lyrics than the other albums on average.

Lexical Density + Lyrical Density
Putting it all together, we can examine the quantities for all of the Radiohead songs in one data visualization. You can examine different albums by clicking the color legend at the right, and compare multiple albums by holding CTRL and clicking more than one.


The songs are colour-coded by album. The points are plotted by lexical density along y-axis against the lyrical density along the x-axis and sized by total number of words in the song. As such, the position of the point in the plot gives an idea of the rate of lyrical content in the track – a song like I Might Be Wrong is fitting a lot less content words into a song at a slower rate than a track like A Wolf at the Door which is packed much tighter with both lyrics and meaning.

Conclusion

This was an interesting project and it was fascinating to take something everyday like song lyrics and analyze them as data (though some Radiohead fans might argue that there is nothing ‘everyday’ about Radiohead lyrics).
All in all, I feel that a lot of the analysis has to be taken with a grain of salt (or a shaker or two), given the size of the data set (n = 89). 
That being said, I still feel it is still proof positive that you can take something typically thought of as very artistic and qualitative like a song, and classify it in a meaningful way in quantitative fashion. I had never listened to the song Fitter Happier, yet it is a clear outlier in several measures – and listening to the song I discovered why – it is a track with a robot-like voice over and not containing sung lyrics at all. 
A more interesting and ambitious project would be to take a much larger data set, where the measures examined here would be more reliable given the large n, and look at things such as trends in time (the evolution of American rock lyrics) or by genre / style of music. This sort of thing exists out there already to an extent, for example, in work done with The Million Song Data Set which I came across in some of my Google searches I made for this project.
But as I said, this would be a large and ambitious amount of work, which is perhaps more suited for something like a research paper or thesis – I am just one (everyday) analyst. 

References & Resources

Radiohead Lyrics at Green Plastic Radiohead
The Million Song Data Set
Measuring the Evolution of Contemporary Western Popular Music [PDF]
Radiohead “House of Cards” by Aaron Koblin
code, data & plots on github

What The Smeg? Some Text Analysis of the Red Dwarf Scripts

Introduction

Just as Pocket fundamentally changed my reading behaviour, I am finding that now having Netflix (and even before that, other downloadable or streaming digital content) is really changing my behaviour as far as television is concerned.

Where watching TV used to be an affair of browsing through 500 channels and complaining there was nothing on, now with the advent of on-demand digital services there is choice. Instead of flipping through hundreds of channels (is that a linear search or a random walk?), most of which have nothing whatsoever that interests you, now you can search for exactly the show you are looking for and watch it when you want. Without commercials.

Wait, what? That’s amazing! No wonder people are ‘cutting the cord’ and media corporations are concerned about the future of their business model.

True, you can still browse. People complain that the selection on Netflix is bad for Canada, but for 8 dollars a month, really it’s pretty good what you’re getting. And given the…. eclectic nature of the selection I sometimes find myself watching something I would never think to look for directly, or give a second chance if I just caught 5 minutes of the middle of it on cable.

Such is the case with Red Dwarf. Red Dwarf is one of those shows that gained a cult following, and, despite its many flaws, for me has a certain charm and some great moments. This despite my not being able to understand all of the jokes (or dialogue!) as it is a show from the BBC.

The point is that before Netflix, I probably wouldn’t come across something like this, and I definitely wouldn’t watch all of it, if there wasn’t that option so easily laid out.

So I watched a lot of this show and got to thinking, why not take this as an opportunity to do some more everyday analytics?

Background

If you’re not familiar with the show or a fan, I’ll briefly summarize here so you’re not totally lost.

The series centers around Dave Lister, an underachieving chicken-soup vending machine repairman aboard the intergalactic mining ship Red Dwarf. Lister inadvertently becomes the last human being alive when being put into stasis for 3 million years by the ship’s computer, Holly, when there is a radiation leak aboard the ship. The remainder of the ship’s crew are Arnold J. Rimmer, a hologram of Lister’s now-deceased bunkmate and superior officer; The Cat, a humanoid evolved from Lister’s pet cat; Kryten, a neurotic sanitation droid; and later Kristine Kochanski, a love interest who gets brought back to life from another dimension.

Conveniently, the Red Dwarf scripts are available online, transcribed by dedicated fans of the program. This just goes to show that the series truly does have cult following, when there are fans who love the show so much as to sit and transcribe episodes just for it’s own sake! But then again, I am doing data analysis and visualization on that same show….

Analysis

Of the ten seasons and 61 episodes of the series, the data set covers Seasons 1-8 and comprises and 51 episodes of those 52 (S08E03 – Back In The Red (Part III) is missing).
I did some text analysis of the data with the tm package for R. 

First we can see the prevalence of different characters within the show over the course of the series. I’ve omitted the x-axis labels as they made the chart appear cluttered, you can see them by interacting.

Lister and Rimmer, the two main characters, have the highest amount of mentions overall. Kryten appears in the eponymous S02E01 and is then later introduced as one of the core characters at the beginning of Season 3. The Cat remains fairly constant throughout the whole series as he appears or speaks mainly for comedic value. In S01E06, Rimmer makes a duplicate of himself which explains the high number of lines by his character and mentions of his name in the script. You can see he disappears after Episode 2 of Season 7 in which his character is written out, until re-appearing in Season 8 (he appears in S07E05 as there is an episode dedicated to the rest of the crew reminiscing about him).

Holly, the ship’s computer, appears consistently at the beginning of the program until disappearing with the Red Dwarf towards the beginning of Season 6. He is later reintroduced when it returns at the beginning of Season 8.

Lister wants to bring back Kochanski as a hologram in S01E03, and she also appears in S02E04, as it is a time travel episode. She is introduced as one of the core cast members in Episode 3 of Season 7 and continues to be so until the end of the series.

Ace is Rimmer’s macho alter-ego from another dimension. He appears a couple time in the series before S07E02, in which he is used as a plot device to write Rimmer out of the show for that season.

Appearance and mentions of other crew members of the Dwarf correspond to the beginning of the series and the end (Season 8) when they are reintroduced. The Captain, Hollister, appears much more frequently towards the end of the show.

Robots appear mainly as one-offs who are the focus of a single episode. The exceptions are the Scutters (Red Dwarf’s utility droids) whose appearances coincide with the parts of the show where the Dwarf exists, and simulants which are mentioned occasionally as villians / plot devices. The toaster and snarky dispensing machine also appear towards the beginning and end, with the former also having speaking parts in S04E04.

As mentioned before, the Dwarf gets destroyed towards at the end of Season 5 until being reintroduced at the beginning of Season 8. During this time, the crew live in one of the ship’s shuttlecraft, The Starbug. You can also see that the starbug is mentioned more frequently in episodes when the crew go on excursions (e.g. Season 3, Episodes 1 and 2).

One of the recurring themes of the show is how much Lister really enjoys Indian food, particularly chicken vindaloo. That and how he’d much rather just drink beer at the pub than do anything. S04E02 (spike 1) features a monster, a Chicken Vindaloo man (don’t ask), and the whole premise of S07E01 (spike 2) is Lister wanting to go back in time to get poppadoms.

Thought this would be fun. Space is a consistent theme of the show, obviously. S07E01 is a time travel episode, and the episodes with Pete (Season 8, 6-7) at the end feature a time-altering device.

Conclusions

I recall talking to associate of mine who recounted his experiences in a data analysis and programming workshop where the data set used was the Enron emails. As he quite rightly pointed out, he knew nothing about the Enron emails, so doing the analysis was difficult – he wasn’t quite sure what he was looking at, or what he should be expecting. He said he later used the Seinfeld scripts as a starting point, as this was at least something he was familiar with.

And that’s an excellent point. You don’t need necessarily need to be a subject matter expert to be an analyst, but it sure helps to have some idea what you exactly you are analyzing. Also I would think that there’s a higher probability you care about what you are trying to analyze more if you know something about it.

On that note, it was enjoyable to analyze the scripts in this manner, and see something so familiar as a television show visualized as data like any other. I think the major themes and changes in the plotlines of the show were well represented in this way.

In terms of future directions, I tried looking at the correlation between terms using the findAssocs() function but got strange results, which I believe is due to the small number of documents. At a later point I’d like to do that properly, with a larger number of documents (perhaps tweets). Also this would work better if synonym replacement for the characters was handled in the original corpus, instead of ad-hoc and after the fact (see code).

Lastly, another thing I took away from all this is that cult TV shows have very, very devoted fan-bases. Probably due to its systemic bias, there is an awful lot about Red Dwarf on Wikipedia, and elsewhere on the internet.

Resources

code and data on github
https://github.com/mylesmharrison/reddwarf

Red Dwarf Scripts (Lady of the Cake)