Seriously, What’s a Data Scientist? (and The Newgrounds Scrape)

So here’s the thing. I wouldn’t feel comfortable calling myself a data scientist (yet).

Whenever someone mentions the term data science (or, god forbid BIG DATA, without a hint of skepticism or irony) people inevitably start talking about the elephant in the room (see what I did there)?

And I don’t know how to ride elephants (yet).

Some people (like yours truly, as just explained) are cautious – “I’m not a data scientist. Data science is a nascent field. No one can go around really calling themselves a data scientist because no one even really knows what data science is yet, there isn’t a strict definition.” (though Wikipedia’s attempt is noble).

Other people are not cautious at all – “I’m a data scientist! Hire me! I know what data are and know how to throw around the term BIG DATA! I’m great with pivot tables in Excel!!”

Aha ha. But I digress.

The point is that I’ve done the first real work which I think falls under the category of data science.

I’m no Python guru, but threw together a scraper to grab all the metadata from Newgrounds portal content.

The data are here if you’re interested in having a go at it already.

The analysis and visualization will take time, that’s for a later article. For now, here’s one of my exploratory plots, of the content rating by date. Already we can gather from this that, at least at Newgrounds, 4-and-half stars equals perfection.

Sure feels like science.

The Hour of Hell of Every Morning – Commute Analysis, April to October 2012

Introduction

So a little while ago I quit my job.

Well, actually, that sounds really negative. I’m told that when you are discussing large changes in your life, like finding a new career, relationship, or brand of diet soda, it’s important to frame things positively.

So let me rephrase that – I’ve left job I previously held to pursue other directions. Why? Because I have to do what I love. I have to move forward. And I have to work with data. It’s what I want, what I’m good at, and what I was meant to do.

So onward and upward to bigger, brighter and better things.

But I digress. The point is that my morning commute has changed.

Background

I really enjoyed this old post at Omninerd, about commute tracking activities and an attempt to use some data analysis to beat traffic mathematically. So I thought, hey, I’m commuting every day, and there’s a lot of data being generated there – why not collect some of it and analyze it too?

The difference here being that I was commuting with public transit instead of driving. So yes, the title is a bit dramatic (it’s an hour of hell in traffic for some people, I actually quite enjoy taking the TTC).

When I initially started collecting the data, I had intended to time both my commute to and from work. Unfortunately, I discovered that due to having a busy personal and professional life outside of the 9 to 5, that there was little point in tracking my commute at the end of the work day, as I was very rarely going straight home (I was ending up with a very sparse data set). I suppose this was one point of insight into my life before even doing any analysis in this experiment.

So I just collected data on the way to work in the morning.

Without going into the personal details of my life in depth, my commute went something like this:

  • walk from home to station
  • take streetcar from station west to next station
  • take subway north to station near place of work
  • walk from subway platform to place of work

Punching the route into Google Maps, it tells me the entire distance is 11.5 km. As we’ll see from the data, my travel time was pretty consistent and on average took about 40 minutes every morning (I knew this even before beginning the data collection). So my speed with all three modes of transportation averages out to ~17.25 km/hr. That probably doesn’t seem that fast, but if you’ve ever driven in Toronto traffic, trust me, it is.

In terms of the methodology for data collection, I simply used the stopwatch on my phone, starting it when I left my doorstep and stopping it when reaching the revolving doors by the elevators at work.

So all told, I kept track of the date, starting time and commute length (and therefore end time). As with many things in life, hindsight is 20/20, and looking back I realized I could have collected the data in a more detailed fashion by breaking it up for each leg of the journey.

This occurred to me towards the end of the experiment, and so I did this for a day. Though you can’t do much data analysis with just this one day, it gives a general idea of the typical structure of my commute:

Okay, that’s fun and all, but that’s really an oversimplification as the journey is broken up into distinct legs. So I made this graphic which shows the breakdown for the trip and makes it look more like a journey. The activity / transport type is colour-coded the same as the pie chart above. The circles are sized proportionally to the time spent, as are the lines between each section.

There should be another line coming from the last circle, but it looks better this way.

Alternatively the visualization can be made more informative by leaving the circles sized by time and changing the curve lengths to represent the distance of each leg travelled. Then the distance for the waiting periods is zero and the graphic looks quite different:

I really didn’t think the walk from house was that long in comparison to the streetcar. Surprising.

Cool, no? And there’s an infinite number of other ways you could go about representing that data, but we’re getting into the realm of information design here. So let’s have a look at the data set.

Analysis

So first and foremost, we ask the question, is there a relationship between the starting time of my morning commute and the length of that commute? That is to say, does how early I leave to go to work in the morning impact how long it takes me to get to work, regardless of which day it is?
Before even looking at the data this is an interesting question to consider, as you could assume (I would venture to say know for a fact) that departure time is an important factor for a driving commute as the speed of one’s morning commute is directly impacted by congestion, which is relative to the number of people commuting at any given time.
However, I was taking public transit and I’m fairly certain congestion doesn’t affect it as much. Plus I headed in the opposite direction of most (away from the downtown core). So is there a relationship here?
Looking at this graph we can see a couple things. First of all, there doesn’t appear to be a salient relationship between the commute start time and duration. Some economists are perfectly happy to run a regression and slam a trend line through a big cloud of data points, but I’m not going to do that here. Maybe if there were a lot of points I’d consider it.

The other reason I’m not going to do that is that you can see from looking at this graph that the data are unevenly distributed. There are more larger values and outliers in the middle, but that’s only because the majority of my commutes started between ~8:15 and ~9:20 so that’s where most of the data lie. 

You can see this if we look at the distribution of starting hour:

I’ve included a density plot as well so I don’t have to worry about bin-sizing issues, though it should be noted that in this case it gives the impression of continuity when there isn’t any. It does help illustrate the earlier point however, about the distribution of starting times. If I were a statistician (which I’m not) I would comment on the distribution being symmetrical (i.e. is not skewed) and on its kurtosis.

The distribution of commute duration, on the other hand, is skewed:

I didn’t have any morning where the combination of my walking and the TTC could get me to North York in less than a half hour.

Next we look at commute duration and starting hour over time. The black line is a 5-day moving average.

Other than several days near the beginning of the experiment in which I left for work extra early, the average start time for the morning trip did not change greatly over the course of the months. There looks like there might be some kind of pattern in the commute duration though, with the peaking?

We can investigate if this is the case by comparing the commute duration per day of week:

There seems to be slightly more variation in the commute duration on Monday, and it takes a bit longer on Thursdays? But look at the y-axis. These aren’t big differences, were talking about a matter of several minutes here. The breakdown for when I leave each day isn’t particularly earth-shattering either:

Normally, I’d leave it at that, but are these differences significant? We can do a one-way ANOVA and check:

> aov1 = aov(commute$starthour ~ commute$weekday, data=commute)
> aov2 = aov(commute$time ~ commute$weekday, data=commute)
> summary(aov1)
              Df Sum Sq Mean Sq F value Pr(>F)
data$weekday   4  0.456  0.1140     0.7  0.593
Residuals    118 19.212  0.1628              
> summary(aov2)
              Df Sum Sq Mean Sq F value Pr(>F)
data$weekday   4   86.4   21.59   1.296  0.275
Residuals    118 1965.4   16.66              

This requires making a lot of assumptions about the data, but assuming they’re true, these results tell us there aren’t statistically significant differences in the either the average commute start time or average commute duration per weekday.

That is to say, on average, it took about the same amount of time per day to get to work and I left around the same time.

This is in stark contrast to what people talk around the water cooler about when they’re discussing their commute. I’ve never done any data analysis on a morning drive myself (or seen any, other than the post at Omninerd), but there are likely more clearly defined weekly patterns to your average driving commute than what we saw here with public transit.

Conclusions

There’s a couple ways you can look at this.
You could say there were no earth-shattering conclusions as a result of the experiment.
Or you could say that, other than the occasional outlier (of the “Attention All Passengers on the Yonge-University-Spadina line” variety) the TTC is remarkably consistent over the course of the week, as is my average departure time (which is astounding given my sleeping patterns).
It’s all about perspective. So onward and upward, until next time.

Resources

How to Beat Traffic Mathematically

TTC Trip Planner
myTTC (independently built by an acquaintance of mine – check out more of his cool work at branigan.ca):
FlowingData: Commute times in your area, mapped [US only]

OECD Data Visualization Challenge: My Entry

The people behind Visualising are doing some great things. As well as providing access to open data sets, an embeddable data visualization player, and a massive gallery of data visualizations, they are building an online community of data visualization enthusiasts (and professionals) with their challenges and events.

In addition, those behind it (Seed and GE) are also connecting people in the real world with their data visualization marathons for students, which are looking to be the dataviz equivalent of the ever popular hackathons held around the world. As far as I know no one else is really doing this sort of thing, with a couple notable exceptions – for example Datakind and their Data Dive events (these not strictly visualization focused, however).

Okay, enough copious hyperlinking.

The latest challenge was to visualize the return on education around the world using some educational indicators from the OECD, and I thought I’d give it a go in Tableau Public.

For my visualization I chose to highlight the differences in the return on education not only between nations, but also the gender-based differences for each country.

I incorporated some other data from the OECD portal on GDP and public spending on education, and so the countries included are those with data present in all three sets.

The World Map shows the countries, coloured by GDP. The bar chart to the right depicts the public spending on education, both tertiary (blue) and non-tertiary (orange), as a percentage of GDP.

The scatterplots contrast both the gender-specific benefit-cost ratios per country, as well as between public (circles) and private (squares) benefit, and between the levels of education. A point higher up on the plots and to the left has a greater benefit-cost ratio (BCR) than a point lower and to the right, which represents a worse investment. The points are sized by internal rate-of-return (ROR).

All in all it was fun and not only did I learn a lot more about using Tableau, it gave me a lot of food for thought about how to best depict data visually as well.

Top 5 Tips for Communicating Data

Properly communicating a message with data is not always easy.

If it were, everyone could do it, and there wouldn’t be questions at the end of presentations, discussions around the best way to tweak a scatterplot, or results to a Google Images search for chartjunk.

Much has been written on the subject of how to properly communicate data, and there’s a real art and science to it. Many fail to appreciate this, which can result in confusion – about the message trying to be conveyed, the salience of various features of the data being presented, or why the information is important.

There’s a lot to be said on the subject, but keep these 5 tips for communicating data in mind, and when you have a data-driven message to get across they will help you do so with clarity and precision.

1. Plan: Know What You Want to Say

Just like you wouldn’t expect an author to write a book without a plot, or an entrepreneur to launch a new venture without a business plan, you can’t expect to march blindly into creating a report or article using data without knowing what you want to say.

Sometimes all the analysis will have already been done, and so you just need to think about how to best present it to get your message across. What variables and relationships are most important? What is the best way to depict them? Why oh why am I using aquamarine in this bar chart?

Other times figuring out your exact message will come together with the analysis, and so you would instead start with a question you want to answer, like “How effective has our new marketing initiative been over the last quarter?” or “How has the size of the middle class in Canada in changed over the last 15 years?”

2. Prepare: Be Ready

As I reflected upon in a previous post, sometimes people fail to recognize that just getting the information and putting in the proper shape is a part of the process that should not be overlooked.

Before you even begin to think about communicating your message, you need to make sure you have the data available and in a format (or formats) that you can comfortably work with. You should also consider what data are most important and how to treat them accordingly, and if any other sets should also be included (see Tip #3).

On this same note, before launching into the analysis or creation of the end product (article, report, slidedeck, etc.) it is important to think about if you are ready in terms of tools. What software packages or analysis environments will be used for the data analysis? What applications will be used to create the end product, whatever it may be?

3. Frame: Context is Key

Another important tip to remember is to properly frame your message by placing the data in context.

Failure to follow this tip results in simply serving up information – data are being presented but there is no message being communicated. Context answers the questions “Why is this important?” and “How is this related to x, y, and z?”

Placing the data in context allows the audience to see how it relates to other data, and why it matters. Do not forget about context, or you will have people asking why they should care about what you are trying to communicate.

4. Simplify: Less is More

Let me be incredibly clear about this: more is not always better. If you want to get a message across, simpler is better. Incredibly complicated relationships can be discussed, depicted, and dissected, but that doesn’t mean that your article, slide or infographic needs to look like a spreadsheet application threw up all over it.

Keep the amount of information that your audience has to process at a time (per slide, paragraph, or figure) small. Relationships and changes should be clearly depicted and key differences highlighted with differences in colour or shape. The amount of text on graphs should be kept to a minimum, and if this is not possible, then perhaps the information needs to be thought about being presented in a different way.

The last thing you want to do is muddle your message with information overload and end up confusing your audience.

5. Engage: It’s Useless If No One Knows It Exists

In the world of business, when creating a report or presenting some data, the audience is often predefined. You create a slidedeck to present to the VP and if your data are communicated properly (because you’ve followed Tips 1-4, wink wink) then all is well and you’re on your way to the top. You email the report and it gets delivered to the client and your dazzling data analysis skills make them an even greater believer in your product. And so on.

In other cases though, like when writing a blog post or news article, your audience may not be picked out for you and so it’s also your job to engage them. All your dazzling data analysis and beautiful visual work will contribute nothing if no eyeballs are laid upon it. For this reason, another tip to remember is to engage interested parties, either directly or indirectly through channels such as social media.

What Are You Waiting For?

So there are your Top 5 Tips for Communicating Data. Like I said, it’s not always easy. Keep these tips in mind, and you’ll ask yourself the right questions before you give all the answers.

Go. Explore the data, and be great. Happy communicating.

Quantified Self Toronto #15 – Text Message Analysis (rehash)

Tonight was Quantified Self Toronto #15.

Eric, Sacha and Carlos shared about what they saw at the Quantified Self Conference in California.

I presented my data analysis of a year of my text messaging behaviour, albeit in slidedeck form.

Sharing my analysis was both awesome and humbling.

It was awesome because I received so many interesting questions about the analysis, and so much interesting discussion about communications was had, both during the meeting and after.

It was humbling because I received so many insightful suggestions about further analysis which could have been done, and which, in most cases, I had overlooked. These suggestions to dig deeper included analysis of:

  • Time interval between messages in conversations (Not trivial, I noted)
  • Total amount of information exchanged over time (length, as opposed to the number of messages)
  • Average or distribution of message length per contact,  and per gender
  • Number of messages per day per contact, as a measure/proxy of relationship strength over time
  • Sentiment analysis of messages, aggregate and per contact (Brilliant! How did I miss that?)

Again, it was quite humbling and also fantastic to hear all these suggestions.

The thing about data analysis is that there are always so many ways to analyze the data (and make data visualizations), and it’s what you want to know and what you want to say that help determine how to best look at it.

It’s late, and on that note, I leave you with a quick graph of the weekly number of messages for several contacts, as a proxy of relationship strength over time (pardon my lack of labeling). So looking forward to the next meeting.

Carlos Rizo, Sacha Chua, Eric Boyd and Alan Majer are the organizers of Quantified Self Toronto. More can be found out about them on their awesome blogs, or by visting quantifiedself.ca

What’s in My Pocket? Read it now! (or Read It Later)

Introduction

You know what’s awesome? Pocket.

I mean, sure, it’s not the first. I think Instapaper existed a little before (perhaps). And there are alternatives, like Google Reader. But Pocket is still my favorite. It’s pretty awesome at what it does.

Pocket (or Read It Later, as it used to be known) has fundamentally changed the way I read.

Before I had an Android phone I used to primarily read books. But applications like Pocket allow you to save an article from the web so you can read it later. Being a big fan of reading (and also procrastination) this was a really great application for me to discover, and I’m quite glad I did. Now I can still catch up on the latest Lifehacker even if I am on the subway and don’t have data connectivity.

Background

The other interesting thing about this application is that they make it fairly easy to get a hold of your data. The website has an export function which allows you to dump all your data for everything you’ve ever added to your reading list into HTML.

Having the URL of every article you’ve ever read in Pocket is handy, as you can revisit all the articles you’ve saved. But there’s more to it than that. The HTML export also contains the time each article was added (in UNIX epoch). Combine this with an XML or JSON dump from the API, and now we’ve got some data to work with.

My data set comprises a list of 2975 URLs added to the application over the period 14/07/2011 – 19/09/2012. The data from the export includes the article ID, article URL, date added and updated, and tags added to each article.

In order to add to the data provided by the export functionalities, I wrote a simple Python script using webarticle2text, which is available on github. This script downloaded the all the text from each article URL and continually added it to a single text file, as well as doing a word count for each article and extracting the top-level domain (TLD).

Analysis

First of all we can take a very simple overview of all the articles I have saved by site:

And because pie-type charts make Edward R. Tufte (and some other dataviz people) cry, here is the same information in a bar chart:
Head and shoulders above all other websites at nearly half of all articles saved is Psychology Today. I would just like to be on the record as saying – don’t hate. I know this particular publication is written in such a fashion that it usually thought of as being slanted towards women, however I find the majority of articles to be quite interesting (as evidenced by the number of articles I have read). Perhaps other men are not that interested in the goings-on in their own and other people’s heads, but I am (apparently).

Also, I think this is largely due to the design of the site. I commented before that using Pocket has changed the way I read. Well, one example of this is that I find I save a lot more articles from sites which have well designed mobile versions, as I primarily add articles from my phone. For this reason I can also see why I have saved so many articles from Psych Today, as their well-designed mobile site has made it easy to do so. Plus the article titles are usually enough to grab me.

You can have a look at their visually appealing mobile site if you are on a phone (it detects if the browser is a desktop browser). The other top sites in the list also have well-designed mobile sites (e.g. The Globe and Mail, AskMen, Ars Technica).

Good mobile site design aside, I like reading psych articles, men’s magazines, news, and tech.

Next we examine the data with respect to time.

Unfortunately the Pocket export only provides two categories: time added and time ‘updated’. Looking at the data, I believe this ‘updated’ definition applies to mutiple actions on the article, like marking as read, adding tags, re-downloading, et cetera. It would be ideal to actually have the date/time when the article was marked as read, as then further interesting analysis could be done. For example, looking at the time interval between when articles were added and read, or the number the number of articles read per day.

Anyhow, we continue with what data are available. As in a previous post, we can get a high-level overview of the data with a scatterplot:

Pretty.

The most salient features which immediately stand out are the two distinct bands in the early morning and late afternoon. These correspond to when the majority of my reading is done, on my communte to and from work on public transit.

You can also see the general usage lining up with events in my personal life. The bands start in early October, shortly after I began my new job and started taking public transit. There is also a distinct gap from late December to early January when I was home visiting family over the Christmas holidays.

You can see that as well as being added while I am on public transit, articles are also added all throughout the day. This is as expected; I often add articles (either on my phone or via browser) over the course of the day while at work. Again, it would be interesting to have more data to look at this further, in particular knowing which articles were read or added from which platform.

I am uncertain about articles which are listed as being updated in the late hours in the evening. Although I sometimes do read articles (usually through the browser) in these hours, I think this may correspond to things like adding tags or also a delay in synching between my phone and the Pocket servers.

I played around with heatmaps and boxplots of the data with respect to time, but there was nothing particularly interesting which you can’t see from this scatterplot. The majority of articles are added and updated Monday to Friday during commute hours.

We can also look at the daily volume of articles added:

This graph looks similar to one seen previously in my post on texting. There are some days where very few articles are added and a few where there are a large number. Looking at the distribution of the number of articles added daily, we see an exponential type distribution:

Lastly we examine the content of the articles I read. As I said, all the article text was downloaded using Python and word counts were calculated for each. We can plot a histogram of this to see the distribution of the article length for what I’ve been reading:

Hmmmmm.

Well, that doesn’t look quite right. Did I really read an article 40,000 words long? That’s about 64 pages isn’t it? Looking at URLs for the articles with tens of thousands of words, I could see that those articles added were either malfunctions of the Pocket article parser, the webarticle2text script, or both. For example, the 40,000 word article was a post on the Dictionary.com blog where the article parser also grabbed the entire comment thread.

Leaving the data as is, but zooming in on a more reasonable portion of the histogram, we see something a little more sensical:

This is a little more what we expect. The bulk of the data are distributed between very short articles and those about 1500 words long. The spikes in the low end also correspond to failures of the article parsers.

Now what about the text content of the articles? I really do enjoy a good wordcloud, however, I know that some people tend look down upon them. This is because there are alternate ways of depicting the same data which are more informative. However as I said, I do enjoy them as they are visually appealing.

So firstly I will present the word content in a more traditional way. After removing stop words, the top 25 words found in the conglomerate file of the article text are as follows:

As you can see, there are issues with the download script as there is some garbage in there (div, the years 2011 and 2012, and garbage characters for “don’t” and “are”, or possibly “you’re”). But it appears that my recreational reading corresponds to the most common subjects of its main sources. The majority of my reading was from Psychology Today and so the number one word we see is “people”. I also read a lot articles from men’s magazines, and so we see words which I suspect primarily come from there (“women”, “social”, “sex”, “job”), as well as the psych articles.

And now the pretty visualization:

Seeing the content of what I read depicted this way has made me have some realizations about my interests. I primarily think of myself as a data person, but obviously I am genuinely interested in people as well.

I’m glad data is in there as a ‘big word’ (just above ‘person’), though maybe not as big as some of the others. I’ve just started to fill my reading list with a lot of data visualization and analysis articles as of late.

Well, that was fun, and somewhat educational. In the meantime, I’ll keep on reading. Because the moment you stop reading is the moment you stop learning. As Dr. Seuss said: “The more that you read, the more things you will know. The more that you learn, the more places you’ll go!”

Conclusions

  • Majority of reading done during commute on public transit
  • Number of articles added daily of exponential-type distribution
  • Most articles read from very short to ~1500 words
  • Articles focused on people, dating, social topics; more recently data

Resources

Pocket (formerly Read It Later) on Google Play:
https://play.google.com/store/apps/details?id=com.ideashower.readitlater.pro

Pocket export to HTML:
http://getpocket.com/export

Mediagazer Editor Lyra McKee: What’s In My Pocket
http://getpocket.com/blog/2012/09/mediagazer-editor-lyra-mckee-whats-in-my-pocket/

Founder/CEO of Pocket Nate Weiner: What’s In My Pocket
http://getpocket.com/blog/2012/08/nate-weiner-whats-in-my-pocket/

Pocket Trends (Data analysis/analytics section of Pocket Blog)
http://getpocket.com/blog/category/trends/

webarticle2text (Python script by Chris Spencer)
https://github.com/chrisspen/webarticle2text

Don’t Do Journey: Karaoke and a Data Analysis Musing

“DON’T DO JOURNEY!!” The look of terror and disbelief in her eyes was both sudden and palpable.

What can I say? People feel very strongly about karaoke. Every since this joy/terror was gifted/unleashed upon the world, it seems that there is no shortage of people who have very strong feelings about it.

It’s kind of a love/hate relationship. People love it. Or they hate it. Or they love to hate it. Or they hate the fact that they love it. Either way, it’s kind of surprising how polarizing it can be.

There’s a place here in Toronto that’s quite popular for it. Well, actually I don’t know how popular it is, but they do have it five nights a week. As I was looking at their website one day, I had one of these oh, neat moments – the contents of their entire karaoke songbook, a list of all 32,636 songs, is available in PDF format.

Slam that into a PDF to CSV converter…. tidy up a little, and we’ve got data!

So what’s the most available to sing at the Fox if you happen to be feeling courageous enough? The Top 10:

Hail to The King, baby.

Traditional? Standard? What the heck? I’ve never even heard of those artists! Are those some 70’s rock bands like The Eagles or…. oh, right. That makes sense. Really, traditional and standard should be the same category.

After traditional songs, no one can touch The King, followed by Ol’ Blue Eyes with about half as many songs. Just in case you were wondering, the next 10 spots after Celine Dion are a lot of country followed by The Stones.

And that, unfortunately, is it. Which brings us to my musing on data analysis.

On a very simplistic high level, you could say that there are 3 steps to data analysis:

1. Get the data
2. Make with the analysis
3. Write up report/article/paper/post for management/news outlet/academic journal/blog

And like I said, that is a massive oversimplification. Because really, you can break each step into many sub-steps, which don’t necessarily flow in order and could be iterative. For example, Step 1:

1a. Get the data
1b. Decide if there are any other data you need
1c. Get that data 
1d. Clean and process data in usable format
1e. ….

Et cetera. My roommate and I were having a discussion on these matters, and he quite astutely pointed out that many people take Step 1 for granted. Worse yet, some don’t appreciate that there is more to Step 1 than 1a.

And that is why this is another short post with only one graph. Because there’s only so much analysis you can do with Artist, Title and Song ID. There’s options, to pull a whole bunch of data: Gracenote (but they appear to be a bit stingy with their API), freedb, MusicBrainz, and Discogs. But I’m not going to set up a local SQL server or write a bunch of code right now; though it would be interesting to see an in-depth analysis taking into consideration many things like song length, year, genre, and lyric content to name a few.

As my roommate and I were talking, he pointed out that if you had a karaoke machine (actually I think it’s computers with iTunes now) which kept track of all the songs picked, there’d be something more interesting to analyze: What is the distribution of the popularity of songs? How frequently are different songs of different genres and years picked?

We agreed that it’s most likely exponential (as many things are) – Don’t Stop Believin’ probably gets picked almost once a night, but there are likely many, many other songs that have never have been (and probably never will be) picked. And lastly, I’m always left wondering, how many singers are actually in tune for more than half the song?

FBI iPhone Leak Breakdown

Don’t know if you heard, but something that is making the news today is that hacker group AntiSec purportedly gained control of an FBI agent’s laptop and got a hold of 12 million UDIDs which were apparently being tracked.

A UDID is Apple’s unique identifier for each of its ‘iDevices’, and if known could be used to get a lot of personally identifiable information about the owner of each product.

The hackers released the data on pastebin here. In the interests of protecting the privacy of the users, they removed all said personally identifiable information from the data. This is kind of a shame in a way, as it would have been interesting to do an analysis of the geographic distribution of the devices which were (allegedly) being tracked, amongst other things. I suppose they released the data for more (allegedly) altruistic purposes – i.e. to let people find out if the FBI was tracking them, not to have the data analyzed.

The one useful column that was left was the device type. Surprisingly, the majority of devices were iPads. Of course, this could just be unique to the million and one records of the 12 million which the group chose to release.

Breakdown:
iPhone: 345,384 (34.5%)
iPad: 589,720 (59%)
iPod touch: 63,724 (6.4%)
Undetermined: 1,173 (0.1%)
Total: 1,000,001

Forgive me Edward Tufte, for using a pie chart.

omg lol brb txt l8r – Text Message Analysis, 2011-2012

Introduction

I will confess, I don’t really like texting. I communicate through text messages, because it does afford many conveniences, and occupies a sort of middle ground between actual conversation and email, but that doesn’t mean that I like it.

Even though I would say I text a fair bit, more than some other Luddites I know, I’m not a serial texter. I’m not like one of these 14-year-old girls who sends thousands of text messages a day (about what, exactly?).

I recall reading about one such girl in the UK who sent in excess of 100,000 text messages one month. Unfortunately her poor parents received a rather hefty phone bill, as she did this without knowing she did not have an unlimited texting plan. But seriously, what the hell did she write? Even if she only wrote one word per text message, 100,000 words is ~200 pages of text. She typed all that out on a mobile phone keyboard (or even worse, a touch screen)? That would be a sizeable book.

If you do the math it’s even crazier in terms of time. There are only 24 hours in the day, so assuming little Miss Teen Texter of the Year did not sleep, she still would have to send 100,000 in a 24 * 30 = 720 hour period, which averages out to be about one message every 25 seconds. I think by that point there is really no value added to the conversations you are having. I’m pretty sure I have friends I haven’t said 100,000 words to over all the time that we’ve know each other.

But I digress.

Background

Actually getting all the data out turned out to be much easier than I anticipated. There exists an Android App which will not only back up all your texts (with the option of emailing it to you), but conveniently does so in an XML file with human-readable dates and a provided stylesheet (!). Import the XML file into Excel or other software and boom! You’ve got time series data for every single text message you’ve ever sent.

My data set spans the time from when I first started using an Android phone (July 2011) up to approximately the present, when I last created the backup (August 13th).

In total over this time period (405 days) I sent 3655 messages (~46.8%) and received 4151 (~53.2%) for a grand total of 7806 messages. This averages out to approximately 19 messages / day total, or about 1.25 messages per hour. As I said, I’m not a serial texter. Also I should probably work on responding to messages.

Analysis

First we can get a ‘bird’s eye view’ of the data by plotting a colour-coded data point for each message, with time of day on the y-axis and the date on the x-axis:


Looks like the majority of my texting occurs between the hours of 8 AM to midnight, which is not surprising. As was established in my earlier post on my sleeping patterns, I do enjoy the night life, as you can see from the intermittent activity in the range outside of these hours (midnight to 4 AM). As Dr. Wolfram commented in his personal analytics posting, it was interesting to look at the plot and think ‘What does this feature correspond to?’ then go back and say ‘Ah, I remember that day!’.

It’s also interesting to see the back and forth nature of the messaging. As I mentioned before, the split in Sent and Received is almost 50/50. This is not surprising – we humans call these ‘conversations’.

We can cross-tabulate the data to produce a graph of the total daily volume in SMS: 

Interesting to note here the spiking phenomenon, in what appears to be a somewhat periodic fashion. This corresponds to the fact that there are some days where I do a lot of texting (i.e. carry on several day-long conversations) contrasted with days where I might have one smaller conversation, or just send one message or so to confirm something (‘We still going to the restaurant at 8?’ – ‘Yup, you know it’ – ‘Cool. I’m going to eat more crab than they hauled in on the latest episode of Deadliest Catch!’).

I appeared to be texting more back in the Fall, and my overall volume of text diminished slightly into the New Year. Looking back at some of the spikes, some corresponded to noteworthy events (birthday, Christmas, New Year’s), whereas others did not. For example, the largest spike, which occurred on September 3rd, just happened to be a day where I had a lot of conversations at once not related to anything in particular.

Lastly, through the magic of a Tableau dashboard (pa-zow!) we can combine these two interactive graphs for some data visualization goodness:


Next we make a histogram of the data to look at the distribution of the daily message volume. The spiking behaviour and variation in volume previously evident can be seen in the tail of the histogram dropping off exponentially:

Note that is the density in black, not a fitted theoretical distribution

The daily volume follows what appears to be an exponential-type distribution (log-normal?). This is really neat to see out of this, as I did not know what to expect (when in doubt, guess Gaussian) but is not entirely shocking –  other communication phenomena have been shown to be a Poisson process (e.g. phone calls). Someone correct me if I am way out of line here.

Lastly we can analyze the volume of text messages per day of the week, by making a box plot:

Something’s not quite right here…

As we saw in the histogram, the data are of an exponential nature. Correcting the y-axis in this regard, the box plot looks a little more how one would expect:

Ahhhh.

We can see that overall there tends to be a greater volume of texts Thursday to Sunday. Hmmm, can you guess why this is? 🙂

This can be further broken down with a heat map of the total hourly volume per day of week:

This is way easier to make in Tableau than in R.

As seen previously in the scatterplot, the majority of messages are concentrated between the hours of 8 (here it looks more like 10) to midnight. In line with the boxplot just above, most of that traffic is towards the weekend. In particular, the majority of the messages were mid-to-late afternoon on Fridays.

We have thus fair mainly been looking at my text messages as time series data. What about the content of the texts I send and receive?

Let’s compare the distribution of message lengths, sent versus received. Since there are an unequal number of Sent and Received messages, I stuck with a density plot:

Line graphs are pretty.

Interestingly, again, the data are distributed in an exponential fashion.

You can see distinctive humps at the 160 character mark. This is due to longer messages being broken down into multiple messages under the max length. Some carriers (or phones?) don’t break up the messages, and so there are a small number of length greater than the ‘official’ limit.

Comparing the blue and red lines, you can see that in general I tend to be wordier than my friends and acquaintances.

Lastly, we can look at the written content. I do enjoy a good wordcloud, so we can by plunk the message contents into R and create one:

Names blurred to protect the innoncent (except me!).

What can we gather from this representation of the text? Well, nothing I didn’t already know…. my phone isn’t exactly a work Blackberry.

Conclusions

  • Majority of text message volume is between 10 AM to midnight
  • Text messages split approximately 50/50 between sent and received due to conversations
  • Daily volume is distributed in an exponential fashion (Poisson?)
  • Majority of volume is towards the end of the week, especially Friday afternoon
  • I should be less wordy (isn’t that the point of the medium?)
  • Everybody’s working for the weekend

References & Resources

SMS Backup and Restore @ Google Play
https://play.google.com/store/apps/details?id=com.riteshsahu.SMSBackupRestore&hl=en

Tableau Public
http://www.tableausoftware.com/public/community