Quantified Self Toronto #15 – Text Message Analysis (rehash)

Tonight was Quantified Self Toronto #15.

Eric, Sacha and Carlos shared about what they saw at the Quantified Self Conference in California.

I presented my data analysis of a year of my text messaging behaviour, albeit in slidedeck form.

Sharing my analysis was both awesome and humbling.

It was awesome because I received so many interesting questions about the analysis, and so much interesting discussion about communications was had, both during the meeting and after.

It was humbling because I received so many insightful suggestions about further analysis which could have been done, and which, in most cases, I had overlooked. These suggestions to dig deeper included analysis of:

  • Time interval between messages in conversations (Not trivial, I noted)
  • Total amount of information exchanged over time (length, as opposed to the number of messages)
  • Average or distribution of message length per contact,  and per gender
  • Number of messages per day per contact, as a measure/proxy of relationship strength over time
  • Sentiment analysis of messages, aggregate and per contact (Brilliant! How did I miss that?)

Again, it was quite humbling and also fantastic to hear all these suggestions.

The thing about data analysis is that there are always so many ways to analyze the data (and make data visualizations), and it’s what you want to know and what you want to say that help determine how to best look at it.

It’s late, and on that note, I leave you with a quick graph of the weekly number of messages for several contacts, as a proxy of relationship strength over time (pardon my lack of labeling). So looking forward to the next meeting.

Carlos Rizo, Sacha Chua, Eric Boyd and Alan Majer are the organizers of Quantified Self Toronto. More can be found out about them on their awesome blogs, or by visting quantifiedself.ca

Don’t Do Journey: Karaoke and a Data Analysis Musing

“DON’T DO JOURNEY!!” The look of terror and disbelief in her eyes was both sudden and palpable.

What can I say? People feel very strongly about karaoke. Every since this joy/terror was gifted/unleashed upon the world, it seems that there is no shortage of people who have very strong feelings about it.

It’s kind of a love/hate relationship. People love it. Or they hate it. Or they love to hate it. Or they hate the fact that they love it. Either way, it’s kind of surprising how polarizing it can be.

There’s a place here in Toronto that’s quite popular for it. Well, actually I don’t know how popular it is, but they do have it five nights a week. As I was looking at their website one day, I had one of these oh, neat moments – the contents of their entire karaoke songbook, a list of all 32,636 songs, is available in PDF format.

Slam that into a PDF to CSV converter…. tidy up a little, and we’ve got data!

So what’s the most available to sing at the Fox if you happen to be feeling courageous enough? The Top 10:

Hail to The King, baby.

Traditional? Standard? What the heck? I’ve never even heard of those artists! Are those some 70’s rock bands like The Eagles or…. oh, right. That makes sense. Really, traditional and standard should be the same category.

After traditional songs, no one can touch The King, followed by Ol’ Blue Eyes with about half as many songs. Just in case you were wondering, the next 10 spots after Celine Dion are a lot of country followed by The Stones.

And that, unfortunately, is it. Which brings us to my musing on data analysis.

On a very simplistic high level, you could say that there are 3 steps to data analysis:

1. Get the data
2. Make with the analysis
3. Write up report/article/paper/post for management/news outlet/academic journal/blog

And like I said, that is a massive oversimplification. Because really, you can break each step into many sub-steps, which don’t necessarily flow in order and could be iterative. For example, Step 1:

1a. Get the data
1b. Decide if there are any other data you need
1c. Get that data 
1d. Clean and process data in usable format
1e. ….

Et cetera. My roommate and I were having a discussion on these matters, and he quite astutely pointed out that many people take Step 1 for granted. Worse yet, some don’t appreciate that there is more to Step 1 than 1a.

And that is why this is another short post with only one graph. Because there’s only so much analysis you can do with Artist, Title and Song ID. There’s options, to pull a whole bunch of data: Gracenote (but they appear to be a bit stingy with their API), freedb, MusicBrainz, and Discogs. But I’m not going to set up a local SQL server or write a bunch of code right now; though it would be interesting to see an in-depth analysis taking into consideration many things like song length, year, genre, and lyric content to name a few.

As my roommate and I were talking, he pointed out that if you had a karaoke machine (actually I think it’s computers with iTunes now) which kept track of all the songs picked, there’d be something more interesting to analyze: What is the distribution of the popularity of songs? How frequently are different songs of different genres and years picked?

We agreed that it’s most likely exponential (as many things are) – Don’t Stop Believin’ probably gets picked almost once a night, but there are likely many, many other songs that have never have been (and probably never will be) picked. And lastly, I’m always left wondering, how many singers are actually in tune for more than half the song?

FBI iPhone Leak Breakdown

Don’t know if you heard, but something that is making the news today is that hacker group AntiSec purportedly gained control of an FBI agent’s laptop and got a hold of 12 million UDIDs which were apparently being tracked.

A UDID is Apple’s unique identifier for each of its ‘iDevices’, and if known could be used to get a lot of personally identifiable information about the owner of each product.

The hackers released the data on pastebin here. In the interests of protecting the privacy of the users, they removed all said personally identifiable information from the data. This is kind of a shame in a way, as it would have been interesting to do an analysis of the geographic distribution of the devices which were (allegedly) being tracked, amongst other things. I suppose they released the data for more (allegedly) altruistic purposes – i.e. to let people find out if the FBI was tracking them, not to have the data analyzed.

The one useful column that was left was the device type. Surprisingly, the majority of devices were iPads. Of course, this could just be unique to the million and one records of the 12 million which the group chose to release.

iPhone: 345,384 (34.5%)
iPad: 589,720 (59%)
iPod touch: 63,724 (6.4%)
Undetermined: 1,173 (0.1%)
Total: 1,000,001

Forgive me Edward Tufte, for using a pie chart.

Let’s Go To The Ex!

I went to The Ex (that’s the Canadian National Exhibition for those of you not ‘in the know’) on Saturday. I enjoy stepping out of the ordinary from time to time and carnivals / fairs / midways / exhibitions etc. are always a great way to do that.

As far as exhibitions go, I believe the CNE is one of the more venerable – it’s been around since 1879 and attracts over 1.3 million visitors every year.

Looking at the website before I went, I saw that they had a nice summary of all the ride height requirements and number of tickets required. I thought perhaps the data could stand to be presented in a more visual form.

First, how about the number of tickets required for the different midways? All of the rides on the ‘Kiddie’ Midway require four tickets, except for one (The Wacky Worm Coaster). The Adult Midway rides are split about 50/50 for five or six tickets, except for one (Sky Ride) which only requires four.

With tickets being $1.50 each, or $1 if you buy them in sets of 22 or 55, that makes the ride price range $6-9 or $4-6. Assuming you buy the $1 tickets, the average price of an adult ride is $5.42 and the average price of a child ride $4.04.

The rides also have height requirements. Note that I’ve simplified things by taking the max height for cases where shorter/younger kids can ride supervised with an adult. Here’s a breakdown of the percentage of the rides in each midway type children can ride, given their height:

Google Docs does not allow non-stacked stepped area charts, so line graph it is.

And here’s the same breakdown with percentage of the total rides (both midways combined), coloured by type. This is a better way to represent the information, as it shows the discrete nature of the height requirement:

Basically if your child is over 4′ they are good for about 80% of all the rides at the CNE.

Something else to consider – how to get your maximum value for your tickets with none left over, given that they are sold in packs of 22 and 55? I would say go with the $36 all-you-can-ride option. Also, how miniscule are your actual odds of winning those carnival games? Because I want a giant purple plush gorilla.

See you next year!

Facebook Friends (in a graph)

I saw this post on FlowingData and thought, “Hey, I can do that, let’s give this Gephi thing a go.”

I don’t have that many Facebook friends, as I try to keep my network well maintained, and I’m also not a heavy user of the service. Also I’ve always kind of wondered – if you are one of those people who has 2000 Facebook friends, are they really all your ‘friends’? If I put you (you silly teenage girl) in a room with those 2000 people, would you be able to call all of them by name? Remember where you met them? What their favorite color is? I digress.

The steps to producing the graph are simple:
1. Install the netvizz Facebook application
2. Run it
3. Import gdf file into Gephi
4. Wow! A graph!

As I said, I don’t have that many Facebook friends but I still found the results pretty interesting:

Red is immediate family, green my Mom’s side and blue my Dad’s. The orange are university friends, and purple High School. Yellow are randoms and friends of friends. Teal is friends of my Mom’s relatives, and pink friends of one of the immediate family. Light blue is a group of friends made while travelling.

The nodes are sized by degree.

Interesting point to note:
High school friend (purple, outlying from others) and friend of immediate family (pink, bottom right node) are both connected to friend of Mom’s family (teal node, bottom) through events totally unrelated to the rest of the network. Small world.

This is that case when you add a new friend on Facebook and it says you already have a mutual friend, and you stop and think, ‘Wait, we do? Sarah knows Thomas? But how did…. through who… when did…..? Huh.’

50 Shades of Grey Wordcloud

Sometimes you just want to see what all the fuss is about. File this under the ‘because I can’ category: I proudly (?) present – a wordcloud produced from the text of E. L. James’ “50 Shades of Grey”.

For a book which is getting all this press about being full of explicit sexuality, the data are not what you expect. Obviously the main characters’ names feature prominently, but if you ask me this visualization shows that this is another romance novel much like any other.

Sure, you probably wouldn’t expect to see the words ‘dominant’ (left, next to grey) and ‘submissive’ (right, next to don’t) in some other books of this type. But look at some of the other words which are largest besides the names of the main characters – eyes, head, hands, hair, voice, smile. Obviously, it’s not just about the sex.

Produced in R using the excellent tm and wordcloud packages.