Quantified Self Toronto #15 – Text Message Analysis (rehash)

Tonight was Quantified Self Toronto #15.

Eric, Sacha and Carlos shared about what they saw at the Quantified Self Conference in California.

I presented my data analysis of a year of my text messaging behaviour, albeit in slidedeck form.

Sharing my analysis was both awesome and humbling.

It was awesome because I received so many interesting questions about the analysis, and so much interesting discussion about communications was had, both during the meeting and after.

It was humbling because I received so many insightful suggestions about further analysis which could have been done, and which, in most cases, I had overlooked. These suggestions to dig deeper included analysis of:

  • Time interval between messages in conversations (Not trivial, I noted)
  • Total amount of information exchanged over time (length, as opposed to the number of messages)
  • Average or distribution of message length per contact,  and per gender
  • Number of messages per day per contact, as a measure/proxy of relationship strength over time
  • Sentiment analysis of messages, aggregate and per contact (Brilliant! How did I miss that?)

Again, it was quite humbling and also fantastic to hear all these suggestions.

The thing about data analysis is that there are always so many ways to analyze the data (and make data visualizations), and it’s what you want to know and what you want to say that help determine how to best look at it.

It’s late, and on that note, I leave you with a quick graph of the weekly number of messages for several contacts, as a proxy of relationship strength over time (pardon my lack of labeling). So looking forward to the next meeting.

Carlos Rizo, Sacha Chua, Eric Boyd and Alan Majer are the organizers of Quantified Self Toronto. More can be found out about them on their awesome blogs, or by visting quantifiedself.ca

What’s in My Pocket? Read it now! (or Read It Later)

Introduction

You know what’s awesome? Pocket.

I mean, sure, it’s not the first. I think Instapaper existed a little before (perhaps). And there are alternatives, like Google Reader. But Pocket is still my favorite. It’s pretty awesome at what it does.

Pocket (or Read It Later, as it used to be known) has fundamentally changed the way I read.

Before I had an Android phone I used to primarily read books. But applications like Pocket allow you to save an article from the web so you can read it later. Being a big fan of reading (and also procrastination) this was a really great application for me to discover, and I’m quite glad I did. Now I can still catch up on the latest Lifehacker even if I am on the subway and don’t have data connectivity.

Background

The other interesting thing about this application is that they make it fairly easy to get a hold of your data. The website has an export function which allows you to dump all your data for everything you’ve ever added to your reading list into HTML.

Having the URL of every article you’ve ever read in Pocket is handy, as you can revisit all the articles you’ve saved. But there’s more to it than that. The HTML export also contains the time each article was added (in UNIX epoch). Combine this with an XML or JSON dump from the API, and now we’ve got some data to work with.

My data set comprises a list of 2975 URLs added to the application over the period 14/07/2011 – 19/09/2012. The data from the export includes the article ID, article URL, date added and updated, and tags added to each article.

In order to add to the data provided by the export functionalities, I wrote a simple Python script using webarticle2text, which is available on github. This script downloaded the all the text from each article URL and continually added it to a single text file, as well as doing a word count for each article and extracting the top-level domain (TLD).

Analysis

First of all we can take a very simple overview of all the articles I have saved by site:

And because pie-type charts make Edward R. Tufte (and some other dataviz people) cry, here is the same information in a bar chart:
Head and shoulders above all other websites at nearly half of all articles saved is Psychology Today. I would just like to be on the record as saying – don’t hate. I know this particular publication is written in such a fashion that it usually thought of as being slanted towards women, however I find the majority of articles to be quite interesting (as evidenced by the number of articles I have read). Perhaps other men are not that interested in the goings-on in their own and other people’s heads, but I am (apparently).

Also, I think this is largely due to the design of the site. I commented before that using Pocket has changed the way I read. Well, one example of this is that I find I save a lot more articles from sites which have well designed mobile versions, as I primarily add articles from my phone. For this reason I can also see why I have saved so many articles from Psych Today, as their well-designed mobile site has made it easy to do so. Plus the article titles are usually enough to grab me.

You can have a look at their visually appealing mobile site if you are on a phone (it detects if the browser is a desktop browser). The other top sites in the list also have well-designed mobile sites (e.g. The Globe and Mail, AskMen, Ars Technica).

Good mobile site design aside, I like reading psych articles, men’s magazines, news, and tech.

Next we examine the data with respect to time.

Unfortunately the Pocket export only provides two categories: time added and time ‘updated’. Looking at the data, I believe this ‘updated’ definition applies to mutiple actions on the article, like marking as read, adding tags, re-downloading, et cetera. It would be ideal to actually have the date/time when the article was marked as read, as then further interesting analysis could be done. For example, looking at the time interval between when articles were added and read, or the number the number of articles read per day.

Anyhow, we continue with what data are available. As in a previous post, we can get a high-level overview of the data with a scatterplot:

Pretty.

The most salient features which immediately stand out are the two distinct bands in the early morning and late afternoon. These correspond to when the majority of my reading is done, on my communte to and from work on public transit.

You can also see the general usage lining up with events in my personal life. The bands start in early October, shortly after I began my new job and started taking public transit. There is also a distinct gap from late December to early January when I was home visiting family over the Christmas holidays.

You can see that as well as being added while I am on public transit, articles are also added all throughout the day. This is as expected; I often add articles (either on my phone or via browser) over the course of the day while at work. Again, it would be interesting to have more data to look at this further, in particular knowing which articles were read or added from which platform.

I am uncertain about articles which are listed as being updated in the late hours in the evening. Although I sometimes do read articles (usually through the browser) in these hours, I think this may correspond to things like adding tags or also a delay in synching between my phone and the Pocket servers.

I played around with heatmaps and boxplots of the data with respect to time, but there was nothing particularly interesting which you can’t see from this scatterplot. The majority of articles are added and updated Monday to Friday during commute hours.

We can also look at the daily volume of articles added:

This graph looks similar to one seen previously in my post on texting. There are some days where very few articles are added and a few where there are a large number. Looking at the distribution of the number of articles added daily, we see an exponential type distribution:

Lastly we examine the content of the articles I read. As I said, all the article text was downloaded using Python and word counts were calculated for each. We can plot a histogram of this to see the distribution of the article length for what I’ve been reading:

Hmmmmm.

Well, that doesn’t look quite right. Did I really read an article 40,000 words long? That’s about 64 pages isn’t it? Looking at URLs for the articles with tens of thousands of words, I could see that those articles added were either malfunctions of the Pocket article parser, the webarticle2text script, or both. For example, the 40,000 word article was a post on the Dictionary.com blog where the article parser also grabbed the entire comment thread.

Leaving the data as is, but zooming in on a more reasonable portion of the histogram, we see something a little more sensical:

This is a little more what we expect. The bulk of the data are distributed between very short articles and those about 1500 words long. The spikes in the low end also correspond to failures of the article parsers.

Now what about the text content of the articles? I really do enjoy a good wordcloud, however, I know that some people tend look down upon them. This is because there are alternate ways of depicting the same data which are more informative. However as I said, I do enjoy them as they are visually appealing.

So firstly I will present the word content in a more traditional way. After removing stop words, the top 25 words found in the conglomerate file of the article text are as follows:

As you can see, there are issues with the download script as there is some garbage in there (div, the years 2011 and 2012, and garbage characters for “don’t” and “are”, or possibly “you’re”). But it appears that my recreational reading corresponds to the most common subjects of its main sources. The majority of my reading was from Psychology Today and so the number one word we see is “people”. I also read a lot articles from men’s magazines, and so we see words which I suspect primarily come from there (“women”, “social”, “sex”, “job”), as well as the psych articles.

And now the pretty visualization:

Seeing the content of what I read depicted this way has made me have some realizations about my interests. I primarily think of myself as a data person, but obviously I am genuinely interested in people as well.

I’m glad data is in there as a ‘big word’ (just above ‘person’), though maybe not as big as some of the others. I’ve just started to fill my reading list with a lot of data visualization and analysis articles as of late.

Well, that was fun, and somewhat educational. In the meantime, I’ll keep on reading. Because the moment you stop reading is the moment you stop learning. As Dr. Seuss said: “The more that you read, the more things you will know. The more that you learn, the more places you’ll go!”

Conclusions

  • Majority of reading done during commute on public transit
  • Number of articles added daily of exponential-type distribution
  • Most articles read from very short to ~1500 words
  • Articles focused on people, dating, social topics; more recently data

Resources

Pocket (formerly Read It Later) on Google Play:
https://play.google.com/store/apps/details?id=com.ideashower.readitlater.pro

Pocket export to HTML:
http://getpocket.com/export

Mediagazer Editor Lyra McKee: What’s In My Pocket
http://getpocket.com/blog/2012/09/mediagazer-editor-lyra-mckee-whats-in-my-pocket/

Founder/CEO of Pocket Nate Weiner: What’s In My Pocket
http://getpocket.com/blog/2012/08/nate-weiner-whats-in-my-pocket/

Pocket Trends (Data analysis/analytics section of Pocket Blog)
http://getpocket.com/blog/category/trends/

webarticle2text (Python script by Chris Spencer)
https://github.com/chrisspen/webarticle2text

omg lol brb txt l8r – Text Message Analysis, 2011-2012

Introduction

I will confess, I don’t really like texting. I communicate through text messages, because it does afford many conveniences, and occupies a sort of middle ground between actual conversation and email, but that doesn’t mean that I like it.

Even though I would say I text a fair bit, more than some other Luddites I know, I’m not a serial texter. I’m not like one of these 14-year-old girls who sends thousands of text messages a day (about what, exactly?).

I recall reading about one such girl in the UK who sent in excess of 100,000 text messages one month. Unfortunately her poor parents received a rather hefty phone bill, as she did this without knowing she did not have an unlimited texting plan. But seriously, what the hell did she write? Even if she only wrote one word per text message, 100,000 words is ~200 pages of text. She typed all that out on a mobile phone keyboard (or even worse, a touch screen)? That would be a sizeable book.

If you do the math it’s even crazier in terms of time. There are only 24 hours in the day, so assuming little Miss Teen Texter of the Year did not sleep, she still would have to send 100,000 in a 24 * 30 = 720 hour period, which averages out to be about one message every 25 seconds. I think by that point there is really no value added to the conversations you are having. I’m pretty sure I have friends I haven’t said 100,000 words to over all the time that we’ve know each other.

But I digress.

Background

Actually getting all the data out turned out to be much easier than I anticipated. There exists an Android App which will not only back up all your texts (with the option of emailing it to you), but conveniently does so in an XML file with human-readable dates and a provided stylesheet (!). Import the XML file into Excel or other software and boom! You’ve got time series data for every single text message you’ve ever sent.

My data set spans the time from when I first started using an Android phone (July 2011) up to approximately the present, when I last created the backup (August 13th).

In total over this time period (405 days) I sent 3655 messages (~46.8%) and received 4151 (~53.2%) for a grand total of 7806 messages. This averages out to approximately 19 messages / day total, or about 1.25 messages per hour. As I said, I’m not a serial texter. Also I should probably work on responding to messages.

Analysis

First we can get a ‘bird’s eye view’ of the data by plotting a colour-coded data point for each message, with time of day on the y-axis and the date on the x-axis:


Looks like the majority of my texting occurs between the hours of 8 AM to midnight, which is not surprising. As was established in my earlier post on my sleeping patterns, I do enjoy the night life, as you can see from the intermittent activity in the range outside of these hours (midnight to 4 AM). As Dr. Wolfram commented in his personal analytics posting, it was interesting to look at the plot and think ‘What does this feature correspond to?’ then go back and say ‘Ah, I remember that day!’.

It’s also interesting to see the back and forth nature of the messaging. As I mentioned before, the split in Sent and Received is almost 50/50. This is not surprising – we humans call these ‘conversations’.

We can cross-tabulate the data to produce a graph of the total daily volume in SMS: 

Interesting to note here the spiking phenomenon, in what appears to be a somewhat periodic fashion. This corresponds to the fact that there are some days where I do a lot of texting (i.e. carry on several day-long conversations) contrasted with days where I might have one smaller conversation, or just send one message or so to confirm something (‘We still going to the restaurant at 8?’ – ‘Yup, you know it’ – ‘Cool. I’m going to eat more crab than they hauled in on the latest episode of Deadliest Catch!’).

I appeared to be texting more back in the Fall, and my overall volume of text diminished slightly into the New Year. Looking back at some of the spikes, some corresponded to noteworthy events (birthday, Christmas, New Year’s), whereas others did not. For example, the largest spike, which occurred on September 3rd, just happened to be a day where I had a lot of conversations at once not related to anything in particular.

Lastly, through the magic of a Tableau dashboard (pa-zow!) we can combine these two interactive graphs for some data visualization goodness:


Next we make a histogram of the data to look at the distribution of the daily message volume. The spiking behaviour and variation in volume previously evident can be seen in the tail of the histogram dropping off exponentially:

Note that is the density in black, not a fitted theoretical distribution

The daily volume follows what appears to be an exponential-type distribution (log-normal?). This is really neat to see out of this, as I did not know what to expect (when in doubt, guess Gaussian) but is not entirely shocking –  other communication phenomena have been shown to be a Poisson process (e.g. phone calls). Someone correct me if I am way out of line here.

Lastly we can analyze the volume of text messages per day of the week, by making a box plot:

Something’s not quite right here…

As we saw in the histogram, the data are of an exponential nature. Correcting the y-axis in this regard, the box plot looks a little more how one would expect:

Ahhhh.

We can see that overall there tends to be a greater volume of texts Thursday to Sunday. Hmmm, can you guess why this is? 🙂

This can be further broken down with a heat map of the total hourly volume per day of week:

This is way easier to make in Tableau than in R.

As seen previously in the scatterplot, the majority of messages are concentrated between the hours of 8 (here it looks more like 10) to midnight. In line with the boxplot just above, most of that traffic is towards the weekend. In particular, the majority of the messages were mid-to-late afternoon on Fridays.

We have thus fair mainly been looking at my text messages as time series data. What about the content of the texts I send and receive?

Let’s compare the distribution of message lengths, sent versus received. Since there are an unequal number of Sent and Received messages, I stuck with a density plot:

Line graphs are pretty.

Interestingly, again, the data are distributed in an exponential fashion.

You can see distinctive humps at the 160 character mark. This is due to longer messages being broken down into multiple messages under the max length. Some carriers (or phones?) don’t break up the messages, and so there are a small number of length greater than the ‘official’ limit.

Comparing the blue and red lines, you can see that in general I tend to be wordier than my friends and acquaintances.

Lastly, we can look at the written content. I do enjoy a good wordcloud, so we can by plunk the message contents into R and create one:

Names blurred to protect the innoncent (except me!).

What can we gather from this representation of the text? Well, nothing I didn’t already know…. my phone isn’t exactly a work Blackberry.

Conclusions

  • Majority of text message volume is between 10 AM to midnight
  • Text messages split approximately 50/50 between sent and received due to conversations
  • Daily volume is distributed in an exponential fashion (Poisson?)
  • Majority of volume is towards the end of the week, especially Friday afternoon
  • I should be less wordy (isn’t that the point of the medium?)
  • Everybody’s working for the weekend

References & Resources

SMS Backup and Restore @ Google Play
https://play.google.com/store/apps/details?id=com.riteshsahu.SMSBackupRestore&hl=en

Tableau Public
http://www.tableausoftware.com/public/community

How much do I weigh? – Quantified Self Toronto #12

Recently I spoke at the Quantified Self Toronto group (you can find the article on other talk here).

It was in late November of last year that I decided I wanted to lose a few pounds. I read most of The Hacker’s Diet, then began tracking my weight using the excellent Libra Android application. Though my drastic reductions of my caloric intake are no more (and so my weight is now fairly steady) I continue to track my weight day-to-day and build the dataset. Perhaps later I can do an analysis of the patterns in fluctuations in my weight separate from the goal of weight loss.

What follows is a rough transcription of the talk I gave, illustrated by the accompanying slides.

Hello Everyone, I’m Myles Harrison and today I’d like to present my first experiment in quantified self and self-tracking. And the name of that experiment is “How Much Do I Weigh?”

So I want to say two things. First of all, at this point you are probably saying to yourself, “How much do I weigh? Well, geez, that’s kind of a stupid question… why don’t you just step on a scale and find out?” And that’s one of the things I discovered as a result of doing this, is that sometimes it’s not necessarily that simple. But I’ll get to that later in the presentation.

The second thing I want to say is that I am not fat.

However, there are not many people whom I know where if you ask them, “Hey, would you like to lose 5 or 10 pounds?” the answer would be no. The same is true for myself. So late last November I decided that I wanted to lose some weight and perhaps get into slightly better shape. Being the sort of person I am, I didn’t go to the gym, I didn’t go a personal trainer, and I didn’t meet with my doctor to discuss my diet. I just Googled stuff. And that’s what lead me to this

The Hacker’s Diet, by John Walker. Walker was one of the co-founders of the company Autodesk which created the popular Autocad software and later went on to become a giant multinational company. Mr. Walker woke up one day and had a realization. He realized that he was very successful, very wealthy, and had a very attractive wife, but he was fat. Really fat. And so John Walker though, “I’ve used my intelligence and analytical thinking to get all these other great things in my life, why can’t I apply my intelligence to the problem of weight, and solve it the same way?” So that’s exactly what he did. And he lost 70 pounds.

Walker’s method was this. He said, let’s forget all about making this too complicated. Let’s look at the problem of health and weight loss as an engineering problem. So there’s just you:

and your body is the entire system, and all this system has, the only things we’re going to think about are inputs and outputs. I don’t care if you’re eating McDonald’s, or Subway, or spaghetti 3 times a day. We’re just talking about the amount of input – how much? Therefore, from this incredibly simplified model of the human body, the way to lose weight is just to ensure that the inputs are less than the outputs.

IN < OUT

Walker realized that this ‘advice’ is so simple and obvious that it is nearly useless in itself. He compared it to the wise financial guru, on being asked how to make money on the stock market by an apprentice, giving the advice: “It’s simple, buy low and sell high.” Still, this is the framework we have as a starting point, so we proceed from here.

So now this raises the question, “Okay well how do we do that?” Well, this is a Quantified Self meet up, so as you’ve probably guessed, we do it by measuring.

We can measure our inputs by counting calories and keeping track of how much we eat. Measuring output is a little more difficult. It is possible to approximate the number of calories burned when exercising, but actually measuring how much energy you are using on a day-to-day basis, just walking around, sitting, going to work, sleeping, etc. is more complicated, and likely not practically possible. So instead, we measure weight as a proxy for output, since this is what we are really concerned with in the first place anyhow. i.e. Are we losing weight or not?

Okay, so we know now what we’ve got to do. How are we going to keep track of all this? Walker, being a technical guy, suggests entering all the information into a piece of computer software, oh, say, I don’t know, like a certain spreadsheet application. This way we can make all kinds of graphs and find the weighted moving average, and do all kinds of other analysis. But I didn’t do that. Now don’t get me wrong, I love data and I love analyzing it, and so I would love doing all those different types of things. However, why would I use a piece of software that I hate (and am forced to on a regular basis) any more than I already have to? Especially when this is the 21st century and I have a perfectly good smartphone and somebody already wrote the software to do it for me!

So, I’m good! Starting in late November of last year I followed the Hacker’s Diet directions and weighed myself every day (or nearly every day, as often as I could) at approximately the same time of day. And along the way, I discovered some things.

One day I was at work and I got a text from my roommate, and it said “Myles, did you draw a square on the bathroom floor in black permanent marker?” To which I responded, “Why yes I did.” To which the response was “Okay, good.” And the reason I that I drew a square on the tiles of the bathroom floor in black permanent marker was because of observational error. More specifically, measurement error. 

If you know anything about your typical drugstore bathroom scale you probably know that they are not really that accurate. If you put the same scale on an uneven surface (say, like tiles on a bathroom floor) you can make the same measurement back-to-back and get wildly different values. That is to say the scales have a lot of random error in their measurement. And that’s why I drew that square on the bathroom floor. That was my attempt to control measurement error, by placing the scale in as close to the same position I could every morning when I weighed myself. Otherwise you get into this sort of bizarre situation where you start thinking, “Okay, so is the scale measuring me or am I measuring the scale?” And if we are attempting to collect some meaningful data and do a quantified self experiment, that is not the sort of situation we want to be in.

So I continued to collect data from last November up until today. And this is what it looks like.

As you can see like most dieters, I was very ambitious at the start and lost approximately 5 pounds between late November and and the tail end of December. That data gap, followed by a large upswing corresponds to the Christmas holidays, when I went off my diet. After that I continued to lose weight, albeit somewhat more gradually up until about mid-March, and since then I have ever-so-slowly been gaining it back, mostly due to the fact that I have not been watching my input as much as I was before.

So, what can we take away from this graph? Well, from my simple ‘1-D’ analysis, we can see a couple of things. The first thing, which should be a surprise to no one, is that it is a lot easier to gain weight than it is to lose it. I think most everyone here (and all past dieters) already knew that. 

Secondly, my diet aside, it is remarkable to see how much variability there is in the daily measurements. True, some of this may be due to the aforementioned measurement error, however in my readings online I also found that a person’s weight can vary by as much as 1 to 3 pounds on a day-to-day basis, due to various biological factors and processes.

Walker comments on this variability in the Hacker’s Diet. It is one of his reasons as to why looking at the moving average and weighing oneself every day is important, if you want to be able to really track whether or not a diet is working. And that’s why doing things like Quantified Self are important, and also what I was alluding to earlier when I said that the question of “How much do I weigh?” is not so simple. It’s not simply a matter of stepping on the scale and looking at a number to see how much you weigh. Because that number you see varies on a daily basis and isn’t a truly accurate measurement of how much you ‘really’ weigh.

!

This ties into the third point that I wanted to draw from the data. That point is that the human body is not like a light switch, it’s more like a thermostat. I remember reading about a study which psychologists did to measure people’s understanding of delayed feedback. They gave people a room with a thermostat, but there was a delay in the thermostat, and it was set to something very very high, on the order of several hours. The participants were tasked with getting to room to stay at a set temperature, however none of them could. Because people (or most people, anyhow) do not intuitively understand things like delayed feedback. The participants in the study kept fiddling with the thermostat and setting it higher and lower because they thought it wasn’t working, and so the temperature in the room always ended up fluctuating wildly. The participants in the study were responding to what they saw the temperature to be when they should have been responding to what the temperature was going to be.

And I think this is a good analogy for the problem with dieting and why it can be so hard. This is why it can be easy to become frustrated and difficult to tell if a diet is working or not. Because if you just step on the scale every day and look at that one number, you don’t see the overall picture, and it can be hard to tell whether you’re losing weight or not. And if you just see that one number you’d never realize that though I can eat a pizza today and I will weight the same tomorrow, it’s not until 3 days later that I have gained 2 pounds. It’s a problem of delayed feedback. And that’s one of the really interesting conclusions I came to ask a result of performing this experiment.

So where does this leave us for the future?

Well, I think I did a pretty good job of measuring my weight almost every day and was able to make some interesting conclusions from my simple ‘1-D’ analysis. However, though I did very well tracking all the output, and did not track any of my inputs whatsoever. In the future if I kept track of this as well (for instance by counting calories) I would have more data and be able to draw some more meaningful conclusions about how my diet is impacting my weight.

Secondly, I did not do one other thing at all. I didn’t exercise. This is something Walker gets to later in his book too (like most diet/health books) however I did not implement any kind of exercise routine or measurement thereof.

In the future I think if I implement these two things, as well as continuing with my consistent measurement of my weight, then perhaps I could ‘get all the way there’

 

|—————| 100%

 
That was my presentation, thank you for listening. If you have any questions I will be happy to answer them.

References / Resources

Libra Weight Manager for Android
https://play.google.com/store/apps/details?id=net.cachapa.libra 

The Hacker’s Diet
http://www.fourmilab.ch/hackdiet/www/hackdiet.html 

Quantified Self Toronto
http://quantifiedself.ca/