Introduction
You know what’s still awesome? Pocket.
As I noted in an earlier post (oh god, was that really more than a year ago?!) I started using the Pocket application, previously known as Read It Later, in July of 2011 and it has changed my reading behavior ever since.
Lately I’ve been thinking a lot about quantified self and how I’m not really tracking anything anymore. Something which was noted at one of the Meetups is that data collection is really the hurdle: like anything in life – voting, marketing, dating, whatever – you have to make it easy otherwise most people probably won’t bother to do it. I’m pretty sure there’s a psychological term for this – something involving the word ‘threshold’.
That’s where smartphones come in. Some people have privacy concerns about having all their data in the cloud (obviously I don’t, as I’m willing putting myself all on display in the blog here) but that aside, one of the cool things about smartphone apps is that you are passively creating lots of data. Over time this results in a data set about you. And if you know how to pull that data you can analyze it (and hence yourself). I did this previously, for instance with my text messages and also with data from Pocket collected up to that time.
So let’s give it a go again, but this time with a different focus for the analysis.
Background
Basically the export is quasi-XML, so it’s a simple matter of writing some R code using the XML library to get the data into a format we can work with (CSV):
Analysis
Tagged vs. Untagged
You can see that initially I resisted tagging articles, but starting November adopted it and began tagging almost all articles added. And because stacked area graphs are not especially good data visualization, here is a line graph of the number of articles tagged per month:
Which better shows that I gradually adopted tagging from October into November. Another thing to note from this graph is that my Pocket usage peaked between November of last year to May of this year, after which the number of articles added on a monthly basis decreases significantly (hence the previous graph being proportional).
Next we examine the number of articles by subject area. I’ve collected them into more-or-less meaningful groups and will explain the different tags as we go along. Note the changing scale on the y-axes for these graphs, as the absolute number of articles varies greatly by category.
Psych & Other Soft Topics
As I noted previously in the other post, when starting to use Pocket I initially read a very large number of psych articles.
I also read a fair number of “personal development” articles (read: self-helpish – mainly from The Art of Manliness) which has decreased greatly as of late. The purple are articles on communications, the light blue “parapsych”, which is my catchall for new-agey articles relating to things like the zodiac, astrology, mentalism, mythology, etc. (I know it’s all nonsense, but hey it’s good conversation for dinner parties and the next category).
The big spike recently was a cool site I found recently with lots of articles on the zodiac (see: The Barnum Effect). Most of these later got deleted.
Dating & Sex
Now that I have your attention… what you don’t read articles on sex? The Globe and Mail’s Life section has a surprising number of them. Also if you read men’s magazines online there are a lot, most of which are actually pretty awful. You can see too that articles on dating made up a large proportion of my reading back in the fall, also from those types of sites (which thankfully I now visit far less frequently).
News, etc.
This next graph is actually a bit busy for my liking, but I found this data set somewhat challenging to visualize overall, given the number of categories and how they change in time.
News is just that. Tech mostly the internet and gadgets. Jobs is anything career related. Finance is both in the news (macro) and personal. Marketing is a newcomer.
Web & Data
The data tag relates to anything data-centric – as of late more applied to big data, data science and analytics. Interestingly my reading on web analytics preceded my new career in it (January 2013), just like my readings in marketing did – which is kind of cool. It also goes to show that if you read enough about analytics in general you’ll eventually read about web analytics.
Data visualization is a tag I created recently so has very few articles – many of which I would have previously tagged with ‘data’.
Life & Humanities
If that other graph was a little too busy this one is definitely so, but I’m not going to bother to break it out into more graphs now. Articles on style are of occasional interest, and travel has become a recent one. ‘Living’ refers mainly to articles on city life (mostly from The Globe as well as the odd one from blogto).
Work
And finally some new-comers, making up the minority, related to work:
SEO is search engine optimization and dev refers to development, web and otherwise.
Gee that was fun, and kind of enlightening. But tagging in Pocket is like in Gmail – it is not one-to-one but many-to-one. So next I thought to try to answer the question: which tags are most related? That is, which tags are most commonly applied to articles together?
To do this we again turn to R and the following code snippet, on top of that previous, does the trick:
All this does is remove the untagged articles from the tag frame and then run a correlation between each column of the tag matrix. I’m no expert on exotic correlation coefficients, so I simply used the standard (Pearson’s). In the case of simple binary variables (true / false such as here), the internet informs me that this reduces to the phi coefficient.
Given there are 30 unique tags, this creates a 30 x 30 matrix, which is visualized below as a heatmap:
Redder is negative, greener is positive. I neglected to add a legend here as when not using ggplot or a custom function it is kind of a pain, but some interesting relationships can still immediately be seen. Most notably food and health articles are the most strongly positively correlated while data and psych articles are most strongly negatively correlated.
Other interesting relationships are that psych articles are negatively correlated with jobs, tech and web analytics (surprise, surprise) and positively correlated with communications, personal development and sex; news is positively correlated with finance, science and tech.
Conclusion
Though apps often make our lives easier with passive data collection, all this information being “in the cloud” does raise questions of data ownership (and governance) and I do wish more companies, large and small, would make it easier for us to get a hold of our data when we want it.
Because at the end of the day, it is ultimately our data that we are producing – and it’s the things it can tell us about ourselves that makes it valuable to us.
Keep it up!
Thanks!