How much do I weigh? – Quantified Self Toronto #12

Recently I spoke at the Quantified Self Toronto group (you can find the article on other talk here).

It was in late November of last year that I decided I wanted to lose a few pounds. I read most of The Hacker’s Diet, then began tracking my weight using the excellent Libra Android application. Though my drastic reductions of my caloric intake are no more (and so my weight is now fairly steady) I continue to track my weight day-to-day and build the dataset. Perhaps later I can do an analysis of the patterns in fluctuations in my weight separate from the goal of weight loss.

What follows is a rough transcription of the talk I gave, illustrated by the accompanying slides.

Hello Everyone, I’m Myles Harrison and today I’d like to present my first experiment in quantified self and self-tracking. And the name of that experiment is “How Much Do I Weigh?”

So I want to say two things. First of all, at this point you are probably saying to yourself, “How much do I weigh? Well, geez, that’s kind of a stupid question… why don’t you just step on a scale and find out?” And that’s one of the things I discovered as a result of doing this, is that sometimes it’s not necessarily that simple. But I’ll get to that later in the presentation.

The second thing I want to say is that I am not fat.

However, there are not many people whom I know where if you ask them, “Hey, would you like to lose 5 or 10 pounds?” the answer would be no. The same is true for myself. So late last November I decided that I wanted to lose some weight and perhaps get into slightly better shape. Being the sort of person I am, I didn’t go to the gym, I didn’t go a personal trainer, and I didn’t meet with my doctor to discuss my diet. I just Googled stuff. And that’s what lead me to this

The Hacker’s Diet, by John Walker. Walker was one of the co-founders of the company Autodesk which created the popular Autocad software and later went on to become a giant multinational company. Mr. Walker woke up one day and had a realization. He realized that he was very successful, very wealthy, and had a very attractive wife, but he was fat. Really fat. And so John Walker though, “I’ve used my intelligence and analytical thinking to get all these other great things in my life, why can’t I apply my intelligence to the problem of weight, and solve it the same way?” So that’s exactly what he did. And he lost 70 pounds.

Walker’s method was this. He said, let’s forget all about making this too complicated. Let’s look at the problem of health and weight loss as an engineering problem. So there’s just you:

and your body is the entire system, and all this system has, the only things we’re going to think about are inputs and outputs. I don’t care if you’re eating McDonald’s, or Subway, or spaghetti 3 times a day. We’re just talking about the amount of input – how much? Therefore, from this incredibly simplified model of the human body, the way to lose weight is just to ensure that the inputs are less than the outputs.

IN < OUT

Walker realized that this ‘advice’ is so simple and obvious that it is nearly useless in itself. He compared it to the wise financial guru, on being asked how to make money on the stock market by an apprentice, giving the advice: “It’s simple, buy low and sell high.” Still, this is the framework we have as a starting point, so we proceed from here.

So now this raises the question, “Okay well how do we do that?” Well, this is a Quantified Self meet up, so as you’ve probably guessed, we do it by measuring.

We can measure our inputs by counting calories and keeping track of how much we eat. Measuring output is a little more difficult. It is possible to approximate the number of calories burned when exercising, but actually measuring how much energy you are using on a day-to-day basis, just walking around, sitting, going to work, sleeping, etc. is more complicated, and likely not practically possible. So instead, we measure weight as a proxy for output, since this is what we are really concerned with in the first place anyhow. i.e. Are we losing weight or not?

Okay, so we know now what we’ve got to do. How are we going to keep track of all this? Walker, being a technical guy, suggests entering all the information into a piece of computer software, oh, say, I don’t know, like a certain spreadsheet application. This way we can make all kinds of graphs and find the weighted moving average, and do all kinds of other analysis. But I didn’t do that. Now don’t get me wrong, I love data and I love analyzing it, and so I would love doing all those different types of things. However, why would I use a piece of software that I hate (and am forced to on a regular basis) any more than I already have to? Especially when this is the 21st century and I have a perfectly good smartphone and somebody already wrote the software to do it for me!

So, I’m good! Starting in late November of last year I followed the Hacker’s Diet directions and weighed myself every day (or nearly every day, as often as I could) at approximately the same time of day. And along the way, I discovered some things.

One day I was at work and I got a text from my roommate, and it said “Myles, did you draw a square on the bathroom floor in black permanent marker?” To which I responded, “Why yes I did.” To which the response was “Okay, good.” And the reason I that I drew a square on the tiles of the bathroom floor in black permanent marker was because of observational error. More specifically, measurement error. 

If you know anything about your typical drugstore bathroom scale you probably know that they are not really that accurate. If you put the same scale on an uneven surface (say, like tiles on a bathroom floor) you can make the same measurement back-to-back and get wildly different values. That is to say the scales have a lot of random error in their measurement. And that’s why I drew that square on the bathroom floor. That was my attempt to control measurement error, by placing the scale in as close to the same position I could every morning when I weighed myself. Otherwise you get into this sort of bizarre situation where you start thinking, “Okay, so is the scale measuring me or am I measuring the scale?” And if we are attempting to collect some meaningful data and do a quantified self experiment, that is not the sort of situation we want to be in.

So I continued to collect data from last November up until today. And this is what it looks like.

As you can see like most dieters, I was very ambitious at the start and lost approximately 5 pounds between late November and and the tail end of December. That data gap, followed by a large upswing corresponds to the Christmas holidays, when I went off my diet. After that I continued to lose weight, albeit somewhat more gradually up until about mid-March, and since then I have ever-so-slowly been gaining it back, mostly due to the fact that I have not been watching my input as much as I was before.

So, what can we take away from this graph? Well, from my simple ‘1-D’ analysis, we can see a couple of things. The first thing, which should be a surprise to no one, is that it is a lot easier to gain weight than it is to lose it. I think most everyone here (and all past dieters) already knew that. 

Secondly, my diet aside, it is remarkable to see how much variability there is in the daily measurements. True, some of this may be due to the aforementioned measurement error, however in my readings online I also found that a person’s weight can vary by as much as 1 to 3 pounds on a day-to-day basis, due to various biological factors and processes.

Walker comments on this variability in the Hacker’s Diet. It is one of his reasons as to why looking at the moving average and weighing oneself every day is important, if you want to be able to really track whether or not a diet is working. And that’s why doing things like Quantified Self are important, and also what I was alluding to earlier when I said that the question of “How much do I weigh?” is not so simple. It’s not simply a matter of stepping on the scale and looking at a number to see how much you weigh. Because that number you see varies on a daily basis and isn’t a truly accurate measurement of how much you ‘really’ weigh.

!

This ties into the third point that I wanted to draw from the data. That point is that the human body is not like a light switch, it’s more like a thermostat. I remember reading about a study which psychologists did to measure people’s understanding of delayed feedback. They gave people a room with a thermostat, but there was a delay in the thermostat, and it was set to something very very high, on the order of several hours. The participants were tasked with getting to room to stay at a set temperature, however none of them could. Because people (or most people, anyhow) do not intuitively understand things like delayed feedback. The participants in the study kept fiddling with the thermostat and setting it higher and lower because they thought it wasn’t working, and so the temperature in the room always ended up fluctuating wildly. The participants in the study were responding to what they saw the temperature to be when they should have been responding to what the temperature was going to be.

And I think this is a good analogy for the problem with dieting and why it can be so hard. This is why it can be easy to become frustrated and difficult to tell if a diet is working or not. Because if you just step on the scale every day and look at that one number, you don’t see the overall picture, and it can be hard to tell whether you’re losing weight or not. And if you just see that one number you’d never realize that though I can eat a pizza today and I will weight the same tomorrow, it’s not until 3 days later that I have gained 2 pounds. It’s a problem of delayed feedback. And that’s one of the really interesting conclusions I came to ask a result of performing this experiment.

So where does this leave us for the future?

Well, I think I did a pretty good job of measuring my weight almost every day and was able to make some interesting conclusions from my simple ‘1-D’ analysis. However, though I did very well tracking all the output, and did not track any of my inputs whatsoever. In the future if I kept track of this as well (for instance by counting calories) I would have more data and be able to draw some more meaningful conclusions about how my diet is impacting my weight.

Secondly, I did not do one other thing at all. I didn’t exercise. This is something Walker gets to later in his book too (like most diet/health books) however I did not implement any kind of exercise routine or measurement thereof.

In the future I think if I implement these two things, as well as continuing with my consistent measurement of my weight, then perhaps I could ‘get all the way there’

 

|—————| 100%

 
That was my presentation, thank you for listening. If you have any questions I will be happy to answer them.

References / Resources

Libra Weight Manager for Android
https://play.google.com/store/apps/details?id=net.cachapa.libra 

The Hacker’s Diet
http://www.fourmilab.ch/hackdiet/www/hackdiet.html 

Quantified Self Toronto
http://quantifiedself.ca/ 

Google Domestic Trends

Google’s mission is to organize all the world’s information and make it universally accessible and useful. In following their mission, the company has produced some amazing tools which allow any internet user to do some data visualization without so much as having to open a spreadsheet.

One of these tools which I stumbled across the other day (which apparently has has existed for some time) is Google Domestic Trends.

I was previously aware of Google Trends, which allows a user to compare the popularity of different search terms, whether if be for serious reasons (e.g. Android vs. iPhone) or say, for something less serious. In Domestic Trends, Google has aggregated relevant search terms across different sectors of the economy, with the results presumably providing insight into market trends by sector (or at least the popularity of those market sectors with respect to time).

I am not an economist, but data are data, so here goes with the pithy commentary and observations.

Air Travel
It’s seasonal, unsurprisingly. Looks like there might be some deals over the holidays I was unaware of. Or that might be a really bad time to buy tickets.

Link

Auto Buyers
As Google notes on the Domestic Trends frontpage, July 2009 was when the U.S. Government instituted its “Cash for Clunkers” program. However, it was also when Toyota recalled almost half a million vehicles due to defective airbags. Oh yeah, and that spike in 2005 is related to the outrageous change in the gas prices of the time.

Link

Bankruptcy
New record. I’m glad I rent.

Link

Computers and Electronics
Seriously, who buys desktops anymore?

Link

Credit Cards
A poignant portrait of the changing state of the American economy and personal debt.

Link

Durable Goods
Merry Christmas honey, I got you a Rhoomba.

Link

Education
School’s out for summer.

Link

Jobs
I want to say that the little spike later in 2011 has nothing to due with employment and is due to Mr. Jobs retiring, however then I would expect a much larger one to be in October.

Link

Mobile and Wireless
The iPhone was revealed to the public on January 9th, 2007 and went on sale in June of the same year. The iPhone 3G and 3GS came out in June and July of 2008 and 2009 respectively. The 4S was released in October 2011. Not sure about mid-2010. The Blackberry Torch came out in June but that would hardly warrant what we see here.

Link

Rental & Real Estate
Apparently it is quite seasonal. Peaks drop off around late July and early August. Students, I would guess.

Link

Shopping
We’ve seen this before. No surprises here.

Link

Unemployment
I know the word you’re thinking of. It’s on the tip of your tongue and it starts with ‘R’.

Link

See also: Google NGram Experiments.

rhok (n’ roll)

This past weekend was rhok Toronto which was a fun, exhausting, educational, and all around amazing weekend which I was honoured to be involved in.

The team I was fortunate enough to be a part of produced a prototype web-service to promote fair housing, and improve the ease of the submission process for investigations into housing by-law violations. An added bonus was that this resulted in this nice visualization of more City of Toronto data.

You can learn more about rhok here.

11 Million Yellow Slips – City of Toronto Parking Tickets, 2008-2011

Introduction

I don’t know about you, but I really hate getting parking tickets. Sometimes I feel like it’s all just a giant cash grab. Really? I can’t park there between the hours of 11 and 3, but every other time is okay? Well, why the hell not?

But ah, such is life. Rules must be in place to keep civil order, keep the engines of city life running and prevent total chaos in the downtown core. However knowing this does not make coming out to the street to find that bright yellow slip of paper under your windshield wiper any easier.

Like everything else in the universe, parking tickets are a source of data. The great people at Open Data Toronto (@Open_TO) have provided all the data from every parking ticket issued in Toronto from 2008 to the end of last year.

So, let us dive in and have a look. We might just discover why we keeping getting all these tickets, or at least ease the collective pain a little in realizing how many others are sharing in it.

Background

The data set is an anonymized record of every parking ticket issued in the city of Toronto from the period 01/01/2008 – 12/31/2011. The fields provided are: the anonymized ticket #, date of infraction, infraction code, description, fine amount, time of infraction, and location (address).

The data set and more information can be found in Open Data Toronto’s data catalogue here.

Originally I had this brilliant idea to geocode every data point, and then create an awesome heat map of the geographical distribution of parking tickets issued. However, given the fact that there are ~11 million records and the Google Maps API has a daily limit of 2,500 geocoding requests per day, even if I was completely diligent and performed the task daily it would still take approximately 4400 days or about 12 years to complete. And no, I am not paying to use the API for Business (which at a limit of 100,000 requests per day would still take ~3.5 months).

If anyone knows a way around this, please drop me an email and fill me in.

Otherwise, you can check out prior art. Patrick Cain at Global News created an awesome interactive map of aggregated parking ticket data from 2010 for locations in the city where over 500 tickets were issued. This turns out to be mainly hospitals, and unsurprisingly, tickets are clustered in the downtown core. Mr. Cain did a similar analysis while at the Toronto Star back in 2009, using data from the previous year.

I just don’t like throwing out data points.

Analysis

Parking Infractions by Type 
Next we consider the parking tickets for the period by infraction type. A simple bar chart outlines the most common parking ticket types:

We will consider those codes which stick out most on the bar chart (the top 10):

> sort(codeTable, decreasing=TRUE)[1:11]
    005     029     210     003     207     009     002     008     006     015
2336433 1822690 1366945 1354671  933478  718692  496283  443706  369079 173078

Putting that into more human-readable format, the most commonly issued types of parking infractions were:

1. 005 – Park on Highway at Prohibited Time of Day
2. 029 – Park Prohibited Place/Time – No Permit
3. 210 – Park Fail to Display Receipt
4. 003 – Park on Private Property w/o Consent
5. 207 – Park w/o ticket from machine
6. 009 – Stop on Highway at Prohibited Time/Day
7. 002 – Park Longer than 3 Hours
8. 008 – Vehicle Standing Prohibited Time/Day
9. 006 – Park on Highway – Excess of Permitted Time
10. 015 – Park within 3M of Fire Hydrant

In case you were wondering, the most expensive tickets (in the range of 100’s of dollars, the max being $450 [!!] ) are all related to handicapped parking spaces.

Time Distribution of Parking Infractions
Let us now consider the parking ticket information with regards to time. First and foremost, we consider the ticket data as a simple tim
e series and plot the data for the exploratory purposes:

Cool.

Most strikingly, there are clearly defined dips in the total number of tickets over the holiday season each year. There also appears to be some kind of periodic variation in the number of tickets issued over time (the downward spikes). A good first guess would be that this is likely related to the day of the week, due to the cycle of the work week related to the volume of cars parked, vehicles in the city, et cetera.

Quickly whipping up a box plot up for the data, we can see that a significantly less proportion of the tickets are issued on Sunday. Also for some reason plotting there are many outliers on the low end. I suspect these are in the aforementioned dips around the holiday season though I did not investigate this.

Conclusions

Performing a quick analysis of many different aspects of the data was not as easy as I had hoped, given the size of the set. Still, it is interesting to see the most common types of violations and the distribution of the majority of the parking tickets with respect to time.

Interesting general points of note:

  • The most common parking infractions are wrong place / wrong time, followed by various types of failing to display a permit / buy a ticket
  • Significantly reduced number of parking violations during the Christmas holiday season
  • More tickets issued during the work week

For Part II, I plan to create some heat maps / 2D histograms of the ticket data with respect to time, and I may yet create a geospatial representation of the data, albeit in aggregated form.

I’m Lovin’ It? – A Nutritional Analysis of McDonald’s

Introduction

The other day I ate at McDonald’s.

I am not particularly proud of this fact. But some days, you are just too tired, too lazy, or too hung-over to bother throwing something together in the kitchen and you just think, “Whatever, I’m getting a Big Mac.”

As I was standing in line, ready to be served a free smile, I saw that someone had put up on the wall the nutritional information poster. From far away I saw the little columns of data, all in neatly organized tabular form, and a light went on over my head. I got excited like the nerd I am. Look! Out in the real world! Neatly organized data just ready to be analyzed! Amazing!

So, of course, after finishing my combo #1 and doing my part to contribute to the destruction of the rain forest, the world’s increasingly worrying garbage problem, and the continuing erosion of my state of health, I rushed right home to download the nutritional information from Ronald McDonald’s website and dive into the data.

Analysis

First of all, I would just like to apologize in advance for
a) using a spreadsheet application, and
b) using bar charts

Forgive me, but we’re not doing any particularly heavy lifting here. And hey, at least it wasn’t in that one piece of software that everybody hates.

Also, by way of a disclaimer, I am not a nutritionist and the content of this article is in no way associated with McDonald’s restaurants or Health Canada.

Sandwiches
First things first. Surprisingly, the largest and fattiest of the items on the board is (what I consider to be) one the “fringe” menu items: the Angus Deluxe Burger. Seriously, does anybody really ever order this thing? Wasn’t it just something the guys in the marketing department came up to recover market share from Harvey’s? But I digress.

Weighing in at just a gram shy of 300, 47 of which come from fat (of which 17 are saturated) this is probably not something you should eat every day, given that it has 780 calories. Using a ballpark figure of 2000 calories a day for a healthy adult, eating just the burger alone would make up almost 40% of your daily caloric intake.

 
Unsurprisingly, the value menu burgers are not as bad in terms of calories and fat, due to their smaller size. This is also the case for the chicken snack wraps and fajita. The McBistro sandwiches, though they are chicken, are on par with the other larger burgers (Big Mac and Big Xtra) in terms of serving size and fat content, so as far as McD’s is concerned choosing a chicken sandwich is not really a healthier option over beef (this is also the case for the caloric content).

As the document on the McDonald’s website is a little dated, some newer, more popular menu items are missing from the data set. However these are available in the web site’s nutritional calculator (which unfortunately is in Flash). FYI the Double Big Mac has 700 calories and weighs 268 grams, 40 of which come from fat (17 saturated). Close, but still not as bad as the Angus Deluxe.

In terms of sodium and cholesterol, again our friend the Angus burger is the worst offender, this time the Angus with Bacon & Cheese, having both the most sodium and cholesterol of any burger on the menu. With a whopping 1990 mg of sodium, or approximately 80% of Health Canada’s recommended daily intake, that’s a salty burger. Here a couple of the smaller burgers are quite bad, the Double Cheeseburger and Quarter Pounder with Cheese both having marginally more sodium than the Big Mac as well as more cholesterol. Best stick with the snack wraps or the other value menu burgers.

Fries
Compared to the burgers, the fries don’t even really seem all that bad. Still, if you order a large, you’re getting over 40% of your recommended daily fat intake. I realize I’m using different units than before here, so for your reference the large fries have 560 calories, 27 grams of fat and 430 mg of sodium.

Soft Drinks
If you are trying to be health-conscious, the worst drinks you could possibly order at McDonald’s are the milkshakes. Our big winner in the drinks department is the large Chocolate Banana Triple Thick Milkshake®. With a serving size of 698g (~1.5 lbs), this delicious shake has over 1000 calories and nearly 30 grams of fat. In fact the milkshakes are, without question, the most caloric of all the drinks available, and are only exceeded in sugar content by some of the large soft drinks.

In terms of watching the calories and sugar, diet drinks are your friend as they have zero calories and no sugar. Below is the caloric and sugar content of the drinks available, sorted in ascending order of caloric content.

 

Salads
And now the big question – McDonald’s salads: a more conscientious choice, or another nutritional offender masquerading as a healthy alternative?

There are quite healthy alternatives in the salad department. Assuming you’re not going to order the Side Garden Salad (which I assume is just lettuce, looking at its caloric and fat content) the Spicy Thai Salad and Spicy Thai with Grilled Chicken are actually quite reasonable, though the latter has a large amount of sodium (520 mg), and all the Thai and Tuscan salads have a lot of sugar (19 and 16 grams of sugar respectively).

However, all these values are referring to the salads sans dressing. If you’re like me (and most other human beings) you probably put dressing on your salad.

The Spicy Thai Salad with the Asian Sesame Dressing added might still be considered within the realm of the healthy – totaling 250 calories and 11 grams of fat. However, keep in mind that would also have 530 mg of sodium (about a quarter of the recommended daily intake) and 29 grams of sugar. Not exactly health food, but not the worst thing you could order.

And for the love of god, just don’t order any old salad at McD’s and think you are making a healthy alternative choice. The Mighty Caesar with Crispy Chicken and Caesar dressing has more fat than a Big Mac combo with medium fries and a Coke (54 g vs. 46 g) and nearly as much sodium (1240 mg vs. 1300 mg), over half the daily recommended intake.

Conclusions

Doing this brief simple examination of the McDonald’s menu will definitely help me be more mindful about the food the next time I choose to eat there. However in terms of of take-aways, there is nothing here really too surprising – we can see that McDonald’s food is, in general, very high in calories, fat, sugar and sodium. This is probably not a surprise for most, as many continue to eat it while being aware of these facts, myself included.

Still, it is somewhat shocking to see it all quantified and laid out in this fashion. A Big Mac meal with a medium fries and medium coke, for instance, has 1120 calories, 46 grams of fat, 1300 mg of sodium and 65 grams of sugar. Yikes. Assuming our 2000 calorie diet, that’s over half the day’s calories in one meal, as well as 71% and 54% of the recommended daily values for fat and sodium respectively. I will probably think twice in the future before I order that again.

If you are trying to be health-conscious and still choose to eat underneath the golden arches, based on what we have seen here, some pointers are:

  • Avoid the Angus Burgers
  • Order a smaller burger (except the double cheese), snack wrap or fajita
  • Avoid the milkshakes
  • Drink diet soft drinks
  • Some salads are acceptable, Caesar dressing is to be avoided

References / Resources

McDonald’s Nutritional Information
http://www1.mcdonalds.ca/NutritionCalculator/NutritionFactsEN.pdf

McDonald’s Canada Nutritional Calculator
http://www.mcdonalds.ca/ca/en/food/nutrition_calculator.html

The Daily % Value (Health Canada)
http://www.hc-sc.gc.ca/fn-an/label-etiquet/nutrition/cons/dv-vq/info-eng.php

Dietary Reference Intake Tables, 2005 (Health Canada)
http://www.hc-sc.gc.ca/fn-an/nutrition/reference/table/index-eng.php

LibreOffice Calc
http://www.libreoffice.org/

My bookshelf

I’d like to start with something small, and simple. The thing about analyzing the data of your own life is that you are the only one doing the research, so you also have to collect all of the data yourself. This takes effort; and, if you’d like to build a large enough data set to do some really interesting (and valid) analysis, time.

So I thought I’d start small. And simple. So I thought, what is an easily available source of data in my life to do some preliminary work? The answer was right next to me as I sat at my desk.

I am not a bibliophile by any stretch of the imagination, as I try to make good use of the public library when I can. I’d prefer to avoid spending copiously on books which will be read once and then collect dust. I have, over time however, amassed a small collection which is currently surpassing the capacity of my tiny IKEA bookcase.

I catalogued all the books in my collection and kept track of a few simple characteristics: number of pages, list price, publication year, binding, type (fiction, non-fiction or reference), subject, and whether or not I had read the book from cover-to-cover (“Completed”).

At the time of cataloguing I had a total of 60 books on my bookshelf. Summary of data:

> source(“books.R”)
[1] “Reading books.csv”
> summary(books)

Pages     

Min.   :  63.0 
1st Qu.: 209.5 
Median : 260.0 
Mean   : 386.1 
3rd Qu.: 434.0 Max.   :1694.0 
      Binding        Year               Type              Subject 
 Hardcover:21   Min.   :1921   Fiction    :15   Math          :12 
 Softcover:39   1st Qu.:1995   Non-fiction:34   Communications: 7 
                Median :2002   Reference  :11   Humour        : 6 
                Mean   :1994                    Coffee Table  : 5 
                3rd Qu.:2006                    Classics      : 4 
                Max.   :2011                    Sci-Fi        : 4 
                                                (Other)       :22 
     Price        Completed
 Min.   :  1.00   -:16    
 1st Qu.: 16.45   N:13    
 Median : 20.49   Y:31    
 Mean   : 35.41           
 3rd Qu.: 30.37           
 Max.   :155.90           

Some of this information is a bit easier to interpret if provided in visual form (click to enlarge):


Looking at the charts we can see that I’m not really into novels, and that almost 1/5th of my library is reference books – due mainly to textbooks from university I still have kicking around. For about 1/3rd of the books which are intended to be read cover-to-cover I have not done so (“Not Applicable” refers to books like coffee-table and reference books which are not intended to be read in their entirety).

Breaking it down further we look at the division by subject/topic:

Interestingly enough, the topics in my book collection are varied (apparently I am well-read?), with the largest chunks being made up by math (both pop-science and textbooks) and communications (professional development reading in the last year).

Let’s take a look at the relationship between the list price of books and other factors.

As expected, there does not appear to be any particular relationship between the publication year of the book and the list price. The outliers near the top of the price range are the textbooks and those on the very far left of publication date are Kafka.

A more likely relationship would be that between a book’s length and its price, as larger books are typically more costly. Having a look at the data for all the books it appears this could be the case:

We can coarsely fit a trendline to the data:
> price <- books$Price
> pages <- books$Pages
> page_price_line <- lm(price ~ pages)
> summary(page_price_line)

Call:
lm(formula = price ~ pages)

Residuals:
    Min      1Q  Median      3Q     Max
-56.620 -13.948  -6.641  -1.508 109.802

Coefficients:
            Estimate Std. Error t value Pr(>|t|)   
(Intercept)  9.95801    6.49793   1.532    0.131   
pages        0.06592    0.01294   5.096 3.97e-06 ***

Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 32.19 on 58 degrees of freedom
Multiple R-squared: 0.3092,    Adjusted R-squared: 0.2973
F-statistic: 25.96 on 1 and 58 DF,  p-value: 3.971e-06
  
 

Our p-value is super small however our goodness of fit (R-squared) is not great. There appears to be some sort of clustering going on here as the larger values (both in price and pages) are more dispersed. We re-examine the plot and divide by binding type:

The softcovers make up the majority of the tightly clustered values and the values for the hardcovers seem to be more spread out. The dashed line is the linear fit for the hardcovers and the solid line for the soft. However the small number (n=21) and dispersion of the points for the former make even doing this questionable. That point aside, we can see on the whole that hardcovers appear to be more expensive, as one would expect. This is illustrated in the box plot below:
 

However there a lot of outlying points on the plot. Looking at the scatterplot again we divide by book type and the picture becomes clearer:

It is clear the reference books make up the majority of the extreme values away from those clustered in the lower regions of the plot and thus could be treated separately.

Closing notes:

  • I did not realize how many non-fiction / general interest / popular reading books have subtitles (e.g. Zero – The Biography of A Dangerous Idea) until cataloguing the ones I own. I suppose this is to make them seem more interesting, with the hopes that people browsing at bookstores to read the blurb on the back and be enticed to purchase the book.
  • Page numbering appears to be completely arbitrary. When I could I used the last page of each book which had a page number listed. Some books have the last page in the book numbered, others have the last full page of text numbered, and still others the last written page before supplementary material at the back (index, appendix, etc.) numbered. The first numbered page also varies, accounting for things like the table of contents, introduction, prologue, copyright notices and the like.
  • Textbooks are expensive. Unreasonably so.
  • Amazon has metadata for each book which you can see under “Details” when you view it (I had to look up some things like price when it was not listed on the book. In these cases, I used Amazon’s “list price”, the crossed out value at the top of the page for a book). I imagine there is an enormous trove of data which would lend itself to much more interesting and detailed analysis than I could perform here.