I Heart Sushi

I Heart Sushi

Introduction

I like sushi.

I’ve been trying to eat a bit better lately though (aren’t we all?) and so got to wondering: just how bad for you is sushi exactly? What are some of the better nutritional choices I can make when I go out to eat at my favorite Japanese(ish) place? What on the menu should I definitely avoid?

And then I got thinking like I normally get think about the world, that hey, it’s all just data, and I remembered how I could just take some nutritional information as raw data as I’ve previously done ages ago for Mickey D’s and see if anything interesting pops out. Plus this seemed like as good as an excuse as any to do some work with the good old data analysis and visualization stack for python, and ipython notebooks, instead of my usual go-to tool of R.

So let’s have a look, shall we?

Background

As always, the first step is getting the data; sometimes the most difficult step. Here the menu in question I chose to use was that from Sushi Stop (I am in no way affiliated nor associated with said brand, nor I am endorsing it), where the nutritional information unfortunately was only available as a PDF, as is often the case.

This is a hurdle data analysts, but more often I think, research analysts and data journalists, can often run into. Fortunately there are tools at our disposal to deal with this kind of thing, so not to worry. Using the awesome Tabula and a little bit of ad hoc cleaning from the command line, it was a simple matter of extracting the data from the PDF and into a convenient CSV. Boom, and we’re ready to go.

Tabula

The data comprises 335 unique items in 17 different categories with 15 different nuritional variables. Let’s dig in.

Analysis

First we include the usual suspects in the python data analysis stack (numpy, matplotlib and pandas), then read the data into a dataframe using pandas.

In [1]:
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
In [2]:
data = pd.read_csv("tabula-nutritional-information.csv", delimiter=",")

Okay, are we wokring with here? Let’s take a look:

In [3]:
print data.columns
print len(data.columns)
data.head()
Index([u'category', u'item', u'serving_size', u'calories', u'fat', u'saturated_fat', u'trans_fat', u'cholesterol', u'sodium', u'carbohydrates', u'fibre', u'sugar', u'protein', u'vitamin_a', u'vitamin_c', u'calcium', u'iron'], dtype='object')
17
Out[3]:
category item serving_size calories fat saturated_fat trans_fat cholesterol sodium carbohydrates fibre sugar protein vitamin_a vitamin_c calcium iron
0 APPETIZERS & SALADS Shrimp Tempura 60 180 8.0 0.0 0 40 125 18 0 0 8 0 0 0 0
1 APPETIZERS & SALADS Three salads 120 130 3.5 0.0 0 60 790 13 4 8 8 2 6 40 8
2 APPETIZERS & SALADS Wakame 125 110 2.0 0.0 0 0 1650 13 4 9 0 0 0 110 0
3 APPETIZERS & SALADS Miso soup 255 70 3.0 0.5 0 0 810 8 1 1 6 0 0 20 25
4 APPETIZERS & SALADS Grilled salmon salad 276 260 19.0 2.5 0 30 340 12 3 6 12 80 80 8 8

5 rows × 17 columns

Let’s look at the distribution of the different variables. You can see that most a heavily skewed or follow power law / log-normal type distributions as most things in nature do. Interestingly there is a little blip there in the serving sizes around 600 which we’ll see later is the ramen soups.

In [4]:
# Have a look
plt.figure(0, figsize=(25,12), dpi=80)
for i in range(2,len(data.columns)):
fig = plt.subplot(2,8,i)
plt.title(data.columns[i], fontsize=25)
plt.hist(data[data.columns[i]])
# fig.tick_params(axis='both', which='major', labelsize=15)
plt.tight_layout()

Let’s do something really simple, and without looking at any of the other nutrients just look at the caloric density of the foods. We can find this by dividing the number of calories in each item by the serving size. We’ll just look at the top 10 worst offenders or so:

In [5]:
data['density']= data['calories']/data['serving_size']
data[['item','category','density']].sort('density', ascending=False).head(12)
Out[5]:
item category density
314 Yin Yang Sauce EXTRAS 5.000000
311 Ma! Sauce EXTRAS 4.375000
75 Akanasu (brown rice) HOSOMAKI 3.119266
0 Shrimp Tempura APPETIZERS & SALADS 3.000000
312 Spicy Light Mayo EXTRAS 2.916667
74 Akanasu HOSOMAKI 2.844037
67 Akanasu avocado (brown rice) HOSOMAKI 2.684564
260 Teriyaki Bomb ‐ brown rice (1 pc) TEMARI 2.539683
262 Teriyaki Bomb ‐ brown rice (4 pcs) TEMARI 2.539683
66 Akanasu avocado HOSOMAKI 2.483221
201 Inferno Roll (brown rice) SUMOMAKI 2.395210
259 Teriyaki Bomb (1 pc) TEMARI 2.380952

12 rows × 3 columns

The most calorically dense thing is Ying-Yang Sauce, which as far as I could ascertain was just mayonnaise and something else put in a ying-yang shape on a plate.Excluding the other sauces (I assume Ma! also includes mayo), the other most calorically dense foods are the variations of the Akanasu roll (sun-dried tomato pesto, light cream cheese, sesame), shrimp tempura (deep fried, so not surprising) and teriyaki bombs, which are basically seafood, cheese and mayo smushed into a ball, deep fried and covered with sauce (guh!). I guess sun-dried tomato pesto has a lot of calories. Wait a second, does brown rice have more calories than white? Oh right, sushi is made with sticky rice, and yes, yes it does. Huh, today I learned.

We can get a more visual overview of the entire menu by plotting the two quantities together. Calories divided by serving size = calories on the y-axis, serving size on the x. Here we colour by category and get a neat little scatterplot.

In [7]:
# Get the unique categories
categories = np.unique(data['category'])

# Get the colors for the unique categories
cm = plt.get_cmap('spectral')
cols = cm(np.linspace(0, 1, len(categories)))

# Iterate over the categories and plot
plt.figure(figsize=(12,8))
for category, col in zip(categories, cols):
d = data[data['category']==category]
plt.scatter(d['serving_size'], d['calories'], s=75, c=col, label=category.decode('ascii', 'ignore'))
plt.xlabel('Serving Size (g)', size=15)
plt.ylabel('Calories', size=15)
plt.title('Serving Size vs. Calories', size=18)

legend = plt.legend(title='Category', loc='center left', bbox_to_anchor=(1.01, 0.5),
ncol=1, fancybox=True, shadow=True, scatterpoints=1)
legend.get_title().set_fontsize(15)

You can see that the nigiri & sashimi generally have smaller serving sizes and so less calories. The ramen soup is in a category all its own with much larger serving sizes than the other items, as I mentioned before and we saw in the histograms. The other rolls are kind of in the middle. The combos, small ramen soups and some of the appetizers and salads also sit away from the ‘main body’ of the rest of the menu.

Points which lie further from the line y=x have higher caloric density, and you can see that even though the top ones we picked out above had the highest raw values and we can probably guess where they are in the graph (the sauces are the vertical blue line near the bottom left, and the Akanasu are probably those pairs of dark green dots to the right), there are other categories which are probably worse overall, like the cluster of red which is sushi pizza. Which category of the menu has highest caloric density (and so is likely best avoided) overall?

In [8]:
# Find most caloric dense categories on average
density = data[['category','density']]
grouped = density.groupby('category')
grouped.agg(np.average).sort('density', ascending=False).head()
Out[8]:
density
category
EXTRAS 2.421875
SUSHI PIZZA 2.099515
CRISPY ROLLS 1.969304
TEMARI 1.807691
HAKO 1.583009

5 rows × 1 columns

As expected, we see that other than the extras (sauces) which have very small serving sizes, on average the sushi pizzas are the most calorically dense group of items on the menu, followed by crispy rolls. The data confirm: deep fried = more calories.

What if we were only concerned with fat (as many weight-conscious people dining out are)? Let’s take a look at the different categories with a little more depth than just a simple average:

In [9]:
# Boxplot of fat content
fat = data[['category','fat']]
grouped = fat.groupby('category')

# Sort
df2 = pd.DataFrame({col:vals['fat'] for col,vals in grouped})
meds = df2.median()
meds.sort(ascending=True)
df2 = df2[meds.index]

# Plot
plt.figure(figsize=(12,8))
fatplot = df2.boxplot(vert=False)

While the combos and appetizers and salads have vary wide ranges in their fat content, we see again that the sushi pizza and crispy rolls have the most fat collectively and so are best avoided.

Now another thing people are often worried about when they are trying to eat well is the amount of sodium they take in. So let’s repeat our previous approach in visually examining caloric density, only this time plot it as one metric on the x-axis and look at where different items on the menu sit with regards to their salt content.

In [10]:
fig = plt.figure(figsize=(12,8))
plt.xlim(0,6)
plt.ylim(-50, 2000)
for category, col in zip(categories, cols):
d = data[data['category']==category]
plt.scatter(d['density'], d['sodium'], s=75, c=col, label=category.decode('ascii', 'ignore'))
plt.xlabel('Caloric density (calories/g)', size=15)
plt.ylabel('Sodium (mg)', size=15)
plt.title('Sodium vs. Caloric Density', size=18)

legend = plt.legend(title='Category', loc='center left', bbox_to_anchor=(1.01, 0.5),
ncol=1, fancybox=True, shadow=True, scatterpoints=1)
legend.get_title().set_fontsize(15)

Here we can see that while the extras (sauces) are very calorically dense, you’re probably not going to take in a crazy amount of salt unless you go really heavy on them (bottom right). If we’re really worried about salt the ramen soups should be avoided, as most of them have very high sodium content (straight line of light green near the left), some north of 1500mg, which is the daily recommended intake by the Health Canada for Adults 14-50. There’s also some of the other items we’ve seen before not looking so good (sushi pizza). Some of the temari (like the teriyaki bombs) and sumomaki (‘regular’ white-on-the-outside maki rolls) should be avoided too? But which ones?

A plot like this is pretty crowded, I’ll admit, so is really better explored, and we can do that using the very cool (and very under-development) MPLD3 package, which combines the convenience of matplotlib with the power of D3.

Below is the same scatterplot, only interactive, so you can mouse over and see what each individual point is. The items to be most avoided (top right in grey and orange), are indeed the teriyaki bombs, as well as the inferno roll (tempura, light cream cheese, sun-dried tomato pesto, red and orange masago, green onion, spicy light mayo, spicy sauce, sesame) as we saw before. Apparently that sun-dried tomato pesto is best taken in moderation.

The Akanasu rolls are the horizontal line of 4 green points close by. Your best bet is probably just to stick to the nigri and sashimi, and maybe some of the regular maki rolls closer to the bottom left corner.

In [11]:
import mpld3
fig, ax = plt.subplots(figsize=(12,8))
ax.set_xlim(0,6)
ax.set_ylim(-50,2000)
N = 100

for category, col in zip(categories, cols):
d = data[data['category']==category]
scatter = ax.scatter(d['density'], d['sodium'], s=40, c=col, label=category.decode('ascii', 'ignore'))
labels = list(d['item'])
tooltip = mpld3.plugins.PointLabelTooltip(scatter, labels=labels)
mpld3.plugins.connect(fig, tooltip)


mpld3.display()
Out[11]:

Conclusion

Well, there we have it folks. A simple look at the data tells us some common-sense things we probably already new:

  • Deep fried foods will make you fat
  • Mayo will make you fat
  • Soup at Japanese restaurants is very salty
  • Sashimi is healthy if you go easy on the soy

And surprisingly, one thing I would not have thought: that sundried tomato pesto is apparently really bad for you if you’re eating conscientiously.

That’s all for now. See you next time and enjoy the raw fish.

References and Resources

Tablua
http://tabula.technology/

Sushi Stop – Nutritional Information (PDF)
http://www.sushishop.com/themes/web/assets/files/nutritional-information-en.pdf

Food & Nutrition – Sodium in Canada (Health Canada)
http://www.hc-sc.gc.ca/fn-an/nutrition/sodium/index-eng.php

code & data on github
https://github.com/mylesmharrison/i_heart_sushi/

Fine Cuppa Joe: 96 Days and 162 Cups of Coffee

Introduction

Let’s get one thing straight: I love me some coffee.
Some people would disagree with me on this, but coffee is really important. Really, really important, and not just to me. Not just because companies like Starbucks and Second Cup and Caribou and Timothy’s and Tim Hortons make it their business, but for another reason.
As far as I know, there are only three legal, socially acceptable drugs: alcohol, nicotine, and caffeine (and some would argue that the first two are not always socially acceptable). Coffee is really important because coffee is the most common, effective and ubiquitous source of delivery for that third drug – and one which is acceptable and ubiquitous not only socially, but also in the world of business.
I remember a long time ago there was a big blackout. I remember that after people pointed out how such a widespread outage was caused by such a small point of failure – they said things like ‘This just goes to show how fragile our infrastructure is! If the terrorists want to win, all they have to do is take out one circuit breaker here or there and all of North America will collapse!’

Ha ha ha, yeah.
But I’d argue that if you really wanted all of North American society to shut down, you could really hit us where it hurts, take away something from us without which we are completely and totally hopeless – cut off our supply of coffee. Think about it! The widespread effects of everyone across all walks of life and all the industries suddenly going Cold Turkey on coffee would be far more damaging in the long run than any little black out. Run for the hills, the great Tim Hortons’ riots of 2013 have erupted and apparently the Mayans only missed date of The Apocalypse by a small margin!
Or at least I think so. Or at least I think the idea is entertaining, though I probably largely got the idea from this Dilbert comic (which I find funnier and more spot-on than most).
But I digress.

Background

Like I said, I love me some coffee (it says so in my Twitter profile), and I’m no stranger to quantified self either, so I thought it would be interesting to apply it and answer the question “Exactly how much coffee am I drinking?” amongst others.
I kept track of my coffee consumption over the period spanning November 30, 2012 to March 5, 2013. I recorded when I had coffee, where it was from, what size, and how I took it. It wasn’t until almost the end of January that I realized I could also be keeping track of how long it took me to consume each cup of coffee, so I started doing that as well. Every time I do something like this I forget then remember how important it is to think about data collection before you set off on your merry way (like for example with the post on my commute).
As well as keeping track of the amount of coffee I drank in terms of cups, I converted the cups to volume by multiplying by the following approximate values:
  • Mug / Small / Short – 240 ml
  • Medium / Tall – 350 ml
  • Large / Grande – 470 ml

Analysis

First and foremost, we examine where I consumed the majority of coffee from over the 3 month period. Starbucks is the clear winner and apparently I almost never go to Second Cup.
bar chart of coffee consumption by location
Second was at work (which is not really a fair comparison, as it’s actually Starbucks coffee anyways). Third is at Dad’s place, almost of all which is due to my being home over the holidays.
Next we look at the time of day when the coffees were consumed. I am going to use this as an illustrative example of why it is important to visualize data.
First consider a histogram for the entire time period of when all the java was imbibed:
histogram of coffee consumption by hour of day
You can see there are peaks at the hours of 10 AM and also at 2 PM. However is this telling the whole story? Let’s look at the all the data plotted out by time of day:
scatterplot of coffee consumption by date and time of day
Having the data plotted out, you can see there is a distinct shift in the hours of the day when I was drinking coffee around the beginning of January. The earliest cup of the day goes from being around 9 AM to around 8, and the latest from in the evening from around 8 PM to the late afternoon (3-4 PM). Well, what happened to constitute this shift in the time of my daily java consumption? Simple – I got a new job.
You can see this shift if we overplot histograms for the hour of day before and after the change:
combined histogram of coffee consumption by hour of day
You can see that the distribution of my coffee consumption is different after I started the new gig – my initial morning coffees occur earlier (in the hours of 7-8 AM instead of 9 or later). You wouldn’t have known that just from looking at the other histogram – so you can see why it’s important to look at all the data before you can go jumping ahead into any analysis!
Using the ml values for the different sizes as mentioned in the introduction, we can calculate the amount consumed per day in ml for visualization of my total coffee consumption over time in volume:
cumulative coffee consumption by date
You can see that my coffee consumption is fairly consistent over time. Over the whole time period of 96 days I drank approximately 50 L of java which comes out to about 520 ml a day (or about 1.5 Talls from Starbucks). 
We can see this by adding a trend line which fits amazing well, the slope is ~0.52 and R-squared ~0.998:
cumulative coffee consumption by date (with trend line)
So the answer to the question from the beginning (“Exactly how much coffee am I drinking?”) is: not as much as I thought – only about 1-2 cups a day. 
When I am drinking it? The peak times of day changed a little bit, but early in the morning and in the mid-afternoon (which I imagine is fairly typical).
How does my daily consumption look over the time period in question? Remarkably consistent.
And just in case you were wondering, out of the 162 cups of coffee I drank over the 3 months, 160 were black.

Conclusions

  • Majority of coffee bought from Starbucks
  • Marked shift in time of day when coffees were consumed due to change in employment
  • Regular / daily rate of consumption about 520 ml and consistent over period of examination
  • I’ll take mine black, thanks

I’m Lovin’ It? – A Nutritional Analysis of McDonald’s

Introduction

The other day I ate at McDonald’s.

I am not particularly proud of this fact. But some days, you are just too tired, too lazy, or too hung-over to bother throwing something together in the kitchen and you just think, “Whatever, I’m getting a Big Mac.”

As I was standing in line, ready to be served a free smile, I saw that someone had put up on the wall the nutritional information poster. From far away I saw the little columns of data, all in neatly organized tabular form, and a light went on over my head. I got excited like the nerd I am. Look! Out in the real world! Neatly organized data just ready to be analyzed! Amazing!

So, of course, after finishing my combo #1 and doing my part to contribute to the destruction of the rain forest, the world’s increasingly worrying garbage problem, and the continuing erosion of my state of health, I rushed right home to download the nutritional information from Ronald McDonald’s website and dive into the data.

Analysis

First of all, I would just like to apologize in advance for
a) using a spreadsheet application, and
b) using bar charts

Forgive me, but we’re not doing any particularly heavy lifting here. And hey, at least it wasn’t in that one piece of software that everybody hates.

Also, by way of a disclaimer, I am not a nutritionist and the content of this article is in no way associated with McDonald’s restaurants or Health Canada.

Sandwiches
First things first. Surprisingly, the largest and fattiest of the items on the board is (what I consider to be) one the “fringe” menu items: the Angus Deluxe Burger. Seriously, does anybody really ever order this thing? Wasn’t it just something the guys in the marketing department came up to recover market share from Harvey’s? But I digress.

Weighing in at just a gram shy of 300, 47 of which come from fat (of which 17 are saturated) this is probably not something you should eat every day, given that it has 780 calories. Using a ballpark figure of 2000 calories a day for a healthy adult, eating just the burger alone would make up almost 40% of your daily caloric intake.

 
Unsurprisingly, the value menu burgers are not as bad in terms of calories and fat, due to their smaller size. This is also the case for the chicken snack wraps and fajita. The McBistro sandwiches, though they are chicken, are on par with the other larger burgers (Big Mac and Big Xtra) in terms of serving size and fat content, so as far as McD’s is concerned choosing a chicken sandwich is not really a healthier option over beef (this is also the case for the caloric content).

As the document on the McDonald’s website is a little dated, some newer, more popular menu items are missing from the data set. However these are available in the web site’s nutritional calculator (which unfortunately is in Flash). FYI the Double Big Mac has 700 calories and weighs 268 grams, 40 of which come from fat (17 saturated). Close, but still not as bad as the Angus Deluxe.

In terms of sodium and cholesterol, again our friend the Angus burger is the worst offender, this time the Angus with Bacon & Cheese, having both the most sodium and cholesterol of any burger on the menu. With a whopping 1990 mg of sodium, or approximately 80% of Health Canada’s recommended daily intake, that’s a salty burger. Here a couple of the smaller burgers are quite bad, the Double Cheeseburger and Quarter Pounder with Cheese both having marginally more sodium than the Big Mac as well as more cholesterol. Best stick with the snack wraps or the other value menu burgers.

Fries
Compared to the burgers, the fries don’t even really seem all that bad. Still, if you order a large, you’re getting over 40% of your recommended daily fat intake. I realize I’m using different units than before here, so for your reference the large fries have 560 calories, 27 grams of fat and 430 mg of sodium.

Soft Drinks
If you are trying to be health-conscious, the worst drinks you could possibly order at McDonald’s are the milkshakes. Our big winner in the drinks department is the large Chocolate Banana Triple Thick Milkshake®. With a serving size of 698g (~1.5 lbs), this delicious shake has over 1000 calories and nearly 30 grams of fat. In fact the milkshakes are, without question, the most caloric of all the drinks available, and are only exceeded in sugar content by some of the large soft drinks.

In terms of watching the calories and sugar, diet drinks are your friend as they have zero calories and no sugar. Below is the caloric and sugar content of the drinks available, sorted in ascending order of caloric content.

 

Salads
And now the big question – McDonald’s salads: a more conscientious choice, or another nutritional offender masquerading as a healthy alternative?

There are quite healthy alternatives in the salad department. Assuming you’re not going to order the Side Garden Salad (which I assume is just lettuce, looking at its caloric and fat content) the Spicy Thai Salad and Spicy Thai with Grilled Chicken are actually quite reasonable, though the latter has a large amount of sodium (520 mg), and all the Thai and Tuscan salads have a lot of sugar (19 and 16 grams of sugar respectively).

However, all these values are referring to the salads sans dressing. If you’re like me (and most other human beings) you probably put dressing on your salad.

The Spicy Thai Salad with the Asian Sesame Dressing added might still be considered within the realm of the healthy – totaling 250 calories and 11 grams of fat. However, keep in mind that would also have 530 mg of sodium (about a quarter of the recommended daily intake) and 29 grams of sugar. Not exactly health food, but not the worst thing you could order.

And for the love of god, just don’t order any old salad at McD’s and think you are making a healthy alternative choice. The Mighty Caesar with Crispy Chicken and Caesar dressing has more fat than a Big Mac combo with medium fries and a Coke (54 g vs. 46 g) and nearly as much sodium (1240 mg vs. 1300 mg), over half the daily recommended intake.

Conclusions

Doing this brief simple examination of the McDonald’s menu will definitely help me be more mindful about the food the next time I choose to eat there. However in terms of of take-aways, there is nothing here really too surprising – we can see that McDonald’s food is, in general, very high in calories, fat, sugar and sodium. This is probably not a surprise for most, as many continue to eat it while being aware of these facts, myself included.

Still, it is somewhat shocking to see it all quantified and laid out in this fashion. A Big Mac meal with a medium fries and medium coke, for instance, has 1120 calories, 46 grams of fat, 1300 mg of sodium and 65 grams of sugar. Yikes. Assuming our 2000 calorie diet, that’s over half the day’s calories in one meal, as well as 71% and 54% of the recommended daily values for fat and sodium respectively. I will probably think twice in the future before I order that again.

If you are trying to be health-conscious and still choose to eat underneath the golden arches, based on what we have seen here, some pointers are:

  • Avoid the Angus Burgers
  • Order a smaller burger (except the double cheese), snack wrap or fajita
  • Avoid the milkshakes
  • Drink diet soft drinks
  • Some salads are acceptable, Caesar dressing is to be avoided

References / Resources

McDonald’s Nutritional Information
http://www1.mcdonalds.ca/NutritionCalculator/NutritionFactsEN.pdf

McDonald’s Canada Nutritional Calculator
http://www.mcdonalds.ca/ca/en/food/nutrition_calculator.html

The Daily % Value (Health Canada)
http://www.hc-sc.gc.ca/fn-an/label-etiquet/nutrition/cons/dv-vq/info-eng.php

Dietary Reference Intake Tables, 2005 (Health Canada)
http://www.hc-sc.gc.ca/fn-an/nutrition/reference/table/index-eng.php

LibreOffice Calc
http://www.libreoffice.org/