Are We Solving The Wrong Problems With Machine Learning?

Let’s talk about corn.

Corn and how it gets from growing in fields onto your table.

Below is a video of a corn harvesting machine:

And here is a video of people gathering corn:

So, I hear you say, what all does this have to do with machine learning?

A lot, as it so happens.

Continue reading “Are We Solving The Wrong Problems With Machine Learning?”

Good Data Visualization Should Be Boring

So I’m going to make a statement that I’m sure some people are going to disagree with: good data visualization should be boring.

Well, at least kind of boring.

I’ve had a lot of conversations with a lot of people over the last few years or so about data visualization: why it’s important, what constitutes good and bad, and examples of its application in both problematic and very effective ways.

A salient point someone made to me once is that part of the problem with the practice of data visualization is that it isn’t viewed as a standalone discipline; it’s simply done, in high school math classes, university courses, or even in the workplace by professionals, and usually assumed that people will just pick it up without discussion around it and its proper application.

I think this is gradually starting to change, as with all the talk (or hype, depending on your point of view) around “Big Data”, analytics is becoming more mainstream, and data visualization is as well as a part of it. I also think dataviz is beginning to – gradually, very gradually – become viewed as a standalone discipline, with courses now being offered in it, and the “data visualization evangelism” of academics such as Edward Tufte and Alberto Cairo and work of practitioners like Stephen Few and Mike Bostock helping to raise awareness of what’s doing it wrong and what’s doing it right. This, along with others creating visualizations which go viral or delivering inspirational TED talks, are doing a lot for visualization as a practice.

The thing I found when I first started to get into dataviz is that even if you’re good with data that doesn’t necessarily mean you’re good at visualizing it. This is because, in addition to working with data, doing proper visualization involves questions of design and also the psychology of perception.

Less is More

I’m a minimalist, and therefore take what I call a functionalist perspective of data visualization. That is to say, the purpose of visualization is to most effectively represent that data so that it can be understood by the audience both most quickly and easily.

As such, I feel that good data visualization should be somewhat dull, or at least somewhat dry; in terms of depicting information and people perceiving it, it is usually the case that simpler is better. This is illustrated in principles like Tufte’s data-ink ratio.

So, look at the charts below. Which is more visually appealing to you? Which is simpler? Which one depicts the quantities such that you are able to interpret them the most quickly, accurately and with the most clarity?

If you’re like me, you’ll say the one on the right, which is a better visualization, even though it may not be as visually appealing to some. Most often you’re better served by a simpler, cleaner visualization (or perhaps several of them) than a lot of complexity and visual noise that doesn’t add to the reader’s understanding.

Never say always

That being said, as I mentioned, choices around data visualization are ultimately ones of design. I do believe that there are some hard and fast rules that should never be broken (e.g. always start the y-axis at 0 for bar charts of strictly positive values, don’t represent data with the same units on a secondary y-axis, never use a line chart for categorical data), however I also believe there are some that are more flexible, depending on what you want to accomplish, and your audience. Should you never, ever, use a pie chart? No. Some people are more comfortable with pie charts just from their familiarity with them. Is a bar chart a better choice in terms of representing the data? Yes. But that doesn’t mean there aren’t exceptions (just don’t make a 3D one).

The same individual that made the observation about dataviz not being taught also pointed out to me another factors that can influence design choices: what she called chart fatigue. Is the bar chart the best way to plot a single metric across a categorical variable? Almost always, yes. But show a room full of businesspeople bar chart after bar chart after bar chart and anyone can tell you that they’re all going to start to look the same, and interpretation of them is going to suffer as a result. Plus you’re probably going to lose the interest of your audience.

Practice makes perfect

In conclusion, I think that awareness of data visualization is only going to get better as companies (and the average consumer) become more “data savvy”. It is my sincere hope that people will give more and more emphasis, not only to the importance of visualization as a tool, but also to the design choices around it, and what constitutes good and bad depictions of data.

For now, just remember that data visualization is ultimately all about communicating and having your reader understand, not necessarily wowing them (though both together are not impossible). And sometimes, that means boring is better.

In Critique of Slopegraphs

I’ve been doing more research into less common types of data visualization techniques recently, and was reading up on slopegraphs.

Andy Kirk wrote a piece praising slopegraphs last December, which goes over the construction of a slopegraph with some example data very nicely. However I’ve seen some other bad examples of data visualization across the web using them, and just thought I’d put in my two cents.

Introductory remarks

I tend to think of slopegraphs as a very boiled-down version of a normal line chart, in which you have only two values for your independent variable and strip away all the non-data ink. This works because if you label all the individual components, you can take away all the cruft because you don’t need the legend or axes anymore, do you? Here’s the example of the before and after that below, using the soccer data from the Andy’s post.
First as a line graph:
Hmm, that’s not very enlightening is it? There are so many values for the categorical variable (team) that the graph requires a plethora of colours in the legend, and a considerable amount of back-and-forth to interpret. Contrast with the slopegraph, which is much easier to interpret as the individual values can be read off, and it also ditches the non-data ink of the axes:

Here it is much easier to read off values for the individual teams, it feels less cluttered, and more data have been encoded both in colour (orange for a decrease between the two years, and blue for an increase) as well as the thickness of the lines (thicker lines for change of > 25%).

Pros and Cons

In my opinion, the slope graph should be viewed as an extension of the line graph, and so even though traditional chart elements like the y-axis have been stripped away, consistency should be kept with the regular conventions of data visualization.
In the above example, Andy has correctly honoured vertical position, so that each team appears on other side of the graph at the correct height according to the number of points it has. This is the same as one of Dr. Tufte’s original graphs (from the Visual Display of Quantitative Information), which follows the same practice and I quite like:
Brilliant. However when you no longer honour the vertical position to encode value, you lose the ability to truly compare across the categorical variable, which tend I disagree with. This is usually done for legibility’s sake (to “uncrowd” the graph when there are a lot of lines), however, I feel like it could still be avoided in most of cases. See below for the example.

Here the vertical position is not honoured, as some values which are smaller appear above those which are larger, so that the lines do not cross and the graph is uncluttered.

Also it should be noted in this case there is more than one value in the independent variable. As long as the scale in the vertical direction is still consistent, the changes in quantity can still be compared by the slope of the lines, even if the exact values cannot be compared because the vertical position no longer corresponds directly to quantity.

Either way, this type of slopegraph is closer to a group of sparklines (as Tufte originally noted), as it allows comparison of the changes in the dependent variable across values of the independent for each value of the categorical variable, but not the exact quantities.

Where things really start to fall apart though, is when slope graphs are used to connect values from two different variables. Charlie Park has some examples of this on his blog post on the subject, such as the one from Ben Fry below:

So here’s the question – what exactly, does the slope of the different lines correspond to? The variable on the left is win-loss record and on the right is total salary. The first author correctly notes that in this case, the slopegraph is an extension of a parallel coordinates graph, which requires some further discussion.
A parallel coordinates graph is all very well and good for doing exploratory data analysis, and finding patterns in data with a large number of variables. However I would avoid graphs like the one above in general – because the variable on the left and the right are not the same, the slope of the line is essentially meaningless. 
In this case of the baseball data, why not just display the information in a regular scatterplot, as below? Simple and clear. You can then include the additional information using colour and size respectively if desired and make a bubble chart.

Was the disproportionately large payroll of the Yankees as obvious in the previous visualization? Maybe, but not as saliently. The relative size of the payroll was encoded in the thickness of the line, but quantity is not interpreted as quickly and accurately when encoded using area/thickness as it is when using position. Also because the previous data were ranked (vertical position did not portray quantity), the much smaller number of wins by Kansas relative to the other teams was not as apparent at is it here.

Fry notes that he chose not to use a scatterplot as he wanted ranking for both quantities, which I suppose is the advantage of the original treatment, and something which is not depicted in the alternative I’ve presented. Also Park correctly notes in the examples on his post that different visualizations draw the eye to different features of the data, and some people have more difficulty interpreting a visualization like a bubble chart than slopegraph. Still, I remain a skeptical functionalist as far as visualization is concerned, and prefer the treatment above to the former.

Alternatives

I’ve presented some criticism of the slopegraphs here, but are there alternatives? Yes. In addition to the above, let’s explore some others, using the data from the soccer example.

Really what we are interested in is the change in the quantity over the two values of the independent variable (year). So we can instead look at that quantity (change between the two years), and visualize it as a bar graph with a baseline of zero. Here the bars are again coloured by whether the change is positive or negative.

This is fine; however we lost the information encoded in the thickness of the lines. We can encode that using the lightness (intensity) of the different bars. Dark for > 25% change, light for the others:

Hmm, not bad. However we’ve still lost the information about the absolute value of points each year. So let’s make that the value along the horizontal axis instead.

Okay fine, now the length of the bars corresponds to the magnitude of the change in points across the two years, with positive changes being coloured blue and negative orange, and the shading corresponding to whether the change was greater or less than 25%.

However, even if I put a legend and told you what the colours correspond to, it’s pretty common for people to think of things as progressing from left to right (at least in Western cultures). The graph is difficult to interpret because for bars in orange the score for the first year is on the right, whereas for those in blue it’s on the left. That is to say, we have the absolute values, but direction of the change is not depicted well. Changing the bars to arrows solves this, as below:

Now we have the absolute values of the points in each year for each team, and the direction of the change is displayed better than just with colour. Adding the gridlines allows the viewer to read off the individual values of points more easily. Lastly, we encode the other categorical variable of interest (change greater/less than 25%) as the thickness of the line.

Like so. After creating the above independently, I discovered visualization consultant Naomi Robbins had already written about this type of chart on Forbes, as an alternative to using multiple pie charts. Jon Peltier also has an excellent in-depth description how to make these types of charts in Excel, as well as showing another alternative visualization option to slope graphs, using a dot plot.

Of course, providing the usual fixings for a graph such as a legend, title and proper axis labels would complete the above, which brings me to my last point. Though I think it’s a good alternative to slopegraphs, it can in no way compete in simplicity given that Dr. Tufte’s example of a slopegraph as it had zero non-data ink. And, of course, this type of graph will not work when there are more than two values in the independent variable which to compare across.

Closing Remarks

It is easy to tell who are the true thought leaders in data visualization, because they often take it upon themselves to find special cases for visualization where people struggle or visualize data poorly, and then invent new visualizations types to fill the need (Tufte with the slopegraph, and Few came up with the bullet graph to supplant god-awful gauges on dashboards).
As I discussed, there are certain cases when slopegraphs should not be used, and I feel you would be better served by other types of graphs; in particular, cases where the slopegraph is a variation of the parallel coordinates chart not the line graph, or where quantity is not encoded in vertical position and comparing quantities for each value of the independent variable is important.

That being said, it is (as always) very important when making choices regarding data visualization to consider the pros and cons of different visualization types, the properties of the data you are trying to communicate, and, of course, the target audience.

Judiciously used, slopegraphs provide a highly efficient way in terms of data-ink ratio to visualize change in quantity across a categorical variable with a large number of values. Their appeal lies both in this and their elegant simplicity.

References & Resources

Slopegraphs discussion on Edward Tufte forum
http://www.edwardtufte.com/bboard/q-and-a-fetch-msg?msg_id=0003nk
In Praise of Slopegraphs, by Andy Kirk
Edward Tufte’s “Slopegraphs” by Charlie Park
http://charliepark.org/slopegraphs/
Peltier Tech: How to Make Arrow Charts in Excel
http://peltiertech.com/WordPress/arrow-charts-in-excel/

Salary vs. Performance of MLB Teams by Ben Fry
http://fathom.info/salaryper/

salary vs performance scatterplot (Tableau Public)

Looking for Your Lens: 3 Tips on How to Be a Great Analyst

The other day as I was walking to work, all of a sudden, “pop!” one of the lenses in my glasses decided to free itself from the prison of its metal frame and take flight.

Well, damn.

The sidewalk was wet and partially covered in snow, and also with little islands of ice here and there. Finding a transparent piece of glass was not going to be easy.

So there I was, wandering about a small patch of sidewalk next to Toronto City Hall, squatting on my haunches, peering down at the sidewalk and awkwardly searching for my special missing piece of glass. I was not optimistic about my chances.

Most people walked on by and paid me no notice, but one kind soul, a woman with black, curly hair, stopped to help.

“Did you lose something?” she asked.

“Yeah,” I said, defeated, and held up my half empty black frames.

“Can I help?” she kindly offered. “I’m good at this sort of thing.”

“I guess,” I said, having already given up on restoring my headgear to completeness.

We scoured the sidewalk while urban passerby gave the occasional puzzled look, hurrying along.

“Ah!” she said, and amazingly, picked up my lens which she had located. It had been hiding on a small patch of snow near a planter.

“WOW!” I was genuinely impressed. “You are good at this. Thanks so much.”

“No problem,” she said. “have a great day!” and then promptly disappeared down the street, leaving me standing there on the sidewalk, bewildered.

That single small episode, a tiny vignette of a single life in a giant city amongst millions of others, was quite profound for me. This was because it got me thinking about two things: one, the kindness of strangers, and the other, of course, what I am always thinking about – the business of doing analysis.

Because as it turns out, those few statements that kind stranger made are equally important in being a great analyst.

“Did you lose something?”

A problem that a lot of analysts deal with on a regular basis is one of communication. The business, the stakeholder, the client, whoever it may be, comes to the analyst for help. They want to find out something about their business because they have data, and it’s the job of the analyst to turn that data (information) into insights (knowledge).
But here’s the problem – you can’t find something if you don’t know what you’re looking for.
Just as the kind passerby wouldn’t have been able to help me find my missing lens if she didn’t know what to look for, if you don’t know what kinds of insights you want to pull out of the data you have, then you won’t be able to find what you’re looking for either.
“We want to know how our people are connecting with our brand.”
It is the job of the analyst to turn these (often vague) desires of the business into specific questions that can be answered by analyzing data.
What people? (everyone, purchasers only, Boomers, Gen X, Gen Y, single mothers between the ages of 22 and 32 in urban centers?) What does connecting with the brand mean? (viewing an ad, purchase, visits to the website, app downloads, posts on social media, all of the above?)
So remember that a very large part of the job of the analyst is communication – not just about data – but working with others to determine exactly what it is they want to know. Once you know that, you can determine how to best do analysis to find the answers that are being sought after – hiding in plain sight in the data, like a piece of glass on a snowy patch of sidewalk.

“Can I help?”

Here’s something I think that a lot of analytical-type thinkers (this author included) often need to be reminded of: you can’t know everything. Even if you really, really want to. I’m sorry but you just can’t.
And that’s why once you know what it is you’re looking for, and what you need, you’ll need to ask for help (and that’s okay, that’s why we have meetings!). Sometimes the mere process of tracking down the data is a considerable task in itself. Sometimes no one really has a great overall understanding of a how a really large, complicated system works – that kind of knowledge is often very distributed. These sorts of situations may require the help of many others in your company (or another business, vendor or client) who all have varying knowledge bases and skill sets.
It’s the job of the analyst to connect with the people they need to, get the data that they need, and do analysis to find the answers which are desired. Also if you’re a good analyst, you’ll probably provide some kind of context around the impact (i.e. business implications) of your answer, and what parties would need to be involved to make take the most beneficial actions as a result.
So even if you’re a data rock star don’t ever be afraid to ask for help; and conversely don’t hesitate to let others know who should help them too.

“I’m good at this sort of thing.”

Getting the analysis done requires not only not being afraid of asking for help, but also knowing the strengths and weaknesses of yourself, your team, and any others you may be working with.
It’s hard, but in my opinion, it takes a bigger person to be honest and admit when they are out of their depth than to say they can do something they clearly cannot.
When you’re out of your depth you have three options, which are really just three different ways of finishing the statement I’m not an expert. And they go something like this: I’m not an expert….
  1.  “… so I’m not going to do it because: I don’t know how / wouldn’t be able to figure it out / it’s not in my job description.”
  2.  “… but I can: learn quickly / give it a try / do my best / become one in 5 days.”
  3.  “… but I know <colleague> is and could: provide context to the problem / definitely help do it / teach us how.”
And the difference between answer #1 and the last two is what separates the office drones from the thought leaders, the reporting monkeys from the truly great analysts, and the unsuccessful from the successful in the world of data.
As I noted in the section above, you should never be afraid to ask for help, because there are going to be others out there that are better at things than you, and if you’re good you’ll recognize this fact and both of you will benefit. Hey, you might even learn something too, so next time you will be the expert.
Just remember that you can do analysis without crunching every number personally. You can work in data science without building the predictive model all by yourself. And you can work with data without writing every line of code alone. No analyst is an island.

“No problem! Have a great day!”

I hope that my little story and these points will help, or at least help you think, about the business of working with data and doing analysis, and what it means to be a great analyst.

This last point is perhaps equally, or even more  important, than the others – always be kind to the people you work with; always make it look easy, no matter how hard it was; and always be happy to help. That, above all, is what will make you a truly great analyst.

The Top 0.1% – Keeping Pace with the Data Explosion?

I’ve been thinking a lot lately, as I do, about data.

When you work with something so closely, it is hard not to have the way you think about what you work with impact the way you think about other things in other parts of your life.

For instance, cooks don’t think about food the same way once they’ve worked in the kitchen; bankers don’t think about money the same way once they’ve seen a vault full of it; and analysts don’t think about data the same way, once they start constantly analyzing it.

The difference being, of course, that not everything is related to food or money, but everything is data if you know how to think like an analyst.

I remember when I was in elementary school as a young child and a friend of mine described to me the things of which he was afraid. We sat in the field behind the school and stared down at the gravel on the track.

“Think about counting every single pebble of gravel on the track,” he said. “That’s the sort of thing that really scares me.” He did seemed genuinely concerned. “Or think about counting every grain of sand on a beach, and then think about how many beaches there are in the whole world, and counting every grain of sand on every single one of those beaches. That’s the sort of thing that frightens me.”

The thing that scared my childhood friend was infinity; or perhaps not the infinite, but just very very large numbers – the quantities of the magnitude relating to that thing everyone is talking about these days called Big Data.

And that’s the sort of thing I’ve been thinking about lately.

I don’t remember the exact figure, but if you scour the internet to read up on our information age, and in particular our age of “Big Data” you will find statements similar to that below:

…. there has been more information created in the past year than there was in all of recorded history before it.


Which brings me to my point about the Top 0.1%.

Or, perhaps, to be more fair, probably something more like the Top 0.01%.

There is so much information out there. Every day around the world, every milliliter of gas pumped, every transaction at POS, every mouse click on millions of websites on the internet is being recorded, and creating more data.

Our capacity to record and store information has exploded exponentially.

But, perhaps somewhat critically, our ability to work with it and analyze it has not.

In the same way that The Ingenuity Gap talks about how the complexity of problems facing society is ever increasing but our ability to implement solutions is not matching that pace, we might be in danger of similarly finding the amount of information being recorded and stored in our modern world is exponentially increasing but our ability to manage and analyze it is not. Not only from a technological perspective, but also from a human perspective – there is only so much information one person can handle working with and keep “in mind”.

I know that many other people are thinking this way as well. This is my crude, data-centric take on what people have been talking about since the 1970’s – information overload. And I know that many other authors have touched on this point recently as it is of legitimate concern; for instance – acknowledging that the skill set needed to work with and make sense of these data sets of exponentially increasing size is so specialized that data scientists will not scale.

Will our technology and ability to manage data be able to keep up with the ever increasing explosion of it? Will software and platforms develop and match pace such that those able to make sense of these large data sets are not just a select group of specialists? How will the analyst of tomorrow handle working with and thinking about the analysis of such massive and diverse data sets?

Only time will answer these questions, but the one thing that seems certain is that the data deluge will only continue as storage becomes ever cheaper and of greater density while the volume, velocity and variety of data collected worldwide continues to explode.

In other words: there’s more that came from.

Problems of Measurement

The other day I was walking in the mall amongst the office buildings and I saw something that I thought was odd.

In the atrium, where the clear glass surrounded all and allowed the sunlight and view of the bustling city streets to trickle in, there sat a woman.

Even from far away I could tell that she looked weary – despondent even, from the slump of her shoulders. But the thing that worried me the most, that was so odd, was that unlike all the others on the platform, she sat. She sat on the ground with her legs dangling over the edge and her eyes staring out the window.  Her shoulders were slumped and she had a certain indifference to all that occurred around her.

As I got closer I saw she had a clipboard and immediately my concern dissipated. The clipboard sat on the floor to her right, and the paper on it had a table partially filled-in with scribbles of blue pen. In her right hand was a black thumb-counter, the kind security guards at concerts and bouncers at nightclubs used to count patrons, coming and going.

I strode past her, in a hurry as always, but finally my curiosity got the better of me and I turned on my heel.

“What are you counting?” I said, trying my best to be inquisitive in the friendliest way possible.

“People getting on and off the buses,” she said, without looking up.  I followed her gaze and saw that from where we were stood we had a direct view of stop 214 on the main street outside. There was a continual flurry of activity which she was responsible for recording – buses stopping, buses leaving and people getting off and on, almost constantly. It was a continual flow of humanity in transit.

Which of course got me thinking about problems of measurement.

As I watched the people mill off and on the buses I thought about the city transit department and their problems of measurement. Surely there must be a better way to track how many people got off and on these buses. Some sort of automated system – a motion sensor or card reader.

But the flow of people getting off and on was a continual stream – it was not a separate series of blips on a radar screen – which made me think that of course the problem was not that simple, otherwise it would have already been solved.

In analytics, a fact which I did not always appreciate is that before there are problems of data, and before there are problems of analysis, there are problems of measurement.

Before you set off to gain insight about the world and your chosen topic of study, you first have to stop and say “What is it we want to know?” which leads to the question, “What is it we need to measure?” and then finally, and sometimes most importantly: “How will we measure it?

In the world of web analytics this has a lot to do with what is known as implementation. You need to make sure all your evars and sprops are in a row, otherwise it’s going to be really hard to actually figure out what’s going on.

After all, if your measuring stick is a bathroom scale it’s very hard to figure out how tall people are.

And then I got to thinking about everyday analytics again. If you’re turning your analysis inward, and you want to learn about the thing that analysts are not paid the big bucks to do analysis on – you – you need to ask the same questions.

You have your own problems of measurement.

You need to start at the beginning and ask yourself – “What is it I want to know?

What percent of your income you spend on rent? How many coffees you drink in a month? How much weight have you lost on your diet?

Then the second question is – “What is it I need to measure?

Income and expenditures? Number of trips to Starbucks? Weight in pounds over time?

And now we come to the last question – “How will I measure it?

Income? Dollars on my pay stub. Number of trips to Starbucks? Self-explanatory. Weight in pounds over time?  On a bathroom scale.

However, I would argue that when we come to the last question now we should treat it differently. We should treat it differently because now this question can be the starting point. Because in the last question, for you, the “it” is your life.

It’s not a website. It’s not a startup. It’s not a product.

It’s your life. 

And it’s yours and yours alone. So you get to decide what is really important to you – and you’re the only one who can. You get to solve your own problems of measurement, and figure out how you are going to measure your life.

So regardless of whether you practice quantified self, or care about everyday analytics, or not, the one question I will leave you with is this – How will you measure your life?

How to Think Like an Analyst

So I was talking to my Aunt a couple weekends ago. My Aunt explained that though she was happy for me and the work that I do, she didn’t understand any of it. I tried my best to explain in general terms what web analytics, and analytics as a whole, is and is all about.

Our conversation continued, and I further offered that though she may not understand exactly what it is I do, she could understand the spirit in which it is done – the way to think about analysis.

Not everyone is cut out to be an analyst. There are those who have always been good with numbers, and there are those describe themselves as ‘the one who was always bad at math in high school’.

And that’s fine. Like I said, not everyone is cut out to be an analyst, not everyone wants to be, and not everyone can be. However, it is a firm belief of mine that everyone, everyone can think like an analyst.

And I’ll show you how.

The Questions You Need to Ask

True, you may not have the skill set necessary to be an analyst – you may, in fact, be one of those who was bad at math in high school, and when people mention spreadsheets you think about bedding not computer software.

But that doesn’t mean you can’t think like an analyst.

Part of being a good analyst is not just being able to do analysis, but being able to ask the right questions which lead to it.

All good analysis starts with a question. So all you have to do is ask the right questions.

And, in this humble author’s opinion, those two questions are how and what.

Question 1 – How (many)?

This is the simple question, and is one of measurement and descriptive statistics.

Thinking quantitatively is a key part of thinking like an analyst.

If you learn to think in this way you will find that ordinary, everyday situations can become part of ordinary, everyday analytics.

For instance, any time you are at some sort of gathering of people or social function you can think like an analyst by asking yourself the question – how many?

How many men are there in the room? How many women? How many are there proportionally?
How many people at the party are wearing glasses? How many are not?
How many people at the networking mixer are eating and drinking? How many are just eating? Just drinking?
How many people at the dinner party decided to have the chicken? How many did not? How many finished all their food and how many left food behind? How many plates did each person have?

And so on.

But as I said, the question of how many is simply one of describing the state of affairs. To really think like an analyst you also need to ask the second question.

Question 2 – What (is the relationship between….)?

The second question helps you to think like an analyst and go beyond simply describing things quantitatively and start thinking about possible relationships.

Here, to illustrate how thinking like an analyst is subject-matter independent, we can pick a topic, any topic. So let’s go with….. peanut butter. I like peanut butter.

The second question is what lately I find I’m asking myself all the time about almost everything (whether I like it or not). And that very important question you can ask yourself is – what is the relationship between……?

Pick properties of, or related to, your subject of analysis – some of which you may compare across or between, and others which may be measured. In technical terms these are known as dimensions and measures, respectively.

For example, using our randomly chosen topic of peanut butter, first we brainstorm all the things we could possibly think of related to peanut butter.

Type (chunky or smooth), brand, container (size, type, colour), price, sales, consumption, nutritional content,  location, time…

And so on. Let’s stop there.

Then we ask the question: What is the relationship between a and b? Where a is one of the things we brainstormed as a category, and b is one of the things we brainstormed as a measurement.

What is the relationship between the type of peanut butter and its nutritional content? (That is, how is chunky peanut butter different from smooth peanut butter in terms of calories and fat?)

What is the relationship between the brand of peanut butter and its sales? (That is, how do the total sales of different peanut butter brands compare? You could also add time and location dimensions – how do sales between brands compare this year? Last year? Worldwide? In Canada vs. the US? Per store in Ontario?)

What is the relationship between the container size and location? (That is, do different countries have different sized containers for peanut butter? What is the average container size per country? In each region? Or look at location in store – are all the containers in the same aisle or are the different sizes in different places (e.g. the bulk food aisle)? How is the distribution of container sizes broken down across different stores across the country?

And so forth. As you can see, there are so many questions you can ask by combining properties of a topic of interest in this way. And these are only questions with two properties – many more questions of greater complexity could be generated by combining multiple properties (e.g. What is the relationship between peanut butter sales and consumption and the brand and type?)

The Hard Question

There is one final question which I did not mention, which, if you really want to think like an analyst, is the most important question of all. In fact, I would go further and say that even if you are not thinking like an analyst, this is the most important question of all. And that ultimate question is why.

The question of why is the most important question, the hardest question, the question which drives all of the analysis that analysts the world over do.

Why.

Why has our new marketing initiative not resulted in increased sales in the third quarter? Why is the sky blue? Why does Amazon send me so many emails related to Home and Garden products? Why can’t I sleep at night? Why are there three million kinds of laundry detergent but only two kinds of baking powder? Why? Why? Why.

This is the question which drives all investigation, which drives all measurement, which drives all analysis.

And this is the question, whether you want to think like an analyst or not, you should always be asking yourself.

Finer Points Regarding Data Visualization Choices

The human mind is limited.

We can only process so much information at one time. Numerals are text which communicate quantity. However, unlike other text, it’s a lot harder to read a whole bunch of numbers and get a high-level understanding of what is being communicated. There are sentences of numbers and quantities (these are called equations, but not everyone is as literate in them) however simply looking at a pile of data and having an understanding of the ‘big picture’ is not something most people can do. This is especially true as the amount of information becomes larger than a table with a few categories and values.

If you’re a market research, business, data, financial, or (insert other prefix here) analyst, part of your job is taking a lot of information and making sense of that information, so that other people don’t have to. Let’s face it – your Senior Manager or The VP doesn’t have time to wade through all the data – that’s why they hired you.

Ever since Descartes’ epiphany (and even before that) people have been realizing that there are other, more effective ways to communicate information than having to look at all the details. You can communicate the shape of the data without knowing exactly how many Twitter followers were gained each day. You can see what the data look like without having to know the exact dollar value for sales each and every day. You can feel what the data are like, and get an intuitive understanding of what’s going on, without having to look at all the raw information.

Enter data visualization.

Like any practice, data visualization and the depicting quantitative relationships visually can be done poorly or can be done well. I’m sure you’ve seen examples of the former, whether it be in a presentation or other report, or perhaps floating around the Internet. And the latter, like so many good things, is not always so plentiful, nor appreciated. Here I present some finer points between data visualization choices, in the hope that you will always find yourself depicting data well.

Pie (and Doughnut) Chart

Ah, the pie chart. The go-to the world over when most people seek to communicate data, and one both loved and loathed by many.
The pie chart should be used to compare quantities of different categories where the proportion of the whole is important, not the absolute values (though these can be added with labelling as well). It’s important that the number of categories being compared remain small – depending on the values, the readability of the chart decreases greatly as the number of categories increases. You can see this below. The second example is a case where an alternate representation should be considered, as the chart’s readability and usefulness is lower given the larger number of proportions being compared:

Doughnut charts are the same as pie charts but with a hole in the center. They may be used in the place of multiple pie charts by nesting the rings:

Hmm.

Though again, as the number of quantities being compared increases the readability and visual utility generally decreases and you are better served by a bar chart in these cases. Also there is the issue that the area of each annulus will be different for the same angle, depending upon which ring it is in.

With circular charts it is best to avoid legends as this causes the eye to flit back and forth between the different segments and the legend, however when abiding by this practice for doughnut charts labeling becomes a problem, as you can see above.

Tufte contends that a bar chart will always serve better than a pie chart (though some others disagree). The issue is that there is some debate about the way the human mind processes comparisons with angular representations versus those depicted linearly or by area. I tend to agree and find the chart below much better data visualization that the one we saw previously:

Isn’t that much better?

From a practical perspective – a pie chart is useful because of its simplicity and familiarity, and is a way to communicate proportion of quantities when the number of categories being compared is small. 
Bonus question:
Q. When is it a good idea to use a 3-D pie chart?
A. Never. Only as an example of bad data visualization!

Bar Charts

Bar charts are used to depict the values of a quantity or quantities across categories. For example, to depict sales by department, or per product type.
This type of chart can be (and is) used to depict values over time, however, said chunks of time should be discrete (e.g. quarters, years) and of a small number. When a comparison is to be done over time and the number of periods / data points is larger, it is better visualized using a line chart.

As the number of categories becomes large, an alternative to the usual arrangement (‘column’ chart) is to arrange the categories vertically and bars horizontally. Note this is best done only for categorical / nominal data as data with an implied order (ordinal, interval, or ratio type data) should be displayed left-to-right in increasing order to be consistent with reading left to right.
Bar charts may also be stacked in order to depict both the values between categories as well as the total across them. If the absolute values are not important, then stacked bar charts may be used in this way in the place of several pie charts, with all bars having a maximum height of 100%:

Stephen Few contends that this still makes it difficult to compare proportions, similar to the problem with pie charts, and has other suggestions [PDF], though I think it is fine on some occassions, depending the nature of the data being depicted.

When creating bar charts it is important to always start the y-axis from zero so as not to produce a misleading graph.

A column chart may also be combined with a line graph of the total per category in a type of combo chart known as Pareto chart.

Scatterplot (and Bubble Graphs)

Scatterplots are used to depict a relationship between two quantitative variables. The value pairs for the variables are plotted against each other, as below:

When used to depict relationships occurring over time, we instead use a special type of scatterplot known as a line graph (next section).

A bubble chart is a type of scatterplot used to compare relationships between three variables, where the points are sized by area according to a third value. Care should be taken to ensure that the points are sized correctly in this type of chart, so as not to incorrectly depict the relative proportion of quantities

Relationships between four variables may also be visualized by colouring each point according to the value of a fourth variable, though this may be a lot of information to depict all at once, depending upon the nature of the data. When animated to include a fifth variable (usually time) it is known as a motion chart, which is perhaps most famously demonstrated in Hans Rosling’s landmark TED Talk which has become somewhat of a legend.

Line Graphs

Line graphs are usually used to depict quantities changing over time. They may also be used to depict relationships between two (numeric) quantities when there is continuity in both.

For example, it makes sense to compare sales over time with a line graph, as time is numerical quantity that varies continuously:

However it would not make sense to use a line graph to compare sales across departments as that is categorical / nominal. Note that there is one exception to this rule and that is the aforementioned Pareto chart.

Omitting the points on the line graph and using a smooth graph instead of line segments creates an impression of more data being plotted, and hence a greater continuity. Compare with the plot above the one below:

So practically speaking save the smooth line graphs for when you have a lot of data and the points would just be visual clutter, otherwise it’s best to overplot the points to be clear about what quantities are being communicated.

Also note that unlike a bar chart, it is acceptable to have a non-zero starting point for the y-axis of a line graph as the change in values is being depicted, not their absolute values.

Now Go Be Great!

This is just a sample of some of the finer differences between the choices for visualizing data. There are of course many more ways to depict data, and, I would argue, that possibilities for data visualization are only limited by the imagination of the visualizer. However when sticking with the tried, true and familiar, keep these points in mind to be great at what you do and get your point across quantitatively and visually.
Go, visualize the data, and be amazing!

Seriously, What’s a Data Scientist? (and The Newgrounds Scrape)

So here’s the thing. I wouldn’t feel comfortable calling myself a data scientist (yet).

Whenever someone mentions the term data science (or, god forbid BIG DATA, without a hint of skepticism or irony) people inevitably start talking about the elephant in the room (see what I did there)?

And I don’t know how to ride elephants (yet).

Some people (like yours truly, as just explained) are cautious – “I’m not a data scientist. Data science is a nascent field. No one can go around really calling themselves a data scientist because no one even really knows what data science is yet, there isn’t a strict definition.” (though Wikipedia’s attempt is noble).

Other people are not cautious at all – “I’m a data scientist! Hire me! I know what data are and know how to throw around the term BIG DATA! I’m great with pivot tables in Excel!!”

Aha ha. But I digress.

The point is that I’ve done the first real work which I think falls under the category of data science.

I’m no Python guru, but threw together a scraper to grab all the metadata from Newgrounds portal content.

The data are here if you’re interested in having a go at it already.

The analysis and visualization will take time, that’s for a later article. For now, here’s one of my exploratory plots, of the content rating by date. Already we can gather from this that, at least at Newgrounds, 4-and-half stars equals perfection.

Sure feels like science.

Top 5 Tips for Communicating Data

Properly communicating a message with data is not always easy.

If it were, everyone could do it, and there wouldn’t be questions at the end of presentations, discussions around the best way to tweak a scatterplot, or results to a Google Images search for chartjunk.

Much has been written on the subject of how to properly communicate data, and there’s a real art and science to it. Many fail to appreciate this, which can result in confusion – about the message trying to be conveyed, the salience of various features of the data being presented, or why the information is important.

There’s a lot to be said on the subject, but keep these 5 tips for communicating data in mind, and when you have a data-driven message to get across they will help you do so with clarity and precision.

1. Plan: Know What You Want to Say

Just like you wouldn’t expect an author to write a book without a plot, or an entrepreneur to launch a new venture without a business plan, you can’t expect to march blindly into creating a report or article using data without knowing what you want to say.

Sometimes all the analysis will have already been done, and so you just need to think about how to best present it to get your message across. What variables and relationships are most important? What is the best way to depict them? Why oh why am I using aquamarine in this bar chart?

Other times figuring out your exact message will come together with the analysis, and so you would instead start with a question you want to answer, like “How effective has our new marketing initiative been over the last quarter?” or “How has the size of the middle class in Canada in changed over the last 15 years?”

2. Prepare: Be Ready

As I reflected upon in a previous post, sometimes people fail to recognize that just getting the information and putting in the proper shape is a part of the process that should not be overlooked.

Before you even begin to think about communicating your message, you need to make sure you have the data available and in a format (or formats) that you can comfortably work with. You should also consider what data are most important and how to treat them accordingly, and if any other sets should also be included (see Tip #3).

On this same note, before launching into the analysis or creation of the end product (article, report, slidedeck, etc.) it is important to think about if you are ready in terms of tools. What software packages or analysis environments will be used for the data analysis? What applications will be used to create the end product, whatever it may be?

3. Frame: Context is Key

Another important tip to remember is to properly frame your message by placing the data in context.

Failure to follow this tip results in simply serving up information – data are being presented but there is no message being communicated. Context answers the questions “Why is this important?” and “How is this related to x, y, and z?”

Placing the data in context allows the audience to see how it relates to other data, and why it matters. Do not forget about context, or you will have people asking why they should care about what you are trying to communicate.

4. Simplify: Less is More

Let me be incredibly clear about this: more is not always better. If you want to get a message across, simpler is better. Incredibly complicated relationships can be discussed, depicted, and dissected, but that doesn’t mean that your article, slide or infographic needs to look like a spreadsheet application threw up all over it.

Keep the amount of information that your audience has to process at a time (per slide, paragraph, or figure) small. Relationships and changes should be clearly depicted and key differences highlighted with differences in colour or shape. The amount of text on graphs should be kept to a minimum, and if this is not possible, then perhaps the information needs to be thought about being presented in a different way.

The last thing you want to do is muddle your message with information overload and end up confusing your audience.

5. Engage: It’s Useless If No One Knows It Exists

In the world of business, when creating a report or presenting some data, the audience is often predefined. You create a slidedeck to present to the VP and if your data are communicated properly (because you’ve followed Tips 1-4, wink wink) then all is well and you’re on your way to the top. You email the report and it gets delivered to the client and your dazzling data analysis skills make them an even greater believer in your product. And so on.

In other cases though, like when writing a blog post or news article, your audience may not be picked out for you and so it’s also your job to engage them. All your dazzling data analysis and beautiful visual work will contribute nothing if no eyeballs are laid upon it. For this reason, another tip to remember is to engage interested parties, either directly or indirectly through channels such as social media.

What Are You Waiting For?

So there are your Top 5 Tips for Communicating Data. Like I said, it’s not always easy. Keep these tips in mind, and you’ll ask yourself the right questions before you give all the answers.

Go. Explore the data, and be great. Happy communicating.