Toronto Data Science Meetup – Machine Learning for Humans

A little while ago I spoke again at the Toronto Data Science Group, and gave a presentation I called “Machine Learning for Humans”:

I had originally intended to cover a wide variety of general “gotchas” around the practical applications of machine learning, however with half an hour there’s really only so much you can cover.

The talk ended up being more of an overview of binary classification, as well as some anecdotes around mistakes in using machine learning I’ve actually seen in the field, including:

  • Not doing any model evaluation at all
  • Doing model evaluation but without cross-validation
  • Not knowing what the cold start problem is and how to avoid it with a recommender system
All in all it was received very well despite being review for a lot of people in the room. As usual, I took away some learnings around presenting:
  • Always lowball for time (the presentation was rushed despite my blistering pace)
  • Never try to use fancy fonts in Powerpoint and expect them to carry over – it never works (copy paste as an image instead when you’ve got the final presentation)
Dan Thierl of Rubikloud gave a really informative and candid talk about what product management at a data science startup can look like. In particular, I was struck by his honesty around the challenges faced (both from technical standpoint and with clients), how quickly you have to move / pivot, and how some clients are just looking for simple solutions (Can you help us dashboard?) and are perhaps not at a level of maturity to want or fully utilize a data science solution.
All in all, another great meetup that prompted some really interesting discussion afterward. I look forward to the next one. I’ve added the presentation to the speaking section.

Big Data Week Toronto 2014 Recap – Meetup #3: Big Data Visualization

This past week was Big Data Week for those of you that don’t know, a week of talks and events held worldwide to “unite the global data communities through series of events and meetups”.

Viafoura put on the events this year for Toronto and was kind enough to extend an invitation to myself to be one of the speakers talking on data visualization and how that relates to all this “Big Data” stuff.

Paul spoke detecting fraud online using visualization and data science techniques. Something I often think about when presenting is how to make your message clear and connect with both the least technical people in the audience (who, quite often, have attended strictly out of curiosity) and the most knowledgeable and technically-minded people present.

I was really impressed with Paul’s visual explanation of the Jaccard coefficient. Not everyone understands set theory, however almost everyone will understand a Venn diagram if you put it in front of them.

So to explain the Jaccard index as a measure of mutual information when giving a presentation, which is better? You could put the definition up on a slide:

 J(A,B) = {{|A cap B|}over{|A cup B|}}.
which is fine for the mathematically-minded in your audience but would probably lose a lot of others. Instead, you could use a visualization like this figure Paul included:
The two depict the same quantity, but the latter is far more accessible to a wide audience. Great stuff.
I spoke on “Practical Visualizations for Visualizing Big Data” which included some fundamentals (thinking about data and perception in visualization / visual encoding) and the challenges the three “V”s of Big Data present when doing visualization and analysis, and some thoughts on how to address them.
This prompted some interesting discussions afterward, I found most people were much more interested in the fundamentals part – how to do visualization effectively, what constitutes a visualization, and the perceptional elements of dataviz and less on the data science aspects of the talk.
Overall it was a great evening and I was happy to get up and talk visualization again. Thanks to the guys from Viafoura for putting this on and inviting me, and to the folks at the Ryerson DMZ for hosting.
Mini-gallery culled from Twitter below:

The Top 0.1% – Keeping Pace with the Data Explosion?

I’ve been thinking a lot lately, as I do, about data.

When you work with something so closely, it is hard not to have the way you think about what you work with impact the way you think about other things in other parts of your life.

For instance, cooks don’t think about food the same way once they’ve worked in the kitchen; bankers don’t think about money the same way once they’ve seen a vault full of it; and analysts don’t think about data the same way, once they start constantly analyzing it.

The difference being, of course, that not everything is related to food or money, but everything is data if you know how to think like an analyst.

I remember when I was in elementary school as a young child and a friend of mine described to me the things of which he was afraid. We sat in the field behind the school and stared down at the gravel on the track.

“Think about counting every single pebble of gravel on the track,” he said. “That’s the sort of thing that really scares me.” He did seemed genuinely concerned. “Or think about counting every grain of sand on a beach, and then think about how many beaches there are in the whole world, and counting every grain of sand on every single one of those beaches. That’s the sort of thing that frightens me.”

The thing that scared my childhood friend was infinity; or perhaps not the infinite, but just very very large numbers – the quantities of the magnitude relating to that thing everyone is talking about these days called Big Data.

And that’s the sort of thing I’ve been thinking about lately.

I don’t remember the exact figure, but if you scour the internet to read up on our information age, and in particular our age of “Big Data” you will find statements similar to that below:

…. there has been more information created in the past year than there was in all of recorded history before it.

Which brings me to my point about the Top 0.1%.

Or, perhaps, to be more fair, probably something more like the Top 0.01%.

There is so much information out there. Every day around the world, every milliliter of gas pumped, every transaction at POS, every mouse click on millions of websites on the internet is being recorded, and creating more data.

Our capacity to record and store information has exploded exponentially.

But, perhaps somewhat critically, our ability to work with it and analyze it has not.

In the same way that The Ingenuity Gap talks about how the complexity of problems facing society is ever increasing but our ability to implement solutions is not matching that pace, we might be in danger of similarly finding the amount of information being recorded and stored in our modern world is exponentially increasing but our ability to manage and analyze it is not. Not only from a technological perspective, but also from a human perspective – there is only so much information one person can handle working with and keep “in mind”.

I know that many other people are thinking this way as well. This is my crude, data-centric take on what people have been talking about since the 1970’s – information overload. And I know that many other authors have touched on this point recently as it is of legitimate concern; for instance – acknowledging that the skill set needed to work with and make sense of these data sets of exponentially increasing size is so specialized that data scientists will not scale.

Will our technology and ability to manage data be able to keep up with the ever increasing explosion of it? Will software and platforms develop and match pace such that those able to make sense of these large data sets are not just a select group of specialists? How will the analyst of tomorrow handle working with and thinking about the analysis of such massive and diverse data sets?

Only time will answer these questions, but the one thing that seems certain is that the data deluge will only continue as storage becomes ever cheaper and of greater density while the volume, velocity and variety of data collected worldwide continues to explode.

In other words: there’s more that came from.