Learning from prediction contests
I think there has never been a time when it has been easier to get into machine learning and predictive analytics than right know. Let me explain …
As you probably know, a company called Kaggle organizes predictive analytics competitions where data scientists can earn money from their skills and companies can tap into some of the unknown talent of the world. There are other similar companies, like CrowdAnalytix, and more specialized/closed variants on the same idea such as Innocentive, but I think Kaggle has deservedly gotten the most buzz because they have succeeded the best in presenting their business case and vision. For example, here is a presentation that Jeremy Howard from Kaggle gave at the Strata NY 2011 conference, where he outlines how Kaggle wants to become a “meritocractic” platform that allows people who are good at analytics to finally get properly compensated for their skills.
I have known about Kaggle for quite some time and been a fan of their business idea, but with one full-time job, occasional work on the side and two young kids, I figured I’d never have the time to participate fruitfully in the competitions myself. As it happened, I got the chance to chat with Kaggle’s CEO Anthony Goldbloom at a conference (Strata 2011 in Santa Clara) and he persuaded me to give the competitions a try. So I finally jumped in, and found that despite not really having the time to spare, I still enjoyed it and learned a lot. So far I’ve only participated for real in one competition, the Dunnhumby Shopper Challenge, where the task was to predict (based on historical shopping records on thousands of customers) at what date each customer would next visit the store, and how much money (within $10) he or she would spend. This task turned out to be surprisingly non-standard and was definitely not something that you could just throw your favourite algorithm at right out of the box.
Already from this one competition I learned / noticed several things:
– You can sometimes get pretty far just by using common sense and a very simple conceptual model. In fact the winning entry by Alexander d’Yakonov (explained here) used essentially the same basic idea as my model, although he had added a couple of tricks that I hadn’t thought of.
– It’s extremely helpful to learn from your competitors. Kaggle often asks high-scoring contestants to explain how they did it, which is a huge service to the community. For Dunnhumby, there was the winning entry that I linked above, plus this from Neil Schneider, who placed second, and this from William Cukierski, who placed fourth. Similar explanations for other competitions can be found under the “How I did it” tag on Kaggle’s blog.
– A competition can really motivate you to learn new stuff that you wouldn’t have dreamed of touching otherwise. The Dunnhumby competition motivated me to learn survival analysis, although I didn’t end up using that particular statistical framework. (I tried, but couldn’t get it to work well on the problem.) I also started to brush up on time series analysis.
During the past few days, I’ve discovered a couple of really, really good resources about how to get started with prediction contests:
Using R for data mining competitions by Jonathan Lee shows case studies from Kaggle competitions. It’s heavy on the R material (which I like) but go look at it even if you don’t use R, as there is a lot more to the presentation than code.
Getting in shape for the sport of data science by Jeremy Howard of Kaggle (again). This is a really great nuts-and-bolts talk about how to compete in prediction contests, with useful tips on how to “munge” your data into shape using different tools – even Excel! – and set up your models. There are many nice tricks here. Finally he explains the ideas behind the random forest algorithm – “a lot of crappy predictors that are all crap in a slightly different way.” For a while I was tempted to apply the same idea to Kaggle competitors – all are crap in a different way but the occasional competitor stumbles on something good and the rest cancel each other’s errors out 🙂 but this theory doesn’t hold as many of the top competitors (such as Jeremy himself before he joined Kaggle) are consistently good.
Soo .. let’s see if I will have time to dive into the next contest in earnest …
P. S. Other interesting resources for learning about prediction / machine learning etc. apart from the stellar presentations mentioned above, and the Kaggle “How I did it” testimonials, include:
Stanford’s free online AI class with Peter Norvig and Sebastian Thrun (of Google’s self-driving car fame)- I’ve been trying to follow this, but predictably enough haven’t had time to keep up
Stanford’s online machine learning class with Andrew Ng – I’ve only watched one lecture but it seems really good