Some links I have come across during the past weeks (or months)
Data is snake oil. Cautions against blindly jumping on the big data bandwagon without thinking through what you actually need.
Mining of Massive Datasets (pdf link). I just love when you can download high-quality books legally and for free. (See also: The Elements of Statistical Learning) I have only had time to read the first few chapters in MoMD, but I like it a lot so far. Whereas TEoSL is a superb reference for the statistical and machine-learning aspects of data crunching (explaining all the hippest algorithms while also providing a solid understanding of “basics” like how K-nearest neighbour and linear regression *really* work), MoMD is decidedly more on the applied/techy side of data mining and even states in the introduction that the book is not about models that learn from data. Instead, there is an approachable but solid introduction to the MapReduce paradigm and how it is used to implement some common algorithms. I also looked briefly at chapter 4 about stream mining, which seems interesting, especially given that this topic is not very widely covered in the literature. The remainder of the book deals with topics like recommendation systems, frequent itemset mining and link analysis.
The Heritage Health Prize looks like an interesting challenge in the same vein as the Netflix Prize and several prediction contests subsequently arranged by Kaggle. The prize sum is 3 million USD (!) and the task is to:
develop a breakthrough algorithm that uses available patient data, including health records and claims data, to predict and prevent unnecessary hospitalizations. […] The winning Team will create a predictive algorithm that can identify patients who are at risk for hospital admissions. Once known, health care providers can develop new care plans and strategies to reach patients before emergencies occur, thereby reducing the number of unnecessary hospitalizations.
The competition will run for two years! It will be really interesting to see what comes out of this. Without really knowing anything about the problem at hand, I think this looks like a typical real-world problem with all of the difficulties that entails for data analysis: inconsistent data formats, incomplete data, subjective judgments that have been fed into the data, etc. Of course this is exactly the kind of problems that we need to learn how to solve. Perhaphs Heritage is right when it
[…] believes that incentivized competition – one that includes the involvement of those with passionate minds that don’t know what can’t be done – is the best way to achieve the radical breakthroughs and innovations necessary to reform our health care system.
Finally, on the off chance that anyone reading this hasn’t started following the excellent Strata Week and Strata Gems updates that O’Reilly are posting in anticipation of their Strata conference (“Making Data Work”). In fact those jam-packed blog posts were part of the reason I haven’t updated the blog in a while, because all of the things I wanted to cover appeared in Strata blog updates before I had the chance to write them up myself. But I love them anyway.