Web search based prediction works well for a first-pass analysis
A simple but interesting study, Predicting consumer behavior with Web search, was just published in PNAS. Inspired by Google Flu Trends and other ways of “predicting the present” by tracking web searches in (almost) real time, the article authors try to compare these methods to “baseline” predictors that use other available sources of information. The results indicate that the search-based methods aren’t necessarily better than baseline – sometimes they are clearly worse – but the less prior information there is, the better the search based method does compared to the baseline predictor. For example, the revenues of video game sequels are well predicted by a baseline model looking at, among other things, the revenue of the predecessor, but the revenues of non-sequel games are hard to predict by the baseline predictor, whereas the search-based prediction works well. Combining both the baseline and the search-based predictors typically results in a modest increase in accuracy above the best of the two. This suggests that search-based prediction is pretty robust in the sense that it can be used in the absence of relevant information and still give reasonable results. Therefore it may be useful in various kinds of first-pass analysis before building a more accurate predictor based on many different information sources. Another advantage of search-based methods that the authors don’t really go into that much is the detection of turning points – like when a trend starts to take off. For example, in their flu prediction examples, an auto-regressive model (a model that uses a weighted average of the last few time points) tracks the actual flu outbreaks almost as well as the search-based model. However, looking closely at the plots, the auto-regressive model always lags a bit behind the search-based predictor. This makes sense, as it is always basing its prediction on the previous time points,and so takes some time to catch on to the fact that the situation has changed radically.
Edit 30/9 2010
I just realized this is the first major publication where I’ve seen stated in the Methods section that the authors used Hadoop and Pig to analyze their data. Yet, Benjamin Black tweets that Hadoop is already legacy. Things move fast.