Follow the Data

A data driven blog

Archive for the tag “svd”

Link roundup

Gearing up into Christmas mode, so no proper write-up for these (interesting) links.

Personalized medicine is about data, not (just) drugs. Written by Thomas Goetz of The Decision Tree for Huffington Post. The Decision tree also has a nice post about why self-tracking isn’t just for geeks.

A Billion Little Experiments (PDF link). An eloquent essay/report about “good” and “bad” patients and doctors, compliance, and access to your own health data.

Latent Semantic Indexing worked well for NetFlix, but not for dating. MIT Technology Review writes about how the algorithms used to match people at (based on latent semantic indexing / SVD) are close to worthless. A bit lightweight, but a fun read.

A podcast about data mining in the mobile world. Featuring Deborah Estrin and Tom Mitchell.  Mitchell just recently wrote an article in Science about how data mining is changing: Mining Our Reality (subscription needed). The take-home message (or one of them) is that data mining is becoming much more real-time oriented. Data are increasingly being analyzed on the fly and used to make quick decisions.

How Zeo, the sleep optimizer, actually works. I mentioned Zeo in a blog post in August.

Netflix prize soon to be awarded

A few years back, an online movie rental site, NetFlix, promised a million dollars to anyone who could improve their recommendation algorithm by a certain percentage (10%). In the same way that Amazon recommends books you might be interested in, NetFlix recommends movies to users.

The NetFlix prize, as the challenge was called, turned out to be more important for data mining and related fields than at least I had anticipated. Many teams from all over the world put in enormous efforts to shave off the last percentage points needed to cross the finish line. At times, it was thought that the teams would be doomed to approaching the cut-off point asymptotically, never actually crossing it.

Last week, though, a team called BellKor did achieve the target percentage. That does not yet make them the winners, as other teams still have the chance to submit a better model before July 26.  Still, it is likely that they will win. At any rate, someone will have won, which is significant.

Although I have not followed the competition in much depth, I believe some of the lessons learned were about the surprising power of using multiple – and I mean many – predictive models. Some of the best predictors were linear combinations of over a hundred different kinds of models.

In particular, it turned out that methods based on “latent factors” in the data – regularities that can be fished out using so-called matrix factorization methods such as SVD or NMF – were very powerful tools for this application, especially when combined with “neighbourhood-based” methods, which basically make predictions by assuming that similar users (based on how they have previously rated films) will like similar films, or conversely, that films that have been similarly rated by different users will tend to be similar.

One of the members of the winning team BellKor, Yehuda Koren, has recently published an interesting paper which outlines strategies for the hard problem of accounting for preferences that change over time. Elements of these are likely to have been included in the prediction model that crossed the 10% line.

There is a lot to chew on in the paper for the more technically inclined, but I would just like to mention a simple and interesting trend Koren found in the Netflix data: that older movies tend to get higher ratings that newer ones.

Koren sets up two hypotheses that could explain this phenomenon:

either, the customers will more readily choose to rent a new movie simply based on novelty value, while they would only choose and older movie after a careful selection process, which would lead to a greater likelihood of enjoying the movie,

or, old movies are just better!

I can think of a couple of other possible explanations as well, such as the effects of nostalgia for movies seen a long time ago for instance, but anyway Koren goes on to compare these two hypotheses using the statistics in the Netflix database. By interpreting parameters in the statistical model he has set up, he concludes that the first explanation (a more careful selection process for old movies) is the more likely one.

Post Navigation