Follow the Data

A data driven blog

Archive for the month “April, 2012”

IMPROVER, a disease-related predictive analytics contest

As I have said before, I think scientific prediction competitions (a form of crowdsourced research) are an interesting way to attack problems in science. The recently launched IMPROVER Systems Biology Verification is such a competition, and it’s especially nice in that it asks a very general question: Is it possible to extract reliable gene expression signatures for common diseases? The diseases selected for this challenge are psoriasis, multiple sclerosis, chronic obstructive pulmonary disease (COPD), and lung cancer, and contestants are allowed to use any public data to construct their predictors. We often read scientific publications with supposed gene expression signatures for various diseases, but a competition framework will better allow us to assess how sensitive and specific those signatures really are.

I see a few problems with the competition (although I should stress that I think it’s a very good initiative – we should have more of these!): (1) the competitors are obliged to submit entries for all four diseases (actually five classifiers are required as the MS challenge is divided into two parts) to be eligible for the prize, which is very tough to manage as each problem is likely to be extremely difficult and the deadline is May 30, 2012 (of course, it may be possible to run the same model on all diseases, but somehow I doubt that will be very successful); (2) I suspect that the open-ended approach allowing all public data to be used will lead to less successful models than in the typically tightly-defined Kaggle competitions; (3) there is too little time to disseminate information about the competition so that people have time to build something that works before 30/5. I am hoping to be wrong about point (2); it would be great if this competition could lead to some insights about how to best leverage diverse data from places like the Gene Expression Omnibus and ArrayExpress.

In view of my points (1)-(3), I predict that not many teams will submit predictions, which of course implies that it would be a good idea for anyone who reads this to participate – you will have a shot at the $50,000 prize (which by the way has to be used for research.)

Machine learning

While preparing for our next podcast recording, here are some interesting recent machine learning developments.

The Protocols and Structures for Inference (PSI) project aims to develop an architecture for presenting machine learning algorithms, their inputs (data) and outputs (predictors) as resource-oriented RESTful web services in order to make machine learning technology accessible to a broader range of people than just machine learning researchers.

Why?

Currently, many machine learning implementations (e.g., in toolkits such as Weka, Orange, Elefant, Shogun, SciKit.Learn, etc.) are tied to specific choices of programming language, and data sets to particular formats (e.g., CSV, svmlight, ARFF). This limits their accessability [sic], since new users may have to learn a new programming language to run a learner or write a parser for a new data format, and their interoperability, requiring data format converters and multiple language platforms.

I think it seems promising. The specification is here.

  • BigML, which has been mentioned in passing on this blog, has now published some videos of what the interface actually looks like. It seems quite nice. While watching the videos, I was thinking “OK, this looks really nice, but does it have an API?” Luckily, it turns out that it has, which is good news for us geekier people who don’t just want to use the GUI.
  • Machine learning in Google Goggles. A video describing some real cutting-edge ML research in Google’s augmented reality glasses, Google Goggles. Definitely worth checking out.

Post Navigation

Follow

Get every new post delivered to your Inbox.

Join 46 other followers