Follow the Data

A data driven blog

Archive for the tag “competitions”

Lessons from medical predictive analytics competititions

Presentations from the organizers and best performing teams of the Systems Biology Improver Challenge, which I have covered before, have been put online. There is a ton of interesting stuff here for people like me who are interested in prediction contests, machine learning, and/or medical bioinformatics.

To rewind a little bit, the SBV Improver Challenge was a set of competitions somewhat similar to Kaggle’s competitions (indeed, there is a presentation by Will Cukierski of Kaggle among these presentations) in that participants needed to build a predictive model for classifying diseased versus healthy samples for four diseases based on gene expression data (simplifying a little bit), and then predict the disease status for an unlabeled test set. The competitions differed from Kaggle competitions in that the training data sets were not fixed – the organizers just pointed to some suggested public data sets, which I think was a nice way to do it.

The point of these sub-competitions was to establish whether gene expression profiles are truly predictive of disease states in general or just in specific cases.

Anyway – I enjoyed browsing these presentations a lot. Some take-home messages:

  • Different diseases vary immensely in how much predictive signal the gene expression data contains. Psoriasis was much easier to classify correctly based on these data than the other diseases: multiple sclerosis, COPD and lung cancer. The nature of the disease was an incomparable more important variable than anything related to normalization, algorithms, pre-processing etc.
  • It was suggested that providing the whole test set at once may be a bad idea, because it may reveal information (such as cluster structure) that would not be known in a real-life scenario (for instance, if you were to go to the doctor to get a gene expression measurement from your own tissue.) It was suggested that next time, the platform would just provide one data point at a time to be predicted. Indeed, I have thought a lot about this “single-sample classification” problem in connection with gene expression data lately. As anyone who has worked with (microarray or RNA-seq) gene expression data knows, there are severe experimental batch effects in these data that are usually removed by normalizing all data points together, but this cannot always be done in practice.
  • In these competitions, none of the alleged superiority of random forest classifiers or GLMs (cf. http://strataconf.com/strata2012/public/schedule/detail/22658) was in evidence, with “simple methods” such as linear discriminant analysis and Mann-Whitney tests for feature selection performing the best. Yes, I know there is probably no statistical significance here due to the small sample size …
  • Speaking of random forests and GLMs, I was surprised to learn about the RGLM (Random Generalized Linear Model), which is kind of a mix of the two; generalized linear models are built on random subsets of the training examples and the features (like in random forest classifiers) and predictions of many of these models are aggregated to get a final prediction. The presentation is here.

The “lessons learned” presentation discusses (most of) these points and more, and is interesting throughout.

 

 

Advertisements

Synapse – a Kaggle for molecular medicine?

I have frequently extolled the virtues of collaborative crowdsourced research, online prediction contests and similar subjects on these pages. Almost 2 years ago, I also mentioned Sage Bionetworks, which had started some interesting efforts in this area at the time.

Last Thursday, I (together with colleagues) got a very interesting update on what Sage is up to at the moment, and those things tie together a lot of threads that I am interested in – prediction contests, molecular diagnostics, bioinformatics, R and more. We were visited by Adam Margolin, who is director of computational biology at Sage (one of their three units).

He described how Sage is compiling and organizing public molecular data (such as that contained in The Cancer Genome Atlas) and developing tools for working with it, but more importantly, that they had hit upon prediction contests as the most effective way to generate modelling strategies for prognostic and diagnostic applications based on these data. (As an aside, Sage now appears to be focusing mostly on cancers rather than all types of disease as earlier; applications include predicting cancer subtype severity and survival outcomes.) Adam thinks that objectively scored prediction contests lets researchers escape from the “self-assessment trap“, where one always unconsciously strives to present the performance of one’s models in the most positive light.

They considered running their competitions on Kaggle (and are still open to it, I think) but given that they already had a good infrastructure for reproducible research, Synapse, they decided to tweak that instead and run the competitions on their own platform. Also, Google donated 50 million core hours (“6000 compute years”) and petabyte-scale storage for the purpose.

There was another reason not to use Kaggle as well. Sage wanted participants to not only upload predictions for which the results is shown on a dynamic leaderboard (which they do), but also to force them to provide runnable code which is actually executed on the Sage platform to generate the predictions. The way it works is that competitors need to use R to build their models, and they need to implement two methods, customTrain() and customPredict() (analogous to the train() and predict() methods implemented by most or all statistical learning methods in R) which are called by the server software. Many groups do not like to use R for their model development but there are ways to easily wrap arbitrary types of code inside R.

The first full-scale competition run on Synapse (which is, BTW, not only a competition platform but a “collaborative compute space that allows scientists to share and analyze data together”, as the web page states) was the Sage/DREAM Breast Cancer Prognosis Challenge, which uses data from a cohort of almost 2,000 breast cancer patients. (The DREAM project is itself worthy of another blog post as a very early (in its seventh year now, I think) platform for objective assessment of predictive models and reverse engineering in computational biology, but I digress …)

The goal of the Sage/DREAM breast cancer prognosis challenge is to find out whether it is possible to identify reliable prognostic molecular signatures for this disease. This question, in a generalized form (can we define diseases, subtypes and outcomes from a molecular pattern?), is still a hot one after many years of a steady stream of published gene expression signatures that have usually failed to replicate, or are meaningless (see e g Most Random Gene Expression Signatures Are Significantly Associated with Breast Cancer Outcome). Another competition that I plugged on this blog, SBV Improver, also had as its goal to assess if informative signatures could be found and its outcomes were disclosed recently. The result there was that out of four diseases addressed (multiple sclerosis, lung cancer, psoriasis, COPD), the molecular portrait (gene expression pattern) for one of them (COPD) did not add any information at all to known clinical characteristics, while for the others the gene expression helped to some extent, notably in psoriasis where it could discriminate almost perfectly between healthy and diseased tissue.

In the Sage/DREAM challenge, the cool thing is that you can directly (after registering an account) lift the R code from the leaderboard and try to reproduce the methods. The team that currently leads, Attractor Metagenes, has implemented a really cool (and actually quite simple) approach to finding “metagenes” (weighted linear combinations of actual genes) by an iterative approach that converges to certain characteristic metagenes, thus the “attractor” in the name. There is a paper on arXiv outlining the approach. Adam Margolin said that the authors have had trouble getting the paper published, but the Sage/DREAM competition has at least objectively shown that the method is sound and it should find its way into the computational biology toolbox now. I for one will certainly try it for some of my work projects.

The fact that Synapse stores both data and models in an open way has some interesting implications. For instance, the models can be applied to entirely new data sets, and they can be ensembled very easily (combined to get an average / majority vote / …). In fact, Sage even encourages competitors to make ensemble versions of models on the leaderboard to generate new models while the competition is going on! This is one step beyond Kaggle. Indeed, there is a team (ENSEMBLE) that specializes in this approach and they are currently at #2 on the leaderboard after Attractor Metagenes.

In the end, the winning team will be allowed to publish a paper about how they did it in Science Translational Medicine without peer review – the journal (correctly I think) assumes that the rigorous evaluation process in Synapse is more objective that peer review. Kudos to Science Translational Medicine for that.

There’s a lot more interesting things to mention, like how Synapse is now tackling “pan-cancer analysis” (looking for commonalities between *all* cancers), how they looked at millions of models to find out general rules of thumb about predictive models (discretization makes for worse performance, elastic net algorithms work best on average, prior knowledge and feature engineering is essential for good performance, etc.)
Perhaps the most remarkable thing in all of this, though, is that someone has found a way to build a crowdsourced card game, The Cure, on top of the Sage/DREAM breast cancer prognosis challenge in order to find even better solutions. I have not quite grasped how they did this – the FAQ states:

TheCure was created as a fun way to solicit help in guiding the search for stable patterns that can be used to make biologically and medically important predictions. When people play TheCure they use their knowledge (or their ability to search the Web or their social networks) to make informed decisions about the best combinations of variables (e.g. genes) to use to build predictive patterns. These combos are the ‘hands’ in TheCure card game. Every time a game is played, the hands are evaluated and stored. Eventually predictors will be developed using advanced machine learning algorithms that are informed by the hands played in the game.

But I’ll try The Cure right now and see if I can figure out what it is doing. You’re welcome to join me!

IMPROVER, a disease-related predictive analytics contest

As I have said before, I think scientific prediction competitions (a form of crowdsourced research) are an interesting way to attack problems in science. The recently launched IMPROVER Systems Biology Verification is such a competition, and it’s especially nice in that it asks a very general question: Is it possible to extract reliable gene expression signatures for common diseases? The diseases selected for this challenge are psoriasis, multiple sclerosis, chronic obstructive pulmonary disease (COPD), and lung cancer, and contestants are allowed to use any public data to construct their predictors. We often read scientific publications with supposed gene expression signatures for various diseases, but a competition framework will better allow us to assess how sensitive and specific those signatures really are.

I see a few problems with the competition (although I should stress that I think it’s a very good initiative – we should have more of these!): (1) the competitors are obliged to submit entries for all four diseases (actually five classifiers are required as the MS challenge is divided into two parts) to be eligible for the prize, which is very tough to manage as each problem is likely to be extremely difficult and the deadline is May 30, 2012 (of course, it may be possible to run the same model on all diseases, but somehow I doubt that will be very successful); (2) I suspect that the open-ended approach allowing all public data to be used will lead to less successful models than in the typically tightly-defined Kaggle competitions; (3) there is too little time to disseminate information about the competition so that people have time to build something that works before 30/5. I am hoping to be wrong about point (2); it would be great if this competition could lead to some insights about how to best leverage diverse data from places like the Gene Expression Omnibus and ArrayExpress.

In view of my points (1)-(3), I predict that not many teams will submit predictions, which of course implies that it would be a good idea for anyone who reads this to participate – you will have a shot at the $50,000 prize (which by the way has to be used for research.)

New analysis competitions

Some interesting competitions in data analysis / prediction:

Kaggle is managing this year’s KDD Cup, which will be about Weibo, China’s rough equivalent to Twitter (with more support for adding pictures and comments on posts, it’s more like a hybrid between Twitter and Facebook maybe). There will be two tasks, (1) predicting which users a certain user will follow (all data being anonymized, of course), and (2) predicting click-through rate in online computational ad systems. According to Gordon Sun, chief scientist at Tencent (the company behind Weibo), the data set to be used is the largest one ever to have been released for competitive purposes.

CrowdAnalytix, an India-based company with a business idea similar to Kaggle’s, has started a fun quickie competition about sentiment mining. Actually the competition might already be over as it ran for just 9 days starting 16/2. The input consists of comments left by visitors to a major airport in India, and the goal is to identify and compile actionable and/or interesting information, such as what kind of services visitors think are missing.

The Clarity challenge is, for me, easily the most interesting challenge of the three, in that it concerns the use of genomic information in healthcare. This challenge (with a prize sum of $25,000) is, in effect, crowdsourcing genomic/medical research (although only 20 teams will get selected to participate). The goal is to identify and report on potential genetic features underlying medical disorders in three children, given the genome sequences of the children and their parents. These genetics features are presently unknown, which is why this competition really represents something new in medical research. I think this is a very nice initiative, in fact I had thought of initiating something similar at my own institute where I work, but this challenge is much better than what I had in mind. It will be very interesting to see what comes out of it.

Machine learning competitions and algorithm comparisons

Tomorrow, 29 May 2010, a lot of (European) people will be watching the Eurovision Song Contest to see which country will take home the prize. Personally, I don’t really care about who wins the contest itself, but I do care (somewhat) about which predictor will win the Eurovision Voting Forecast competition arranged by kaggle.com. Kaggle describes itself as “a platform for data mining, bioinformatics and forecasting competitions“. It provides an objective framework for comparing techniques and “allows organizations to have their data scrutinized by the world’s best statisticians.”

Contests like this are fun, but they can also have more serious aims. For instance, Kaggle also hosts a competition about predicting HIV progression based on the virus’ DNA sequence. The currently leading submission has already improved on the best methods reported in literature, and so a post at Kaggle’s No Free Hunch blog asks whether competitions might be the future of research. I think they may well be, at least in some domains. A few months back, I mentioned an interesting challenge at Innocentive which is essentially a very difficult pure research problem, and it will be interesting to learn how the winning team there did it (if any details are disclosed). (I signed up for this competition myself, but haven’t been able to devote more than one or two hours to it so far, unfortunately.)

There are other platforms for prediction competitions as well, for instance TunedITs challenge platform, which allows university teachers to “make their courses more attractive and valuable, through organization of on-line student competitions instead of traditional assignments.” TunedIT also has a research platform where you can run automated tests on machine learning algorithms and get reproducible experimental results. You can also benchmark results against a knowledge base or contribute to and use a repository of various data sets and algorithms.

Another initiative for serious evaluation of machine learning algorithms in various problem domains is MLcomp. Here, you can upload your own datasets (or use pre-loaded ones) and run existing algorithms on them through a web interface. MLcomp then reports various metrics that allow you to compare different methods.

By the way, 22 teams participated in Kaggle’s Eurovision challenge, and Azerbaijan is the clear favorite, having been picked as the winner by 14 teams. Let’s see how it goes tomorrow.

Post Navigation