Follow the Data

A data driven blog

Archive for the tag “competition”

Large-scale machine learning course & streaming organism classification

The NYU Large Scale Machine Learning course looks like it will be very worthwhile to follow. The instructors, John Langford and Yann Le Cun, are both key figures in the machine learning field – for instance, the former developed Vowpal Wabbit and the latter has done pioneering work in deep learning. It is not an online course like those at Coursera et al., but they have promised to put lecture videos and slides online. I’ll certainly try to follow along as best I can.


There is an interesting Innocentive challenge going on, called “Identify Organisms from A Stream of DNA Sequences.” This is interesting to me both because of the subject matter (classification based on DNA sequences) and also because the winner is explicitly required to submit an efficient, scalable solution (not just a good classifier.) Also, the prize sum is one million US dollars! It’s exactly this kind of algorithms that will be needed to enable the “genomic observatories” that I have mentioned before on this blog which will continuously stream out sequences obtained from the environment.

Synapse – a Kaggle for molecular medicine?

I have frequently extolled the virtues of collaborative crowdsourced research, online prediction contests and similar subjects on these pages. Almost 2 years ago, I also mentioned Sage Bionetworks, which had started some interesting efforts in this area at the time.

Last Thursday, I (together with colleagues) got a very interesting update on what Sage is up to at the moment, and those things tie together a lot of threads that I am interested in – prediction contests, molecular diagnostics, bioinformatics, R and more. We were visited by Adam Margolin, who is director of computational biology at Sage (one of their three units).

He described how Sage is compiling and organizing public molecular data (such as that contained in The Cancer Genome Atlas) and developing tools for working with it, but more importantly, that they had hit upon prediction contests as the most effective way to generate modelling strategies for prognostic and diagnostic applications based on these data. (As an aside, Sage now appears to be focusing mostly on cancers rather than all types of disease as earlier; applications include predicting cancer subtype severity and survival outcomes.) Adam thinks that objectively scored prediction contests lets researchers escape from the “self-assessment trap“, where one always unconsciously strives to present the performance of one’s models in the most positive light.

They considered running their competitions on Kaggle (and are still open to it, I think) but given that they already had a good infrastructure for reproducible research, Synapse, they decided to tweak that instead and run the competitions on their own platform. Also, Google donated 50 million core hours (“6000 compute years”) and petabyte-scale storage for the purpose.

There was another reason not to use Kaggle as well. Sage wanted participants to not only upload predictions for which the results is shown on a dynamic leaderboard (which they do), but also to force them to provide runnable code which is actually executed on the Sage platform to generate the predictions. The way it works is that competitors need to use R to build their models, and they need to implement two methods, customTrain() and customPredict() (analogous to the train() and predict() methods implemented by most or all statistical learning methods in R) which are called by the server software. Many groups do not like to use R for their model development but there are ways to easily wrap arbitrary types of code inside R.

The first full-scale competition run on Synapse (which is, BTW, not only a competition platform but a “collaborative compute space that allows scientists to share and analyze data together”, as the web page states) was the Sage/DREAM Breast Cancer Prognosis Challenge, which uses data from a cohort of almost 2,000 breast cancer patients. (The DREAM project is itself worthy of another blog post as a very early (in its seventh year now, I think) platform for objective assessment of predictive models and reverse engineering in computational biology, but I digress …)

The goal of the Sage/DREAM breast cancer prognosis challenge is to find out whether it is possible to identify reliable prognostic molecular signatures for this disease. This question, in a generalized form (can we define diseases, subtypes and outcomes from a molecular pattern?), is still a hot one after many years of a steady stream of published gene expression signatures that have usually failed to replicate, or are meaningless (see e g Most Random Gene Expression Signatures Are Significantly Associated with Breast Cancer Outcome). Another competition that I plugged on this blog, SBV Improver, also had as its goal to assess if informative signatures could be found and its outcomes were disclosed recently. The result there was that out of four diseases addressed (multiple sclerosis, lung cancer, psoriasis, COPD), the molecular portrait (gene expression pattern) for one of them (COPD) did not add any information at all to known clinical characteristics, while for the others the gene expression helped to some extent, notably in psoriasis where it could discriminate almost perfectly between healthy and diseased tissue.

In the Sage/DREAM challenge, the cool thing is that you can directly (after registering an account) lift the R code from the leaderboard and try to reproduce the methods. The team that currently leads, Attractor Metagenes, has implemented a really cool (and actually quite simple) approach to finding “metagenes” (weighted linear combinations of actual genes) by an iterative approach that converges to certain characteristic metagenes, thus the “attractor” in the name. There is a paper on arXiv outlining the approach. Adam Margolin said that the authors have had trouble getting the paper published, but the Sage/DREAM competition has at least objectively shown that the method is sound and it should find its way into the computational biology toolbox now. I for one will certainly try it for some of my work projects.

The fact that Synapse stores both data and models in an open way has some interesting implications. For instance, the models can be applied to entirely new data sets, and they can be ensembled very easily (combined to get an average / majority vote / …). In fact, Sage even encourages competitors to make ensemble versions of models on the leaderboard to generate new models while the competition is going on! This is one step beyond Kaggle. Indeed, there is a team (ENSEMBLE) that specializes in this approach and they are currently at #2 on the leaderboard after Attractor Metagenes.

In the end, the winning team will be allowed to publish a paper about how they did it in Science Translational Medicine without peer review – the journal (correctly I think) assumes that the rigorous evaluation process in Synapse is more objective that peer review. Kudos to Science Translational Medicine for that.

There’s a lot more interesting things to mention, like how Synapse is now tackling “pan-cancer analysis” (looking for commonalities between *all* cancers), how they looked at millions of models to find out general rules of thumb about predictive models (discretization makes for worse performance, elastic net algorithms work best on average, prior knowledge and feature engineering is essential for good performance, etc.)
Perhaps the most remarkable thing in all of this, though, is that someone has found a way to build a crowdsourced card game, The Cure, on top of the Sage/DREAM breast cancer prognosis challenge in order to find even better solutions. I have not quite grasped how they did this – the FAQ states:

TheCure was created as a fun way to solicit help in guiding the search for stable patterns that can be used to make biologically and medically important predictions. When people play TheCure they use their knowledge (or their ability to search the Web or their social networks) to make informed decisions about the best combinations of variables (e.g. genes) to use to build predictive patterns. These combos are the ‘hands’ in TheCure card game. Every time a game is played, the hands are evaluated and stored. Eventually predictors will be developed using advanced machine learning algorithms that are informed by the hands played in the game.

But I’ll try The Cure right now and see if I can figure out what it is doing. You’re welcome to join me!

Crowdsourcing adverse drug event prediction algorithms

There’s an interesting competition, Observational Medical Outcomes Partnership Cup (OMOP Cup), going on until March this year (so unfortunately a bit late for laggards like me to participate). The background is that a lot of data on adverse drug events has recently become available, but much of this data is in free text and unstandardized formats. The development of algorithms for identifying patterns in adverse drug events, an by extension for predicting new events, has therefore been lagging. A good way to find and predict adverse drug events could save a lot of lives worldwide. The OMOP has therefore constructed a ”simulated” data set which resembles the kind of information you get from insurance claims and medical records. There are two algorithmic tasks, the first of which resembles classical data mining problems where you get an entire data set which you try to characterize as well as possible, and the second of which is more in a stream mining style where your algorithm is continuously evaluated by running it against observations that become sequentially available over time. The prizes are USD10.000 for the first task and USD5.000 for the second.

The OMOP Cup appears to be hosted by Orwik, a company which I was very vaguely aware of but hadn’t really looked at. Its product appears to be a data management solution for scientists, either individual researchers or groups of collaborators who want to keep their data available (for example, after key people have left) and perhaps sharable (for multi-group collaborations), all (I assume) with the aim of supporting well-documented and reproducible research.

Post Navigation