Follow the Data

A data driven blog

Archive for the tag “crowdsourcing”

Synapse – a Kaggle for molecular medicine?

I have frequently extolled the virtues of collaborative crowdsourced research, online prediction contests and similar subjects on these pages. Almost 2 years ago, I also mentioned Sage Bionetworks, which had started some interesting efforts in this area at the time.

Last Thursday, I (together with colleagues) got a very interesting update on what Sage is up to at the moment, and those things tie together a lot of threads that I am interested in – prediction contests, molecular diagnostics, bioinformatics, R and more. We were visited by Adam Margolin, who is director of computational biology at Sage (one of their three units).

He described how Sage is compiling and organizing public molecular data (such as that contained in The Cancer Genome Atlas) and developing tools for working with it, but more importantly, that they had hit upon prediction contests as the most effective way to generate modelling strategies for prognostic and diagnostic applications based on these data. (As an aside, Sage now appears to be focusing mostly on cancers rather than all types of disease as earlier; applications include predicting cancer subtype severity and survival outcomes.) Adam thinks that objectively scored prediction contests lets researchers escape from the “self-assessment trap“, where one always unconsciously strives to present the performance of one’s models in the most positive light.

They considered running their competitions on Kaggle (and are still open to it, I think) but given that they already had a good infrastructure for reproducible research, Synapse, they decided to tweak that instead and run the competitions on their own platform. Also, Google donated 50 million core hours (“6000 compute years”) and petabyte-scale storage for the purpose.

There was another reason not to use Kaggle as well. Sage wanted participants to not only upload predictions for which the results is shown on a dynamic leaderboard (which they do), but also to force them to provide runnable code which is actually executed on the Sage platform to generate the predictions. The way it works is that competitors need to use R to build their models, and they need to implement two methods, customTrain() and customPredict() (analogous to the train() and predict() methods implemented by most or all statistical learning methods in R) which are called by the server software. Many groups do not like to use R for their model development but there are ways to easily wrap arbitrary types of code inside R.

The first full-scale competition run on Synapse (which is, BTW, not only a competition platform but a “collaborative compute space that allows scientists to share and analyze data together”, as the web page states) was the Sage/DREAM Breast Cancer Prognosis Challenge, which uses data from a cohort of almost 2,000 breast cancer patients. (The DREAM project is itself worthy of another blog post as a very early (in its seventh year now, I think) platform for objective assessment of predictive models and reverse engineering in computational biology, but I digress …)

The goal of the Sage/DREAM breast cancer prognosis challenge is to find out whether it is possible to identify reliable prognostic molecular signatures for this disease. This question, in a generalized form (can we define diseases, subtypes and outcomes from a molecular pattern?), is still a hot one after many years of a steady stream of published gene expression signatures that have usually failed to replicate, or are meaningless (see e g Most Random Gene Expression Signatures Are Significantly Associated with Breast Cancer Outcome). Another competition that I plugged on this blog, SBV Improver, also had as its goal to assess if informative signatures could be found and its outcomes were disclosed recently. The result there was that out of four diseases addressed (multiple sclerosis, lung cancer, psoriasis, COPD), the molecular portrait (gene expression pattern) for one of them (COPD) did not add any information at all to known clinical characteristics, while for the others the gene expression helped to some extent, notably in psoriasis where it could discriminate almost perfectly between healthy and diseased tissue.

In the Sage/DREAM challenge, the cool thing is that you can directly (after registering an account) lift the R code from the leaderboard and try to reproduce the methods. The team that currently leads, Attractor Metagenes, has implemented a really cool (and actually quite simple) approach to finding “metagenes” (weighted linear combinations of actual genes) by an iterative approach that converges to certain characteristic metagenes, thus the “attractor” in the name. There is a paper on arXiv outlining the approach. Adam Margolin said that the authors have had trouble getting the paper published, but the Sage/DREAM competition has at least objectively shown that the method is sound and it should find its way into the computational biology toolbox now. I for one will certainly try it for some of my work projects.

The fact that Synapse stores both data and models in an open way has some interesting implications. For instance, the models can be applied to entirely new data sets, and they can be ensembled very easily (combined to get an average / majority vote / …). In fact, Sage even encourages competitors to make ensemble versions of models on the leaderboard to generate new models while the competition is going on! This is one step beyond Kaggle. Indeed, there is a team (ENSEMBLE) that specializes in this approach and they are currently at #2 on the leaderboard after Attractor Metagenes.

In the end, the winning team will be allowed to publish a paper about how they did it in Science Translational Medicine without peer review – the journal (correctly I think) assumes that the rigorous evaluation process in Synapse is more objective that peer review. Kudos to Science Translational Medicine for that.

There’s a lot more interesting things to mention, like how Synapse is now tackling “pan-cancer analysis” (looking for commonalities between *all* cancers), how they looked at millions of models to find out general rules of thumb about predictive models (discretization makes for worse performance, elastic net algorithms work best on average, prior knowledge and feature engineering is essential for good performance, etc.)
Perhaps the most remarkable thing in all of this, though, is that someone has found a way to build a crowdsourced card game, The Cure, on top of the Sage/DREAM breast cancer prognosis challenge in order to find even better solutions. I have not quite grasped how they did this – the FAQ states:

TheCure was created as a fun way to solicit help in guiding the search for stable patterns that can be used to make biologically and medically important predictions. When people play TheCure they use their knowledge (or their ability to search the Web or their social networks) to make informed decisions about the best combinations of variables (e.g. genes) to use to build predictive patterns. These combos are the ‘hands’ in TheCure card game. Every time a game is played, the hands are evaluated and stored. Eventually predictors will be developed using advanced machine learning algorithms that are informed by the hands played in the game.

But I’ll try The Cure right now and see if I can figure out what it is doing. You’re welcome to join me!


Three angles on crowd science

Some recently announced news that illuminate crowd science, advancing science by somehow leveraging a community, from three different angles.

  • The Harvard Clinical and Translational Science Center (or Harvard Catalyst) has “launched a pilot service through which researchers at the university can submit computational problems in areas such as genomics, proteomics, radiology, pathology, and epidemiology” via the TopCoder online competitive community for software development and digital creation. One recently started Harvard Catalyst challenge is called FitnessEstimator. The aim of the project is to “use next-generation sequencing data to determine the abundance of specific DNA sequences at multiple time points in order to determine the fitness of specific sequences in the presence of selective pressure. As an example, the project abstract notes that such an approach might be used to measure how certain bacterial sequences become enriched or depleted in the presence of antibiotics.” (the quotes are from a GenomeWeb article that is behind a paywall) I think it’s very interesting to use online software development contests for scientific purposes, as a very useful complement to Kaggle competitions, where the focus is more on data analysis. Sometimes, really good code is important too!
  • This press release describes the idea of connectomics (which is very big in neuroscience circles now) and how the connectomics researcher Sebastian Seung and colleagues have developed a new online game, EyeWire, where players trace neural branches “through images of mouse brain scans by playing a simple online game, helping the computer to color a neuron as if the images were part of a three-dimensional coloring book.” The images are actual data from the lab of professor Winfried Denk. “Humans collectively spend 600 years each day playing Angry Birds. We harness this love of gaming for connectome analysis,” says Prof. Seung in the press release. (For similar online games that benefit research, see e.g. Phylo, FoldIt and EteRNA.)
  • Wisdom of Crowds for Robust Gene Network Inference is a newly published paper in Nature Methods, where the authors looked at a kind of community ensemble prediction method. Let’s back-track a bit. The Dialogue on Reverse Engineering Assessment and Methods (DREAM) initiative is a yearly challenge where contestants try to reverse engineer various kinds of biological networks and/or predict the output of some or all nodes in the network under various conditions. (If it sounds too abstract, go to the link above and check out what the actual challenges have been like.) The DREAM initiative is a nice way to check the performance of the currently touted methods in an unbiased way. In the Nature Methods paper, the authors show that “no single inference method performs optimally across all data sets. In contrast, integration of predictions from multiple inference methods shows robust and high performance across diverse data sets” and that “Our results establish community-based methods as a powerful and robust tool for the inference of transcriptional gene regulatory networks.” So, in a very wisdom-of-crowds manner (as indeed the paper title suggests), it’s better to combine the predictions of all the contestants than just use the best ones. It’s like taking a composite prediction of all Kaggle competitors in a certain contest and observing that this composite prediction was superior to all individual teams’ predictions. I’m sure Kaggle has already done this kind of experiment, does anyone know?

Sergey Brin’s new science and IBM’s Jeopardy machine

Two good articles from the mainstream press.

Sergey Brin’s Search for a Parkinson’s Cure deals with the Google co-founders quest to minimize his high hereditary risk of getting Parkinson’s disease (which he found out through a test from 23andme, the company his wife founded) while simultaneously paving the way for a more rapid way to do science.

Brin is proposing to bypass centuries of scientific epistemology in favor of a more Googley kind of science. He wants to collect data first, then hypothesize, and then find the patterns that lead to answers. And he has the money and the algorithms to do it.

This idea about a less hypothesis-driven kind of science, based more on observing correlations and patterns, surfaces once in a while. A couple of years ago, Chris Anderson received a lot of criticism for describing what is more or less the same idea in The End of Theory. You can’t escape the need for some sort of theory or hypothesis, and when it comes to something like Parkinson we just don’t know enough about its physiology and biology yet. However, I think Brin is right in emphasizing the need to get data and knowledge about diseases to circulate more quickly and to try to milk the existing data sets for what they are worth. If nothing else, his frontal attack on Parkinson’s may lead to improved techniques for dealing with über-sized data sets.

Smarter Than You Think is about IBM’s new question-answering system Watson, which is apparently now good enough to be put in an actual Jeopardy competition on US national TV (scheduled to happen this fall). It’s a bit hard to believe, but I guess time will tell.

Most question-answering systems rely on a handful of algorithms, but Ferrucci decided this was why those systems do not work very well: no single algorithm can simulate the human ability to parse language and facts. Instead, Watson uses more than a hundred algorithms at the same time to analyze a question in different ways, generating hundreds of possible solutions. Another set of algorithms ranks these answers according to plausibility; for example, if dozens of algorithms working in different directions all arrive at the same answer, it’s more likely to be the right one.

IBM plans to sell Watson-like systems top corporate customers for sifting through huge document collections.

Data mine your way to $100,000

The crowdsourcing company Innocentive, which serves up tough (usually scientific or technical) problems for anyone to solve against a monetary reward, has put up a predictive analytics challenge for which the reward is a whopping USD100,000. You’ll have to register to find out what the challenge is about and in order to download the data, but I don’t think I’m giving away too much by saying that it’s a life science/bioinformatics challenge, albeit one which could be solved without much knowledge of biology. Innocentive has apparently implemented a new system for testing your predictive models against a reference data set, kind of like NetFlix, and they also have a leader board showing the currently best models (measured by R^2 of predictions vs. test set using Spearman’s rank correlation). Of course, the final submission from each contestant will be scored on a completely separate test set.

Crowdsourcing adverse drug event prediction algorithms

There’s an interesting competition, Observational Medical Outcomes Partnership Cup (OMOP Cup), going on until March this year (so unfortunately a bit late for laggards like me to participate). The background is that a lot of data on adverse drug events has recently become available, but much of this data is in free text and unstandardized formats. The development of algorithms for identifying patterns in adverse drug events, an by extension for predicting new events, has therefore been lagging. A good way to find and predict adverse drug events could save a lot of lives worldwide. The OMOP has therefore constructed a ”simulated” data set which resembles the kind of information you get from insurance claims and medical records. There are two algorithmic tasks, the first of which resembles classical data mining problems where you get an entire data set which you try to characterize as well as possible, and the second of which is more in a stream mining style where your algorithm is continuously evaluated by running it against observations that become sequentially available over time. The prizes are USD10.000 for the first task and USD5.000 for the second.

The OMOP Cup appears to be hosted by Orwik, a company which I was very vaguely aware of but hadn’t really looked at. Its product appears to be a data management solution for scientists, either individual researchers or groups of collaborators who want to keep their data available (for example, after key people have left) and perhaps sharable (for multi-group collaborations), all (I assume) with the aim of supporting well-documented and reproducible research.

Food and politics

This is a couple of months old and a bit silly, but worth a mention, I think. The collaborative decision-making site, which wants to take the pain out of making decisions by letting you ask a stranger (actually an aggregate of a whole lot of them) for advice, has published a report on correlations between political persuasion and food preferences.

Some background on You “teach” about your personality by answering questions, so the resulting advice will be influenced by the choices of people who have  personality profiles similar to yours. When you are actually about to make a decision, the system asks you more and more  questions related to to specific choice you are facing, and weights its advice accordingly. You can also give feedback on the final recommendation and hopefully get even better advice in the future.

Of course, collects a lot of information on different kinds of preferences as a “side effect” of all of this advice-giving. This info can be mined in order to discover surprising – or sometimes not so surprising – correlations. In the food/politics report mentioned above, self-declared liberals and conservatives were compared with respect to their favorite foods, cooking skills and so on. Some of the results:

– If you have both liberals and conservatives at your dinner table, it’s safest to serve hot dogs or double cheeseburger, as both groups like these. If you serve margaritas, do so with salt on the glass.

– Liberals like international food like Thai and Indian, while conservatives prefer things like pizza and Mac & Cheese.

– Conservatives have apple corers and know how to use them, but liberals don’t even know what they are.

Another report reveals that people who self-identify as Mac people like Andy Warhol while PC people don’t; that Mac users prefer Vespas but PC users prefer Harleys; and that Mac people like The Office more.

I guess all of these results sort of confirm stereotypes we already have, but it’s still good fun to read through these reports, which are periodically announced on the blog. Who knows, perhaps one day a *useful* correlation pops out …

Crowdsourcing dinosaur science

The recently initiated Open Dinosaur project is an excellent example of crowdsourcing in science. The people behind the project are enlisting volunteers to find skeletal measurements from dinosaurs in published articles and submit them into a common database. Or as they put it, “Essentially, we aim to construct a giant spreadsheet with as many measurements for ornithischian dinosaur limb bones as possible.” All contributors (anyone can participate) get to be co-authors on the paper that will be submitted at the end of the project.

One good thing about the project is that its originators have obviously taken pains to help the participants get going. They’ve put up comprehensive tutorials about dinosaur bone structure (!) and about how to locate relevant references and find the correct information in them.

As of yesterday, they had over 300 verified entries, after just ten days. It will be interesting to see other similar efforts in the future.

Post Navigation