Follow the Data

A data driven blog

Archive for the tag “sage”

Lessons learned from mining heterogeneous cancer data sets

How much can we learn about cancer treatment and prevention by large-scale data collection and analysis?

An interesting paper was just published: “Assessing the clinical utility of cancer genomic and proteomic data across tumor types“. I am afraid the article is behind a paywall, but no worries – I will summarize the main points here! Basically the authors have done a large-scale data mining study of data published within The Cancer Genome Atlas (TCGA) project, a very ambitious effort to collect molecular data on different kinds of tumors. The main question they ask is how much clinical utility these molecular data can add to conventional clinical information such as tumor stage, tumor grade, age and gender.

The lessons I drew from the paper are:

  • The molecular data does not add that much predictive information beyond the clinical information. As the authors put it in the discussion, “This echoes the observation that the number of cancer prognostic molecular markers in clinical use is pitifully small, despite decades of protracted and tremendous efforts.” It is an unfortunate fact of life in cancer genomics that many molecular classifiers (based on gene expression patterns usually) have been proposed to predict tumor severity, patient survival and so on, but different groups keep coming up with different gene sets and they tend not to be validated in independent cohorts.
  • When looking at what factors explain most of the variation, the type of tumor explains the most (37.4%), followed by the type of data used (that is, gene expression, protein expression, micro-RNA expression, DNA methylation or copy number variations) which explains 17.4%, with the interaction between tumor type and data type in third place (11.8%), suggesting that some data types are more informative for certain tumors than others. The algorithm used is fairly unimportant (5.2%). At the risk of drawing unwarranted conclusions, it is tempting to generalize this into something like this: the most important factor is the intrinsic difficulty of modeling the system, the next most important factor is the decision of what data to collect and/or feature engineering, while the type of algorithm used for learning the model comes far behind.
  • Perhaps surprisingly, there was essentially no cross-tumor predictive power between models. (There was one exception to this.) That is, a model built for one type of tumor was typically useless when predicting the prognosis for another tumor type.
  • Individual molecular features (expression levels of individual genes, for instance) did not add predictive power beyond what was already in the clinical information, but in some cases molecular subtype did. The molecular subtype is a “molecular footprint” derived using consensus NMF (nonnegative matrix factorization, an unsupervised method that can be used for dimension reduction and clustering). This footprint that described a general pattern was informative whereas the individual features making up the footprint weren’t. This seems consistent with the issue mentioned above about gene sets failing to consistently predict tumor severity. The predictive information is on a higher level than the individual genes.

The authors argue that one reason for the failure of predictive modeling in cancer research has been that investigators have relied too much on p values to say something about the clinical utility of their markers, when they should instead have focused more on the effect size, or the magnitude of difference in patient outcomes.

They also make a good point about reliability and reproducibility. I quote: “The literature of tumor biomarkers is plagued by publication bias and selective and/or incomplete reporting“. To help combat these biases, the authors (many of whom are associated with Sage Biosystems, who I have mentioned repeatedly on this blog) have made available an open model-assessment platform, including of course all the models from the paper itself, but which can also be used to assess your own favorite model.

Synapse – a Kaggle for molecular medicine?

I have frequently extolled the virtues of collaborative crowdsourced research, online prediction contests and similar subjects on these pages. Almost 2 years ago, I also mentioned Sage Bionetworks, which had started some interesting efforts in this area at the time.

Last Thursday, I (together with colleagues) got a very interesting update on what Sage is up to at the moment, and those things tie together a lot of threads that I am interested in – prediction contests, molecular diagnostics, bioinformatics, R and more. We were visited by Adam Margolin, who is director of computational biology at Sage (one of their three units).

He described how Sage is compiling and organizing public molecular data (such as that contained in The Cancer Genome Atlas) and developing tools for working with it, but more importantly, that they had hit upon prediction contests as the most effective way to generate modelling strategies for prognostic and diagnostic applications based on these data. (As an aside, Sage now appears to be focusing mostly on cancers rather than all types of disease as earlier; applications include predicting cancer subtype severity and survival outcomes.) Adam thinks that objectively scored prediction contests lets researchers escape from the “self-assessment trap“, where one always unconsciously strives to present the performance of one’s models in the most positive light.

They considered running their competitions on Kaggle (and are still open to it, I think) but given that they already had a good infrastructure for reproducible research, Synapse, they decided to tweak that instead and run the competitions on their own platform. Also, Google donated 50 million core hours (“6000 compute years”) and petabyte-scale storage for the purpose.

There was another reason not to use Kaggle as well. Sage wanted participants to not only upload predictions for which the results is shown on a dynamic leaderboard (which they do), but also to force them to provide runnable code which is actually executed on the Sage platform to generate the predictions. The way it works is that competitors need to use R to build their models, and they need to implement two methods, customTrain() and customPredict() (analogous to the train() and predict() methods implemented by most or all statistical learning methods in R) which are called by the server software. Many groups do not like to use R for their model development but there are ways to easily wrap arbitrary types of code inside R.

The first full-scale competition run on Synapse (which is, BTW, not only a competition platform but a “collaborative compute space that allows scientists to share and analyze data together”, as the web page states) was the Sage/DREAM Breast Cancer Prognosis Challenge, which uses data from a cohort of almost 2,000 breast cancer patients. (The DREAM project is itself worthy of another blog post as a very early (in its seventh year now, I think) platform for objective assessment of predictive models and reverse engineering in computational biology, but I digress …)

The goal of the Sage/DREAM breast cancer prognosis challenge is to find out whether it is possible to identify reliable prognostic molecular signatures for this disease. This question, in a generalized form (can we define diseases, subtypes and outcomes from a molecular pattern?), is still a hot one after many years of a steady stream of published gene expression signatures that have usually failed to replicate, or are meaningless (see e g Most Random Gene Expression Signatures Are Significantly Associated with Breast Cancer Outcome). Another competition that I plugged on this blog, SBV Improver, also had as its goal to assess if informative signatures could be found and its outcomes were disclosed recently. The result there was that out of four diseases addressed (multiple sclerosis, lung cancer, psoriasis, COPD), the molecular portrait (gene expression pattern) for one of them (COPD) did not add any information at all to known clinical characteristics, while for the others the gene expression helped to some extent, notably in psoriasis where it could discriminate almost perfectly between healthy and diseased tissue.

In the Sage/DREAM challenge, the cool thing is that you can directly (after registering an account) lift the R code from the leaderboard and try to reproduce the methods. The team that currently leads, Attractor Metagenes, has implemented a really cool (and actually quite simple) approach to finding “metagenes” (weighted linear combinations of actual genes) by an iterative approach that converges to certain characteristic metagenes, thus the “attractor” in the name. There is a paper on arXiv outlining the approach. Adam Margolin said that the authors have had trouble getting the paper published, but the Sage/DREAM competition has at least objectively shown that the method is sound and it should find its way into the computational biology toolbox now. I for one will certainly try it for some of my work projects.

The fact that Synapse stores both data and models in an open way has some interesting implications. For instance, the models can be applied to entirely new data sets, and they can be ensembled very easily (combined to get an average / majority vote / …). In fact, Sage even encourages competitors to make ensemble versions of models on the leaderboard to generate new models while the competition is going on! This is one step beyond Kaggle. Indeed, there is a team (ENSEMBLE) that specializes in this approach and they are currently at #2 on the leaderboard after Attractor Metagenes.

In the end, the winning team will be allowed to publish a paper about how they did it in Science Translational Medicine without peer review – the journal (correctly I think) assumes that the rigorous evaluation process in Synapse is more objective that peer review. Kudos to Science Translational Medicine for that.

There’s a lot more interesting things to mention, like how Synapse is now tackling “pan-cancer analysis” (looking for commonalities between *all* cancers), how they looked at millions of models to find out general rules of thumb about predictive models (discretization makes for worse performance, elastic net algorithms work best on average, prior knowledge and feature engineering is essential for good performance, etc.)
Perhaps the most remarkable thing in all of this, though, is that someone has found a way to build a crowdsourced card game, The Cure, on top of the Sage/DREAM breast cancer prognosis challenge in order to find even better solutions. I have not quite grasped how they did this – the FAQ states:

TheCure was created as a fun way to solicit help in guiding the search for stable patterns that can be used to make biologically and medically important predictions. When people play TheCure they use their knowledge (or their ability to search the Web or their social networks) to make informed decisions about the best combinations of variables (e.g. genes) to use to build predictive patterns. These combos are the ‘hands’ in TheCure card game. Every time a game is played, the hands are evaluated and stored. Eventually predictors will be developed using advanced machine learning algorithms that are informed by the hands played in the game.

But I’ll try The Cure right now and see if I can figure out what it is doing. You’re welcome to join me!

Sage commons and personalized medicine

Today I had the chance to talk to Stephen Friend, who started Sage Bionetworks, which I must have blogged about at some point but can’t find any entries for at the moment.

Sage was created to facilitate research through opening up genetic, clinical and other sorts of data as much as possible, or as the web site puts it, to address “the acute need for a new approach to using complex genetic information for drug development.” Stephen Friend has previously been at Rosetta and Merck and among the current data sets in the Sage Repository, there are several interesting ones, containing both genetic and phenotypic information, that have been used in high-profile Merck/Rosetta-related papers by Eric Schadt and others.

However, the Sage project is not only about providing data; it’s also about disease network modeling (I’m guessing both on the gene and protein levels), a goal that Friend is clearly serious about. Another interesting thing is that Sage has Jeff Hammerbacher – the whiz-kid who built up Facebook’s IT platform – on its board of directors. And he’s not there as a token big data guy – Friend told me that Hammerbacher is actively involved in developing Sage’s IT infrastructure. I think it’s great to have data scientists working on biological problems. As Hilary Mason said on the Strata conference, we have enough ad optimization solutions now – let’s do something different!

I hadn’t looked at Sage for a while and was pleasantly surprised to learn they have a nice Tumblr site which linked to interesting content such as this Scientific American Pathways compilation of good articles about data, medicine and personalization and the RNA game EteRNA, which is something on the lines of the Phylo and FoldIt games previously covered on this blog. I also found the Personalized Health Manifesto, written by David Ewing Duncan and underwritten by people like Stephen Friend, George Church, Eric Schadt, Atul Butte, Misha Angrist and many others. I haven’t read it yet but certainly aim to do so as soon as possible.

Post Navigation