Follow the Data

A data driven blog

Archive for the tag “R”

Synapse – a Kaggle for molecular medicine?

I have frequently extolled the virtues of collaborative crowdsourced research, online prediction contests and similar subjects on these pages. Almost 2 years ago, I also mentioned Sage Bionetworks, which had started some interesting efforts in this area at the time.

Last Thursday, I (together with colleagues) got a very interesting update on what Sage is up to at the moment, and those things tie together a lot of threads that I am interested in – prediction contests, molecular diagnostics, bioinformatics, R and more. We were visited by Adam Margolin, who is director of computational biology at Sage (one of their three units).

He described how Sage is compiling and organizing public molecular data (such as that contained in The Cancer Genome Atlas) and developing tools for working with it, but more importantly, that they had hit upon prediction contests as the most effective way to generate modelling strategies for prognostic and diagnostic applications based on these data. (As an aside, Sage now appears to be focusing mostly on cancers rather than all types of disease as earlier; applications include predicting cancer subtype severity and survival outcomes.) Adam thinks that objectively scored prediction contests lets researchers escape from the “self-assessment trap“, where one always unconsciously strives to present the performance of one’s models in the most positive light.

They considered running their competitions on Kaggle (and are still open to it, I think) but given that they already had a good infrastructure for reproducible research, Synapse, they decided to tweak that instead and run the competitions on their own platform. Also, Google donated 50 million core hours (“6000 compute years”) and petabyte-scale storage for the purpose.

There was another reason not to use Kaggle as well. Sage wanted participants to not only upload predictions for which the results is shown on a dynamic leaderboard (which they do), but also to force them to provide runnable code which is actually executed on the Sage platform to generate the predictions. The way it works is that competitors need to use R to build their models, and they need to implement two methods, customTrain() and customPredict() (analogous to the train() and predict() methods implemented by most or all statistical learning methods in R) which are called by the server software. Many groups do not like to use R for their model development but there are ways to easily wrap arbitrary types of code inside R.

The first full-scale competition run on Synapse (which is, BTW, not only a competition platform but a “collaborative compute space that allows scientists to share and analyze data together”, as the web page states) was the Sage/DREAM Breast Cancer Prognosis Challenge, which uses data from a cohort of almost 2,000 breast cancer patients. (The DREAM project is itself worthy of another blog post as a very early (in its seventh year now, I think) platform for objective assessment of predictive models and reverse engineering in computational biology, but I digress …)

The goal of the Sage/DREAM breast cancer prognosis challenge is to find out whether it is possible to identify reliable prognostic molecular signatures for this disease. This question, in a generalized form (can we define diseases, subtypes and outcomes from a molecular pattern?), is still a hot one after many years of a steady stream of published gene expression signatures that have usually failed to replicate, or are meaningless (see e g Most Random Gene Expression Signatures Are Significantly Associated with Breast Cancer Outcome). Another competition that I plugged on this blog, SBV Improver, also had as its goal to assess if informative signatures could be found and its outcomes were disclosed recently. The result there was that out of four diseases addressed (multiple sclerosis, lung cancer, psoriasis, COPD), the molecular portrait (gene expression pattern) for one of them (COPD) did not add any information at all to known clinical characteristics, while for the others the gene expression helped to some extent, notably in psoriasis where it could discriminate almost perfectly between healthy and diseased tissue.

In the Sage/DREAM challenge, the cool thing is that you can directly (after registering an account) lift the R code from the leaderboard and try to reproduce the methods. The team that currently leads, Attractor Metagenes, has implemented a really cool (and actually quite simple) approach to finding “metagenes” (weighted linear combinations of actual genes) by an iterative approach that converges to certain characteristic metagenes, thus the “attractor” in the name. There is a paper on arXiv outlining the approach. Adam Margolin said that the authors have had trouble getting the paper published, but the Sage/DREAM competition has at least objectively shown that the method is sound and it should find its way into the computational biology toolbox now. I for one will certainly try it for some of my work projects.

The fact that Synapse stores both data and models in an open way has some interesting implications. For instance, the models can be applied to entirely new data sets, and they can be ensembled very easily (combined to get an average / majority vote / …). In fact, Sage even encourages competitors to make ensemble versions of models on the leaderboard to generate new models while the competition is going on! This is one step beyond Kaggle. Indeed, there is a team (ENSEMBLE) that specializes in this approach and they are currently at #2 on the leaderboard after Attractor Metagenes.

In the end, the winning team will be allowed to publish a paper about how they did it in Science Translational Medicine without peer review – the journal (correctly I think) assumes that the rigorous evaluation process in Synapse is more objective that peer review. Kudos to Science Translational Medicine for that.

There’s a lot more interesting things to mention, like how Synapse is now tackling “pan-cancer analysis” (looking for commonalities between *all* cancers), how they looked at millions of models to find out general rules of thumb about predictive models (discretization makes for worse performance, elastic net algorithms work best on average, prior knowledge and feature engineering is essential for good performance, etc.)
Perhaps the most remarkable thing in all of this, though, is that someone has found a way to build a crowdsourced card game, The Cure, on top of the Sage/DREAM breast cancer prognosis challenge in order to find even better solutions. I have not quite grasped how they did this – the FAQ states:

TheCure was created as a fun way to solicit help in guiding the search for stable patterns that can be used to make biologically and medically important predictions. When people play TheCure they use their knowledge (or their ability to search the Web or their social networks) to make informed decisions about the best combinations of variables (e.g. genes) to use to build predictive patterns. These combos are the ‘hands’ in TheCure card game. Every time a game is played, the hands are evaluated and stored. Eventually predictors will be developed using advanced machine learning algorithms that are informed by the hands played in the game.

But I’ll try The Cure right now and see if I can figure out what it is doing. You’re welcome to join me!

Stockholm R useR Group inaugural meeting

Yesterday, the Stockholm R useR group had its inaugural meeting, hosted by Statisticon, a statistical consulting firm with offices in the heart of the city. It was a smaller and more intimate affair than the Stockholm Big Data meetup last week, with perhaps 25 people attending. If my memory serves, the entities represented at the meeting were Statisticon itself, the Swedish Institute for Communicable Disease Control, Klarna, Stockholm University (researchers in 3 different fields), the Swedish Pensions Agency, and Karolinska Institutet.

There were two themes that came up again and again: firstly, reproducible dynamic reporting – everyone seemed to either use (and love) or want to learn Sweave (and to some extent knitr), and secondly, managing big data sets in R. Thus it was decided to focus on these for the next meeting: an expert from the group will give a presentation on Sweave, and another group of members will try to collect information on what is available for “big data” in R.

I thought it was interesting to see that the representatives from the Swedish Pensions Agency (there were 3 of them) seemed so committed to R, open source and open data. Nice! It was also mentioned that another employee of the same agency, who wasn’t present, has been developing his own big-data R package for dealing with the 9-million-row table containing pension-related data on all Swedish citizens.

A good week for (big) data (science)

Perhaps as a subconscious compensation for my failure to attend Strata 2012 last week (I did watch some of the videos and study the downloads from the “Two Most Important Algorithms in Predictive Modeling Today” session), I devoted this week to more big-data/data-science things than usual.

Monday to Wednesday were spent at a Hadoop and NGS (Next Generation [DNA] Sequencing) data processing hackathon hosted by CSC in Espoo, Finland. All of the participants were very nice and accomplished; I’ll just single out two people for having developed high-throughput DNA sequencing related Hadoop software: Matti Niemenmaa, who is the main developer of Hadoop-BAM, a library for manipulating aligned sequence data in the cloud, and Luca Pireddu, who is the main developer of Seal, which is a nice Hadoop toolkit for sequencing data which enables running several different types of tasks in distributed fashion. Other things we looked at was the CloudBioLinux project, map/reduce sequence assembly using Contrail and CSC’s biological high-throughput data analysis platform Chipster.

On Friday, me and blog co-author Joel went to record our first episode of the upcoming Follow the Data podcast series with Fredrik Olsson and Magnus Sahlgren from Gavagai. In the podcast series, we will try to interview mainly Swedish but also other companies that we feel are big data or analytics related in an interesting way. Today I have been listening to the first edit and feel relatively happy with it, even though it is quite rough, owing to our lack of experience. I also hate to hear my own recorded voice, especially in English … I am working on one or two blog posts to summarize the highlights of the podcast (which is in English) and the following discussion in Swedish.

Over the course of the week, I’ve also worked in the evenings and on planes to finish an assignment for an academic R course I am helping out with. I decided to experiment a bit with this assignment and to base it on a Kaggle challenge. The students will download data from Kaggle and get instructions that can be regarded as a sort of “prediction contests 101″, discussing the practical details of getting your data into shape, evaluating your models, figuring out which variables are most important and so on. It’s been fun and can serve as a checklist for my self in the future.

Stay tuned for the first episode of Follow the Data podcast!

RStudio

I’m not normally a big user of IDEs, but I have to say that the new RStudio is pretty slick. It’s a free, open-source IDE for R and looks a bit like the Matlab IDE with a tabbed interface for convenient access to variables and objects, plots and data tables. RStudio runs on Mac, Linux and Windows or on a server, where it can be accessed remotely through a web browser. A nice touch is that it supports Sweave and TeX document creation, although I haven’t tested either of those yet. Maybe now’s the time to learn some Sweave. I started to use RStudio yesterday and I think it will replace the Mac GUI for R that I have been using. The latter is all right but a bit too disjointed when you start plotting and editing several files at once.

Food and health data set

I stumbled into an amazing dataset about food and health, available online here (Google spreadsheet) and described at the Canibais e Reis blog. I found it through the Cluster analysis of what the world eats blog post, which is cool, but which doesn’t go into the health part of the dataset. By the way, the R code used that blog post is useful for learning how to plot things onto a map of the world in R (and it calculates the most deviant food habits in Mexico and USA as a bonus). Also note the first line:

diet<-read.csv(“http://spreadsheets.google.com/pub?key=tdzqfp-_ypDqUNYnJEq8sgg&single=true&gid=0&output=csv&#8221;)

which reads the data set directly from an URL into an R data structure, ready to be manipulated. I think it’s pretty neat, but then I am easily impressed.

The Canibais e Reis author was interested in data on the relationship between nutrition, lifestyle and health worldwide, but those data were dispersed over various sources and used different formats. He therefore (heroically) combined information from sources like the FAO Statistical Yearbook (for world nutrition data), the British Heart Foundation (for world heart-related, diabetes, obesity, cholesterol etc. disease statistics) and the WHO Global Health Atlas and WHO Statistical Information System (for general world health statistics like mortality, sanitation, drinking water, etc.) After cleaning up the data set and removing incomplete entries, he ended up with a complete matrix of 101 nutrition, health and lifestyle variables for 86 countries. Let the mining begin!

As the blog post describing the data points out, there’s bound to be a lot of confounding variables and non-independence in the data set, so it would be a good idea to apply tools like PCA (see e.g. the recent article Principal Components for Modeling), canonical correlation analysis or something similar to it as a pre-processing step. I haven’t had time to do more than fiddle around a bit – for example, I ran a quick PCA on the food related part of the matrix to try to find out the major direction of variation in world diets. The first principal component (which, at 19.8%, is not very dominant) reflects a division between rice eating countries and “meat and wheat” countries with high consumption of animal products, wheat, meat and sugar.
Canibais e Reis provides a dynamic Excel file where some different types of analysis have been performed. It’s fun to explore the unexpected correlations (or absent correlations) that pop up (the worksheets BEST and WORST in the Excel file). One surprising finding that emerges is that cholesterol is not correlated to cardiovascular disease across this data set (in fact there is a slight negative correlation).

My favourite finding, though, is that cheese consumption is not correlated to death from non-communicable diseases or cardiovascular diseases. Those correlations may be massively influenced by confounding variables, but they are negative enough that I choose to continue chomping on those cheeses …

Video time

Here are a few video clips I’ve enjoyed watching over the past week.

From TEDMED2009, David Agus talks about cancer research and covers quite a lot of territory, from the value of monitoring your habits (he briefly discusses his own Philips DirectLife device) to the need for a molecular rather than tissue-based definition of cancer and his quest to model cancer as a complex system that has to do with a lot more than genetics.

The Argument for Better Health, in 3 Minutes & 53 Seconds is an attempt to summarize the most important arguments of Thomas Goetz’ new book The Decision Tree for a broader audience. In other words, it’s about how individuals can take control of their own health by using what Goetz calls a decision tree approach. The video, although good, is kind of entry-level material; if you want to go a bit deeper, you could download podcasts of the introduction and first chapter of the book. Here are three good reviews of the book.

Finally, a video in four parts explaining the benefits of using the R language for statistical analysis. I use R myself practically daily and think it’s great. These videos make it clear that it has now spread far outside of academia and has become an important part of the data analyst’s toolbox.

Post Navigation

Follow

Get every new post delivered to your Inbox.

Join 46 other followers