Follow the Data

A data driven blog

Archive for the tag “R”

Tutorial: Exploring TCGA breast cancer proteomics data

Data used in this publication were generated by the Clinical Proteomic Tumor Analysis Consortium (NCI/NIH).

The Cancer Genome Atlas (TCGA) has become a focal point for a lot of genomics and bioinformatics research. DNA and RNA level data on different tumor types are now used in countless papers to test computational methods and to learn more about hallmarks of different types of cancer.

Perhaps, though, there aren’t as many people who are using the quantitative proteomics data hosted by Clinical Proteomic Tumor Analysis Consortium (CPTAC). There are mass spectrometry based expression measurements for many different types of tumor available at their Data Portal.

As I have been comparing some (currently in-house, to be published eventually) cancer proteomics data sets against TCGA proteomics data, I thought I would share some code, tricks and tips for those readers who want to start analyzing TCGA data (whether proteomics, transcriptomics or other kinds) but don’t quite know where to start.

To this end, I have put a tutorial Jupyter notebook at Github: TCGA protein tutorial

The tutorial is written in R, mainly because I like the TCGA2STAT and Boruta packages (but I just learned there is a Boruta implementation in Python as well.) If you think it would be useful to have a similar tutorial in Python, I will consider writing one.

The tutorial consists, roughly, of these steps:

  • Getting a usable set of breast cancer proteomics data
    This consists of downloading the data, selecting the subset that we want to focus on, removing features with undefined values, etc..
  • Doing feature selection to find proteins predictive of breast cancer subtype.
    Here, the Boruta feature selection package is used to identify a compact set of proteins that can predict the so-called PAM50 subtype of each tumor sample. (The PAM50 subtype is based on mRNA expression levels.)
  • Comparing RNA-seq data and proteomics data on the same samples.
    Here, we use the TCGA2STAT package to obtain TCGA RNA-seq data and find the set of common gene names and common samples between our protein and mRNA-seq data in order to look at protein-mRNA correlations.

Please visit the notebook if you are interested!

Some of the take-aways from the tutorial may be:

  • A bit of messing about with metadata, sample names etc. is usually necessary to get the data in the proper format, especially if you are combining different kinds of data (such as RNA-seq and proteomics here). I guess you’ve heard them say that 80% of data science is data preparation!…
  • There are now quantitative proteomics data available for many types of TCGA tumor samples.
  • TCGA2STAT is a nice package for importing certain kinds of TCGA data into an R session.
  • Boruta is an interesting alternative for feature selection in a classification context.

This post was prepared with permission from CPTAC.

P.S. I may add some more material on a couple of ways to do multivariate data integration on TCGA data sets later, or make that a separate blog post. Tell me if you are interested.

Playing with Swedish election data

The Swedish general elections were held on September 14, 2014, and resulted in a messy parliamentary situation in which the party receiving the most votes, the Social Democrats, had a hard time putting together a functioning government. The previous right-wing (by Swedish standards) government led by the Moderates was voted out after eight years in power. The most discussed outcome was that the nationalist Sweden Democrats party surged to 12.9% of the vote, up from about 5% in the 2010 elections.

I read data journalist Jens Finnäs’ interesting blog post “Covering election night with R“. He was covering the elections with live statistical analysis on Twitter. I approached him afterwards and asked for the data sets he had put together on voting results and various characteristics of the voting municipalities, in order to do some simple statistical analysis of voting patterns, and he kindly shared them with me (they are also available on GitHub, possibly in a slightly different form, I haven’t checked).

What I wanted to find out was whether there was a clear urban-rural separation in voting patterns and whether a principal component analysis would reveal a “right-left” axis and a “traditional-cosmopolitan” axis corresponding to a schematic that I have seen a couple of times now. I also wanted to see if a random forest classifier would be able to predict the vote share of the Sweden Democrats, or some other party, in a municipality, based only on municipality descriptors.

There are some caveats in this analysis. For example, we are dealing with compositional data here in the voting outcomes: all the vote shares must sum to one (or 100%). That means that neither PCA nor random forests may be fully appropriate. Wilhelm Landerholm pointed me to an interesting paper about PCA for compositional data. As for the random forests, I suppose I should use some form of multiple-output RF, which could produce one prediction per party, but since this was a quick and dirty analysis, I just did it party by party.

The analysis is available as a document with embedded code and figures at Rpubs, or on GitHub if you prefer that. You’ll have to go there to see the plots, but some tentative “results” that I took away were:

  • There are two axes where one (PC1) can be interpreted as a Moderate – Social Democrat axis (rather than a strict right vs left axis), and one (PC2) that can indeed be interpreted as a traditionalist – cosmopolitan axis, with the Sweden Democrats at one end, and the Left party (V) (also to some extent the environmental party, MP, the Feminist initiative, FI, and the liberal FP) at the other end.
  • There is indeed a clear difference between urban and rural voters (urban voters are more likely to vote for the Moderates, rural voters for the Social democrats).
  • Votes for the Sweden Democrats are also strongly geographically determined, but here it is more of a gradient along Sweden’s length (the farther north, the less votes – on average – for SD).
  • Surprisingly (?), the reported level of crime in a municipality doesn’t seem to affect voting patterns at all.
  • A random forest classifier can predict votes for a party pretty well on unseen data based on municipality descriptors. Not perfectly by any means, but pretty well.
  • The most informative features for predicting SD vote share were latitude, longitude, proportion of motorcycles, and proportion of educated individuals.
  • The most informative feature for predicting Center party votes was the proportion of tractors 🙂 Likely a confounder/proxy for rural-ness.

There are other things that would be cool to look at, such as finding the most “atypical” municipalities based on the RF model. Also there is some skew in the RF prediction scatter plots that should be examined. I’ll leave it as is for now, and perhaps return to it at some point.

Edit 2015-01-03. I read a post at the Swedish Cornucopia blog, which points out that the number of asylum seekers per capita is positively correlated with SD votes (Pearson r~0.32) and negatively correlated (r~-0.34) with votes for the moderates, M). The author thought this was highly significant but I felt that there were probably more important indicators. I therefore downloaded data on asylum seekers per capita, which had been put together based on combining the Migration Board’s statistics on asylum seekers from December 2014 with Statistics Sweden’s population statistics, and introduced this as a new indicator in my models. I pushed the updated version to GitHub. My interpretation based on the PCA and random forest analyses is that the number of asylum seekers per capita is not among the most important indicators for explaining the SD vote share.

Book and MOOC

As of today, is stocking a book to which I have contributed, RNA-seq Data Analysis: A Practical Approach. I realize the title might sound obscure to readers who are unfamiliar with genomics and bioinformatics. Simply put, RNA-seq is short for RNA sequencing, a method for measuring what we call gene expression. While the DNA contained in each cell is (to a first approximation) identical, different tissues and cell types turn their genes on and off in different ways in response to different conditions. The process when DNA is transcribed to RNA is called gene expression. RNA-seq has become a rather important experimental method and the lead author of our book, Eija Korpelainen, wanted to put together a user-friendly, practical and hopefully unbiased compendium of the existing RNA-seq data analysis methods and toolkits, without neglecting underlying theory. I contributed one chapter, the one about differential expression analysis, which basically means statistical testing for significant gene expression differences between groups of samples.

I am also currently involved as an assistant teacher in the Explore Statistics with R course given by Karolinska Institutet through the edX MOOC platform. Specifically, I have contributed material to the final week (week 5) which will start next Tuesday (October 7th). That material is also about RNA-seq analysis – I try to show a range of tools available in R which allow you to perform a complete analysis workflow for a typical scenario. Until the fifth week starts, I am helping out with answering student questions in the forums. It’s been a positive experience so far, but it is clear that one can never prepare enough for a MOOC – errors in phrasing, grading, etc are bound to pop up. Luckily, several gifted students are doing an amazing job of answering the questions from other students, while teaching us teachers a thing or two about the finer points of R.

Speaking of MOOCs, Coursera’s Mining Massive Datasets course featuring Jure Leskovec, Anand Rajaraman and Jeff Ullman started today. My plan is to try to follow it – we shall see if I have time.

Hadley Wickham lecture: ggvis, tidyr, dplyr and much more

Another week, another great meetup. This time, the very prolific Hadley Wickham visited the Stockholm R useR group and talked for about an hour about his new projects.

Perhaps some background is in order. Hadleys PhD thesis (free pdf here) is a very inspiring tour of different aspects of practical data analysis issues, such as reshaping data into a “tidy” for that is easy to work with (he developed the R reshape package for this), visualizing clustering and classification problems (see his classifly, clusterfly, and meifly packages) and creating a consistent language for describing plots and graphics (which resulted in the influential ggplot2 package). He has also made the plyr package as a more consistent version of the various “apply” functions in R. I learned a lot from this thesis.

Today, Hadley talked about several new packages that he has been developing to further improve on his earlier toolkit. He said that in general, his packages become simpler and simpler as he re-defines the basic operations needed for data analysis.

  • The newest one (“I wrote it about four days ago”, Hadley said) is called tidyr (it’s not yet on CRAN but can be installed from GitHub) and provides functions for getting data into the “tidy” format mentioned above. While reshape had the melt and cast commands, tidyr has gather, separate, and spread.
  • dplyr – the “next iteration of plyr”, which is faster and focuses on data frames. It uses commands like select, filter, mutate, summarize, arrange.
  • ggvis – a “dynamic version of ggplot2” which is designed for responsive dynamic graphics, streaming visualization and meant for the web. This looked really nice. For example, you can easily add sliders to a plot so you can change the parameters and watch how the plot changes in real time. ggvis is built on Shiny but provides easier ways to make the plots. You can even embed dynamic ggvis plots in R markdown documents with knitR so that the resulting report can contain sliders and other things. This is obviously not possible with PDFs though. ggvis will be released on CRAN “in a week or so”.

Hadley also highlighted the magrittr package which implements a pipe operator for R (Magritte/pipe … get it? (groan)) The pipe looks like %>% and at first blush it may not look like a big deal, but Hadley made a convincing case that using the pipe together with (for example) dplyr results in code that is much easier to read, write and debug.

Hadley is writing a book, Advanced R (wiki version here), which he said has taught him a lot about the inner workings of R. He mentioned Rcpp as an excellent way to write C++ code and embed it in R packages. The bigvis package was mentioned as a “proof of concept” of how one might visualize big data sets (where the number of data points is larger than the number of pixels on the screen, so it is physically impossible to plot everything and summarization is necessary.)

Stockholm data happenings

The weather may be terrible at the moment in Stockholm (it was really a downer to come back from the US this morning) but there are a couple of interesting data-related events coming up. The past week, I missed two interesting events: the KTH Symposium on Big Data (past Mon, May 26) and the AWS Summit (past Tue, May 27).

In June, there will be meetups on deep learning (Machine Learning Stockholm group, June 9 at Spotify) and on Shiny and ggvis presented by Hadley Wickham himself (Stockholm useR group, June 16 at Pensionsmyndigheten.) There are wait lists for both.

Danny Bickson is giving a tutorial on GraphLab at Stockholm iSocial Summer school June 2-4. He has indicated that he would be happy to chat with anyone who is interested in connection with this.

King are looking for a “data guru” – a novel job title!

Finally, Wilhelm Landerholm, a seasoned data scientist who was way ahead of the hype curve, has finally started (or revived?) his blog on big data, which unfortunately is Swedish only: We Want Your Data.



Synapse – a Kaggle for molecular medicine?

I have frequently extolled the virtues of collaborative crowdsourced research, online prediction contests and similar subjects on these pages. Almost 2 years ago, I also mentioned Sage Bionetworks, which had started some interesting efforts in this area at the time.

Last Thursday, I (together with colleagues) got a very interesting update on what Sage is up to at the moment, and those things tie together a lot of threads that I am interested in – prediction contests, molecular diagnostics, bioinformatics, R and more. We were visited by Adam Margolin, who is director of computational biology at Sage (one of their three units).

He described how Sage is compiling and organizing public molecular data (such as that contained in The Cancer Genome Atlas) and developing tools for working with it, but more importantly, that they had hit upon prediction contests as the most effective way to generate modelling strategies for prognostic and diagnostic applications based on these data. (As an aside, Sage now appears to be focusing mostly on cancers rather than all types of disease as earlier; applications include predicting cancer subtype severity and survival outcomes.) Adam thinks that objectively scored prediction contests lets researchers escape from the “self-assessment trap“, where one always unconsciously strives to present the performance of one’s models in the most positive light.

They considered running their competitions on Kaggle (and are still open to it, I think) but given that they already had a good infrastructure for reproducible research, Synapse, they decided to tweak that instead and run the competitions on their own platform. Also, Google donated 50 million core hours (“6000 compute years”) and petabyte-scale storage for the purpose.

There was another reason not to use Kaggle as well. Sage wanted participants to not only upload predictions for which the results is shown on a dynamic leaderboard (which they do), but also to force them to provide runnable code which is actually executed on the Sage platform to generate the predictions. The way it works is that competitors need to use R to build their models, and they need to implement two methods, customTrain() and customPredict() (analogous to the train() and predict() methods implemented by most or all statistical learning methods in R) which are called by the server software. Many groups do not like to use R for their model development but there are ways to easily wrap arbitrary types of code inside R.

The first full-scale competition run on Synapse (which is, BTW, not only a competition platform but a “collaborative compute space that allows scientists to share and analyze data together”, as the web page states) was the Sage/DREAM Breast Cancer Prognosis Challenge, which uses data from a cohort of almost 2,000 breast cancer patients. (The DREAM project is itself worthy of another blog post as a very early (in its seventh year now, I think) platform for objective assessment of predictive models and reverse engineering in computational biology, but I digress …)

The goal of the Sage/DREAM breast cancer prognosis challenge is to find out whether it is possible to identify reliable prognostic molecular signatures for this disease. This question, in a generalized form (can we define diseases, subtypes and outcomes from a molecular pattern?), is still a hot one after many years of a steady stream of published gene expression signatures that have usually failed to replicate, or are meaningless (see e g Most Random Gene Expression Signatures Are Significantly Associated with Breast Cancer Outcome). Another competition that I plugged on this blog, SBV Improver, also had as its goal to assess if informative signatures could be found and its outcomes were disclosed recently. The result there was that out of four diseases addressed (multiple sclerosis, lung cancer, psoriasis, COPD), the molecular portrait (gene expression pattern) for one of them (COPD) did not add any information at all to known clinical characteristics, while for the others the gene expression helped to some extent, notably in psoriasis where it could discriminate almost perfectly between healthy and diseased tissue.

In the Sage/DREAM challenge, the cool thing is that you can directly (after registering an account) lift the R code from the leaderboard and try to reproduce the methods. The team that currently leads, Attractor Metagenes, has implemented a really cool (and actually quite simple) approach to finding “metagenes” (weighted linear combinations of actual genes) by an iterative approach that converges to certain characteristic metagenes, thus the “attractor” in the name. There is a paper on arXiv outlining the approach. Adam Margolin said that the authors have had trouble getting the paper published, but the Sage/DREAM competition has at least objectively shown that the method is sound and it should find its way into the computational biology toolbox now. I for one will certainly try it for some of my work projects.

The fact that Synapse stores both data and models in an open way has some interesting implications. For instance, the models can be applied to entirely new data sets, and they can be ensembled very easily (combined to get an average / majority vote / …). In fact, Sage even encourages competitors to make ensemble versions of models on the leaderboard to generate new models while the competition is going on! This is one step beyond Kaggle. Indeed, there is a team (ENSEMBLE) that specializes in this approach and they are currently at #2 on the leaderboard after Attractor Metagenes.

In the end, the winning team will be allowed to publish a paper about how they did it in Science Translational Medicine without peer review – the journal (correctly I think) assumes that the rigorous evaluation process in Synapse is more objective that peer review. Kudos to Science Translational Medicine for that.

There’s a lot more interesting things to mention, like how Synapse is now tackling “pan-cancer analysis” (looking for commonalities between *all* cancers), how they looked at millions of models to find out general rules of thumb about predictive models (discretization makes for worse performance, elastic net algorithms work best on average, prior knowledge and feature engineering is essential for good performance, etc.)
Perhaps the most remarkable thing in all of this, though, is that someone has found a way to build a crowdsourced card game, The Cure, on top of the Sage/DREAM breast cancer prognosis challenge in order to find even better solutions. I have not quite grasped how they did this – the FAQ states:

TheCure was created as a fun way to solicit help in guiding the search for stable patterns that can be used to make biologically and medically important predictions. When people play TheCure they use their knowledge (or their ability to search the Web or their social networks) to make informed decisions about the best combinations of variables (e.g. genes) to use to build predictive patterns. These combos are the ‘hands’ in TheCure card game. Every time a game is played, the hands are evaluated and stored. Eventually predictors will be developed using advanced machine learning algorithms that are informed by the hands played in the game.

But I’ll try The Cure right now and see if I can figure out what it is doing. You’re welcome to join me!

Stockholm R useR Group inaugural meeting

Yesterday, the Stockholm R useR group had its inaugural meeting, hosted by Statisticon, a statistical consulting firm with offices in the heart of the city. It was a smaller and more intimate affair than the Stockholm Big Data meetup last week, with perhaps 25 people attending. If my memory serves, the entities represented at the meeting were Statisticon itself, the Swedish Institute for Communicable Disease Control, Klarna, Stockholm University (researchers in 3 different fields), the Swedish Pensions Agency, and Karolinska Institutet.

There were two themes that came up again and again: firstly, reproducible dynamic reporting – everyone seemed to either use (and love) or want to learn Sweave (and to some extent knitr), and secondly, managing big data sets in R. Thus it was decided to focus on these for the next meeting: an expert from the group will give a presentation on Sweave, and another group of members will try to collect information on what is available for “big data” in R.

I thought it was interesting to see that the representatives from the Swedish Pensions Agency (there were 3 of them) seemed so committed to R, open source and open data. Nice! It was also mentioned that another employee of the same agency, who wasn’t present, has been developing his own big-data R package for dealing with the 9-million-row table containing pension-related data on all Swedish citizens.

A good week for (big) data (science)

Perhaps as a subconscious compensation for my failure to attend Strata 2012 last week (I did watch some of the videos and study the downloads from the “Two Most Important Algorithms in Predictive Modeling Today” session), I devoted this week to more big-data/data-science things than usual.

Monday to Wednesday were spent at a Hadoop and NGS (Next Generation [DNA] Sequencing) data processing hackathon hosted by CSC in Espoo, Finland. All of the participants were very nice and accomplished; I’ll just single out two people for having developed high-throughput DNA sequencing related Hadoop software: Matti Niemenmaa, who is the main developer of Hadoop-BAM, a library for manipulating aligned sequence data in the cloud, and Luca Pireddu, who is the main developer of Seal, which is a nice Hadoop toolkit for sequencing data which enables running several different types of tasks in distributed fashion. Other things we looked at was the CloudBioLinux project, map/reduce sequence assembly using Contrail and CSC’s biological high-throughput data analysis platform Chipster.

On Friday, me and blog co-author Joel went to record our first episode of the upcoming Follow the Data podcast series with Fredrik Olsson and Magnus Sahlgren from Gavagai. In the podcast series, we will try to interview mainly Swedish but also other companies that we feel are big data or analytics related in an interesting way. Today I have been listening to the first edit and feel relatively happy with it, even though it is quite rough, owing to our lack of experience. I also hate to hear my own recorded voice, especially in English … I am working on one or two blog posts to summarize the highlights of the podcast (which is in English) and the following discussion in Swedish.

Over the course of the week, I’ve also worked in the evenings and on planes to finish an assignment for an academic R course I am helping out with. I decided to experiment a bit with this assignment and to base it on a Kaggle challenge. The students will download data from Kaggle and get instructions that can be regarded as a sort of “prediction contests 101”, discussing the practical details of getting your data into shape, evaluating your models, figuring out which variables are most important and so on. It’s been fun and can serve as a checklist for my self in the future.

Stay tuned for the first episode of Follow the Data podcast!


I’m not normally a big user of IDEs, but I have to say that the new RStudio is pretty slick. It’s a free, open-source IDE for R and looks a bit like the Matlab IDE with a tabbed interface for convenient access to variables and objects, plots and data tables. RStudio runs on Mac, Linux and Windows or on a server, where it can be accessed remotely through a web browser. A nice touch is that it supports Sweave and TeX document creation, although I haven’t tested either of those yet. Maybe now’s the time to learn some Sweave. I started to use RStudio yesterday and I think it will replace the Mac GUI for R that I have been using. The latter is all right but a bit too disjointed when you start plotting and editing several files at once.

Food and health data set

I stumbled into an amazing dataset about food and health, available online here (Google spreadsheet) and described at the Canibais e Reis blog. I found it through the Cluster analysis of what the world eats blog post, which is cool, but which doesn’t go into the health part of the dataset. By the way, the R code used that blog post is useful for learning how to plot things onto a map of the world in R (and it calculates the most deviant food habits in Mexico and USA as a bonus). Also note the first line:


which reads the data set directly from an URL into an R data structure, ready to be manipulated. I think it’s pretty neat, but then I am easily impressed.

The Canibais e Reis author was interested in data on the relationship between nutrition, lifestyle and health worldwide, but those data were dispersed over various sources and used different formats. He therefore (heroically) combined information from sources like the FAO Statistical Yearbook (for world nutrition data), the British Heart Foundation (for world heart-related, diabetes, obesity, cholesterol etc. disease statistics) and the WHO Global Health Atlas and WHO Statistical Information System (for general world health statistics like mortality, sanitation, drinking water, etc.) After cleaning up the data set and removing incomplete entries, he ended up with a complete matrix of 101 nutrition, health and lifestyle variables for 86 countries. Let the mining begin!

As the blog post describing the data points out, there’s bound to be a lot of confounding variables and non-independence in the data set, so it would be a good idea to apply tools like PCA (see e.g. the recent article Principal Components for Modeling), canonical correlation analysis or something similar to it as a pre-processing step. I haven’t had time to do more than fiddle around a bit – for example, I ran a quick PCA on the food related part of the matrix to try to find out the major direction of variation in world diets. The first principal component (which, at 19.8%, is not very dominant) reflects a division between rice eating countries and “meat and wheat” countries with high consumption of animal products, wheat, meat and sugar.
Canibais e Reis provides a dynamic Excel file where some different types of analysis have been performed. It’s fun to explore the unexpected correlations (or absent correlations) that pop up (the worksheets BEST and WORST in the Excel file). One surprising finding that emerges is that cholesterol is not correlated to cardiovascular disease across this data set (in fact there is a slight negative correlation).

My favourite finding, though, is that cheese consumption is not correlated to death from non-communicable diseases or cardiovascular diseases. Those correlations may be massively influenced by confounding variables, but they are negative enough that I choose to continue chomping on those cheeses …

Post Navigation