Follow the Data

A data driven blog

Archive for the tag “cancer”

Tutorial: Exploring TCGA breast cancer proteomics data

Data used in this publication were generated by the Clinical Proteomic Tumor Analysis Consortium (NCI/NIH).

The Cancer Genome Atlas (TCGA) has become a focal point for a lot of genomics and bioinformatics research. DNA and RNA level data on different tumor types are now used in countless papers to test computational methods and to learn more about hallmarks of different types of cancer.

Perhaps, though, there aren’t as many people who are using the quantitative proteomics data hosted by Clinical Proteomic Tumor Analysis Consortium (CPTAC). There are mass spectrometry based expression measurements for many different types of tumor available at their Data Portal.

As I have been comparing some (currently in-house, to be published eventually) cancer proteomics data sets against TCGA proteomics data, I thought I would share some code, tricks and tips for those readers who want to start analyzing TCGA data (whether proteomics, transcriptomics or other kinds) but don’t quite know where to start.

To this end, I have put a tutorial Jupyter notebook at Github: TCGA protein tutorial

The tutorial is written in R, mainly because I like the TCGA2STAT and Boruta packages (but I just learned there is a Boruta implementation in Python as well.) If you think it would be useful to have a similar tutorial in Python, I will consider writing one.

The tutorial consists, roughly, of these steps:

  • Getting a usable set of breast cancer proteomics data
    This consists of downloading the data, selecting the subset that we want to focus on, removing features with undefined values, etc..
  • Doing feature selection to find proteins predictive of breast cancer subtype.
    Here, the Boruta feature selection package is used to identify a compact set of proteins that can predict the so-called PAM50 subtype of each tumor sample. (The PAM50 subtype is based on mRNA expression levels.)
  • Comparing RNA-seq data and proteomics data on the same samples.
    Here, we use the TCGA2STAT package to obtain TCGA RNA-seq data and find the set of common gene names and common samples between our protein and mRNA-seq data in order to look at protein-mRNA correlations.

Please visit the notebook if you are interested!

Some of the take-aways from the tutorial may be:

  • A bit of messing about with metadata, sample names etc. is usually necessary to get the data in the proper format, especially if you are combining different kinds of data (such as RNA-seq and proteomics here). I guess you’ve heard them say that 80% of data science is data preparation!…
  • There are now quantitative proteomics data available for many types of TCGA tumor samples.
  • TCGA2STAT is a nice package for importing certain kinds of TCGA data into an R session.
  • Boruta is an interesting alternative for feature selection in a classification context.

This post was prepared with permission from CPTAC.

P.S. I may add some more material on a couple of ways to do multivariate data integration on TCGA data sets later, or make that a separate blog post. Tell me if you are interested.

Lessons learned from mining heterogeneous cancer data sets

How much can we learn about cancer treatment and prevention by large-scale data collection and analysis?

An interesting paper was just published: “Assessing the clinical utility of cancer genomic and proteomic data across tumor types“. I am afraid the article is behind a paywall, but no worries – I will summarize the main points here! Basically the authors have done a large-scale data mining study of data published within The Cancer Genome Atlas (TCGA) project, a very ambitious effort to collect molecular data on different kinds of tumors. The main question they ask is how much clinical utility these molecular data can add to conventional clinical information such as tumor stage, tumor grade, age and gender.

The lessons I drew from the paper are:

  • The molecular data does not add that much predictive information beyond the clinical information. As the authors put it in the discussion, “This echoes the observation that the number of cancer prognostic molecular markers in clinical use is pitifully small, despite decades of protracted and tremendous efforts.” It is an unfortunate fact of life in cancer genomics that many molecular classifiers (based on gene expression patterns usually) have been proposed to predict tumor severity, patient survival and so on, but different groups keep coming up with different gene sets and they tend not to be validated in independent cohorts.
  • When looking at what factors explain most of the variation, the type of tumor explains the most (37.4%), followed by the type of data used (that is, gene expression, protein expression, micro-RNA expression, DNA methylation or copy number variations) which explains 17.4%, with the interaction between tumor type and data type in third place (11.8%), suggesting that some data types are more informative for certain tumors than others. The algorithm used is fairly unimportant (5.2%). At the risk of drawing unwarranted conclusions, it is tempting to generalize this into something like this: the most important factor is the intrinsic difficulty of modeling the system, the next most important factor is the decision of what data to collect and/or feature engineering, while the type of algorithm used for learning the model comes far behind.
  • Perhaps surprisingly, there was essentially no cross-tumor predictive power between models. (There was one exception to this.) That is, a model built for one type of tumor was typically useless when predicting the prognosis for another tumor type.
  • Individual molecular features (expression levels of individual genes, for instance) did not add predictive power beyond what was already in the clinical information, but in some cases molecular subtype did. The molecular subtype is a “molecular footprint” derived using consensus NMF (nonnegative matrix factorization, an unsupervised method that can be used for dimension reduction and clustering). This footprint that described a general pattern was informative whereas the individual features making up the footprint weren’t. This seems consistent with the issue mentioned above about gene sets failing to consistently predict tumor severity. The predictive information is on a higher level than the individual genes.

The authors argue that one reason for the failure of predictive modeling in cancer research has been that investigators have relied too much on p values to say something about the clinical utility of their markers, when they should instead have focused more on the effect size, or the magnitude of difference in patient outcomes.

They also make a good point about reliability and reproducibility. I quote: “The literature of tumor biomarkers is plagued by publication bias and selective and/or incomplete reporting“. To help combat these biases, the authors (many of whom are associated with Sage Biosystems, who I have mentioned repeatedly on this blog) have made available an open model-assessment platform, including of course all the models from the paper itself, but which can also be used to assess your own favorite model.

Cancer, machine learning and data integration

Machine Learning Methods in the Computational Biology of Cancer is an arXiv preprint of a pretty nice article dealing with some analysis that can be used for high-dimensional biological (and other) data – although the examples come from cancer research, they could easily be about something else. This paper does a good job of describing penalized regression methods such as lasso, ridge regression and elastic net. It also goes into compressed sensing and its applicability to biology, although cautioning that it cannot yet be straightforwardly applied to biological data. This is because compressed sensing is based on the assumption that one can choose the “measurement matrix” freely, whereas in biology, it (usually called “design matrix” in this context) is already fixed.

The Critical Assessment of Massive Data Analysis (CAMDA) 2014 conference has released its data analysis challenges. Last year’s challenges on toxicogenomics and toxicity prediction will be reprised (perhaps in modified form, I didn’t check), but they have added a new challenge which I find interesting because it focuses on data integration (combining distinct data sets on gene, protein and micro-RNA expression as well as gene structural variations and DNA methylation) and uses published data from the International Cancer Genome Consortium (ICGC). I think it’s a good thing to re-analyze, mash up and meta-analyze data from these large-scale projects, and the CAMDA challenges are interesting because they are so open-ended, in contrast to e g Kaggle challenges (which I also like but in a different way). The goals in the CAMDA challenges are quite open to interpretation (and also ambitious), for instance:

  • Question 1: What are disease causal changes? Can the integration of comprehensive multi-track -omics data give a clear answer?
  • Question 2: Can personalized medicine and rational drug treatment plans be derived from the data? And how can we validate them down the road?

Video time

Here are a few video clips I’ve enjoyed watching over the past week.

From TEDMED2009, David Agus talks about cancer research and covers quite a lot of territory, from the value of monitoring your habits (he briefly discusses his own Philips DirectLife device) to the need for a molecular rather than tissue-based definition of cancer and his quest to model cancer as a complex system that has to do with a lot more than genetics.

The Argument for Better Health, in 3 Minutes & 53 Seconds is an attempt to summarize the most important arguments of Thomas Goetz’ new book The Decision Tree for a broader audience. In other words, it’s about how individuals can take control of their own health by using what Goetz calls a decision tree approach. The video, although good, is kind of entry-level material; if you want to go a bit deeper, you could download podcasts of the introduction and first chapter of the book. Here are three good reviews of the book.

Finally, a video in four parts explaining the benefits of using the R language for statistical analysis. I use R myself practically daily and think it’s great. These videos make it clear that it has now spread far outside of academia and has become an important part of the data analyst’s toolbox.

Post Navigation