Follow the Data

A data driven blog

Tutorial: Exploring TCGA breast cancer proteomics data

Data used in this publication were generated by the Clinical Proteomic Tumor Analysis Consortium (NCI/NIH).

The Cancer Genome Atlas (TCGA) has become a focal point for a lot of genomics and bioinformatics research. DNA and RNA level data on different tumor types are now used in countless papers to test computational methods and to learn more about hallmarks of different types of cancer.

Perhaps, though, there aren’t as many people who are using the quantitative proteomics data hosted by Clinical Proteomic Tumor Analysis Consortium (CPTAC). There are mass spectrometry based expression measurements for many different types of tumor available at their Data Portal.

As I have been comparing some (currently in-house, to be published eventually) cancer proteomics data sets against TCGA proteomics data, I thought I would share some code, tricks and tips for those readers who want to start analyzing TCGA data (whether proteomics, transcriptomics or other kinds) but don’t quite know where to start.

To this end, I have put a tutorial Jupyter notebook at Github: TCGA protein tutorial

The tutorial is written in R, mainly because I like the TCGA2STAT and Boruta packages (but I just learned there is a Boruta implementation in Python as well.) If you think it would be useful to have a similar tutorial in Python, I will consider writing one.

The tutorial consists, roughly, of these steps:

  • Getting a usable set of breast cancer proteomics data
    This consists of downloading the data, selecting the subset that we want to focus on, removing features with undefined values, etc..
  • Doing feature selection to find proteins predictive of breast cancer subtype.
    Here, the Boruta feature selection package is used to identify a compact set of proteins that can predict the so-called PAM50 subtype of each tumor sample. (The PAM50 subtype is based on mRNA expression levels.)
  • Comparing RNA-seq data and proteomics data on the same samples.
    Here, we use the TCGA2STAT package to obtain TCGA RNA-seq data and find the set of common gene names and common samples between our protein and mRNA-seq data in order to look at protein-mRNA correlations.

Please visit the notebook if you are interested!

Some of the take-aways from the tutorial may be:

  • A bit of messing about with metadata, sample names etc. is usually necessary to get the data in the proper format, especially if you are combining different kinds of data (such as RNA-seq and proteomics here). I guess you’ve heard them say that 80% of data science is data preparation!…
  • There are now quantitative proteomics data available for many types of TCGA tumor samples.
  • TCGA2STAT is a nice package for importing certain kinds of TCGA data into an R session.
  • Boruta is an interesting alternative for feature selection in a classification context.

This post was prepared with permission from CPTAC.

P.S. I may add some more material on a couple of ways to do multivariate data integration on TCGA data sets later, or make that a separate blog post. Tell me if you are interested.


Single Post Navigation

4 thoughts on “Tutorial: Exploring TCGA breast cancer proteomics data

  1. Thank you for the great post! I’m interested in data mining techniques (feature selection, classification, association testing) from large datasets (GWAS), and the examples appearing often refer back to microarray datasets… This tutorial nicely shows up that…. Thank-you!

  2. Mikael Huss on said:

    You’re welcome! I’m happy if people find these notes useful.

  3. Cool analysis! I’m surprised to see there’s such bad correlation between protein and mrna-seq data. Is that well known? Any thoughts on what biological mechanism is at play or if this is related to errors in measurement?

    • Mikael Huss on said:

      It is well known that correlations between mRNA and protein levels can be pretty bad, but there is a lot of debate about what the main reasons for that could be. For sure, measurement noise in the mass spec (and also RNA-seq, but I feel mass spec is still a bit noisier) is likely a factor, but there are also biological aspects such as turnover time, protein localization etc. Some groups have tried to address the question by sequencing ribosome-bound RNAs (which are presumably being translated, and thus more directly connected to protein concentrations.) I am currently looking at a set of breast cancer tumors (different from the one in this blog post, but I can’t use those data here before we have submitted to a journal) and there the correlations are clearly better than here, although far from perfect (but for the “famous” PAM50 genes/proteins, the correlations are quite good, ~0.8 on average).

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: