Follow the Data

A data driven blog

Archive for the tag “feature selection”

Tutorial: Exploring TCGA breast cancer proteomics data

Data used in this publication were generated by the Clinical Proteomic Tumor Analysis Consortium (NCI/NIH).

The Cancer Genome Atlas (TCGA) has become a focal point for a lot of genomics and bioinformatics research. DNA and RNA level data on different tumor types are now used in countless papers to test computational methods and to learn more about hallmarks of different types of cancer.

Perhaps, though, there aren’t as many people who are using the quantitative proteomics data hosted by Clinical Proteomic Tumor Analysis Consortium (CPTAC). There are mass spectrometry based expression measurements for many different types of tumor available at their Data Portal.

As I have been comparing some (currently in-house, to be published eventually) cancer proteomics data sets against TCGA proteomics data, I thought I would share some code, tricks and tips for those readers who want to start analyzing TCGA data (whether proteomics, transcriptomics or other kinds) but don’t quite know where to start.

To this end, I have put a tutorial Jupyter notebook at Github: TCGA protein tutorial

The tutorial is written in R, mainly because I like the TCGA2STAT and Boruta packages (but I just learned there is a Boruta implementation in Python as well.) If you think it would be useful to have a similar tutorial in Python, I will consider writing one.

The tutorial consists, roughly, of these steps:

  • Getting a usable set of breast cancer proteomics data
    This consists of downloading the data, selecting the subset that we want to focus on, removing features with undefined values, etc..
  • Doing feature selection to find proteins predictive of breast cancer subtype.
    Here, the Boruta feature selection package is used to identify a compact set of proteins that can predict the so-called PAM50 subtype of each tumor sample. (The PAM50 subtype is based on mRNA expression levels.)
  • Comparing RNA-seq data and proteomics data on the same samples.
    Here, we use the TCGA2STAT package to obtain TCGA RNA-seq data and find the set of common gene names and common samples between our protein and mRNA-seq data in order to look at protein-mRNA correlations.

Please visit the notebook if you are interested!

Some of the take-aways from the tutorial may be:

  • A bit of messing about with metadata, sample names etc. is usually necessary to get the data in the proper format, especially if you are combining different kinds of data (such as RNA-seq and proteomics here). I guess you’ve heard them say that 80% of data science is data preparation!…
  • There are now quantitative proteomics data available for many types of TCGA tumor samples.
  • TCGA2STAT is a nice package for importing certain kinds of TCGA data into an R session.
  • Boruta is an interesting alternative for feature selection in a classification context.

This post was prepared with permission from CPTAC.

P.S. I may add some more material on a couple of ways to do multivariate data integration on TCGA data sets later, or make that a separate blog post. Tell me if you are interested.

Advertisements

Post Navigation