In talks that I have given in the past few years, I have often made the point that most of genomics has not been “big data” in the usual sense, because although the raw data files can often be large, they are often processed in a more or less predictable way until they are “small” (e.g., tables of gene expression measurements or genetic variants in a small number of samples). This in turn depends on the fact that it is hard and expensive to obtain biological samples, so in a typical genomics project the sample size is small (from just a few to tens or in rare cases hundreds or thousands) while the dimensionality is large (e.g. 20,000 genes, 10,000 proteins or a million SNPs). This is in contrast to many “canonical big data” scenarios where one has a large number of examples (like product purchases) with a small dimensionality (maybe the price, category and some other properties of the product.)
Because of these issues, I have been hopeful about using published data on e.g. gene expression based on RNA sequencing or on metagenomics to draw conclusions based on data from many studies. In the former case (gene expression/RNA-seq) it could be to build classifiers for predicting tissue or cell type for a given gene expression profile. In the latter case (metagenomics/metatranscriptomics, maybe even metaproteomics) it could also be to build classifiers but also to discover completely new varieties of e.g. bacteria or viruses from the “biological dark matter” that makes up a large fraction of currently generated metagenomics data. These kinds of analysis are usually called meta-analysis, but I am fond of the term cumulative biology, which I came across in a paper by Samuel Kaski and colleagues (Toward Computational Cumulative Biology by Combining Models of Biological Datasets.)
Of course, there is nothing new about meta-analysis or cumulative biology – many “cumulative” studies have been published about microarray data – but nevertheless, I think that some kind of threshold has been crossed when it comes to really making use of the data deposited in public repositories. There has been development both in APIs allowing access to public data, in data structures that have been designed to deal specifically with large sequence data, and in automating analysis pipelines.
Below are some interesting papers and packages that are all in some way related to analyzing public gene expression data in different ways. I annotate each resource with a couple of tags.
Sequence Bloom Trees. [data structures] These data structures (described in the paper Fast search of thousands of short-read sequencing experiments) allow indexing of a very large number of sequences into a data structure that can be rapidly queried with your own data. I first tried it about a year ago and found it to be useful to check for the presence of short snippets of interest (RNA sequences corresponding to expressed peptides of a certain type) in published transcriptomes. The authors have made available a database of 2,652 RNA-seq experiments from human brain, breast and blood which served as a very useful reference point.
The Lair. [pipelines, automation, reprocessing] Lior Pachter and the rest of the gang behind popular RNA-seq analysis tools Kallisto and Sleuth have taken their concept further with Lair, a platform for interactive re-analysis of published RNA-seq datasets. They use a Snakemake based analysis pipeline to process and analyze experiments in a consistent way – see the example analyses listed here. Anyone can request a similar re-analysis of a published data set by providing a config file, design matrix and other details as described here.
Toil. [pipelines, automation, reprocessing] The abstract of this paper, which was recently submitted to bioRxiv, states: Toil is portable, open-source workflow software that supports contemporary workflow definition languages and can be used to securely and reproducibly run scientific workflows efficiently at large-scale. To demonstrate Toil, we processed over 20,000 RNA-seq samples to create a consistent meta-analysis of five datasets free of computational batch effects that we make freely available. Nearly all the samples were analysed in under four days using a commercial cloud cluster of 32,000 preemptable cores. The authors used their workflow software to quantify expression in four studies: The Cancer Genome Atlas (TCGA), Therapeutically Applicable Research To Generate Effective Treatments (TARGET), Pacific Pediatric Neuro-Oncology Consortium (PNOC), and the Genotype Tissue Expression Project (GTEx).
EBI’s RNA-seq-API. [API, discovery, reprocessing, compendium] The RESTful RNA-seq Analysis API provided by the EBI currently contains raw, FPKM and TPM gene and exon counts for a staggering 265,000 public sequencing runs in 264 different species, as well as ftp locations of CRAM, bigWig and bedGraph files. See the documentation here.
Digital Expression Explorer. [reprocessing, compendium] This resource contains hundreds of thousands of uniformly processed RNA-seq data sets (e.g., >73,000 human data sets and >97,000 mouse ones). The data sets were processed into gene-level counts, which led to some Twitter debate between the transcript-level quantification hardliners and the gene-count-tolerant communities, if I may label the respective camps in that way. These data sets can be downloaded in bulk.
CompendiumDb. [API, discovery] This is an R package that facilitates the programmatic retrieval of functional genomics data (i.e., often gene expression data) from the Gene Expression Omnibus (GEO), one of the main repositories for this kind of data.
Omics Discovery Index (OmicsDI). [discovery] This is described as a “Knowledge Discovery framework across heterogeneous data (genomics, proteomics and metabolomics)” and is mentioned here both because a lot of it is gene expression data and because it seems like a good resource for finding data across different experimental types for the same conditions.
MetaRNASeq. [discovery] A browser-based query system for finding RNA-seq experiments that fulfill certain search criteria. Seems useful when looking for data sets from a certain disease state, for example.
Tradict. [applications of meta-analysis] In this study, the authors analyzed 23,000 RNA-seq experiments to find out whether gene expression profiles could be reconstructed from a small subset of just 100 marker genes (out of perhaps 20,000 available genes). The author claims that it works well and the manuscript contains some really interesting graphs showing, for example, how most of the variation in gene expression is driven by developmental stage and tissue.
In case you think that these types of meta-analysis are only doable with large computing clusters with lots of processing power and storage, you’ll be happy to find out that it is easy to analyze RNA-seq experiments in a streaming fashion, without having to download FASTQ or even BAM files to disk (Valentine Svensson wrote a nice blog post about this), and with tools such as Kallisto, it does not really take that long to quantify the expression levels in a sample.
Finally, I’ll acknowledge that the discovery-oriented tools above (APIs, metadata search etc) still work on the basis of knowing what kind of data set you are looking for. But another interesting way of searching for expression data would be querying by content, that is, showing a search system the data you have at hand and asking it to provide the data sets most similar to it. This is discussed in the cumulative biology paper mentioned at the start of this blog post: “Instead of searching for datasets that have been described similarly, which may not correspond to a statistical similarity in the datasets themselves, we would like to conduct that search in a data-driven way, using as the query the dataset itself or a statistical (rather than a semantic) description of it.” In a similar vein, Titus Brown has discussed using MinHash signatures for identifying similar samples and finding collaborators.