Follow the Data

A data driven blog

Archive for the tag “api”

Cumulative biology and meta-analysis of gene expression data

In talks that I have given in the past few years, I have often made the point that most of genomics has not been “big data” in the usual sense, because although the raw data files can often be large, they are often processed in a more or less predictable way until they are “small” (e.g., tables of gene expression measurements or genetic variants in a small number of samples). This in turn depends on the fact that it is hard and expensive to obtain biological samples, so in a typical genomics project the sample size is small (from just a few to tens or in rare cases hundreds or thousands) while the dimensionality is large (e.g. 20,000 genes, 10,000 proteins or a million SNPs). This is in contrast to many “canonical big data” scenarios where one has a large number of examples (like product purchases) with a small dimensionality (maybe the price, category and some other properties of the product.)

Because of these issues, I have been hopeful about using published data on e.g. gene expression based on RNA sequencing or on metagenomics to draw conclusions based on data from many studies. In the former case (gene expression/RNA-seq) it could be to build classifiers for predicting tissue or cell type for a given gene expression profile. In the latter case (metagenomics/metatranscriptomics, maybe even metaproteomics) it could also be to build classifiers but also to discover completely new varieties of e.g. bacteria or viruses from the “biological dark matter” that makes up a large fraction of currently generated metagenomics data. These kinds of analysis are usually called meta-analysis, but I am fond of the term cumulative biology, which I came across in a paper by Samuel Kaski and colleagues (Toward Computational Cumulative Biology by Combining Models of Biological Datasets.)

Of course, there is nothing new about meta-analysis or cumulative biology – many “cumulative” studies have been published about microarray data – but nevertheless, I think that some kind of threshold has been crossed when it comes to really making use of the data deposited in public repositories. There has been development both in APIs allowing access to public data, in data structures that have been designed to deal specifically with large sequence data, and in automating analysis pipelines.

Below are some interesting papers and packages that are all in some way related to analyzing public gene expression data in different ways. I annotate each resource with a couple of tags.

Sequence Bloom Trees. [data structures] These data structures (described in the paper Fast search of thousands of short-read sequencing experiments) allow indexing of a very large number of sequences into a data structure that can be rapidly queried with your own data. I first tried it about a year ago and found it to be useful to check for the presence of short snippets of interest (RNA sequences corresponding to expressed peptides of a certain type) in published transcriptomes. The authors have made available a database of 2,652 RNA-seq experiments from human brain, breast and blood which served as a very useful reference point.

The Lair. [pipelines, automation, reprocessing] Lior Pachter and the rest of the gang behind popular RNA-seq analysis tools Kallisto and Sleuth have taken their concept further with Lair, a platform for interactive re-analysis of published RNA-seq datasets. They use a Snakemake based analysis pipeline to process and analyze experiments in a consistent way – see the example analyses listed here. Anyone can request a similar re-analysis of a published data set by providing a config file, design matrix and other details as described here.

Toil. [pipelines, automation, reprocessing] The abstract of this paper, which was recently submitted to bioRxiv, states: Toil is portable, open-source workflow software that supports contemporary workflow definition languages and can be used to securely and reproducibly run scientific workflows efficiently at large-scale. To demonstrate Toil, we processed over 20,000 RNA-seq samples to create a consistent meta-analysis of five datasets free of computational batch effects that we make freely available. Nearly all the samples were analysed in under four days using a commercial cloud cluster of 32,000 preemptable cores. The authors used their workflow software to quantify expression in  four studies: The Cancer Genome Atlas (TCGA), Therapeutically Applicable Research To Generate Effective Treatments (TARGET), Pacific Pediatric Neuro-Oncology Consortium (PNOC), and the Genotype Tissue Expression Project (GTEx).

EBI’s RNA-seq-API. [API, discovery, reprocessing, compendium] The RESTful RNA-seq Analysis API provided by the EBI currently contains raw, FPKM and TPM gene and exon counts for a staggering 265,000 public sequencing runs in 264 different species, as well as ftp locations of CRAM, bigWig and bedGraph files. See the documentation here.

Digital Expression Explorer. [reprocessing, compendium] This resource contains hundreds of thousands of uniformly processed RNA-seq data sets (e.g., >73,000 human data sets and >97,000 mouse ones). The data sets were processed into gene-level counts, which led to some Twitter debate between the transcript-level quantification hardliners and the gene-count-tolerant communities, if I may label the respective camps in that way. These data sets can be downloaded in bulk.

CompendiumDb. [API, discovery] This is an R package that facilitates the programmatic retrieval of functional genomics data (i.e., often gene expression data) from the Gene Expression Omnibus (GEO), one of the main repositories for this kind of data.

Omics Discovery Index (OmicsDI). [discovery] This is described as a “Knowledge Discovery framework across heterogeneous data (genomics, proteomics and metabolomics)” and is mentioned here both because a lot of it is gene expression data and because it seems like a good resource for finding data across different experimental types for the same conditions.

MetaRNASeq. [discovery] A browser-based query system for finding RNA-seq experiments that fulfill certain search criteria. Seems useful when looking for data sets from a certain disease state, for example.

Tradict. [applications of meta-analysis] In this study, the authors analyzed 23,000 RNA-seq experiments to find out whether gene expression profiles could be reconstructed from a small subset of just 100 marker genes (out of perhaps 20,000 available genes). The author claims that it works well and the manuscript contains some really interesting graphs showing, for example, how most of the variation in gene expression is driven by developmental stage and tissue.

In case you think that these types of meta-analysis are only doable with large computing clusters with lots of processing power and storage, you’ll be happy to find out that it is easy to analyze RNA-seq experiments in a streaming fashion, without having to download FASTQ or even BAM files to disk (Valentine Svensson wrote a nice blog post about this), and with tools such as Kallisto, it does not really take that long to quantify the expression levels in a sample.

Finally, I’ll acknowledge that the discovery-oriented tools above (APIs, metadata search etc) still work on the basis of knowing what kind of data set you are looking for. But another interesting way of searching for expression data would be querying by content, that is, showing a search system the data you have at hand and asking it to provide the data sets most similar to it. This is discussed in the cumulative biology paper mentioned at the start of this blog post: “Instead of searching for datasets that have been described similarly, which may not correspond to a statistical similarity in the datasets themselves, we would like to conduct that search in a data-driven way, using as the query the dataset itself or a statistical (rather than a semantic) description of it.” In a similar vein, Titus Brown has discussed using MinHash signatures for identifying similar samples and finding collaborators.

Watson hackathon in Uppsala

Today I spent most of the day trying to grok IBM Watson’s APIs during a hackathon (Hackup) in Uppsala, where the aim was to develop useful apps using those APIs. Watson is, of course, famous for being good at Jeopardy and for being at the center for IBM’s push into healthcare analytics, but I hadn’t spent much time before this hackathon checking out exactly what is available to users now in terms of APIs etc. It turned out to be a fun learning experience and I think a good time was had by all.

We used IBM’s Bluemix platform to develop apps. As the available Watson API’s (also including the Alchemy APIs that are now part of Bluemix) are mostly focused on natural language analysis (rather than generic classification and statistical modeling), our team – consisting of me and two other bioinformaticians from Scilifelab – decided to try to build a service for transcribing podcasts (using the Watson Speech To Text API) in order to annotate and tag them using the Alchemy APIs for keyword extraction, entity extraction etc. This, we envisioned, would allow podcast buffs to identify in which episode of their favorite show a certain topic was discussed, for instance. Eventually, after ingesting a large number of podcast episodes, the tagging/annotation might also enable things like podcast recommendations and classification, as podcasts could be compared to each other based on themes and keywords. This type of “thematic mapping” could also be interesting for following a single podcast’s thematic development.

As is often the case, we spent a fair amount of time on some supposedly mundane details. Since the speech-to-text conversion was relatively slow, we tried different strategies to split the audio files and process them in parallel, but could not quite make it work. Still, we ended up with a (Python-based) solution that was indeed able to transcribe and tag podcast episodes, but it’s still missing a front-end interface and a back-end database to hold information about multiple podcast episodes.

There were many other teams who developed cool apps. For instance one team made a little app for voice control of a light switch using a Raspberry Pi, and another team had devised an “AI shopper” that will remind you to buy stuff that you have forgotten to put on your shopping list. One entry was a kind of recommendation system for what education you should pursue, based on comparing a user-submitted text against a model trained on papers from people in different careers, and another one was an app for quantifying the average positive/negative/neutral sentiments found in tweets from different accounts (e.g. NASA had very positive tweets on average whereas BBC News was fairly negative).

All in all, a nice experience, and it was good to take a break from the Stockholm scene and see what’s going on in my old home town. Good job by Jason Dainter and the other organizers!

Genomics Today and Tomorrow presentation

Below is a Slideshare link/widget to a presentation I gave at the Genomics Today and Tomorrow event in Uppsala a couple of weeks ago (March 19, 2015).

I spoke after Jonathan Bingham of Google Genomics and talked a little bit about how APIs, machine learning, and what I call “querying by dataset” could make life easier for bioinformaticians working on data integration. In particular, I gave examples of a few types of queries that one would like to be able to do against “all public data” (slides 19-24).

Not long after, I saw this preprint (called “Large-Scale Search of Transcriptomic Read Sets with Sequence Bloom Trees”) that seems to provide part of the functionality that I was envisioning – in particular, the ability to query public sequence repositories by content (using a sequence as a query), rather than by annotation (metadata). The beginning of the abstract goes like this:

Enormous databases of short-read RNA-seq sequencing experiments such as the NIH Sequence Read Archive (SRA) are now available. However, these collections remain difficult to use due to the inability to search for a particular expressed sequence. A natural question is which of these experiments contain sequences that indicate the expression of a particular sequence such as a gene isoform, lncRNA, or uORF. However, at present this is a computationally demanding question at the scale of these databases. We introduce an indexing scheme, the Sequence Bloom Tree (SBT), to support sequence-based querying of terabase-scale collections of thousands of short-read sequencing experiments.

Google Prediction API open to all

I’ve been eagerly waiting to use the Google Prediction API ever since it was announced, and now (since sometime in May) it’s open for everyone who has a Google account (and a credit card). Previously, you had to be able to provide a U.S. mailing address.

Google’s Prediction API is basically a nice way to run your classification and/or prediction tasks through Google’s black-box set of machine learning tools. The way it works is that you upload your training data to Google Storage, which is something like Google’s version of Amazon’s S3: a cloud-based storage system where you store your data in “buckets”. (Google Storage, like S3, uses the term bucket and, also like S3, requires that bucket names only use lower-case letters.) You can activate both Google Storage and the Prediction API from the Google APIs Console. This is also where you will find (click “API access” on the left hand menu) the access key that you will need to run prediction tasks. You’ll have to give credit card details to pay for potential future usage.

The training examples that you put in Storage need to be formatted according to the specification in the Developer’s Guide. Once they have been uploaded, you can train a model on the uploaded data, make predictions about new examples, update existing models and more using one of the client libraries or even simpler, just by copying some of the bash scripts shown on the same page (hidden behind ‘+’ signs which can be expanded.) For these bash scripts to work as written on that page, you need to paste your API key into a file called ‘googlekey’ located in the directory from where you are running the script.

I used this walkthrough example about cancer classification from gene expression data to get up to speed on how Google Prediction API works. Now I’m thinking about what data to throw at it next. Perhaps it would be fun to input some Kaggle contest data sets into it as a kind of “Google baseline” predictor? 🙂

Data sources on the web

So where are all these huge data sets that I (and others) have been talking about? Well, some of them are freely available for download. For example, the extensive Reality Mining data set from MIT (which I have blogged about) is available as a mySQL database for anyone to play around with.

There are a couple of repositories for data sets. Infochimps has hundreds or probably thousands of data sets from a wide variety of sources. Some of the data is directly downloadable from the site, while other data sets are just pointed to. Datamob is a similar, though smaller, resource. Amazon’s Public Data Sets are meant to be used seamlessly from within Amazon’s cloud computing applications, like the Elastic Compute Clusters (EC2). Here, we find massive datasets such as the collection of all publicly available DNA sequences from GenBank.

Peter Skomoroch has a del.icio.us tag for datasets which is probably the most extensive reference for big downloadable data out there (and which makes this blog post rather superfluous …) Due to the magic of del.icio.us, this list is of course dynamic and continuosly growing.

Finally, programmableweb is perhaps not strictly about data per se, but provides links to known APIs for access to web-based resources through your own programs.

Post Navigation