What can “big data” (read “Hadoop”) do for genomics?
Prompted by the recent news that Cloudera and Mount Sinai School of Medicine will collaborate to “solve medical challenges using big data” (more specifically, Cloudera’s Jeff Hammerbacher, ex-big data guru at Facebook, will collaborate with the equally trailblazing mathematician/biologist Eric Schadt at Mount Sinai’s Institute for Genomics and Multiscale Biology) and that NextBio will collaborate with Intel to “optimize the Hadoop stack and advance big data technologies in medicine”, I would like to offer some random thoughts on possible use cases.
Note that “big data” essentially means “Hadoop” in the above press releases, and that the “medicine” they mention should be understood as “genomic medicine” or just “genomics”. Since I happen to know a thing or two about genomics, I will limit myself to (parts of) genomics and Hadoop/MapReduce in this post. For a good overview of big data and medicine in a broader sense than I can describe here, check out this rather nice GigaOm article.
Existing Hadoop/MapReduce stuff for NGS
In the world of high-throughput, or next-generation sequencing (NGS), which is rapidly becoming more and more indispensable for genomics, there are a few Hadoop-based frameworks that I am aware of and that should probably be mentioned first. Packages like Cloudburst and Crossbow leverage Hadoop to perform “read mapping” (approximate string matching for taking a DNA sequence from the sequencer and figuring out where in a known genome it came from), Myrna and Eoulsan do the same but also extend the workflow to quantifying gene expression and identifying differentially expressed genes based on the sequences, and Contrail does Hadoop-based de novo assembly (piecing together a new genome from sequences without previous knowledge, like an extremely difficult jigsaw puzzle). These are essentially MapReduce implementations of existing software, which is all good and fine, but I haven’t seen these tools being used much so far. Perhaps one reason is that read mapping is usually not a major bottleneck compared to some other steps, and with recently released software such as SeqAlto and SNAP (thx Tom Dyar)
(and another package that I’m sure I read about the other day but can’t seem find right now) promising a further 10x-100x speed increase compared to existing tools, there is just not a pressing need at the moment. Contrail, the de novo assembler, does offer an opportunity for research groups who don’t have access to a very RAM-rich computers (de novo assembly is notoriously memory hungry, with 512 Gb RAM machines often being strained to the limit on certain data sets) to perform assembly on commodity clusters.
Then there are the projects that attempt to build a Hadoop infrastructure for next-generation sequencing, like Seal, which provides “map-reducification” for a number of common NGS operations, or Hadoop-BAM (a library for processing BAM files, a common sequence alignment format, in Hadoop) and SeqPig (a library with import and export functions to allow common bioinformatics formats to be used in Pig).
What Hadoop could be useful for
I’m sure people smarter than me will come up with many different use cases for Hadoop in genomics and medicine. At this point, however, I would suggest these general themes:
- Statistical associations between various kinds of data vectors – clinical, environmental, molecular, microbial... This is more or less a batch-processing problem and thus suited to Hadoop. NextBio (the company mentioned in the beginning, who are teaming up with Intel) are doing this as a core part of their business; computing correlations between gene expression levels in different tissues, diseases and conditions and clinical information, drug data etc. However, this concept could (and should) be extended to other things like environmental information, lifestyle factors, genetic variants (SNV, structural variations, copy number variations etc.), epigenetic data (chromatic structure, DNA methylation, histone modifications …), personal microbiomes (the gut microbiota in each patient etc.) Of course, collecting and compiling the data to perform these correlations will be hard; a much harder “big data” problem than computing the actual correlations. SolveBio is a new company that seems to want to understand cancer by compiling vast quantities of data in such a way. This is how they put it in an interview (titled, ambitiously, “The Cloud Will Cure Cancer“): “Patients can measure every feature, as the technology becomes cheaper: genome sequence, gene expression in every accessible tissue, chromatin state, small molecules and metabolites, indigenous microbes, pathogens, etc. These data pools can be created by anyone who has the consent of the patients: universities, hospitals, or companies. The resulting networks, the “data tornado”, will be huge. This will be a huge amount of data and a huge opportunity to use statistical learning for medicine.” In fact, a third recently announced bigdata/genomics collaboration, between Google and the Institute for Systems Biology (ISB), has already started to explore what this type of tools could look like in their Cancer Regulome Explorer. ISB has used the Google Compute Engine to scale a random forest algorithm to 600,000 cores across Google’s global data centers in order to “explore associations between DNA, RNA, epigenetic, and clinical cancer data.” See this case study for some more details (not many more to be honest.)
- Metagenomics. This means, according to one definition, “the application of modern genomics techniques to the study of communities of microbial organisms directly in their natural environments, bypassing the need for isolation and lab cultivation of individual species.” (There is really nothing “meta” about it, it’s just that you are looking at many species at once, which is why it is also called environmental genomics or community genomics in some cases.) For example, Craig Venter’s project to sequence as many living things as possible in the Sargasso sea is metagenomics, as is sequencing samples from the human gut, snot etc. in search of novel bacteria, viruses and fungi (or just characterizing the variety of known ones.) It’s a fascinating field; for an easy introduction, see the TED Talk called “What’s left to explore?” by Nathan Wolfe. Analyzing sequences from metagenomics projects is of course much more difficult than usual, because you are randomly sampling sequences for which you don’t know the source organism but have to infer it in some way. This calls for smart use of proper data structures for indexing and querying, and as much parallelization as possible, very likely in some Hadoopy kind of way. C Titus Brown has written a lot of interesting stuff about the metagenomics data deluge on his blog, Living in an Ivory Basement, where he has explored esoteric and useful things such as probabilistic de Bruijn graphs. Lately, compressive genomics - algorithms that compute directly on compressed genomic data - has become something of a buzz phrase (although similar ideas have been used for quite some time). Some combination of all of these approaches will be needed to combat the inevitable information overload.
Beyond batch processing
In my mind, Hadoop has been associated with batch processing, but today I heard that the newest version of Hadoop not only includes a completely overhauled version of MapReduce called YARN, but it will even allow using other kinds of frameworks, such as streaming real-time analytics frameworks, to operate on the data stored in HDFS. I’ve been thinking about possible applications of stream analytics in next-generation sequencing. Surprisingly, there is already software for streaming quantification of sequences, eXpress - these guys are surely ahead of their time. The immediate use case I can think of is for the USB-stick-sized MinION nanopore sequencer, which reportedly will produce output in a real-time manner (which no sequencers do today as far as I know) so that you can start your analysis while the sequencer is still running. If the vision about “genomic observatories” to “take the planet’s biological pulse” comes true, I’m sure there will be plenty of work to do for the stream analytics clusters of the world …
This has been a rambling post that will probably need a few updates in the coming days – congratulations and thanks if you made it to the end!