OK, so let’s see what applications of parallel computing frameworks in “high throughput biology”, represented here by genomics and proteomics, that we can find currently. I’ll focus mostly on Hadoop since I find it interesting to look at how much traction it gets in life science. It seems pretty clear that it’s not used as much in this space as in various other areas such as retail, advertising, gaming and so on. This could be because
- the map-reduce framework lends itself more easily to transaction data (who bought what on Amazon, who clicked on a certain link etc.) or other kinds of data like tweets that can be represented in a single line of text
- biological data sets are simply not that big yet (the data volumes of a Netflix or a Facebook dwarf those of even the most powerful sequencing centers)
- computational biologists usually work on supercomputing clusters that are provided by universities or research institutes (and that are administered by someone else), or on a single server – but not on large clusters of cheap machines which they can administer themselves
- there are too few programmers in biology who can (or have time to) work with these systems
Any other suggestions for reasons behind this discrepancy?
The parallel computing framework that has been used the most in bioinformatics is probably MPI, Message Passing Interface, for example in mpiBLAST and lately in the Ray (DNA sequence) assembler, which is very powerful on the kind of MPI enabled clusters that are often found in academia.
Before we move over to Hadoop, I just wanted to mention that a newer cluster computing framework, Spark, has recently been used by Adam Roberts et al. for “streaming fragment assignment” in the cloud. Depending on on your background, you can understand this from a mathematical point of view as performing expectation maximization efficiently in a distributed way, or from a biological point of view as assigning sequence reads to their likely transcript of origin in a fast and probabilistic manner.
Now to the Hadoop applications in bioinformatics. For this part, I have relied partly on a review article, Survey of MapReduce frame operation in bioinformatics.
Hydra – A Hadoop-based search engine for matching spectra from shotgun mass spectrometry against increasingly large sequence databases (see paper)
Chorus – Not sure if this actually uses Hadoop but says it uses the map-reduce paradigm on Amazon EC2. It is intended as a cloud enabled storage area for all of the world’s mass spectrometry data.
SeqPig – a library to enable the usage of Hadoop Pig features for analyzing high-throughput sequencing data sets. Builds on Hadoop-BAM, a useful Java library for dealing with various high-throughput sequencing formats such as BAM, FASTQ, and BCF in Hadoop.
Seal – A suite of tools for DNA sequence alignment and related tasks, which is the only Hadoop tool for genomics that I know is actively used in production.
Cloudburst – Sequence alignment. An early demonstration of what is possible which is not used much now (from what I can tell)
Crossbow – Resequencing analysis (sequence alignment + SNP calling)
Eoulsan – RNA sequencing analysis pipeline interfacing mostly existing tools
Myrna – RNA sequencing analysis pipeline with newly written code for some steps
Fx – RNA sequencing analysis pipeline. Uses Hadoop and the excellent RNA-seq aligner GSNAP, so probably a good solution
SeqWare – Includes LIMS-type functionality, a workflow engine, a query engine etc. for handling high-throughput sequencing data.
Contrail – De novo sequence assembly based on Hadoop. Very interesting approach (see e g this presentation) but when I tried it, it was rather poorly documented and I was unable to get satisfactory results. Lately I have been thinking about whether new graph-based parallel computing frameworks like GraphLab could be adapted for de novo assembly (which is essentially a graph traversal problem).
There are also other projects listed in the review article I referenced above, but I have skipped those for various reasons (e.g. extremely niche applications, solving problems that are too small to need Hadoop, etc.)
I’d be happy to receive feedback on this little survey with things I’ve missed, reasons for why the adoption is not higher, suggestions for tools that should be developed, and so on.