Cloud analytics for big DNA data
It’s already become a cliché to point out that the cost of DNA sequencing is decreasing at a rate much faster than Moore’s law, so that the rate of acquisition of genomic data is rapidly overpowering our ability to process and analyze it. One might think that MapReduce, Hadoop, cloud computing and all of that stuff would help us here, and indeed there are some types of DNA sequence analysis that can be attacked in this way (see e g Crossbow and Myrna), but there are other kinds of analysis that just don’t lend themselves easily to parallel processing, like de novo DNA sequence assembly. (As an aside, this presentation by C Titus Brown introduces some neat tricks for “handling ridiculous amounts of data with probabilistic data structures” with applications to sequence assembly. The same man has also written a good but scary blog post, “The sky is falling! The sky is falling!“, about how hard it will be, and already is, to deal with the massive amounts of sequence data that is coming out.)
But, to the point! There has been a slow but steady buildup of various kinds of cloud-based analytics platforms for sequencing-derived data sets.
Perhaps the company with the highest profile among these is DNANexus, which was founded by researchers from Stanford university. It offers cloud-based storage and analysis solution for sequencing centers and individual researchers. DNANexus is built on top of Amazon Web Services (AWS) and offers ready-made workflows for use cases such as identifying genetic variations. The company recently teamed up with Complete Genomics, a company that sequences massive amounts of whole human genomes, so that Complete Genomics customers can view and analyze their sequences in DNANexus.
SeqCentral seems to be a close competitor to DNANexus, but I have heard surprisingly little about this company. It may be that their offering is a bit different from that of DNANexus. This TechCrunch piece says that ‘SeqCentral will allow scientists to compare their data to others to see if their sequencing is new or if it is “known.” The startup will bring in public data from universities, research organizations, and companies and allow you compare your sequencing to this existing data.’ So the value proposition may be more about data integration than about just analyzing your own samples. The TechCrunch piece also mentions a subscription fee of $99/year for researchers, which would perhaps make sense as a payment for convenient access to public data through the cloud (it would be way too cheap as a fee for analyzing even a single experiment). The SeqCentral blog has a somewhat interesting post on the surprising efficiency of Python (and the surprisingly poor performance of awk) for sequence file parsing on Amazon EC2.
The Indian company Geschickten (bit of a weird name, but OK) also builds their GenomicsCloud solution on AWS and (I’m just guessing here) probably has a wider but less turnkey offering than DNANexus. From the looks of it, you may need to know a bit more about what you are doing with Geschickten’s solution (which is not necessarily bad).
GenomeQuest has its own data center/private cloud for processing huge sequencing data sets.
CloudNumbers is not exclusively focused on sequence data but rather aims to be “The platform for computationally intensive calculations in the cloud”. They do mention biotech as the first example field that they will focus on and have a range of DNA sequence analysis software packages integrated in the platform.
The platforms described above mostly deal with the “easier” problems like mapping (matching sequences to the genomes from which they presumably originate), which are still pretty difficult to solve well, but the Bio Cloud Computing Center at BGI (the sequencing behemoth previously known as Beijing Genomics Institute) also offers a “scalable assembly solution” called Hecate, of which Bio-IT World wrote:
Hecate is a MapReduce program designed to run on a cluster, representing what Tianjian Chen calls “the first ring of the genome process tool chain. It solves the scalability of assembly. We can now assemble any genome on commodity machines. If you don’t get bigger machines, you just need more time. If you have enough funding, we can accelerate assembly about 10-20X.
It would be really interesting to know where the MapReduce-ification is applied and how this piece of software works. Does anybody know? It’s a bit unclear to me whether the BGI Cloud will be a free service or if you have to pay for it. At any rate, it offers a wide variety of sequence analysis tools, many of which were developed at BGI in the first place, which hopefully means they will be well integrated into the cloud solution.
Finally, there is the Galaxy cloud, which I highly recommend to fellow bioinformaticians. It’s a free and extensible (Python-based on top of Amazon EC2) platform for doing all kinds of sequence analysis. It requires some IT savvy to get up and running and some basic bioinformatics understanding to use it properly, but it’s really not that hard. In fact Galaxy is so good that it probably threatens the business of most of the companies mentioned above. Maybe if one of them can come up with a really efficient way to upload huge sequence files quickly into the cloud, they would have a shot …
P.S. Another intriguing company which doesn’t really have anything to do with cloud computing is BlueSeq, a “sequencing service exchange” that matches up sequencing projects with relevant sequencing service providers. A matchmaker for DNA sequencing!