Follow the Data

A data driven blog

Archive for the tag “genomics”

What can “big data” (read “Hadoop”) do for genomics?

Prompted by the recent news that Cloudera and Mount Sinai School of Medicine will collaborate to “solve medical challenges using big data” (more specifically, Cloudera’s Jeff Hammerbacher, ex-big data guru at Facebook, will collaborate with the equally trailblazing mathematician/biologist Eric Schadt at Mount Sinai’s Institute for Genomics and Multiscale Biology) and that NextBio will collaborate with Intel to “optimize the Hadoop stack and advance big data technologies in medicine”, I would like to offer some random thoughts on possible use cases.

Note that “big data” essentially means “Hadoop” in the above press releases, and that the “medicine” they mention should be understood as “genomic medicine” or just “genomics”. Since I happen to know a thing or two about genomics, I will limit myself to (parts of) genomics and Hadoop/MapReduce in this post. For a good overview of big data and medicine in a broader sense than I can describe here, check out this rather nice GigaOm article.

Existing Hadoop/MapReduce stuff for NGS

In the world of high-throughput, or next-generation sequencing (NGS), which is rapidly becoming more and more indispensable for genomics, there are a few Hadoop-based frameworks that I am aware of and that should probably be mentioned first. Packages like Cloudburst and Crossbow leverage Hadoop to perform “read mapping” (approximate string matching for taking a DNA sequence from the sequencer and figuring out where in a known genome it came from), Myrna and Eoulsan do the same but also extend the workflow to quantifying gene expression and identifying differentially expressed genes based on the sequences, and Contrail does Hadoop-based de novo assembly (piecing together a new genome from sequences without previous knowledge, like an extremely difficult jigsaw puzzle). These are essentially MapReduce implementations of existing software, which is all good and fine, but I haven’t seen these tools being used much so far. Perhaps one reason is that read mapping is usually not a major bottleneck compared to some other steps, and with recently released software such as SeqAlto and SNAP (thx Tom Dyar) (and another package that I’m sure I read about the other day but can’t seem find right now) promising a further 10x-100x speed increase compared to existing tools, there is just not a pressing need at the moment. Contrail, the de novo assembler,  does offer an opportunity for research groups who don’t have access to a very RAM-rich computers (de novo assembly is notoriously memory hungry, with 512 Gb RAM machines often being strained to the limit on certain data sets) to perform assembly on commodity clusters.

Then there are the projects that attempt to build a Hadoop infrastructure for next-generation sequencing, like Seal, which provides “map-reducification” for a number of common NGS operations, or Hadoop-BAM (a library for processing BAM files, a common sequence alignment format, in Hadoop) and SeqPig (a library with import and export functions to allow common bioinformatics formats to be used in Pig).

What Hadoop could be useful for

I’m sure people smarter than me will come up with many different use cases for Hadoop in genomics and medicine. At this point, however, I would suggest these general themes:

  • Statistical associations between various kinds of data vectors – clinical, environmental, molecular, microbial... This is more or less a batch-processing problem and thus suited to Hadoop. NextBio (the company mentioned in the beginning, who are teaming up with Intel) are doing this as a core part of their business; computing correlations between gene expression levels in different tissues, diseases and conditions and clinical information, drug data etc. However, this concept could (and should) be extended to other things like environmental information, lifestyle factors, genetic variants (SNV, structural variations, copy number variations etc.), epigenetic data (chromatic structure, DNA methylation, histone modifications …), personal microbiomes (the gut microbiota in each patient etc.) Of course, collecting and compiling the data to perform these correlations will be hard; a much harder “big data” problem than computing the actual correlations.  SolveBio is a new company that seems to want to understand cancer by compiling vast quantities of data in such a way. This is how they put it in an interview (titled, ambitiously, “The Cloud Will Cure Cancer“): “Patients can measure every feature, as the technology becomes cheaper: genome sequence, gene expression in every accessible tissue, chromatin state, small molecules and metabolites, indigenous microbes, pathogens, etc. These data pools can be created by anyone who has the consent of the patients: universities, hospitals, or companies. The resulting networks, the “data tornado”, will be huge. This will be a huge amount of data and a huge opportunity to use statistical learning for medicine.” In fact, a third recently announced bigdata/genomics collaboration, between Google and the Institute for Systems Biology (ISB), has already started to explore what this type of tools could look like in their Cancer Regulome Explorer. ISB has used the Google Compute Engine to scale a random forest algorithm to 600,000 cores across Google’s global data centers in order to “explore associations between DNA, RNA, epigenetic, and clinical cancer data.” See this case study for some more details (not many more to be honest.)
  • Metagenomics. This means, according to one definition, “the application of modern genomics techniques to the study of communities of microbial organisms directly in their natural environments, bypassing the need for isolation and lab cultivation of individual species.(There is really nothing “meta” about it, it’s just that you are looking at many species at once, which is why it is also called environmental genomics or community genomics in some cases.) For example, Craig Venter’s project to sequence as many living things as possible in the Sargasso sea is metagenomics, as is sequencing samples from the human gut, snot etc. in search of novel bacteria, viruses and fungi (or just characterizing the variety of known ones.) It’s a fascinating field; for an easy introduction, see the TED Talk called “What’s left to explore?” by Nathan Wolfe. Analyzing sequences from metagenomics projects is of course much more difficult than usual, because you are randomly sampling sequences for which you don’t know the source organism but have to infer it in some way. This calls for smart use of proper data structures for indexing and querying, and as much parallelization as possible, very likely in some Hadoopy kind of way. C Titus Brown has written a lot of interesting stuff about the metagenomics data deluge on his blog, Living in an Ivory Basement, where he has explored esoteric and useful things such as probabilistic de Bruijn graphs. Lately, compressive genomics - algorithms that compute directly on compressed genomic data - has become something of a buzz phrase (although similar ideas have been used for quite some time). Some combination of all of these approaches will be needed to combat the inevitable information overload.

Beyond batch processing

In my mind, Hadoop has been associated with batch processing, but today I heard that the newest version of Hadoop not only includes a completely overhauled version of MapReduce called YARN, but it will even allow using other kinds of frameworks, such as streaming real-time analytics frameworks, to operate on the data stored in HDFS. I’ve been thinking about possible applications of stream analytics in next-generation sequencing. Surprisingly, there is already software for streaming quantification of sequences, eXpress - these guys are surely ahead of their time. The immediate use case I can think of is for the USB-stick-sized MinION nanopore sequencer, which reportedly will produce output in a real-time manner (which no sequencers do today as far as I know) so that you can start your analysis while the sequencer is still running. If the vision about “genomic observatories” to “take the planet’s biological pulse” comes true, I’m sure there will be plenty of work to do for the stream analytics clusters of the world …

This has been a rambling post that will probably need a few updates in the coming days – congratulations and thanks if you made it to the end!

New analysis competitions

Some interesting competitions in data analysis / prediction:

Kaggle is managing this year’s KDD Cup, which will be about Weibo, China’s rough equivalent to Twitter (with more support for adding pictures and comments on posts, it’s more like a hybrid between Twitter and Facebook maybe). There will be two tasks, (1) predicting which users a certain user will follow (all data being anonymized, of course), and (2) predicting click-through rate in online computational ad systems. According to Gordon Sun, chief scientist at Tencent (the company behind Weibo), the data set to be used is the largest one ever to have been released for competitive purposes.

CrowdAnalytix, an India-based company with a business idea similar to Kaggle’s, has started a fun quickie competition about sentiment mining. Actually the competition might already be over as it ran for just 9 days starting 16/2. The input consists of comments left by visitors to a major airport in India, and the goal is to identify and compile actionable and/or interesting information, such as what kind of services visitors think are missing.

The Clarity challenge is, for me, easily the most interesting challenge of the three, in that it concerns the use of genomic information in healthcare. This challenge (with a prize sum of $25,000) is, in effect, crowdsourcing genomic/medical research (although only 20 teams will get selected to participate). The goal is to identify and report on potential genetic features underlying medical disorders in three children, given the genome sequences of the children and their parents. These genetics features are presently unknown, which is why this competition really represents something new in medical research. I think this is a very nice initiative, in fact I had thought of initiating something similar at my own institute where I work, but this challenge is much better than what I had in mind. It will be very interesting to see what comes out of it.

23andme seeks genetic markers for healthy aging

I recently blogged about the Harvard Study of Adult Development, which tries to identify factors predictive of happy aging. Well, just yesterday I read an interesting blog post at Genetic Future describing how 23andme is now looking to identify (genetic) factors for healthy aging.

Apparently, 23andme offered free genetic scans to participants in the Palo Alto Senior Games, a big sporting event for people of age 50 and up. A spokesman for the company told the Palo Alto Online that they want to use the genetic scans to try to find genetic factors underlying healthy aging. The participants in the Palo Alto Senior Games are, almost by definition, healthy seniors, and 23andme are looking to recruit (or already succeeded in recruiting – it wasn’t completely clear to me from the blog post) 4500 individuals, which is a pretty sizable cohort.

This is not the first large-scale recruitment drive from 23andme targeting a specific group – they previously announced a Parkinson’s project with 3000 participants. It’ll be interesting to see what comes out of these projects.

Sequencing data storm

Today, I attended a talk given by Wang Jun, a humorous and t-shirt-clad whiz kid who set up the bioinformatics arm of Beijing Genomics Institute (BGI) as a 23-year-old PhD student, became a professor at 27, and is now the director of BGI’s facility in Shenzhen, near Hong Kong. Although I work with bioinformatics at a genome institute myself, this presentation really drove home how much storage, computing power and know-how is really required for biology now and in the near future.

BGI does staggering amounts of genome sequencing – “If it tastes good, sequence it! If it is useful, sequence it!” as Wang Jun joked – from indigenous Chinese plants to rice, pandas and humans. They have a very interesting individual genome project where they basically apply many different techniques on samples from the same person and compare the results against known references. One of many interesting results from this project was the finding that human genomes not only vary in single “DNA letter” variants (so called SNPs, single nucleotide polymorphisms) or the number of times certain stretches of DNA are repeated (“copy number variations”) – it now turns out there are DNA snippets that, in largely binary fashion, some people have and some don’t.

Although the existing projects demand a lot of resources and manpower – the BGI has 250 bioinformaticians (!) which is still too few; according to Wang they want to quickly increase this number to 500 – this is nothing compared to what will happen after the next wave of sequencing technologies, when we will start to sequence single cells from different (or the same) tissues in an individual. Already, the data sets generated are so vast that they cannot be distributed over the internet. Wang  recounted how he had to bring ten terabyte drives to Europe by himself in order to share his data with researchers at EBI (European Bioinformatics Institute). Now, they are trying out cloud computing as a way to avoid moving the data around.

Wang attributed a lot of BGI’s success to young, hardworking programmers and scientists – many of them university dropouts – who don’t have any preconceptions about science and therefore are prepared to try anything. “These are teenagers that publish in Nature,” said Jun, apparently feeling that he was (at 33) already over the hill. “They don’t run on a 24-hour cycle like us, they run on 36h-cycles and bring sleeping bags to the lab.”

All in all, good fun.

Post Navigation

Follow

Get every new post delivered to your Inbox.

Join 46 other followers