Follow the Data

A data driven blog

Archive for the tag “sequencing”

Analytics challenges in genomics

Continuing on the theme of data analysis and genomics, here is a presentation I gave for the Data Mining course at Uppsala university in October this year. It talks a little bit about massively parallel DNA sequencing, then goes on to mention grand visions such as sequencing millions of genomes, discovering new species by metagenomics, “genomic observatories” etc, then goes into the practical difficulties and finally suggests some strategies like prediction contests. Enjoy!


What can “big data” (read “Hadoop”) do for genomics?

Prompted by the recent news that Cloudera and Mount Sinai School of Medicine will collaborate to “solve medical challenges using big data” (more specifically, Cloudera’s Jeff Hammerbacher, ex-big data guru at Facebook, will collaborate with the equally trailblazing mathematician/biologist Eric Schadt at Mount Sinai’s Institute for Genomics and Multiscale Biology) and that NextBio will collaborate with Intel to “optimize the Hadoop stack and advance big data technologies in medicine”, I would like to offer some random thoughts on possible use cases.

Note that “big data” essentially means “Hadoop” in the above press releases, and that the “medicine” they mention should be understood as “genomic medicine” or just “genomics”. Since I happen to know a thing or two about genomics, I will limit myself to (parts of) genomics and Hadoop/MapReduce in this post. For a good overview of big data and medicine in a broader sense than I can describe here, check out this rather nice GigaOm article.

Existing Hadoop/MapReduce stuff for NGS

In the world of high-throughput, or next-generation sequencing (NGS), which is rapidly becoming more and more indispensable for genomics, there are a few Hadoop-based frameworks that I am aware of and that should probably be mentioned first. Packages like Cloudburst and Crossbow leverage Hadoop to perform “read mapping” (approximate string matching for taking a DNA sequence from the sequencer and figuring out where in a known genome it came from), Myrna and Eoulsan do the same but also extend the workflow to quantifying gene expression and identifying differentially expressed genes based on the sequences, and Contrail does Hadoop-based de novo assembly (piecing together a new genome from sequences without previous knowledge, like an extremely difficult jigsaw puzzle). These are essentially MapReduce implementations of existing software, which is all good and fine, but I haven’t seen these tools being used much so far. Perhaps one reason is that read mapping is usually not a major bottleneck compared to some other steps, and with recently released software such as SeqAlto and SNAP (thx Tom Dyar) (and another package that I’m sure I read about the other day but can’t seem find right now) promising a further 10x-100x speed increase compared to existing tools, there is just not a pressing need at the moment. Contrail, the de novo assembler,  does offer an opportunity for research groups who don’t have access to a very RAM-rich computers (de novo assembly is notoriously memory hungry, with 512 Gb RAM machines often being strained to the limit on certain data sets) to perform assembly on commodity clusters.

Then there are the projects that attempt to build a Hadoop infrastructure for next-generation sequencing, like Seal, which provides “map-reducification” for a number of common NGS operations, or Hadoop-BAM (a library for processing BAM files, a common sequence alignment format, in Hadoop) and SeqPig (a library with import and export functions to allow common bioinformatics formats to be used in Pig).

What Hadoop could be useful for

I’m sure people smarter than me will come up with many different use cases for Hadoop in genomics and medicine. At this point, however, I would suggest these general themes:

  • Statistical associations between various kinds of data vectors – clinical, environmental, molecular, microbial... This is more or less a batch-processing problem and thus suited to Hadoop. NextBio (the company mentioned in the beginning, who are teaming up with Intel) are doing this as a core part of their business; computing correlations between gene expression levels in different tissues, diseases and conditions and clinical information, drug data etc. However, this concept could (and should) be extended to other things like environmental information, lifestyle factors, genetic variants (SNV, structural variations, copy number variations etc.), epigenetic data (chromatic structure, DNA methylation, histone modifications …), personal microbiomes (the gut microbiota in each patient etc.) Of course, collecting and compiling the data to perform these correlations will be hard; a much harder “big data” problem than computing the actual correlations.  SolveBio is a new company that seems to want to understand cancer by compiling vast quantities of data in such a way. This is how they put it in an interview (titled, ambitiously, “The Cloud Will Cure Cancer“): “Patients can measure every feature, as the technology becomes cheaper: genome sequence, gene expression in every accessible tissue, chromatin state, small molecules and metabolites, indigenous microbes, pathogens, etc. These data pools can be created by anyone who has the consent of the patients: universities, hospitals, or companies. The resulting networks, the “data tornado”, will be huge. This will be a huge amount of data and a huge opportunity to use statistical learning for medicine.” In fact, a third recently announced bigdata/genomics collaboration, between Google and the Institute for Systems Biology (ISB), has already started to explore what this type of tools could look like in their Cancer Regulome Explorer. ISB has used the Google Compute Engine to scale a random forest algorithm to 600,000 cores across Google’s global data centers in order to “explore associations between DNA, RNA, epigenetic, and clinical cancer data.” See this case study for some more details (not many more to be honest.)
  • Metagenomics. This means, according to one definition, “the application of modern genomics techniques to the study of communities of microbial organisms directly in their natural environments, bypassing the need for isolation and lab cultivation of individual species.(There is really nothing “meta” about it, it’s just that you are looking at many species at once, which is why it is also called environmental genomics or community genomics in some cases.) For example, Craig Venter’s project to sequence as many living things as possible in the Sargasso sea is metagenomics, as is sequencing samples from the human gut, snot etc. in search of novel bacteria, viruses and fungi (or just characterizing the variety of known ones.) It’s a fascinating field; for an easy introduction, see the TED Talk called “What’s left to explore?” by Nathan Wolfe. Analyzing sequences from metagenomics projects is of course much more difficult than usual, because you are randomly sampling sequences for which you don’t know the source organism but have to infer it in some way. This calls for smart use of proper data structures for indexing and querying, and as much parallelization as possible, very likely in some Hadoopy kind of way. C Titus Brown has written a lot of interesting stuff about the metagenomics data deluge on his blog, Living in an Ivory Basement, where he has explored esoteric and useful things such as probabilistic de Bruijn graphs. Lately, compressive genomics – algorithms that compute directly on compressed genomic data – has become something of a buzz phrase (although similar ideas have been used for quite some time). Some combination of all of these approaches will be needed to combat the inevitable information overload.

Beyond batch processing

In my mind, Hadoop has been associated with batch processing, but today I heard that the newest version of Hadoop not only includes a completely overhauled version of MapReduce called YARN, but it will even allow using other kinds of frameworks, such as streaming real-time analytics frameworks, to operate on the data stored in HDFS. I’ve been thinking about possible applications of stream analytics in next-generation sequencing. Surprisingly, there is already software for streaming quantification of sequences, eXpress – these guys are surely ahead of their time. The immediate use case I can think of is for the USB-stick-sized MinION nanopore sequencer, which reportedly will produce output in a real-time manner (which no sequencers do today as far as I know) so that you can start your analysis while the sequencer is still running. If the vision about “genomic observatories” to “take the planet’s biological pulse” comes true, I’m sure there will be plenty of work to do for the stream analytics clusters of the world …

This has been a rambling post that will probably need a few updates in the coming days – congratulations and thanks if you made it to the end!

New “big DNA data” cloud and genome interpretation companies

It seems like cloud computing platforms for what I call “big DNA data” (mostly data derived from high-throughput-sequencing experiments) have really started to take off now. About a year ago, I blogged about companies based on these ideas, but in the past month or so I feel like I have been reading about new companies in this space every now and then. A related category of companies that has emerged is what I call “genome interpretation companies”; companies that want to help you to make sense of (e.g.) big sequence data sets to arrive at some more or less actionable medical information. The cloud infrastructure and genome interpretation parts of big DNA data analysis can’t be cleanly separated and many companies offer some combination of both, which makes sense – if you have already built up the infrastructure, you might as well provide some tools.

DNANexus was already mentioned in the blog post from over a year ago, and it’s the company in this space that has received the most ecstatic press coverage. It has built up an impressive set of services compared to last year, but the most interesting thing for me at this point is that they have promised to launch “DNAnexus X”, a “community-inspired collaborative and scalable data technology platform”, in the near future.

Among new players, I have looked briefly at the following (I try to classify them roughly into “infrastructure-oriented” or “interpretation-oriented” although some are a mix of the two; I also mention two other companies that don’t fit to either category):


GeneStack promises a “Genomic Operating System”, which will be launched sometime during this year and is described as follows:

Access well-curated genomics and transcriptomics public data from major repositories worldwide. Store and share securely NGS data sets with your colleagues. Run high-performance computations on public and proprietary data in the cloud. Develop and sell genomics apps.

Appistry seems to be a general high-performance analytics company, although with one of its specializations in life science (meaning in this case “high performance sequencing”.) They seem to offer mainly infrastructure, including analytics pipelines, which I think probably don’t extend into what I am calling “interpretation” in this blog post.

Seven Bridges Genomics offers a cloud platform with open-source tools for genomics for hospitals, smaller labs and other organizations that don’t have their own computing infrastructure. They are also the first company I’ve seen that employs a zombie.

Bina Technologies offers an interesting “hybrid” approach to cloud genome analytics. Realizing that many customers are deterred by the long upload times to, for instance, Amazon EC2, they have something called the “Bina Box” that processes the raw sequence data locally, after which the pre-processed and compressed (and thus much smaller) version of the data is uploaded to the “Bina Cloud.”


Personalis‘ tagline is “Founded by global leaders in human genome interpretation” and indeed, their team of founders would be very hard to beat. About a month ago, some details on the DNA variant detection engine used by the company were published. (The engine, called HugeSeq, is also freely available in an academic version which is not supposed to be quite as cutting-edge as the one used by Personalis.)

SolveBio is still in private beta but a recent rather visionary article by founder Kaganovich titled “The Cloud will Cure Cancer” talks about the birth of “Big Bio” and calculating correlations in the cloud for getting a handle on the molecular profiles of tumors, predicting drug targets and designing treatment regimens.

SVBio or Silicon Valley Biosystems  is being very secretive so far but is said to offer “interpretive software for the human genome.”


DNA Guide (visualization) – not to be confused with Swedish 23andme clone DNA-Guide – has a technical solution for visualizing and navigating personal DNA data on the web safely while adhering to privacy regulations. (see Slideshare show)

Metaome (concept/knowledge search?) has developed DistilBio, a semantic search and data integration platform with a dynamic interface for navigating among biological concepts. It’s a bit hard to explain but kind of cool if you are into life science research. There is a demo on the site.

Final words

Since so much of this blog post has been about cloud computing and personal genomics, I should mention that Amazon has recently put up sequence data from the 1000 Genomes project in their cloud. There are instructions and a tutorial here for those that would like to play around with the data.

Also, on the topic of computing infrastructure for genomics, Chris Dagdigian’s slides from Bio-IT World Expo 2012 are pretty interesting. Among other things, he is suggesting that uploading genomic data into the cloud is now becoming feasible (using Aspera software).

A good week for (big) data (science)

Perhaps as a subconscious compensation for my failure to attend Strata 2012 last week (I did watch some of the videos and study the downloads from the “Two Most Important Algorithms in Predictive Modeling Today” session), I devoted this week to more big-data/data-science things than usual.

Monday to Wednesday were spent at a Hadoop and NGS (Next Generation [DNA] Sequencing) data processing hackathon hosted by CSC in Espoo, Finland. All of the participants were very nice and accomplished; I’ll just single out two people for having developed high-throughput DNA sequencing related Hadoop software: Matti Niemenmaa, who is the main developer of Hadoop-BAM, a library for manipulating aligned sequence data in the cloud, and Luca Pireddu, who is the main developer of Seal, which is a nice Hadoop toolkit for sequencing data which enables running several different types of tasks in distributed fashion. Other things we looked at was the CloudBioLinux project, map/reduce sequence assembly using Contrail and CSC’s biological high-throughput data analysis platform Chipster.

On Friday, me and blog co-author Joel went to record our first episode of the upcoming Follow the Data podcast series with Fredrik Olsson and Magnus Sahlgren from Gavagai. In the podcast series, we will try to interview mainly Swedish but also other companies that we feel are big data or analytics related in an interesting way. Today I have been listening to the first edit and feel relatively happy with it, even though it is quite rough, owing to our lack of experience. I also hate to hear my own recorded voice, especially in English … I am working on one or two blog posts to summarize the highlights of the podcast (which is in English) and the following discussion in Swedish.

Over the course of the week, I’ve also worked in the evenings and on planes to finish an assignment for an academic R course I am helping out with. I decided to experiment a bit with this assignment and to base it on a Kaggle challenge. The students will download data from Kaggle and get instructions that can be regarded as a sort of “prediction contests 101”, discussing the practical details of getting your data into shape, evaluating your models, figuring out which variables are most important and so on. It’s been fun and can serve as a checklist for my self in the future.

Stay tuned for the first episode of Follow the Data podcast!


I was discussing the importance of data visualization with a co-worker a couple of weeks ago. We agreed that some sort of dynamic, intuitive interfaces for looking at and interacting with huge data sets in general, and sequencing-based data sets in particular, would be extremely useful. As the Dataspora blog puts it in a recent post, “The ultimate end-point for most data analysis is a human decision-maker, whose highest bandwidth channel is his or her eyeballs.” (the post is worth reading in its entirety)

Apparently Illumina (one the biggest vendors of high-throughput sequencers) agree; they’ve announced a competition where the aim is to provide useful visualizations of a number of genomic datasets derived from a breast cancer cell line. The competition closes at March 15, 2011.

Here’s a nice paper, A Tour through the Visualization Zoo, which provides a whirlwind tour of different kinds of graphs. The figures are actually interactive, so you can mess around with them if you are reading the article online.

The Infosthetics blog highlights Patients Like Me as the most successful marriage of online social media and data visualization.

TR personalized medicine briefing

MIT’s Technology Review magazine has a briefing on personalized medicine. It’s worth a look, although it’s quite heavily tilted towards DNA sequencing technology (which I am interested in, but there is a lot more to personalized medicine). Not surprisingly, one of the articles in the briefing makes the point that the biggest bottleneck in personalized medicine will be data analysis, the risk being that “…we will end up with a collection of data … unable to predict anything.” (As an aside, I would be moderately wealthy if I had a euro for each time I’d read the phrase “drowning in data”, which appears in the article heading. I think I even rejected that as a name for this blog. It would be nice to see someone come up with a fresh alternative verb to “drowning” …)

Technology Review also has a piece on how IBM has started to put their mathematicians to work in business analytics. They mention a neat technique I hadn’t been aware of: “…they used a technique called high-quantile modeling–which tries to predict, say, the 90th percentile of a distribution rather than the mean–to estimate potential spending by each customer and calculate how much of that demand IBM could fulfill“.

The last part of the article talks about a very interesting problem: how to model a system where output from the model itself affects the system, or as the article puts it “…situations where a model must incorporate behavioral changes that the model itself has inspired“. I’m surprised the article doesn’t mention the obvious applicability of this to the stock market, where of course thousands of professional and amateur data miners use prediction models (their own and others’) to determine how they buy and sell stocks. Instead, its example comes from traffic control:

For example, […] a traffic congestion system might use messages sent to GPS units to direct drivers away from the site of a highway accident. But the model would also have to calculate how many people would take its advice, lest it end up creating a new traffic jam on an alternate route.

Body computing, preventive, predictive and social medicine

There have been many interesting articles and blog posts about the future of medicine, and specifically about the need to automatically monitor various physiological parameters, and, importantly, to start focusing more on health rather than disease; prevention rather than curing. The latter point has been stressed by Adam Bosworth, the former head of Google Health, in interviews like this one (audio) and this one (video, “The Body 2.0”). Bosworth is one of the founders of a company, Keas, that wants to help people understand their health data, set health goals and pursue them. He has a new blog post where he talks about machine learning in the context of health care. He (probably rightly) sees health care as lagging behind in adoption of predictive analytics. But he thinks this will change:

All the systems emerging to help consumers get personalized advice and information about their health are going to be incredible treasure troves of data about what works. And this will be a virtuous cycle. As the systems learn, they will encourage consumers to increasingly flow data into them for better more personalized advice and encourage physicians to do the same and then this data will help these systems to learn even more rapidly. I predict now that within a decade, no practicing physician will consider treating their patients without the support/advice of the expertise embodied in the machine learning that will have taken place. And finally, we will truly move to an evidence based health care system.

Along similar lines, the Broader Perspective blog writes about the “three tiers of medicine” that may make up the future healthcare system. The first tier consists of automated health monitoring tools that collect information about your health, The second tier is about preventive medicine and involves “health coaches”, who “…incorporate genomic data, together with family history and current phenotype and biomarker data into an overall care plan“. Finally, the third tier is the traditional health care system of today (hospitals, doctors, nurses).

I learned a new term for the enabling technology for the first (data-collection) tier: body computing. The Third Body Computing Conference will be hosted by the University of Southern California on Friday (9 October). The conference’s definition of body computing is that

“Body Computing” refers to an implanted wireless device, which can transmit up-to-the-second physiologic data to physicians, patients, and patients’ loved ones.

A new article about the future of health care in Fast Company also talks about body computing and predictive/preventive health care:

Wireless monitoring and communication devices are becoming a part of our everyday lives. Integrated into our daily activities, these devices unobtrusively collect information for us. For example, instead of doing an annual health checkup (i.e. cardiac risk assessment), near real-time health data access can be used to provide rolling assessments and alert patients of changes to their health risk based on biometrics assessment and monitoring (blood pressure, weight, sleep etc). With predictive health analytics, health information intelligence, and data visualization, major risks or abnormalities can be detected and sent to the doctor, possibly preempting complications such as stroke, heart attack, or kidney disease.

Although the article is named The Future of Health Care Is Social, it actually talks mostly about self-tracking and predictive analytics. It does go into social aspects of future healthcare, like online health/disease-related networks such as PatientsLikeMe or CureTogether. All in all, a nice article.

And finally (if anyone is still awakw), it has been widely reported that IBM has joined the sequencing fray and are trying to develop a nanopore-based system, a “DNA transistor”, for cheap sequencing. There are now several players in this area (for example, Oxford Nanopore, Pacific Biosystems, NABSYS) and some of them are bound to lose out – time will tell who will emerge on top. Anyway, the reason I mentioned this is partly that IBM explicitly connected this announcement to healthcare reform and personalized healthcare (IBM CEO also wants to resequence the health-care system) and partly because of the surprising (to me) fact that “[…] IBM also manages the entire health system for Denmark.” Really?

By the way, a good way to get updates on body computing is to follow Dr Leslie Saxon on Twitter.

Sequencing data storm

Today, I attended a talk given by Wang Jun, a humorous and t-shirt-clad whiz kid who set up the bioinformatics arm of Beijing Genomics Institute (BGI) as a 23-year-old PhD student, became a professor at 27, and is now the director of BGI’s facility in Shenzhen, near Hong Kong. Although I work with bioinformatics at a genome institute myself, this presentation really drove home how much storage, computing power and know-how is really required for biology now and in the near future.

BGI does staggering amounts of genome sequencing – “If it tastes good, sequence it! If it is useful, sequence it!” as Wang Jun joked – from indigenous Chinese plants to rice, pandas and humans. They have a very interesting individual genome project where they basically apply many different techniques on samples from the same person and compare the results against known references. One of many interesting results from this project was the finding that human genomes not only vary in single “DNA letter” variants (so called SNPs, single nucleotide polymorphisms) or the number of times certain stretches of DNA are repeated (“copy number variations”) – it now turns out there are DNA snippets that, in largely binary fashion, some people have and some don’t.

Although the existing projects demand a lot of resources and manpower – the BGI has 250 bioinformaticians (!) which is still too few; according to Wang they want to quickly increase this number to 500 – this is nothing compared to what will happen after the next wave of sequencing technologies, when we will start to sequence single cells from different (or the same) tissues in an individual. Already, the data sets generated are so vast that they cannot be distributed over the internet. Wang  recounted how he had to bring ten terabyte drives to Europe by himself in order to share his data with researchers at EBI (European Bioinformatics Institute). Now, they are trying out cloud computing as a way to avoid moving the data around.

Wang attributed a lot of BGI’s success to young, hardworking programmers and scientists – many of them university dropouts – who don’t have any preconceptions about science and therefore are prepared to try anything. “These are teenagers that publish in Nature,” said Jun, apparently feeling that he was (at 33) already over the hill. “They don’t run on a 24-hour cycle like us, they run on 36h-cycles and bring sleeping bags to the lab.”

All in all, good fun.

Post Navigation