Follow the Data

A data driven blog

Archive for the tag “genomics”

Genomics Today and Tomorrow presentation

Below is a Slideshare link/widget to a presentation I gave at the Genomics Today and Tomorrow event in Uppsala a couple of weeks ago (March 19, 2015).

I spoke after Jonathan Bingham of Google Genomics and talked a little bit about how APIs, machine learning, and what I call “querying by dataset” could make life easier for bioinformaticians working on data integration. In particular, I gave examples of a few types of queries that one would like to be able to do against “all public data” (slides 19-24).

Not long after, I saw this preprint (called “Large-Scale Search of Transcriptomic Read Sets with Sequence Bloom Trees”) that seems to provide part of the functionality that I was envisioning – in particular, the ability to query public sequence repositories by content (using a sequence as a query), rather than by annotation (metadata). The beginning of the abstract goes like this:

Enormous databases of short-read RNA-seq sequencing experiments such as the NIH Sequence Read Archive (SRA) are now available. However, these collections remain difficult to use due to the inability to search for a particular expressed sequence. A natural question is which of these experiments contain sequences that indicate the expression of a particular sequence such as a gene isoform, lncRNA, or uORF. However, at present this is a computationally demanding question at the scale of these databases. We introduce an indexing scheme, the Sequence Bloom Tree (SBT), to support sequence-based querying of terabase-scale collections of thousands of short-read sequencing experiments.

Deep learning and genomics: the splicing code [and breast cancer features]

Last summer, I wrote a little bit about potential applications of deep learning to genomics. What I had in mind then was (i) to learn a hierarchy of cell types based on single-cell RNA sequencing data (with gene expression measures in the form of integers or floats as inputs) and (ii) to discover features in metagenomics data (based on short sequence snippets; k-mers). I had some doubts regarding the latter application because I was not sure how much the system could learn from short k-mers. Well, now someone has tried deep learning from DNA sequence features!

Let’s back up a little bit. One of many intriguing questions in biology is exactly how splicing works. A lot is known about the rules controlling it but not everything. A recent article in Science, The human splicing code reveals new insights into the genetic determinants of disease (unfortunately paywalled), used a machine learning approach (ensembles of neural networks) to predict splicing events and the effects of single-base mutations on the same using only DNA sequence information as input. Melissa Gymrek has a good blog post on the paper, so I won’t elaborate too much. Importantly though, in this paper the features are still hand-crafted (there are 1393 sequence based features).

In an extension of this work, the same group used deep learning to actually learn the features from the sequence data. Hannes Bretschneider posted this presentation from NIPS 2014 describing the work, and it is very interesting. They used a convolutional network that was able to discover things like the reading frame (the three-nucleotide periodicity resulting from how amino acids are encoded in protein-coding DNA stretches) and known splicing signals.

They have also made available a GPU-accelerated deep learning library for DNA sequence data for Python: Hebel. Right now it seems like only feedforward nets are available (not the convolutional nets mentioned in the talk). I am currently trying to install the package on my Mac.

Needless to say, I think this is a very interesting development and I hope to try this approach on some entirely different problem.

Edit 2015-01-06. Well, what do you know! Just found out that my suggestion (i) has been tried as well. At the currently ongoing PSB’15 conference, Jie Tan has presented work using a denoising autoencoder network to learn a representation of breast cancer gene expression data. The learned features were shown to represent things like tumor vs. normal tissue status, estrogen receptor (ER) status and molecular subtypes. I had thought that there wasn’t enough data yet to support this kind of approach (and even told someone who suggested using The Cancer Genome Atlas [TCGA] data as much at a data science workshop last month – this work uses TCGA data as well as data from METABRIC), and the authors remark in the paper that it is surprising that the method seems to work so well. Previously my thinking was that we needed to await the masses of single-cell gene expression data that are going to come out in the coming years.

Deep learning and genomics?

Yesterday, I attended an excellent meetup organized by the Stockholm Machine Learning meetup group at Spotify’s headquarters. There were two presentations: First one by Pawel Herman, who gave a very good general introduction into the roots, history, present and future of deep learning, and a more applied talk by Josephine Sullivan, where she showed some impressive results obtained by her group in image recognition as detailed in a recent paper titled “CNN features off-the-shelf: An astounding baseline for recognition” [pdf]. I’m told that slides from the presentations will be posted on the meetup web page soon.

Anyway, this meetup naturally got me thinking about whether deep learning could be used for genomics in some fruitful way. At first blush it does not seem like a good match: deep learning models have an enormous number of parameters and mostly seem to be useful with a very large number of training examples (although not as many as the number of parameters perhaps). Unfortunately, the sample sizes in genomics are usually small – it’s a very small n, large p domain at least in a general sense.

I wonder whether it would make sense to throw a large number of published human gene expression data sets (microarray or RNA-seq; there should be thousands of these now) into a deep learner to see what happens. The idea would not necessarily be to create a good classification model, but rather to learn a good hierarchical representation of gene expression patterns. Both Pawel and Josephine stressed that one of the points of deep learning is to automatically learn a good multi-level data representation, such as a set of more and more abstract set of visual categories in the case of picture classification. Perhaps we could learn something about abstract transcriptional states on various levels. Or not.

There are currently two strains of genomics that I feel are especially interesting from a “big data” perspective, namely single-cell transcriptomics and metagenomics (or metatranscriptomics, metaproteomics and what have you). Perhaps deep learning could actually be a good paradigm for analyzing single-cell transcriptomics (single-cell RNA-seq) data. Some researchers are talking about generating tens of thousands of single-cell expression profiles. The semi-redundant information obtained from many similar but not identical profiles is reminiscent of the redundant visual features that deep learning methods like to consume as input (according to the talks yesterday). Maybe this type of data would fit better than the “published microarray data” idea above.

For metagenomics (or meta-X-omics), it’s harder to speculate on what a useful deep learning solution would be. I suppose one could try to feed millions or billions of bits of sequences (k-mers) to a deep learning system in the hope of learning some regularities in the data. However, it was also mentioned at the meetup that deep learning methods still have a ways to go when it comes to natural language processing, and it seems to me that DNA “words” are closer to natural language than they are to pixel data.

I suppose we will find out eventually what can be done in this field now that Google has joined the genomics party!

Analytics challenges in genomics

Continuing on the theme of data analysis and genomics, here is a presentation I gave for the Data Mining course at Uppsala university in October this year. It talks a little bit about massively parallel DNA sequencing, then goes on to mention grand visions such as sequencing millions of genomes, discovering new species by metagenomics, “genomic observatories” etc, then goes into the practical difficulties and finally suggests some strategies like prediction contests. Enjoy!

What can “big data” (read “Hadoop”) do for genomics?

Prompted by the recent news that Cloudera and Mount Sinai School of Medicine will collaborate to “solve medical challenges using big data” (more specifically, Cloudera’s Jeff Hammerbacher, ex-big data guru at Facebook, will collaborate with the equally trailblazing mathematician/biologist Eric Schadt at Mount Sinai’s Institute for Genomics and Multiscale Biology) and that NextBio will collaborate with Intel to “optimize the Hadoop stack and advance big data technologies in medicine”, I would like to offer some random thoughts on possible use cases.

Note that “big data” essentially means “Hadoop” in the above press releases, and that the “medicine” they mention should be understood as “genomic medicine” or just “genomics”. Since I happen to know a thing or two about genomics, I will limit myself to (parts of) genomics and Hadoop/MapReduce in this post. For a good overview of big data and medicine in a broader sense than I can describe here, check out this rather nice GigaOm article.

Existing Hadoop/MapReduce stuff for NGS

In the world of high-throughput, or next-generation sequencing (NGS), which is rapidly becoming more and more indispensable for genomics, there are a few Hadoop-based frameworks that I am aware of and that should probably be mentioned first. Packages like Cloudburst and Crossbow leverage Hadoop to perform “read mapping” (approximate string matching for taking a DNA sequence from the sequencer and figuring out where in a known genome it came from), Myrna and Eoulsan do the same but also extend the workflow to quantifying gene expression and identifying differentially expressed genes based on the sequences, and Contrail does Hadoop-based de novo assembly (piecing together a new genome from sequences without previous knowledge, like an extremely difficult jigsaw puzzle). These are essentially MapReduce implementations of existing software, which is all good and fine, but I haven’t seen these tools being used much so far. Perhaps one reason is that read mapping is usually not a major bottleneck compared to some other steps, and with recently released software such as SeqAlto and SNAP (thx Tom Dyar) (and another package that I’m sure I read about the other day but can’t seem find right now) promising a further 10x-100x speed increase compared to existing tools, there is just not a pressing need at the moment. Contrail, the de novo assembler,  does offer an opportunity for research groups who don’t have access to a very RAM-rich computers (de novo assembly is notoriously memory hungry, with 512 Gb RAM machines often being strained to the limit on certain data sets) to perform assembly on commodity clusters.

Then there are the projects that attempt to build a Hadoop infrastructure for next-generation sequencing, like Seal, which provides “map-reducification” for a number of common NGS operations, or Hadoop-BAM (a library for processing BAM files, a common sequence alignment format, in Hadoop) and SeqPig (a library with import and export functions to allow common bioinformatics formats to be used in Pig).

What Hadoop could be useful for

I’m sure people smarter than me will come up with many different use cases for Hadoop in genomics and medicine. At this point, however, I would suggest these general themes:

  • Statistical associations between various kinds of data vectors – clinical, environmental, molecular, microbial... This is more or less a batch-processing problem and thus suited to Hadoop. NextBio (the company mentioned in the beginning, who are teaming up with Intel) are doing this as a core part of their business; computing correlations between gene expression levels in different tissues, diseases and conditions and clinical information, drug data etc. However, this concept could (and should) be extended to other things like environmental information, lifestyle factors, genetic variants (SNV, structural variations, copy number variations etc.), epigenetic data (chromatic structure, DNA methylation, histone modifications …), personal microbiomes (the gut microbiota in each patient etc.) Of course, collecting and compiling the data to perform these correlations will be hard; a much harder “big data” problem than computing the actual correlations.  SolveBio is a new company that seems to want to understand cancer by compiling vast quantities of data in such a way. This is how they put it in an interview (titled, ambitiously, “The Cloud Will Cure Cancer“): “Patients can measure every feature, as the technology becomes cheaper: genome sequence, gene expression in every accessible tissue, chromatin state, small molecules and metabolites, indigenous microbes, pathogens, etc. These data pools can be created by anyone who has the consent of the patients: universities, hospitals, or companies. The resulting networks, the “data tornado”, will be huge. This will be a huge amount of data and a huge opportunity to use statistical learning for medicine.” In fact, a third recently announced bigdata/genomics collaboration, between Google and the Institute for Systems Biology (ISB), has already started to explore what this type of tools could look like in their Cancer Regulome Explorer. ISB has used the Google Compute Engine to scale a random forest algorithm to 600,000 cores across Google’s global data centers in order to “explore associations between DNA, RNA, epigenetic, and clinical cancer data.” See this case study for some more details (not many more to be honest.)
  • Metagenomics. This means, according to one definition, “the application of modern genomics techniques to the study of communities of microbial organisms directly in their natural environments, bypassing the need for isolation and lab cultivation of individual species.(There is really nothing “meta” about it, it’s just that you are looking at many species at once, which is why it is also called environmental genomics or community genomics in some cases.) For example, Craig Venter’s project to sequence as many living things as possible in the Sargasso sea is metagenomics, as is sequencing samples from the human gut, snot etc. in search of novel bacteria, viruses and fungi (or just characterizing the variety of known ones.) It’s a fascinating field; for an easy introduction, see the TED Talk called “What’s left to explore?” by Nathan Wolfe. Analyzing sequences from metagenomics projects is of course much more difficult than usual, because you are randomly sampling sequences for which you don’t know the source organism but have to infer it in some way. This calls for smart use of proper data structures for indexing and querying, and as much parallelization as possible, very likely in some Hadoopy kind of way. C Titus Brown has written a lot of interesting stuff about the metagenomics data deluge on his blog, Living in an Ivory Basement, where he has explored esoteric and useful things such as probabilistic de Bruijn graphs. Lately, compressive genomics – algorithms that compute directly on compressed genomic data – has become something of a buzz phrase (although similar ideas have been used for quite some time). Some combination of all of these approaches will be needed to combat the inevitable information overload.

Beyond batch processing

In my mind, Hadoop has been associated with batch processing, but today I heard that the newest version of Hadoop not only includes a completely overhauled version of MapReduce called YARN, but it will even allow using other kinds of frameworks, such as streaming real-time analytics frameworks, to operate on the data stored in HDFS. I’ve been thinking about possible applications of stream analytics in next-generation sequencing. Surprisingly, there is already software for streaming quantification of sequences, eXpress – these guys are surely ahead of their time. The immediate use case I can think of is for the USB-stick-sized MinION nanopore sequencer, which reportedly will produce output in a real-time manner (which no sequencers do today as far as I know) so that you can start your analysis while the sequencer is still running. If the vision about “genomic observatories” to “take the planet’s biological pulse” comes true, I’m sure there will be plenty of work to do for the stream analytics clusters of the world …

This has been a rambling post that will probably need a few updates in the coming days – congratulations and thanks if you made it to the end!

New analysis competitions

Some interesting competitions in data analysis / prediction:

Kaggle is managing this year’s KDD Cup, which will be about Weibo, China’s rough equivalent to Twitter (with more support for adding pictures and comments on posts, it’s more like a hybrid between Twitter and Facebook maybe). There will be two tasks, (1) predicting which users a certain user will follow (all data being anonymized, of course), and (2) predicting click-through rate in online computational ad systems. According to Gordon Sun, chief scientist at Tencent (the company behind Weibo), the data set to be used is the largest one ever to have been released for competitive purposes.

CrowdAnalytix, an India-based company with a business idea similar to Kaggle’s, has started a fun quickie competition about sentiment mining. Actually the competition might already be over as it ran for just 9 days starting 16/2. The input consists of comments left by visitors to a major airport in India, and the goal is to identify and compile actionable and/or interesting information, such as what kind of services visitors think are missing.

The Clarity challenge is, for me, easily the most interesting challenge of the three, in that it concerns the use of genomic information in healthcare. This challenge (with a prize sum of $25,000) is, in effect, crowdsourcing genomic/medical research (although only 20 teams will get selected to participate). The goal is to identify and report on potential genetic features underlying medical disorders in three children, given the genome sequences of the children and their parents. These genetics features are presently unknown, which is why this competition really represents something new in medical research. I think this is a very nice initiative, in fact I had thought of initiating something similar at my own institute where I work, but this challenge is much better than what I had in mind. It will be very interesting to see what comes out of it.

23andme seeks genetic markers for healthy aging

I recently blogged about the Harvard Study of Adult Development, which tries to identify factors predictive of happy aging. Well, just yesterday I read an interesting blog post at Genetic Future describing how 23andme is now looking to identify (genetic) factors for healthy aging.

Apparently, 23andme offered free genetic scans to participants in the Palo Alto Senior Games, a big sporting event for people of age 50 and up. A spokesman for the company told the Palo Alto Online that they want to use the genetic scans to try to find genetic factors underlying healthy aging. The participants in the Palo Alto Senior Games are, almost by definition, healthy seniors, and 23andme are looking to recruit (or already succeeded in recruiting – it wasn’t completely clear to me from the blog post) 4500 individuals, which is a pretty sizable cohort.

This is not the first large-scale recruitment drive from 23andme targeting a specific group – they previously announced a Parkinson’s project with 3000 participants. It’ll be interesting to see what comes out of these projects.

Sequencing data storm

Today, I attended a talk given by Wang Jun, a humorous and t-shirt-clad whiz kid who set up the bioinformatics arm of Beijing Genomics Institute (BGI) as a 23-year-old PhD student, became a professor at 27, and is now the director of BGI’s facility in Shenzhen, near Hong Kong. Although I work with bioinformatics at a genome institute myself, this presentation really drove home how much storage, computing power and know-how is really required for biology now and in the near future.

BGI does staggering amounts of genome sequencing – “If it tastes good, sequence it! If it is useful, sequence it!” as Wang Jun joked – from indigenous Chinese plants to rice, pandas and humans. They have a very interesting individual genome project where they basically apply many different techniques on samples from the same person and compare the results against known references. One of many interesting results from this project was the finding that human genomes not only vary in single “DNA letter” variants (so called SNPs, single nucleotide polymorphisms) or the number of times certain stretches of DNA are repeated (“copy number variations”) – it now turns out there are DNA snippets that, in largely binary fashion, some people have and some don’t.

Although the existing projects demand a lot of resources and manpower – the BGI has 250 bioinformaticians (!) which is still too few; according to Wang they want to quickly increase this number to 500 – this is nothing compared to what will happen after the next wave of sequencing technologies, when we will start to sequence single cells from different (or the same) tissues in an individual. Already, the data sets generated are so vast that they cannot be distributed over the internet. Wang  recounted how he had to bring ten terabyte drives to Europe by himself in order to share his data with researchers at EBI (European Bioinformatics Institute). Now, they are trying out cloud computing as a way to avoid moving the data around.

Wang attributed a lot of BGI’s success to young, hardworking programmers and scientists – many of them university dropouts – who don’t have any preconceptions about science and therefore are prepared to try anything. “These are teenagers that publish in Nature,” said Jun, apparently feeling that he was (at 33) already over the hill. “They don’t run on a 24-hour cycle like us, they run on 36h-cycles and bring sleeping bags to the lab.”

All in all, good fun.

Post Navigation