Follow the Data

A data driven blog

Archive for the tag “deep-learning”

List of deep learning implementations in biology

[Note: this list now lives at GitHub, where it will be continuously updated, so please go there instead!]

I’m going to start collecting papers on, and implementations of, deep learning in biology (mostly genomics, but other areas as well) on this page. It’s starting to get hard to keep up! For the purposes of this list, I’ll consider things like single-layer autoencoders, although not literally “deep”, to qualify for inclusion. The categorizations will by necessity be arbitrary and might be changed around from time to time.

In parallel, I’ll try to post some of these on gitxiv as well under the tag bioinformatics plus other appropriate tags.

Please let me know about the stuff I missed!


Neural graph fingerprints [github][gitxiv]

A convolutional net that can learn features which are useful for predicting properties of novel molecules; “molecular fingerprints”. The net works on a graph where atoms are nodes and bonds are edges. Developed by the group of Ryan Adams, who co-hosts the very good Talking Machines podcast.


Pcons2 – Improved Contact Predictions Using the Recognition of Protein Like Contact Patterns [web interface]

Here, a “deep random forest” with five layers is used to improve predictions of which residues (amino acids) in a protein are physically interacting which each other. This is useful for predicting the overall structure of the protein (a very hard problem.)


Gene expression

In modeling gene expression, the inputs are typically numerical values (integers or floats) estimating how much RNA is produced from a DNA template in a particular cell type or condition.

ADAGE – Analysis using Denoising Autoencoders of Gene Expression [github][gitxiv]

This is a Theano implementation of stacked denoising autoencoders for extracting relevant patterns from large sets of gene expression data, a kind of feature construction approach if you will. I have played around with this package quite a bit myself. The authors initially published a conference paper applying the model to a compendium of breast cancer (microarray) gene expression data, and more recently posted a paper on bioRxiv where they apply it to all available expression data (microarray and RNA-seq) on the pathogen Pseudomonas aeruginosa. (I understand that this manuscript will soon be published in a journal.)

Learning structure in gene expression data using deep architectures [paper]

This is also about using stacked denoising autoencoders for gene expression data, but there is no available implementation (as far as I could tell). Included here for the sake of completeness (or something.)

Gene expression inference with deep learning [github][paper]

This deals with a specific prediction task, namely to predict the expression of specified target genes from a panel of about 1,000 pre-selected “landmark genes”. As the authors explain, gene expression levels are often highly correlated and it may be a cost-effective strategy in some cases to use such panels and then computationally infer the expression of other genes. Based on Pylearn2/Theano.

Learning a hierarchical representation of the yeast transcriptomic machinery using an autoencoder model [paper]

The authors use stacked autoencoders to learn biological features in yeast from thousands of microarrays. They analyze the hidden layer representations and show that these encode biological information in a hierarchical way, so that for instance transcription factors are represented in the first hidden layer.

Predicting enhancers and regulatory regions

Here the inputs are typically “raw” DNA sequence, and convolutional networks (or layers) are often used to learn regularities within the sequence. Hat tip to Melissa Gymrek ( for pointing out some of these.

DanQ: a hybrid convolutional and recurrent deep neural network for quantifying the function of DNA sequences [github][gitxiv]

Made for predicting the function of non-protein coding DNA sequence. Uses a convolution layer to capture regulatory motifs (i e single DNA snippets that control the expression of genes, for instance), and a recurrent layer (of the LSTM type) to try to discover a “grammar” for how these single motifs work together. Based on Keras/Theano.

Basset – learning the regulatory code of the accessible genome with deep convolutional neural networks [github][gitxiv]

Based on Torch, this package focuses on predicting the accessibility (or “openness”) of the chromatin – the physical packaging of the genetic information (DNA+associated proteins). This can exist in more condensed or relaxed states in different cell types, which is partly influenced by the DNA sequence (not completely, because then it would not differ from cell to cell.)

DeepSEA – Predicting effects of noncoding variants with deep learning–based sequence model [web server][paper]

Like the packages above, this one also models chromatin accessibility as well as the binding of certain proteins (transcription factors) to DNA and the presence of so-called histone marks that are associated with changes in accessibility. This piece of software seems to focus a bit more explicitly than the others on predicting how single-nucleotide mutations affect the chromatin structure. Published in a high-profile journal (Nature Methods).

DeepBind – Predicting the sequence specificities of DNA- and RNA-binding proteins by deep learning [code][paper]

This is from the group of Brendan Frey in Toronto, and the authors are also involved in the company Deep Genomics. DeepBind focuses on predicting the binding specificities of DNA-binding or RNA-binding proteins, based on experiments such as ChIP-seq, ChIP-chip, RIP-seq,  protein-binding microarrays, and HT-SELEX. Published in a high-profile journal (Nature Biotechnology.)

PEDLA: predicting enhancers with a deep learning-based algorithmic framework [code][paper]

This package is for predicting enhancers (stretches of DNA that can enhance the expression of a gene under certain conditions or in a certain kind of cell, often working at a distance from the gene itself) based on heterogeneous data from (e.g.) the ENCODE project, using 1,114 features altogether.

DEEP: a general computational framework for predicting enhancers

Genome-Wide Prediction of cis-Regulatory Regions Using Supervised Deep Learning Methods (and several other papers applying various kinds of deep networks to regulatory region prediction) [code][one paper out of several]

Wyeth Wasserman’s group have made a kind of toolkit (based on the Theano tutorials) for applying different kinds of deep learning architectures to cis-regulatory element (DNA stretches that can modulate the expression of a nearby gene) prediction. They use a specific “feature selection layer” in their nets to restrict the number of features in the models. This is implemented as an additional sparse one-to-one linear layer between the input layer and the first hidden layer of a multi-layer perceptron.


Predicting DNA Methylation State of CpG Dinucleotide Using Genome Topological Features and Deep Networks [paper][web server]

This implementation uses a stacked autoencoder with a supervised layer on top of it to predict whether a certain type of genomic region called “CpG islands” (stretches with an overrepresentation of a sequence pattern where a C nucleotide is followed by a G) is methylated (a chemical modification to DNA that can modify its function, for instance methylation in the vicinity of a gene is often but not always related to the down-regulation or silencing of that gene.) This paper uses a network structure where the hidden layers in the autoencoder part have a much larger number of nodes than the input layer, so it would have been nice to read the authors’ thoughts on what the hidden layers represent.

Single-cell applications

CellCnn – Representation Learning for detection of disease-associated cell subsets

This is a convolutional network (Lasagne/Theano) based approach for “Representation Learning for detection of phenotype-associated cell subsets.” It is interesting because most neural network approaches for high-dimensional molecular measurements (such as those in the gene expression category above) have used autoencoders rather than convolutional nets.

Population genetics

Deep learning for population genetic inference [paper]

No implementation available yet but says an open-source one will be made available soon.


This is a harder category to populate because a lot of theoretical work on neural networks and deep learning has been intertwined with neuroscience. For example, recurrent neural networks have long been used for modeling e.g. working memory and attention. In this post I am really looking for pure applications of DL rather than theoretical work, although that is extremely interesting.

For more applied DL, I have found

Deep learning for neuroimaging: a validation study [paper]

SPINDLE: SPINtronic deep learning engine for large-scale neuromorphic computing [paper]

I’m sure there are many others. Maybe digging up some seminal neuroscience papers modeling brain areas and functions with different kinds of neural networks would be a worthy topic for a future blog post.



Deep learning and genomics: the splicing code [and breast cancer features]

Last summer, I wrote a little bit about potential applications of deep learning to genomics. What I had in mind then was (i) to learn a hierarchy of cell types based on single-cell RNA sequencing data (with gene expression measures in the form of integers or floats as inputs) and (ii) to discover features in metagenomics data (based on short sequence snippets; k-mers). I had some doubts regarding the latter application because I was not sure how much the system could learn from short k-mers. Well, now someone has tried deep learning from DNA sequence features!

Let’s back up a little bit. One of many intriguing questions in biology is exactly how splicing works. A lot is known about the rules controlling it but not everything. A recent article in Science, The human splicing code reveals new insights into the genetic determinants of disease (unfortunately paywalled), used a machine learning approach (ensembles of neural networks) to predict splicing events and the effects of single-base mutations on the same using only DNA sequence information as input. Melissa Gymrek has a good blog post on the paper, so I won’t elaborate too much. Importantly though, in this paper the features are still hand-crafted (there are 1393 sequence based features).

In an extension of this work, the same group used deep learning to actually learn the features from the sequence data. Hannes Bretschneider posted this presentation from NIPS 2014 describing the work, and it is very interesting. They used a convolutional network that was able to discover things like the reading frame (the three-nucleotide periodicity resulting from how amino acids are encoded in protein-coding DNA stretches) and known splicing signals.

They have also made available a GPU-accelerated deep learning library for DNA sequence data for Python: Hebel. Right now it seems like only feedforward nets are available (not the convolutional nets mentioned in the talk). I am currently trying to install the package on my Mac.

Needless to say, I think this is a very interesting development and I hope to try this approach on some entirely different problem.

Edit 2015-01-06. Well, what do you know! Just found out that my suggestion (i) has been tried as well. At the currently ongoing PSB’15 conference, Jie Tan has presented work using a denoising autoencoder network to learn a representation of breast cancer gene expression data. The learned features were shown to represent things like tumor vs. normal tissue status, estrogen receptor (ER) status and molecular subtypes. I had thought that there wasn’t enough data yet to support this kind of approach (and even told someone who suggested using The Cancer Genome Atlas [TCGA] data as much at a data science workshop last month – this work uses TCGA data as well as data from METABRIC), and the authors remark in the paper that it is surprising that the method seems to work so well. Previously my thinking was that we needed to await the masses of single-cell gene expression data that are going to come out in the coming years.

Quick notes

  • I’ve found the Data Skeptic to be a nice podcast about data science and related subjects. For example, the “data myths” episode and the one with Matthew Russell (who wrote Mining the Social Web) are fun.
  • When I was in China last month, the seat pocket in front of me in the cab we took from the Beijing airport had a glossy magazine in it. The first feature article was about big data (大数据) analysis applied to Chinese TV series and movies, Netflix-style. Gotta beat those Korean dramas! One of the hotels we stayed in Beijing had organized an international conference on big data analytics the day before we arrived at the hotel. The signs and posters were still there. Anecdotes, not data, but still.
  • November was a good meetup month in Stockholm. The Machine Learning group had another good event at Spotify HQ, with interesting presentations from Watty , both about how to “data bootstrap” a startup when you discover that the existing data you’ve acquired is garbage and need to start generating your own in a hurry, and about the actual nitty gritty details of their algorithms (which model and predict energy consumption from different devices in households by deconvoluting a composite signal), and also about embodied cognition and robotics by Jorge Davila-Chacon (slides here). Also, in an effort to revive the Stockholm Big Data group, I co-organized (together with Stefan Avestad from Ericsson) a meetup with Paco Nathan on Spark. The slides for the talk, which was excellent and extremely appreciated by the audience, can be found here. Paco also gave a great workshop the next day on how to actually use Spark. Finally, I’ve joined the organizing committee of SRUG, the Stockholm R useR group, and have started to plan some future meetups there. The next one will be on December 9 and will deal with how Swedish governmental organizations use R.
  • Erik Bernhardsson of Spotify has written a fascinating blog post combining two of my favorite subjects: chess and deep learning. He has trained a 3 layer deep and 2048 unit wide network on 100 million games from FICS (the Free Internet Chess Server, where I, incidentally, play quite often). I’ve often thought about why it seems to be so hard to build a chess engine that really learns the game from scratch, using actual machine learning, rather than the rule- and heuristic based programs that have ruled the roost, and which have been pre-loaded with massive opening libraries and endgame tablebases (giving the optimal move in any position with less than N pieces; I think that N is currently about =<7). It would be much cooler to have a system that just learns implicitly how to play and does not rely on knowledge. Well, Erik seems to have achieved that, kind of. The cool thing is that this program does not need to be told explicitly how the pieces move; it can infer it from data. Since the system is using amateur games, it sensibly enough does not care about the outcome of each game (that would be a weak label for learning). I do think that Erik is a bit optimistic when he writes that “Still, even an amateur player probably makes near-optimal moves for most time.” Most people who have analyzed their own games, or online games, with a strong engine know that amateur games are just riddled with blunders. (I remember the old Max Euwe book “Chess master vs chess amateur”, which also demonstrated this convincingly … but I digress).  Still, a very impressive demonstration! I once supervised a master’s thesis where the aim was to teach a neural network to play some specific endgames, and even that was a challenge. As Erik notes in his blog post, his system needs to be tried against a “real” chess engine. It is reported to score around 33% against Sunfish, but that is a fairly weak engine, as I found out by playing it half and hour ago.

Videos for rainy days

If, like me, you are on vacation, you might have time to watch some of these cool videos on a rainy day (unless you read books instead, as you probably should):

Video from the Europe-wide machine learning meetup on June 15. Andrew Ng’s talk is probably the highlight, but I also enjoyed Muthu Muthukrishnan’s “On Sketching” and Sam Bessalah’s “Abstract algebra for stream learning.”

Video (+ slides) from the Deep Learning meetup in Stockholm on June 9
I saw these live and they were quite good. After some preliminaries, the first presentation (by Pawel Herman) starts at around 00:06:00 in the video and the second presentation (by Josephine Sullivan) starts at around 1:18:30.

Videos from the Big Data in Biomedicine event at Stanford
. Obviously I haven’t seen all of these, but the ones I have seen have been of really high quality. I particularly enjoyed the talk by Google’s David Glazer on the search behemoth’s efforts in genomics and Sandrine Dudoit on the role of statisticians in data science (where she echoes to some extent Terry Speed’s pessimistic views), but I think all of the talks are worth watching.

Deep learning and genomics?

Yesterday, I attended an excellent meetup organized by the Stockholm Machine Learning meetup group at Spotify’s headquarters. There were two presentations: First one by Pawel Herman, who gave a very good general introduction into the roots, history, present and future of deep learning, and a more applied talk by Josephine Sullivan, where she showed some impressive results obtained by her group in image recognition as detailed in a recent paper titled “CNN features off-the-shelf: An astounding baseline for recognition” [pdf]. I’m told that slides from the presentations will be posted on the meetup web page soon.

Anyway, this meetup naturally got me thinking about whether deep learning could be used for genomics in some fruitful way. At first blush it does not seem like a good match: deep learning models have an enormous number of parameters and mostly seem to be useful with a very large number of training examples (although not as many as the number of parameters perhaps). Unfortunately, the sample sizes in genomics are usually small – it’s a very small n, large p domain at least in a general sense.

I wonder whether it would make sense to throw a large number of published human gene expression data sets (microarray or RNA-seq; there should be thousands of these now) into a deep learner to see what happens. The idea would not necessarily be to create a good classification model, but rather to learn a good hierarchical representation of gene expression patterns. Both Pawel and Josephine stressed that one of the points of deep learning is to automatically learn a good multi-level data representation, such as a set of more and more abstract set of visual categories in the case of picture classification. Perhaps we could learn something about abstract transcriptional states on various levels. Or not.

There are currently two strains of genomics that I feel are especially interesting from a “big data” perspective, namely single-cell transcriptomics and metagenomics (or metatranscriptomics, metaproteomics and what have you). Perhaps deep learning could actually be a good paradigm for analyzing single-cell transcriptomics (single-cell RNA-seq) data. Some researchers are talking about generating tens of thousands of single-cell expression profiles. The semi-redundant information obtained from many similar but not identical profiles is reminiscent of the redundant visual features that deep learning methods like to consume as input (according to the talks yesterday). Maybe this type of data would fit better than the “published microarray data” idea above.

For metagenomics (or meta-X-omics), it’s harder to speculate on what a useful deep learning solution would be. I suppose one could try to feed millions or billions of bits of sequences (k-mers) to a deep learning system in the hope of learning some regularities in the data. However, it was also mentioned at the meetup that deep learning methods still have a ways to go when it comes to natural language processing, and it seems to me that DNA “words” are closer to natural language than they are to pixel data.

I suppose we will find out eventually what can be done in this field now that Google has joined the genomics party!

Stockholm data happenings

The weather may be terrible at the moment in Stockholm (it was really a downer to come back from the US this morning) but there are a couple of interesting data-related events coming up. The past week, I missed two interesting events: the KTH Symposium on Big Data (past Mon, May 26) and the AWS Summit (past Tue, May 27).

In June, there will be meetups on deep learning (Machine Learning Stockholm group, June 9 at Spotify) and on Shiny and ggvis presented by Hadley Wickham himself (Stockholm useR group, June 16 at Pensionsmyndigheten.) There are wait lists for both.

Danny Bickson is giving a tutorial on GraphLab at Stockholm iSocial Summer school June 2-4. He has indicated that he would be happy to chat with anyone who is interested in connection with this.

King are looking for a “data guru” – a novel job title!

Finally, Wilhelm Landerholm, a seasoned data scientist who was way ahead of the hype curve, has finally started (or revived?) his blog on big data, which unfortunately is Swedish only: We Want Your Data.



Two good resources (about sklearn and deep learning)

I have been using R, mostly happily, for the past 6 or 7 years, for its variety of statistical and machine learning packages and the relative ease of producing nice-looking plots. At the same time I am a big user of Python for things that R really doesn’t do that well, such as large-scale string manipulation. I had been aware of scikit-learn (or sklearn) for a while as a potential way to be able to do “everything” in Python including stats and plotting, but never really felt the pull to start using it. In the beginning, it felt too immature; later, it felt too messy when I looked at the documentation.

Last week, however, I came across a really good tutorial by Jake Vanderplas that finally made sklearn click for me and perhaps will push me over the edge to start using it. (I don’t expect to leave R any time soon, though…) The tutorial shows, step by step, how to divide your data set into training and test sets, fit models and make predictions, perform grid searches for parameter settings, plot learning curves etc.

 Deep learning is another subject (although much bigger than sklearn of course) that I have kept up a passing interest in but never really looked into properly, because I wasn’t sure where to start. The new book Deep learning: Methods and applications (PDF link) by Li Deng and Dong Yu seems like a good place to start. I’ve only read a few chapters, but so far it has done a good job of clarifying terms and putting deep learning methods into a historical context.

Large-scale machine learning course & streaming organism classification

The NYU Large Scale Machine Learning course looks like it will be very worthwhile to follow. The instructors, John Langford and Yann Le Cun, are both key figures in the machine learning field – for instance, the former developed Vowpal Wabbit and the latter has done pioneering work in deep learning. It is not an online course like those at Coursera et al., but they have promised to put lecture videos and slides online. I’ll certainly try to follow along as best I can.


There is an interesting Innocentive challenge going on, called “Identify Organisms from A Stream of DNA Sequences.” This is interesting to me both because of the subject matter (classification based on DNA sequences) and also because the winner is explicitly required to submit an efficient, scalable solution (not just a good classifier.) Also, the prize sum is one million US dollars! It’s exactly this kind of algorithms that will be needed to enable the “genomic observatories” that I have mentioned before on this blog which will continuously stream out sequences obtained from the environment.

Post Navigation