Yesterday, I attended an excellent meetup organized by the Stockholm Machine Learning meetup group at Spotify’s headquarters. There were two presentations: First one by Pawel Herman, who gave a very good general introduction into the roots, history, present and future of deep learning, and a more applied talk by Josephine Sullivan, where she showed some impressive results obtained by her group in image recognition as detailed in a recent paper titled “CNN features off-the-shelf: An astounding baseline for recognition” [pdf]. I’m told that slides from the presentations will be posted on the meetup web page soon.
Anyway, this meetup naturally got me thinking about whether deep learning could be used for genomics in some fruitful way. At first blush it does not seem like a good match: deep learning models have an enormous number of parameters and mostly seem to be useful with a very large number of training examples (although not as many as the number of parameters perhaps). Unfortunately, the sample sizes in genomics are usually small – it’s a very small n, large p domain at least in a general sense.
I wonder whether it would make sense to throw a large number of published human gene expression data sets (microarray or RNA-seq; there should be thousands of these now) into a deep learner to see what happens. The idea would not necessarily be to create a good classification model, but rather to learn a good hierarchical representation of gene expression patterns. Both Pawel and Josephine stressed that one of the points of deep learning is to automatically learn a good multi-level data representation, such as a set of more and more abstract set of visual categories in the case of picture classification. Perhaps we could learn something about abstract transcriptional states on various levels. Or not.
There are currently two strains of genomics that I feel are especially interesting from a “big data” perspective, namely single-cell transcriptomics and metagenomics (or metatranscriptomics, metaproteomics and what have you). Perhaps deep learning could actually be a good paradigm for analyzing single-cell transcriptomics (single-cell RNA-seq) data. Some researchers are talking about generating tens of thousands of single-cell expression profiles. The semi-redundant information obtained from many similar but not identical profiles is reminiscent of the redundant visual features that deep learning methods like to consume as input (according to the talks yesterday). Maybe this type of data would fit better than the “published microarray data” idea above.
For metagenomics (or meta-X-omics), it’s harder to speculate on what a useful deep learning solution would be. I suppose one could try to feed millions or billions of bits of sequences (k-mers) to a deep learning system in the hope of learning some regularities in the data. However, it was also mentioned at the meetup that deep learning methods still have a ways to go when it comes to natural language processing, and it seems to me that DNA “words” are closer to natural language than they are to pixel data.
I suppose we will find out eventually what can be done in this field now that Google has joined the genomics party!