Follow the Data

A data driven blog

Archive for the category “Links”

List of deep learning implementations in biology

[Note: this list now lives at GitHub, where it will be continuously updated, so please go there instead!]

I’m going to start collecting papers on, and implementations of, deep learning in biology (mostly genomics, but other areas as well) on this page. It’s starting to get hard to keep up! For the purposes of this list, I’ll consider things like single-layer autoencoders, although not literally “deep”, to qualify for inclusion. The categorizations will by necessity be arbitrary and might be changed around from time to time.

In parallel, I’ll try to post some of these on gitxiv as well under the tag bioinformatics plus other appropriate tags.

Please let me know about the stuff I missed!

Cheminformatics

Neural graph fingerprints [github][gitxiv]

A convolutional net that can learn features which are useful for predicting properties of novel molecules; “molecular fingerprints”. The net works on a graph where atoms are nodes and bonds are edges. Developed by the group of Ryan Adams, who co-hosts the very good Talking Machines podcast.

Proteomics

Pcons2 – Improved Contact Predictions Using the Recognition of Protein Like Contact Patterns [web interface]

Here, a “deep random forest” with five layers is used to improve predictions of which residues (amino acids) in a protein are physically interacting which each other. This is useful for predicting the overall structure of the protein (a very hard problem.)

Genomics

Gene expression

In modeling gene expression, the inputs are typically numerical values (integers or floats) estimating how much RNA is produced from a DNA template in a particular cell type or condition.

ADAGE – Analysis using Denoising Autoencoders of Gene Expression [github][gitxiv]

This is a Theano implementation of stacked denoising autoencoders for extracting relevant patterns from large sets of gene expression data, a kind of feature construction approach if you will. I have played around with this package quite a bit myself. The authors initially published a conference paper applying the model to a compendium of breast cancer (microarray) gene expression data, and more recently posted a paper on bioRxiv where they apply it to all available expression data (microarray and RNA-seq) on the pathogen Pseudomonas aeruginosa. (I understand that this manuscript will soon be published in a journal.)

Learning structure in gene expression data using deep architectures [paper]

This is also about using stacked denoising autoencoders for gene expression data, but there is no available implementation (as far as I could tell). Included here for the sake of completeness (or something.)

Gene expression inference with deep learning [github][paper]

This deals with a specific prediction task, namely to predict the expression of specified target genes from a panel of about 1,000 pre-selected “landmark genes”. As the authors explain, gene expression levels are often highly correlated and it may be a cost-effective strategy in some cases to use such panels and then computationally infer the expression of other genes. Based on Pylearn2/Theano.

Learning a hierarchical representation of the yeast transcriptomic machinery using an autoencoder model [paper]

The authors use stacked autoencoders to learn biological features in yeast from thousands of microarrays. They analyze the hidden layer representations and show that these encode biological information in a hierarchical way, so that for instance transcription factors are represented in the first hidden layer.

Predicting enhancers and regulatory regions

Here the inputs are typically “raw” DNA sequence, and convolutional networks (or layers) are often used to learn regularities within the sequence. Hat tip to Melissa Gymrek (http://melissagymrek.com/science/2015/12/01/unlocking-noncoding-variation.html) for pointing out some of these.

DanQ: a hybrid convolutional and recurrent deep neural network for quantifying the function of DNA sequences [github][gitxiv]

Made for predicting the function of non-protein coding DNA sequence. Uses a convolution layer to capture regulatory motifs (i e single DNA snippets that control the expression of genes, for instance), and a recurrent layer (of the LSTM type) to try to discover a “grammar” for how these single motifs work together. Based on Keras/Theano.

Basset – learning the regulatory code of the accessible genome with deep convolutional neural networks [github][gitxiv]

Based on Torch, this package focuses on predicting the accessibility (or “openness”) of the chromatin – the physical packaging of the genetic information (DNA+associated proteins). This can exist in more condensed or relaxed states in different cell types, which is partly influenced by the DNA sequence (not completely, because then it would not differ from cell to cell.)

DeepSEA – Predicting effects of noncoding variants with deep learning–based sequence model [web server][paper]

Like the packages above, this one also models chromatin accessibility as well as the binding of certain proteins (transcription factors) to DNA and the presence of so-called histone marks that are associated with changes in accessibility. This piece of software seems to focus a bit more explicitly than the others on predicting how single-nucleotide mutations affect the chromatin structure. Published in a high-profile journal (Nature Methods).

DeepBind – Predicting the sequence specificities of DNA- and RNA-binding proteins by deep learning [code][paper]

This is from the group of Brendan Frey in Toronto, and the authors are also involved in the company Deep Genomics. DeepBind focuses on predicting the binding specificities of DNA-binding or RNA-binding proteins, based on experiments such as ChIP-seq, ChIP-chip, RIP-seq,  protein-binding microarrays, and HT-SELEX. Published in a high-profile journal (Nature Biotechnology.)

PEDLA: predicting enhancers with a deep learning-based algorithmic framework [code][paper]

This package is for predicting enhancers (stretches of DNA that can enhance the expression of a gene under certain conditions or in a certain kind of cell, often working at a distance from the gene itself) based on heterogeneous data from (e.g.) the ENCODE project, using 1,114 features altogether.

DEEP: a general computational framework for predicting enhancers

Genome-Wide Prediction of cis-Regulatory Regions Using Supervised Deep Learning Methods (and several other papers applying various kinds of deep networks to regulatory region prediction) [code][one paper out of several]

Wyeth Wasserman’s group have made a kind of toolkit (based on the Theano tutorials) for applying different kinds of deep learning architectures to cis-regulatory element (DNA stretches that can modulate the expression of a nearby gene) prediction. They use a specific “feature selection layer” in their nets to restrict the number of features in the models. This is implemented as an additional sparse one-to-one linear layer between the input layer and the first hidden layer of a multi-layer perceptron.

Methylation

Predicting DNA Methylation State of CpG Dinucleotide Using Genome Topological Features and Deep Networks [paper][web server]

This implementation uses a stacked autoencoder with a supervised layer on top of it to predict whether a certain type of genomic region called “CpG islands” (stretches with an overrepresentation of a sequence pattern where a C nucleotide is followed by a G) is methylated (a chemical modification to DNA that can modify its function, for instance methylation in the vicinity of a gene is often but not always related to the down-regulation or silencing of that gene.) This paper uses a network structure where the hidden layers in the autoencoder part have a much larger number of nodes than the input layer, so it would have been nice to read the authors’ thoughts on what the hidden layers represent.

Single-cell applications

CellCnn – Representation Learning for detection of disease-associated cell subsets
[code][paper]

This is a convolutional network (Lasagne/Theano) based approach for “Representation Learning for detection of phenotype-associated cell subsets.” It is interesting because most neural network approaches for high-dimensional molecular measurements (such as those in the gene expression category above) have used autoencoders rather than convolutional nets.

Population genetics

Deep learning for population genetic inference [paper]

No implementation available yet but says an open-source one will be made available soon.

Neuroscience

This is a harder category to populate because a lot of theoretical work on neural networks and deep learning has been intertwined with neuroscience. For example, recurrent neural networks have long been used for modeling e.g. working memory and attention. In this post I am really looking for pure applications of DL rather than theoretical work, although that is extremely interesting.

For more applied DL, I have found

Deep learning for neuroimaging: a validation study [paper]

SPINDLE: SPINtronic deep learning engine for large-scale neuromorphic computing [paper]

I’m sure there are many others. Maybe digging up some seminal neuroscience papers modeling brain areas and functions with different kinds of neural networks would be a worthy topic for a future blog post.

 

 

Some resources

  • DataTau seems like a worthwhile Reddit-like site devoted to all things data.
  • Foundations of data science [PDF link]. A (quite complete) draft of a rather mathematically oriented book on data science. I haven’t had time to read it yet but it looks interesting. It seems to put quite a lot of emphasis on understanding the quirks of high-dimensional spaces.
  • Techniques to improve the accuracy of your predictive models. Useful (1h 20 min long) video of a presentation given by Phil Brierley at an R user group meeting in Melbourne.

Three angles on crowd science

Some recently announced news that illuminate crowd science, advancing science by somehow leveraging a community, from three different angles.

  • The Harvard Clinical and Translational Science Center (or Harvard Catalyst) has “launched a pilot service through which researchers at the university can submit computational problems in areas such as genomics, proteomics, radiology, pathology, and epidemiology” via the TopCoder online competitive community for software development and digital creation. One recently started Harvard Catalyst challenge is called FitnessEstimator. The aim of the project is to “use next-generation sequencing data to determine the abundance of specific DNA sequences at multiple time points in order to determine the fitness of specific sequences in the presence of selective pressure. As an example, the project abstract notes that such an approach might be used to measure how certain bacterial sequences become enriched or depleted in the presence of antibiotics.” (the quotes are from a GenomeWeb article that is behind a paywall) I think it’s very interesting to use online software development contests for scientific purposes, as a very useful complement to Kaggle competitions, where the focus is more on data analysis. Sometimes, really good code is important too!
  • This press release describes the idea of connectomics (which is very big in neuroscience circles now) and how the connectomics researcher Sebastian Seung and colleagues have developed a new online game, EyeWire, where players trace neural branches “through images of mouse brain scans by playing a simple online game, helping the computer to color a neuron as if the images were part of a three-dimensional coloring book.” The images are actual data from the lab of professor Winfried Denk. “Humans collectively spend 600 years each day playing Angry Birds. We harness this love of gaming for connectome analysis,” says Prof. Seung in the press release. (For similar online games that benefit research, see e.g. Phylo, FoldIt and EteRNA.)
  • Wisdom of Crowds for Robust Gene Network Inference is a newly published paper in Nature Methods, where the authors looked at a kind of community ensemble prediction method. Let’s back-track a bit. The Dialogue on Reverse Engineering Assessment and Methods (DREAM) initiative is a yearly challenge where contestants try to reverse engineer various kinds of biological networks and/or predict the output of some or all nodes in the network under various conditions. (If it sounds too abstract, go to the link above and check out what the actual challenges have been like.) The DREAM initiative is a nice way to check the performance of the currently touted methods in an unbiased way. In the Nature Methods paper, the authors show that “no single inference method performs optimally across all data sets. In contrast, integration of predictions from multiple inference methods shows robust and high performance across diverse data sets” and that “Our results establish community-based methods as a powerful and robust tool for the inference of transcriptional gene regulatory networks.” So, in a very wisdom-of-crowds manner (as indeed the paper title suggests), it’s better to combine the predictions of all the contestants than just use the best ones. It’s like taking a composite prediction of all Kaggle competitors in a certain contest and observing that this composite prediction was superior to all individual teams’ predictions. I’m sure Kaggle has already done this kind of experiment, does anyone know?

Snippets

Some interesting nuggets gleaned from the web, without much editorial comment 🙂

Quick links

Links without a common theme

  • Are we ready for a true data disaster? Interesting Infoworld article that talks about possibilities for devastating “data spills” that could have effects as bad as the oil spill, or worse.
  • Monkey Analytics – a “web based computation tool” that lets users run R, Python and Matlab commands in the cloud.
  • Blogs and tweets could predict the future. New Scientist article that mentions Google’s study from last year where they tried to use search data to predict various economic variables. A lot of organizations have seized upon that idea, and lately we have seen examples such as Recorded Future, a company that attempts to “mine the future” using future-related online text sources. Google famously used the “predictions from search data” idea to predict flu outbreaks. One of the interesting things here, I think, is that people’s searches (which could be viewed naïvely as ways to obtain data) actually become data in themselves; data that can be used as predictors in a statistical models. The Physics of Data is an interesting video where Google’s Marissa Mayer talks about this topic and a lot of other googly stuff (I don’t really get the name of the presentation though, despite her attempt to justify it in the beginning …).
  • Wikiposit aims to be a “Wikipedia of numerical data.” It aggregates thousands of public data sets (currently 110,000) into a single format and offers a simple API to access them. As of now, it only supports time series data, mostly from the financial domain.

Link roundup

Here are some interesting links from the past few weeks (or in some cases, months). I’m toying with the idea of just tweeting most of the links I find in the future and reserving the blog for more in-depth ruminations. We’ll see how it turns out. Anyway … here are some links!

Open Data

The collaborative filtering news site Reddit has introduced a new Open Data category.

Following the example of New York and San Francisco (among others), London will launch an open data platform, the London Data Store.

Personal informatics and medicine

Quantified Self has a growing (and open/editable) list of self-tracking and related resources. Notable among those is Personal Informatics, which itself tracks a number of resources – I like the term personal informatics and the site looks slick.

Nicholas Felton’s Annual Report 2009. “Each day in 2009, I asked every person with whom I had a meaningful encounter to submit a record of this meeting through an online survey. These reports form the heart of the 2009 Annual Report.” Amazing guy.

What can I do with my personal genome? A slide show by LaBlogga of Broader Perspectives.

David Ewing Duncan, “the experimental man“, has read Francis Collins’ new book about the future of personalized medicine (Language of Life: DNA and the Revolution in Personalized Medicine­) and written a rather lukewarm review about it.

Duncan himself is involved in a very cool experiment (again) – the company Cellular Dynamics International has promised to grow him some personalized heart cells. Say what? Well, basically, they are going to take blood cells from him, “re-program” them back to stem-cell like cells (induced pluripotent cells), and make those differentiate into heart cells. These will of course be a perfect genetic match for him.

Duncan has also put information about his SNPs (single-nucleotide polymorphisms; basically variable DNA positions that  differ from person to person) online for anyone to view, and promises to make 2010 the year when he tries to make sense of all the data, including SNP information, that he obtained about his body when he was writing his book Experimental Man. As he puts it, “Producing huge piles of DNA for less money is exciting, but it’s time to move to the next step: to discover what all of this means.”

HolGenTech – a smartphone based system for scanning barcodes of products and matching them to your genome (!) – that is, it can tell you to avoid some products if you have had a genome scan which found you have a genetic predisposition to react badly to certain substances. I don’t think that the marketing video done in a very responsible way (it says that the system: “makes all the optimal choices for your health and well being every time you shop for your genome“, but this is simply not true – we know too little about genomic risk factors to be able to make any kind of “optimal” choices), but I had to mention it.

The genome they use in the above presentation belongs to the journalist Boonsri Dickinson. Here are some interviews she recently did with Esther Dyson and Leroy Hood, on personalized medicine and systems biology, respectively, at the Personalized Medicine World Conference in January.

Online calculators for cancer outcome and general lifestyle advice. These are very much in the spirit of The Decision Tree blog, through which I in fact found these calculators.

Data mining

Microsoft has patented a system for “Personal Data Mining”. It is pretty heavy reading and I know too little about patents to able to tell how much this would actually prevent anyone from doing various types of recommendation systems and personal data mining tools in the future; probably not to any significant extent?

OKCupid has a fun analysis about various characteristics of profile pictures and how they correlate to online dating success. They mined over 7000 user profiles and associated images. Of course there are numerous caveats in the data interpretation and these are discussed in the comments; still good fun.

A microgaming network has tried to curb data mining of their poker data. Among other things, bulk downloading of hand histories will be made impossible.

Link roundup

Gearing up into Christmas mode, so no proper write-up for these (interesting) links.

Personalized medicine is about data, not (just) drugs. Written by Thomas Goetz of The Decision Tree for Huffington Post. The Decision tree also has a nice post about why self-tracking isn’t just for geeks.

A Billion Little Experiments (PDF link). An eloquent essay/report about “good” and “bad” patients and doctors, compliance, and access to your own health data.

Latent Semantic Indexing worked well for NetFlix, but not for dating. MIT Technology Review writes about how the algorithms used to match people at Match.com (based on latent semantic indexing / SVD) are close to worthless. A bit lightweight, but a fun read.

A podcast about data mining in the mobile world. Featuring Deborah Estrin and Tom Mitchell.  Mitchell just recently wrote an article in Science about how data mining is changing: Mining Our Reality (subscription needed). The take-home message (or one of them) is that data mining is becoming much more real-time oriented. Data are increasingly being analyzed on the fly and used to make quick decisions.

How Zeo, the sleep optimizer, actually works. I mentioned Zeo in a blog post in August.

Link roundup

A roundup of some interesting links from the past few weeks.
Brian Mossop of the Decision Tree blog is embarking on a project to find out how much personal data is needed to stay healthy. He will use devices like the Zeo sleep coach and the Nike+ sportband to record his personal data and post updates about what he has found. He’s also promised a longer blog post after 30 days summarizing his experiences.

dnaSnips is a site that compares reports received by the same person from three different direct-to-consumer (DTC) genetic analysis services:  23andme, deCODEme and Navigenics. Summarizing the experiment, the author feels that all three services give pretty accurate results. (link found via Anthony Fejes’ blog)

By now, there are probably few people who haven’t heard about the scientist who turned out to be call girl blogger Belle de Jour. I was intrigued to find that her Amazon wishlist contains hardcore statistical data analysis books like Chris Bishop’s Neural Networks for Pattern Recognition and An Introduction to Support Vector Machines and Other Kernel-based Learning Methods. In fact, she’s previously blogged about her Bayesian theory of relationships, so she’s clearly no slouch when it comes to statistics and data mining.

Algorithms in healthcare

The Medical Quack has made many posts about the use of algorithms (which, of course, go hand in hand with data) in healthcare over the past months. Here’s links to a few of them:

Staying Alive – Dr. Craig Freied Co-Founder of (Azyxxi) Microsoft Amalga And Hospital Safety (ABC 20/20)
(about the “data junkie” behind Microsoft’s Amalga system)

The Algorithm Report: HealthCare and Other Related Instances

Brave New Films – “United Wealth Care” – It’s In the Algorithms

Are We Ever Going to Get Some Algorithm Centric Laws Passed for Healthcare!

Health Care Insurers Suggest Algorithms and Business Intelligence solutions to provide health insurance solution

Post Navigation