Follow the Data

A data driven blog

List of deep learning implementations in biology

[Note: this list now lives at GitHub, where it will be continuously updated, so please go there instead!]

I’m going to start collecting papers on, and implementations of, deep learning in biology (mostly genomics, but other areas as well) on this page. It’s starting to get hard to keep up! For the purposes of this list, I’ll consider things like single-layer autoencoders, although not literally “deep”, to qualify for inclusion. The categorizations will by necessity be arbitrary and might be changed around from time to time.

In parallel, I’ll try to post some of these on gitxiv as well under the tag bioinformatics plus other appropriate tags.

Please let me know about the stuff I missed!


Neural graph fingerprints [github][gitxiv]

A convolutional net that can learn features which are useful for predicting properties of novel molecules; “molecular fingerprints”. The net works on a graph where atoms are nodes and bonds are edges. Developed by the group of Ryan Adams, who co-hosts the very good Talking Machines podcast.


Pcons2 – Improved Contact Predictions Using the Recognition of Protein Like Contact Patterns [web interface]

Here, a “deep random forest” with five layers is used to improve predictions of which residues (amino acids) in a protein are physically interacting which each other. This is useful for predicting the overall structure of the protein (a very hard problem.)


Gene expression

In modeling gene expression, the inputs are typically numerical values (integers or floats) estimating how much RNA is produced from a DNA template in a particular cell type or condition.

ADAGE – Analysis using Denoising Autoencoders of Gene Expression [github][gitxiv]

This is a Theano implementation of stacked denoising autoencoders for extracting relevant patterns from large sets of gene expression data, a kind of feature construction approach if you will. I have played around with this package quite a bit myself. The authors initially published a conference paper applying the model to a compendium of breast cancer (microarray) gene expression data, and more recently posted a paper on bioRxiv where they apply it to all available expression data (microarray and RNA-seq) on the pathogen Pseudomonas aeruginosa. (I understand that this manuscript will soon be published in a journal.)

Learning structure in gene expression data using deep architectures [paper]

This is also about using stacked denoising autoencoders for gene expression data, but there is no available implementation (as far as I could tell). Included here for the sake of completeness (or something.)

Gene expression inference with deep learning [github][paper]

This deals with a specific prediction task, namely to predict the expression of specified target genes from a panel of about 1,000 pre-selected “landmark genes”. As the authors explain, gene expression levels are often highly correlated and it may be a cost-effective strategy in some cases to use such panels and then computationally infer the expression of other genes. Based on Pylearn2/Theano.

Learning a hierarchical representation of the yeast transcriptomic machinery using an autoencoder model [paper]

The authors use stacked autoencoders to learn biological features in yeast from thousands of microarrays. They analyze the hidden layer representations and show that these encode biological information in a hierarchical way, so that for instance transcription factors are represented in the first hidden layer.

Predicting enhancers and regulatory regions

Here the inputs are typically “raw” DNA sequence, and convolutional networks (or layers) are often used to learn regularities within the sequence. Hat tip to Melissa Gymrek ( for pointing out some of these.

DanQ: a hybrid convolutional and recurrent deep neural network for quantifying the function of DNA sequences [github][gitxiv]

Made for predicting the function of non-protein coding DNA sequence. Uses a convolution layer to capture regulatory motifs (i e single DNA snippets that control the expression of genes, for instance), and a recurrent layer (of the LSTM type) to try to discover a “grammar” for how these single motifs work together. Based on Keras/Theano.

Basset – learning the regulatory code of the accessible genome with deep convolutional neural networks [github][gitxiv]

Based on Torch, this package focuses on predicting the accessibility (or “openness”) of the chromatin – the physical packaging of the genetic information (DNA+associated proteins). This can exist in more condensed or relaxed states in different cell types, which is partly influenced by the DNA sequence (not completely, because then it would not differ from cell to cell.)

DeepSEA – Predicting effects of noncoding variants with deep learning–based sequence model [web server][paper]

Like the packages above, this one also models chromatin accessibility as well as the binding of certain proteins (transcription factors) to DNA and the presence of so-called histone marks that are associated with changes in accessibility. This piece of software seems to focus a bit more explicitly than the others on predicting how single-nucleotide mutations affect the chromatin structure. Published in a high-profile journal (Nature Methods).

DeepBind – Predicting the sequence specificities of DNA- and RNA-binding proteins by deep learning [code][paper]

This is from the group of Brendan Frey in Toronto, and the authors are also involved in the company Deep Genomics. DeepBind focuses on predicting the binding specificities of DNA-binding or RNA-binding proteins, based on experiments such as ChIP-seq, ChIP-chip, RIP-seq,  protein-binding microarrays, and HT-SELEX. Published in a high-profile journal (Nature Biotechnology.)

PEDLA: predicting enhancers with a deep learning-based algorithmic framework [code][paper]

This package is for predicting enhancers (stretches of DNA that can enhance the expression of a gene under certain conditions or in a certain kind of cell, often working at a distance from the gene itself) based on heterogeneous data from (e.g.) the ENCODE project, using 1,114 features altogether.

DEEP: a general computational framework for predicting enhancers

Genome-Wide Prediction of cis-Regulatory Regions Using Supervised Deep Learning Methods (and several other papers applying various kinds of deep networks to regulatory region prediction) [code][one paper out of several]

Wyeth Wasserman’s group have made a kind of toolkit (based on the Theano tutorials) for applying different kinds of deep learning architectures to cis-regulatory element (DNA stretches that can modulate the expression of a nearby gene) prediction. They use a specific “feature selection layer” in their nets to restrict the number of features in the models. This is implemented as an additional sparse one-to-one linear layer between the input layer and the first hidden layer of a multi-layer perceptron.


Predicting DNA Methylation State of CpG Dinucleotide Using Genome Topological Features and Deep Networks [paper][web server]

This implementation uses a stacked autoencoder with a supervised layer on top of it to predict whether a certain type of genomic region called “CpG islands” (stretches with an overrepresentation of a sequence pattern where a C nucleotide is followed by a G) is methylated (a chemical modification to DNA that can modify its function, for instance methylation in the vicinity of a gene is often but not always related to the down-regulation or silencing of that gene.) This paper uses a network structure where the hidden layers in the autoencoder part have a much larger number of nodes than the input layer, so it would have been nice to read the authors’ thoughts on what the hidden layers represent.

Single-cell applications

CellCnn – Representation Learning for detection of disease-associated cell subsets

This is a convolutional network (Lasagne/Theano) based approach for “Representation Learning for detection of phenotype-associated cell subsets.” It is interesting because most neural network approaches for high-dimensional molecular measurements (such as those in the gene expression category above) have used autoencoders rather than convolutional nets.

Population genetics

Deep learning for population genetic inference [paper]

No implementation available yet but says an open-source one will be made available soon.


This is a harder category to populate because a lot of theoretical work on neural networks and deep learning has been intertwined with neuroscience. For example, recurrent neural networks have long been used for modeling e.g. working memory and attention. In this post I am really looking for pure applications of DL rather than theoretical work, although that is extremely interesting.

For more applied DL, I have found

Deep learning for neuroimaging: a validation study [paper]

SPINDLE: SPINtronic deep learning engine for large-scale neuromorphic computing [paper]

I’m sure there are many others. Maybe digging up some seminal neuroscience papers modeling brain areas and functions with different kinds of neural networks would be a worthy topic for a future blog post.



ASCII Autoencoder

Joel and I were playing around with TensorFlow, the deep learning library that Google recently released and that you have no doubt heard of. We had put together a little autoencoder implementation and were trying to get a handle on how well it was working.

An autoencoder can be viewed as a neural network where the final layer, the output layer, is supposed to reconstruct the values that have been fed into the input layer, possibly after some distortion of the inputs (like forcing a fraction of them to be zero, dropout, or adding some random noise). In the case with corrupted, it’s called a denoising autoencoder, and the purpose of adding the noise or dropout is to make the system discover more robust statistical regularities in the input data (there is some good discussion here).

An autoencoder often has fewer nodes in the hidden layer(s) than in the input and is then used to learn a more compact and general representation of the data (the code or encoding). With only one hidden layer and linear activation functions, the encoding should be essentially the same as one gets from PCA (principal component analysis), but non-linear activation functions (e g sigmoid and tanh) will yield different representations, and multi-layer or stacked autoencoders will add a hierarchical aspect.

Some references on autoencoders:

Ballard (1987) – Modular learning in neural networks

Andrew Ng’s lecture notes on sparse autoencoders

Vincent et al (2010) – Stacked denoising autoencoders

Tan et al (2015) – ADAGE analysis of publicly available gene expression data collections illuminates Pseudomonas aeruginosa-host interactions

Anyway, we were trying some different parametrizations of the autoencoder (its training performance can depend quite a lot on how the weights are initialized, the learning rate and the number of hidden nodes) and felt it was a bit boring to just look at a single number (the reconstruction error). We wanted to get a feel for how training is progressing across the input data matrix, so we made the script output for each 1000 rounds of training a colored block of text in the terminal where the background color represents the absolute difference between the target value and the reconstructed value using bins. The “best” bin (bin 0) is dark green and represents that the reconstruction is very close to the original input; the “bad” bins have reddish colors. If the data point has been shifted t0 a new bin in the last 1000 rounds (i e the reconstruction has improved or deteriorated noticeably), a colored digit indicating the new bin is shown in the foreground. (This makes more sense when you actually look at it.) We only show the first 75 training examples and the first 75 features, so if your data set is larger than that you won’t see all of it.

The code is on GitHub. There are command-line switches for controlling the number of hidden nodes, learning rate, and other such things. There are probably many aspects that could be improved but we thought this was a fun way to visualize the progress and see if there are any regions that clearly stand out in some way.

Here are a few screenshots of an example execution of the script.

As the training progresses, the overall picture gets a bit greener (the reconstruction gets closer to the input values) and the reconstructions get a bit more stable (i e not as many values have a digit on them to indicate that the reconstruction has improved or deteriorated). The values under each screenshot indicates the number of training cycles and the mean squared reconstruction error.

Watson hackathon in Uppsala

Today I spent most of the day trying to grok IBM Watson’s APIs during a hackathon (Hackup) in Uppsala, where the aim was to develop useful apps using those APIs. Watson is, of course, famous for being good at Jeopardy and for being at the center for IBM’s push into healthcare analytics, but I hadn’t spent much time before this hackathon checking out exactly what is available to users now in terms of APIs etc. It turned out to be a fun learning experience and I think a good time was had by all.

We used IBM’s Bluemix platform to develop apps. As the available Watson API’s (also including the Alchemy APIs that are now part of Bluemix) are mostly focused on natural language analysis (rather than generic classification and statistical modeling), our team – consisting of me and two other bioinformaticians from Scilifelab – decided to try to build a service for transcribing podcasts (using the Watson Speech To Text API) in order to annotate and tag them using the Alchemy APIs for keyword extraction, entity extraction etc. This, we envisioned, would allow podcast buffs to identify in which episode of their favorite show a certain topic was discussed, for instance. Eventually, after ingesting a large number of podcast episodes, the tagging/annotation might also enable things like podcast recommendations and classification, as podcasts could be compared to each other based on themes and keywords. This type of “thematic mapping” could also be interesting for following a single podcast’s thematic development.

As is often the case, we spent a fair amount of time on some supposedly mundane details. Since the speech-to-text conversion was relatively slow, we tried different strategies to split the audio files and process them in parallel, but could not quite make it work. Still, we ended up with a (Python-based) solution that was indeed able to transcribe and tag podcast episodes, but it’s still missing a front-end interface and a back-end database to hold information about multiple podcast episodes.

There were many other teams who developed cool apps. For instance one team made a little app for voice control of a light switch using a Raspberry Pi, and another team had devised an “AI shopper” that will remind you to buy stuff that you have forgotten to put on your shopping list. One entry was a kind of recommendation system for what education you should pursue, based on comparing a user-submitted text against a model trained on papers from people in different careers, and another one was an app for quantifying the average positive/negative/neutral sentiments found in tweets from different accounts (e.g. NASA had very positive tweets on average whereas BBC News was fairly negative).

All in all, a nice experience, and it was good to take a break from the Stockholm scene and see what’s going on in my old home town. Good job by Jason Dainter and the other organizers!

GitXiv – collaborative open source computer science

Just wanted to highlight GitXiv, an interesting new resource that combines paper pre-print publication, implementation code and a discussion forum in the same space. The About page explains the concept well:

In recent years, a highly interesting pattern has emerged: Computer scientists release new research findings on arXiv and just days later, developers release an open-source implementation on GitHub. This pattern is immensely powerful. One could call it collaborative open computer science (cocs).

GitXiv is a space to share links to open computer science projects. Countless Github and arXiv links are floating around the web. Its hard to keep track of these gems. GitXiv attempts to solve this problem by offering a collaboratively curated feed of projects. Each project is conveniently presented as arXiv + Github + Links + Discussion. Members can submit their findings and let the community rank and discuss it. A regular newsletter makes it easy to stay up-to-date on recent advancements. It´s free and open.

The feed contains a lot of yummy research on things like deep learning, natural language processing and graphs, but GitXiv is not restricted to any particular computer science areas – anything is welcome!

Neural networks hallucinating text

I’ve always been a sucker for algorithmic creativity, so when I saw the machine generated Obama speeches, I immediately wanted to try the same method on other texts. Fortunately, that was easily done by simply cloning the char-rnn repository by Andrej Karpathy, which the Obama-RNN was based on. Andrej has also written a long and really very good introduction to recurrent neural networks if you want to know more about the inner workings of the thing.

I started by downloading an archive of all the posts on this blog and trained a network with default parameters according to the char-rnn instructions. In the training phase, the network tries to learn to predict the next character in the text. Note that it does not (initially) know anything about words or sentences, but learns about those concepts implicitly with training. After training, I let the network hallucinate new blog posts by sampling from the network state (this is also described on the GitHub page). The network can be “primed” with a word or a phrase, and a temperature parameter controls how conservative or daring the network should be when generating new text. Essentially a low temperature will give a repetitive output endlessly rehashing the same concepts (namely, the most probable ones based on the training data) while a high temperature will output more adventurous stuff such as weird new “words” and sometimes imaginary hyperlinks (if links were included in the input).

Here are a few samples from the Follow the Data RNN. You’ll be the judge of how well it captures the blog’s spirit.

Temperature 1:

predictive pullimation; in personal ining the find of R crition Large Liforrsion Sachelity Complents have bio carristed in software predictions on yenogic data is to first general revalumated by people of cholescologion, slides of programs where pluding this onlines it a hagn to be step-nature for genomes with onlowed data data set that from the system that me personito of the being prediction of pacen”</em>.


perhoul what yeverope informatation will likely moveling in langup genetic paring how health reed masses disease out its mames by (I’ve not, one would PDAscoprometain, this proop whothem the astimate all talks be a single wike this blogpors high tho out that company around and brarting provide nest potential cervel gods a sall sult of they are it is now laned to write can be used metagenome have not probable to the imail, the pettoling in an, described

Temperature 0.2:

articles are and self-tracking and company that the command the starts to see that the results of the prediction in the prediction contests about the prediction contests in the contest at the command the data sets of statistics and prediction contests about the presentation of the data set of both say that the results of the presentation is an interesting problems in the concept of the prediction data sets of both samples of data sets of the problem is that is a predictive analysis and some sequences and sense for the talk may be a set of the paper and the sequence profiles are an experience and probably the statistical analysis of analysis statistics and provides and prediction contests in the contests and company that the command the starts of the paper in the prediction probably and better that all the company that the (etc …)

OK, that was not great. Let’s raise the temperature to 0.7 and seed the net with “deep learning”:

maker the runion following how build on our related in atther are algorithms in scientific movement is sequence of the paper in a lot more and results of data posting to see in genetal sequence of discussion in the software completely (with Manalay will be all cantrusualing indelation and know possistration and simplement caper is the lately and the lately of the entroagh the clinical replot on what thoughts in the related network. The currently in the <a href=””>here</a&gt;.

In such as well be a new up some notes stall more results to release which can be used on the process of biolity (or because the problem is to have been every booking the clinical resomm that build of the paper is on the web company relation to me that it up a preteors, machine learning teams. It are blogging index, the exterment). I’ve lidel sentiment methods to call contain variant public.

The interview quarthyre of they asseance in the prediction methods.
A  multiple possible to such fut to this sequence data sets are both the facural statistics of courses where can be complenes and inpreserver the allow predictive results are and more scale some the paper is that the build betted examples are signity.</li>

You’ll note that the net has invented many interesting new words, which is a consequence of the fact that it is generating one character at a time.

For my next experiment, I couldn’t resist putting the collected short stories of JG Ballard (one of my favourite writers) into the network. Here, I used 256 nodes, but perhaps should have trained a bit longer as the results are still a bit shaky (although they have a distinctly Ballardian feel at times). In particular, I am missing words like “canopy”, “perimeter”, and of course the drained swimming pools. Here are some samples:

US DAY, counterlahes to face the films, one programme demolishiking science prison erecting out of the solar aircraft on my mind.

Television made him for itself his pieces of published telescope.

A simple motion of Kennedy heard your legs through the suit within a fortuna from Beach. Angel and London Hinton’s gas. A single tenant of the succession of all the transition limbs ultil to me that three overlooking the space–craft has been insilent.

An assocations of 112–hour long water in front of them all sweak, as if he was saying, his coming at the statue. Its still bays rated like a large physician from the watch–tobe. The ancient birds were simply to embrace her for a long wholly self–conscious police. Nothing two days, children before the cities, Charles Wallari Street calps our overhead, efforts gives up the drums.

Ward shook his head sadly. ‘I don’t felt some due.’

Mongable was like my people seems my fear–spinding beach–car. Yet, but you an overhead time, they’re going to do with the summer seems only in trister held of? I didn’t wasn’t already get to do. If the prayer has questioned much, however, as soon as selfables exhilaration of peaced Franz. Laster had lost impuly as wousen much wave. Perhaps if they meaning on flatper. ‘Let’s go outstands.’ He listened up in the lounge, a acut shifting in and out of the grass and hands. Ryan stood by her mask velvet, neutral on the darkened bedroom. Waist doldent helicopters whisky to stopped the chauffeur on to the water. Simulator was convinced the sculptor of the suffiring moonlight.

At the like the woman stopped and he could pany when it raised the emptines and lined themselves from the stairs, hoods would soon realize the tiled sand–rays of them, like the fourth houses were hidden from the streets, and then spotted in a small features of path Aselage.

At three of the speeding expirsm, Traven returned to me. Once, as I looked at his wife and shields. All afternoons the bel

Or with the max temperature:

Wind’s speaking, Halloway began to climb into the distant. The Sign of the crest of his treasure, like the hidden profile on one of the clocks, rost in the blockhen flourish.

For years before the soared assemble, their fragmented pen with a jewelly confused sector with Bayliss and London on either side with slow he had learned off the sleep. Surprised by the display’s optical relief than there. Accorring you, over–grey manner, the idleg to make on the abandoned gaps of magnine air, welephoring on their rough and owners during myself.

I didn’t have ask pressed to its professional record libbariards and reverse, creatively and given a single–washing suit that had slumped.

As he watched him he left her away from him, presumably slightly old but prismatic apparation of the air, turned back among the beaches.

As we sat down, Slept asleep again, and was sufficioned about in patiently. Taking his fingers, pantic season flew overrounds and satisfied him to lie–accaling with noise. As she sitused, slung at his feet in custom of letters turned on never at falles. Gunfir felt in one hand, he moved away at my carded grass. Her personante stepped off his chest, seemed to have passed. He was in advance with a survoinnes’ crawl. Quickly he regretted to receive while Mrs Osmond – survived the softly steps, pushing abandone in the stone bird. Drill as the houses opposite, whose side–struggle of down–thems alerts prepared to take part of the sea.

Ryker quickly waved him down to his mouth overlooking the raw building, and watched him into a valley behind it. When an interval of hundreds of its impact the Vicionic pores, however, first waiting for my moving palls, for the visual minor curved and shakes test of the town centre. Averything again even he was out of to the building, and by the timewinds like metal landmarsh fingers technique unlimited silk like the few ratts of heels, the unique light from the wheels and speedback on the wound, and continue the s

Using a low temperature (0.2 in this case) we get a lot of stations, statues and chauffeurs:

(the station we have been able to see me to the sound of the station and an essentian of some sort of conscious the sense of a starting concentration of the station and his last few days and the sea and the sea and the sea and set out the station and a series of conscious of the sea. The computers the statue was the special morning the station that had been a million disasted and set off in the corner of the car park. The statue had been transformed into a series of space complex and anti–games in the sun. The first stage of the station was a small car park in the sunlight. The chauffeur was being seen through the shadows of the sky. The bony skin was still standing at the store and started to stand up and down the staircase. He was aware of the chauffeur, and the car park was almost convinced that the station was a small conclusion of the station and a series of experiments in the sense of the sea. The station was almost to himself. He had been a sudden international art of the station that the station was the only way of world was a series of surface. An area of touch with a strange surge of fresh clock and started to stay here to the surrounding bunker. He stood up and stared at her and watched the statue for the first time the statue for the first time and started to stand up and down the stairway to the surface of the stairway. He was about to see me with a single flower, but he was aware of the continuous sight of the statue, and was suffered by the stars of the statue for the first time and the statue for the first time of the sight of the statue in the centre of the car, watching the shore like a demolish of some pathetic material.

That’s it for this time!

Genomics Today and Tomorrow presentation

Below is a Slideshare link/widget to a presentation I gave at the Genomics Today and Tomorrow event in Uppsala a couple of weeks ago (March 19, 2015).

I spoke after Jonathan Bingham of Google Genomics and talked a little bit about how APIs, machine learning, and what I call “querying by dataset” could make life easier for bioinformaticians working on data integration. In particular, I gave examples of a few types of queries that one would like to be able to do against “all public data” (slides 19-24).

Not long after, I saw this preprint (called “Large-Scale Search of Transcriptomic Read Sets with Sequence Bloom Trees”) that seems to provide part of the functionality that I was envisioning – in particular, the ability to query public sequence repositories by content (using a sequence as a query), rather than by annotation (metadata). The beginning of the abstract goes like this:

Enormous databases of short-read RNA-seq sequencing experiments such as the NIH Sequence Read Archive (SRA) are now available. However, these collections remain difficult to use due to the inability to search for a particular expressed sequence. A natural question is which of these experiments contain sequences that indicate the expression of a particular sequence such as a gene isoform, lncRNA, or uORF. However, at present this is a computationally demanding question at the scale of these databases. We introduce an indexing scheme, the Sequence Bloom Tree (SBT), to support sequence-based querying of terabase-scale collections of thousands of short-read sequencing experiments.

Some interesting new algorithms

Just wanted to note down some new algorithms that I came across for future reference. Haven’t actually tried any of these yet.

  • LIBFMM, a library for Field-aware Factorization machines. Developed by a group at National Taiwan University, this technique has been used to win two Kaggle click-through competitions. (Criteo, Avazu)
  • Random Bits Regression, a “strong general predictor for big data” (paper). “This method first generates a large number of random binary intermediate/derived features based on the original input matrix, and then performs regularized  linear/logistic regression on those intermediate/derived features to predict the outcome.
  • BIDMach, a CPU and GPU-accelerated machine learning library that shows some amazing benchmark results compared to Spark, Vowpal Wabbit, scikit-learn etc.

And another one which is not as new, but which I wanted to highlight because of a nice blog post about interactions and generalization by David Chudzicki:

Some interesting company announcements

  • Algorithmia, the open marketplace for algorithms, is now live. I find it an interesting concept: to build a community around algorithm development, where users can build on each other’s algorithms and make them available as a web service.
  •  SolveBio is also in public beta as of today. Please refer to an older post for some hints on how to use the API.
  • ForecastThis just launched their DSX platform for automated model testing and building. Looked pretty impressive from a small trial with a tricky dataset I have – a large number of models was constructed and run on the dataset with a set of metrics reported for each, and many of those looked better than the ones I had from e g random forest, but the trial version of the platform does not include access to the actual models, so I wasn’t able to see the details.
  • Seven Bridges Genomics will release the first graph-based human reference genome and related tools. There has been a lot of talk about graph-based genome references (a graph is a natural way to represent the many kinds of natural variation found among human genomes), but the tools to handle them have been lacking.

Peak meetup in Stockholm?

This month (March 2015) is simply packed with data related meetups in Stockholm. Are we witnessing the peak of interest or can it be sustained? Nice to see that there seems to be a large enough group of enthusiasts to sustain these meetings.

These are the meetups that I am aware of (and in some cases have been to) in Stockholm this month:

March 10. Kickoff for the new Stockholm Deep Learning group. Already held, there is a recording here.

March 11. Google BigQuery 101 with Jordan Tigani, already held. at Google’ Stockholm office.

March 18. Stockholm Hadoop User Group meeting: Druid at Criteo and History of Hadoop at Spotify.

March 18. Big Data, Stockholm.

March 23. Stockholm R useR Group meeting: R in genomics. (I am one of the organizers)

March 23. Stockholm Machine Learning / Spotify Tech Group: Large-scale Machine Learning with Spark and Flink

March 24. Stockholm Natural Language Processing group meeting: Word Embedding, from theory to practice at Gavagai.

March 26. Apache Spark Show and Tell.

That’s seven eight meetups, without me even really looking. Perhaps there are others too? I hope Stockholm can sustain the momentum (and perhaps that it spreads to other cities in the region.)

Notes on genomics APIs #3: SolveBio

This is the third in a short series of posts with notes on different genomics APIs. The first post, which was about the One Codex API, can be found here, and the second one, about Google Genomics, can be found here.

SolveBio “delivers the critical reference data used by hospitals and companies to run genomic applications”, according to their web page. They focus on clinical genomics and on helping developers who need to access various data sources in a programmatic way. Their curated data library provides access to (as of February 2015) “over 300 datasets for genomics, proteomics, literature annotation, variant-disease relationships, and more.) Some examples of those datasets are the ClinVar disease gene database from NIH, the Somatic Mutations dataset from The Cancer Genome Atlas, and the COSMIC catalogue of somatic mutations in cancer.

SolveBio offers a RESTful API with Python and Ruby clients already available and an R client under development. The Getting Started Guide really tells you most of what you need to know to use it, but let’s try it out here on this blog anyway!

You should, of course, start by signing up for a free account. After that, it’s time to get the client. I will use the Python one in this post. It can be installed by giving this command:

curl -skL | bash

You can also install it with pip.

Now you will need to login. This will prompt you for your email and password that you registered when signing up.

solvebio login

At this point you can view a useful tutorial by giving solvebio tutorial. The tutorial explains the concept of depositories, which are versioned containers for data sets. For instance (as explained in the docs), there is a ClinVar depository which (as of version 3.1.0) has three datasets: ClinVar, Variants, and Submissions. Each dataset within a depository is designed for a specific use-case. For example, the Variants dataset contains data on genomic variants, and supports multiple genome builds.

Now start the interactive SolveBio shell. This shell (in case you followed the instructions above) is based on iPython.


The command Depository.all() will show the available depositories. Currently, the list looks like this (you’ll want to click the image to blow it up a bit):
Screen Shot 2015-02-04 at 15.29.50

In a similar way, you can view all the data sets with Dataset.all(). Type Dataset.all(latest=True) to view only the latest additions.

To work with a data set, you need to ‘retrieve’ it with a command like:

ds = Dataset.retrieve('ClinVar/3.1.0-2015-01-13/Variants')

It is perfectly possible to leave out the version of the data set: ds = Dataset.retrieve('ClinVar/Variants') but that is bad practice from a reproducibility viewpoint and is not recommended, especially in production code.

Now we can check which fields are available in the ds object representing the data set we selected.


There are fields for things like alternate alleles for the variant in question, sources of clinical information on the variant, the name of any gene(s) overlapping the variant, and the genomic coordinates for the variant.

You can create a Python iterator for looping through all the records (variants) using ds.query(). To view the first variant, type ds.query()[0]. This will give you an idea of how each record (variant) is described in this particular data set. In practice, you will almost always want to filter your query according to some specified criteria. So for example, to look for known pathogenic variants in the titin (TTN) gene, you could filter as follows:

ttn_vars = ds.query().filter(clinical_significance='Pathogenic', gene_symbol_hgnc='TTN')

This will give you an iterator with a bunch of records (currently 18) that you can examine in more detail.

If you want to search for variants in some specified genomic region that you have identified as interesting, you can do that too, but it is only possible for some data sets. In this case it turns out that we can do it with this version of the ClinVar variant data set, because it is considered a “genomic” data set, which we can see because the command ds.is_genomicreturns True. (Some of the older versions return False here.)

ds.query(genome_build='GRCh37').range('chr3', 22500000, 23000000)

Note that you can specify a genome build in the query, which is very convenient.

Moving on to a different depository and data set, we can search for diabetes-related variants as defined via genome wide association studies with something like the following:

ds = Dataset.retrieve('GWAS/1.0.0-2015-01-13/GWAS')
ds.fields() # Check out which fields are available
ds.query().filter(phenotype='diabetes') # Also works with "Diabetes"
ds.query().filter(journal='science',phenotype='diabetes') # Only look for diabetes GWAS published in Science

Also, giving a command likeDataset.retrieve('GWAS/1.0.0-2015-01-13/GWAS').help() will open up a web page describing the dataset in your browser.

Post Navigation