Follow the Data

A data driven blog

Swedish school fires and Kaggle open data

For quite a while now, I have been rather mystified and intrigued by the fact that Sweden has one of the highest rates of school fires due to arson. According to the Division of Fire Safety Engineering at Lund University, “Almost every day between one and two school fires occur in Sweden. In most cases arson is the cause of the fire.” This is a lot for a small country with less than 10 million inhabitants, and the associated costs can be up to a billion SEK (around 120 million USD) per year.

It would be hard to find a suitable dataset to address the question why arson school fires are so frequent in Sweden compared to other countries in a data-driven way – but perhaps it would be possible to stay within a Swedish context and find out which properties and indicators of Swedish towns (municipalities, to be exact) might be related to a high frequency of school fires?

To answer this question, I  collected data on school fire cases in Sweden between 1998 and 2014 through a web site with official statistics from the Swedish Civil Contingencies Agency. As there was no API to allow easy programmatic access to schools fire data, I collected them by a quasi-manual process, downloading XLSX report generated from the database year by year, after which I joined these with an R script into a single table of school fire cases where the suspected reason was arson. (see Github link below for full details!)

To complement  these data, I used a list of municipal KPI:s (key performance indicators) from 2014, that Johan Dahlberg put together for our contribution in Hack for Sweden earlier this year. These KPIs were extracted from Kolada (a database of Swedish municipality and county council statistics) by repeatedly querying its API.

There is a Github repo containing all the data and detailed information on how it was extracted.

The open Kaggle dataset lives at So far, the process of uploading and describing the data has been smooth. I’ve learned that each Kaggle dataset has an associated discussion forum, and (potentially) a bunch of “kernels”, which are analysis scripts or notebooks in Python, R or Julia. I hope that other people will contribute script and analyses based on these data. Please do if you find this dataset intriguing!

Cumulative biology and meta-analysis of gene expression data

In talks that I have given in the past few years, I have often made the point that most of genomics has not been “big data” in the usual sense, because although the raw data files can often be large, they are often processed in a more or less predictable way until they are “small” (e.g., tables of gene expression measurements or genetic variants in a small number of samples). This in turn depends on the fact that it is hard and expensive to obtain biological samples, so in a typical genomics project the sample size is small (from just a few to tens or in rare cases hundreds or thousands) while the dimensionality is large (e.g. 20,000 genes, 10,000 proteins or a million SNPs). This is in contrast to many “canonical big data” scenarios where one has a large number of examples (like product purchases) with a small dimensionality (maybe the price, category and some other properties of the product.)

Because of these issues, I have been hopeful about using published data on e.g. gene expression based on RNA sequencing or on metagenomics to draw conclusions based on data from many studies. In the former case (gene expression/RNA-seq) it could be to build classifiers for predicting tissue or cell type for a given gene expression profile. In the latter case (metagenomics/metatranscriptomics, maybe even metaproteomics) it could also be to build classifiers but also to discover completely new varieties of e.g. bacteria or viruses from the “biological dark matter” that makes up a large fraction of currently generated metagenomics data. These kinds of analysis are usually called meta-analysis, but I am fond of the term cumulative biology, which I came across in a paper by Samuel Kaski and colleagues (Toward Computational Cumulative Biology by Combining Models of Biological Datasets.)

Of course, there is nothing new about meta-analysis or cumulative biology – many “cumulative” studies have been published about microarray data – but nevertheless, I think that some kind of threshold has been crossed when it comes to really making use of the data deposited in public repositories. There has been development both in APIs allowing access to public data, in data structures that have been designed to deal specifically with large sequence data, and in automating analysis pipelines.

Below are some interesting papers and packages that are all in some way related to analyzing public gene expression data in different ways. I annotate each resource with a couple of tags.

Sequence Bloom Trees. [data structures] These data structures (described in the paper Fast search of thousands of short-read sequencing experiments) allow indexing of a very large number of sequences into a data structure that can be rapidly queried with your own data. I first tried it about a year ago and found it to be useful to check for the presence of short snippets of interest (RNA sequences corresponding to expressed peptides of a certain type) in published transcriptomes. The authors have made available a database of 2,652 RNA-seq experiments from human brain, breast and blood which served as a very useful reference point.

The Lair. [pipelines, automation, reprocessing] Lior Pachter and the rest of the gang behind popular RNA-seq analysis tools Kallisto and Sleuth have taken their concept further with Lair, a platform for interactive re-analysis of published RNA-seq datasets. They use a Snakemake based analysis pipeline to process and analyze experiments in a consistent way – see the example analyses listed here. Anyone can request a similar re-analysis of a published data set by providing a config file, design matrix and other details as described here.

Toil. [pipelines, automation, reprocessing] The abstract of this paper, which was recently submitted to bioRxiv, states: Toil is portable, open-source workflow software that supports contemporary workflow definition languages and can be used to securely and reproducibly run scientific workflows efficiently at large-scale. To demonstrate Toil, we processed over 20,000 RNA-seq samples to create a consistent meta-analysis of five datasets free of computational batch effects that we make freely available. Nearly all the samples were analysed in under four days using a commercial cloud cluster of 32,000 preemptable cores. The authors used their workflow software to quantify expression in  four studies: The Cancer Genome Atlas (TCGA), Therapeutically Applicable Research To Generate Effective Treatments (TARGET), Pacific Pediatric Neuro-Oncology Consortium (PNOC), and the Genotype Tissue Expression Project (GTEx).

EBI’s RNA-seq-API. [API, discovery, reprocessing, compendium] The RESTful RNA-seq Analysis API provided by the EBI currently contains raw, FPKM and TPM gene and exon counts for a staggering 265,000 public sequencing runs in 264 different species, as well as ftp locations of CRAM, bigWig and bedGraph files. See the documentation here.

Digital Expression Explorer. [reprocessing, compendium] This resource contains hundreds of thousands of uniformly processed RNA-seq data sets (e.g., >73,000 human data sets and >97,000 mouse ones). The data sets were processed into gene-level counts, which led to some Twitter debate between the transcript-level quantification hardliners and the gene-count-tolerant communities, if I may label the respective camps in that way. These data sets can be downloaded in bulk.

CompendiumDb. [API, discovery] This is an R package that facilitates the programmatic retrieval of functional genomics data (i.e., often gene expression data) from the Gene Expression Omnibus (GEO), one of the main repositories for this kind of data.

Omics Discovery Index (OmicsDI). [discovery] This is described as a “Knowledge Discovery framework across heterogeneous data (genomics, proteomics and metabolomics)” and is mentioned here both because a lot of it is gene expression data and because it seems like a good resource for finding data across different experimental types for the same conditions.

MetaRNASeq. [discovery] A browser-based query system for finding RNA-seq experiments that fulfill certain search criteria. Seems useful when looking for data sets from a certain disease state, for example.

Tradict. [applications of meta-analysis] In this study, the authors analyzed 23,000 RNA-seq experiments to find out whether gene expression profiles could be reconstructed from a small subset of just 100 marker genes (out of perhaps 20,000 available genes). The author claims that it works well and the manuscript contains some really interesting graphs showing, for example, how most of the variation in gene expression is driven by developmental stage and tissue.

In case you think that these types of meta-analysis are only doable with large computing clusters with lots of processing power and storage, you’ll be happy to find out that it is easy to analyze RNA-seq experiments in a streaming fashion, without having to download FASTQ or even BAM files to disk (Valentine Svensson wrote a nice blog post about this), and with tools such as Kallisto, it does not really take that long to quantify the expression levels in a sample.

Finally, I’ll acknowledge that the discovery-oriented tools above (APIs, metadata search etc) still work on the basis of knowing what kind of data set you are looking for. But another interesting way of searching for expression data would be querying by content, that is, showing a search system the data you have at hand and asking it to provide the data sets most similar to it. This is discussed in the cumulative biology paper mentioned at the start of this blog post: “Instead of searching for datasets that have been described similarly, which may not correspond to a statistical similarity in the datasets themselves, we would like to conduct that search in a data-driven way, using as the query the dataset itself or a statistical (rather than a semantic) description of it.” In a similar vein, Titus Brown has discussed using MinHash signatures for identifying similar samples and finding collaborators.

App for exploring brain region specific gene expression

(Short version: explore region-specific gene expression in two human brains at

The Allen Institute for Brain Science has done a tremendous amount of work to digitalize and make available information on gene expression at a fine-grained level both in the mouse brain and the human brain. The Allen Brain Atlas contains a lot of useful information on the developing brain in mouse and human, the aging brain, etc. – both via GUIs and an API.

Among other things, the Allen institute has published gene expression data for healthy human brains divided by brain structure, assessed using both microarrays and RNA sequencing. In the RNA-seq case (which I have been looking at for reasons outlined below), two brains have been sectioned into 121 different parts, each representing one of many anatomical structures. This gives “region-specific” expression data which are quite useful for other researchers who want to compare their brain gene expression experiments to publicly available reference data. Note that each of the defined regions will still be a mix of cell types (various kinds of neuron, astrocytes, oligodendrocytes etc.), so we are still looking at a mix of cell types here, although resolved into brain regions. (Update 2016-07-22: The recently released R package ABAEnrichment seems very useful for a more programmatic approach than the one described here to accessing information about brain structure and cell type specific genes in Allen Brain Atlas data!)

As I have been working on a few projects concerning gene expression in the brain in some specific disease states, there has been a need to compare our own data to “control brains” which are not (to our knowledge) affected by any disease. In one of the projects, it has also been of interest to compare gene expression profiles to expression patterns in specific brain regions. As these projects both used RNA sequencing as their method of quantifying gene (or transcript) expression, I decided to take a closer look at the Allen Institute brain RNA-seq data and eventually ended up writing a small interactive app which is currently hosted at (as well as a back-up location available on request if that one doesn’t work.)

Screen Shot 2016-07-23 at 15.03.45

A screenshot of the Allen brain RNA-seq visualization app

The primary functions of the app are the following:

(1) To show lists of the most significantly up-regulated genes in each brain structure (genes that are significantly more expressed in that structure than in others, on average). These lists are shown in the upper left corner, and a drop-down menu below the list marked “Main structure” is used to select the structure of interest. As there are data from two brains, the expression level is shown separately for these in units of TPM (transcripts per million). Apart from the columns showing the TPM for each sampled brain (A and B, respectively), there is a column showing the mean expression of the gene across all brain structures, and across both brains.

(2) To show box plots comparing the distribution of the TPM expression levels in the structure of interest (the one selected in the “Main structure” drop-down menu) with the TP distribution in other structures. This can be done on the level of one of the brains or both. You might wonder why there is a “distribution” of expression values in a structure. The reason is simply that there are many samples (biopsies) from the same structure.

So one simple usage scenario would be to select a structure in the drop-down menu, say “Striatum”, and press the “Show top genes” button. This would render a list of genes topped by PCP4, which has a mean TPM of >4,300 in brain A and >2,000 in brain B, but just ~500 on average in all regions. Now you could select PCP4, copy and paste it into the “gene” textbox and click “Show gene expression across regions.” This should render a (ggplot2) box plot partitioned by brain donor.

There is another slightly less useful functionality:

(3)  The lower part of the screen is occupied by a principal component plot of all of the samples colored by brain structure (whereas the donor’s identity is indicated by the shape of the plotting character.) The reason I say it’s not so useful is that it’s currently hard-coded to show principal components 1 and 2, while I ponder where I should put drop-down menus or similar allowing selection of arbitrary components.

The PCA plot clearly shows that most of the brain structures are similar in their expression profiles, apart from the structures: cerebral cortex, globus pallidus and striatum, which form their own clusters that consist of samples from both donors. In other words, the gene expression profiles for these structures are distinct enough not to get overshadowed by batch or donor effects and other confounders.

I hope that someone will find this little app useful!



Is it unusually cold today?

The frequently miserable Swedish weather often makes me think “Is it just me, or is it unusually cold today?” Occasionally, it’s the reverse scenario – “Hmm, seems weirdly warm for April 1st – I wonder what the typical temperature this time of year is?” So I made myself a little Shiny app which is now hosted here. I realize it’s not so interesting for people who don’t live in Stockholm, but then again I have many readers who do … and it would be dead simple to create the same app for another Swedish location, and probably many other locations as well.

The app uses three different data sources, all from the Swedish Meteorological and Hydrological Institute (SMHI). The estimate of the current temperature is taken from the “latest hour” data for Stockholm-Bromma (query). For the historical temperature data, I use two different sources with different granularity. There is a data set that goes back to 1756 which contains daily averages, and another one that goes back to 1961 but which has temperatures at 06:00 (6 am), 12:00 (noon) and 18:00 (6 pm). The latter one makes it easier to compare to the current temperature, at least if you happen to be close to one of those times.


Hacking open government data

I spent last weekend with my talented colleagues Robin Andéer and Johan Dahlberg participating in the Hack For Sweden hackathon in Stockholm, where the idea is to find the most clever ways to make use of open data from government agencies. Several government entities were actively supporting and participating in this well-organized though perhaps slightly unfortunately named event (I got a few chuckles from acquaintances when I mentioned my participations.)

Our idea was to use data from Kolada, a database containing more than 2000 KPIs (key performance indicators) for different aspects of life in the 290 Swedish municipalities (think “towns” or “cities”, although the correspondence is not exactly 1-to-1), to get a “birds-eye view” of how similar or different the municipalities/towns are in general. Kolada has an API that allows piecemeal retrieval of these KPIs, so we started by essentially scraping the database (a bulk download option would have been nice!) to get a table of 2,303 times 290 data points, which we then wanted to be able to visualize and explore in an interactive way.

One of the points behind this app is that it is quite hard to wrap your head around the large number of performance indicators, which might be a considerable mental barrier for someone trying to do statistical analysis on Swedish municipalities. We hoped to create a “jumping-board” where you can quickly get a sense on what is distinctive for each municipality and which variables might be of interest, after which a user would be able to go deeper into a certain direction of analysis.

We ended up using the Bokeh library for Python to make a visualization where the user can select municipalities and drill down a little bit to the underlying data, and Robin and Johan cobbled together a web interface (available at  We plotted the municipalities using principal component analysis (PCA) projections after having tried and discarded alternatives like MDS and t-SNE. When the user selects a town in the PCA plot, the web interface displays its most distinctive (i.e. least typical) characteristics. It’s also possible to select two towns and get a list of the KPIs that differ the most between the two towns (based on ranks across all towns). Note that all of the KPIs are named and described in Swedish, which may make the whole thing rather pointless for non-Swedish users.

The code is on GitHub and the current incarnation of the app is at Kommunvis.

Perhaps unsurprisingly, there were lots of cool projects on display at Hack for Sweden. The overall winners were the Ge0Hack3rs team, who built a striking 3D visualization of different parameters for Stockholm (e.g. the density of companies, restaurants etc.) as an aid for urban planners and visitors. A straightforward but useful service which I liked was Cykelranking, built by the Sweco Position team, an index for how well each municipality is doing in terms of providing opportunities for bicycling, including detailed info on bicycle paths and accident-prone locations.

This was the third time that the yearly Hack for Sweden event was held, and I think the organization was top-notch, in large, spacey locations with seemingly infinite supply of coffee, food, and snacks, as well as helpful government agency data specialists in green T-shirts whom you were able to consult with questions. We definitely hope to be back next year with fresh new ideas.

This was more or less a 24-hour hackathon (Saturday morning to Sunday morning), although certainly our team used less time (we all went home to sleep on Saturday evening), yet a lot of the apps built were quite impressive, so I asked some other teams how much they had prepared in advance. All of them claimed not to have prepared anything, but I suspect most teams did like ours did (and for which I am grateful): prepared a little dummy/bare-bones application just to make sure they wouldn’t get stuck in configuration, registering accounts etc. on the competition day. I think it’s a good thing in general to require (as this hackathon did) that the competitors state clearly in advance what they intend to do, and prod them a little bit to prepare in advance so that they can really focus on building functionality on the day(s) of the hackathon instead of fumbling around with installation.



Tutorial: Exploring TCGA breast cancer proteomics data

Data used in this publication were generated by the Clinical Proteomic Tumor Analysis Consortium (NCI/NIH).

The Cancer Genome Atlas (TCGA) has become a focal point for a lot of genomics and bioinformatics research. DNA and RNA level data on different tumor types are now used in countless papers to test computational methods and to learn more about hallmarks of different types of cancer.

Perhaps, though, there aren’t as many people who are using the quantitative proteomics data hosted by Clinical Proteomic Tumor Analysis Consortium (CPTAC). There are mass spectrometry based expression measurements for many different types of tumor available at their Data Portal.

As I have been comparing some (currently in-house, to be published eventually) cancer proteomics data sets against TCGA proteomics data, I thought I would share some code, tricks and tips for those readers who want to start analyzing TCGA data (whether proteomics, transcriptomics or other kinds) but don’t quite know where to start.

To this end, I have put a tutorial Jupyter notebook at Github: TCGA protein tutorial

The tutorial is written in R, mainly because I like the TCGA2STAT and Boruta packages (but I just learned there is a Boruta implementation in Python as well.) If you think it would be useful to have a similar tutorial in Python, I will consider writing one.

The tutorial consists, roughly, of these steps:

  • Getting a usable set of breast cancer proteomics data
    This consists of downloading the data, selecting the subset that we want to focus on, removing features with undefined values, etc..
  • Doing feature selection to find proteins predictive of breast cancer subtype.
    Here, the Boruta feature selection package is used to identify a compact set of proteins that can predict the so-called PAM50 subtype of each tumor sample. (The PAM50 subtype is based on mRNA expression levels.)
  • Comparing RNA-seq data and proteomics data on the same samples.
    Here, we use the TCGA2STAT package to obtain TCGA RNA-seq data and find the set of common gene names and common samples between our protein and mRNA-seq data in order to look at protein-mRNA correlations.

Please visit the notebook if you are interested!

Some of the take-aways from the tutorial may be:

  • A bit of messing about with metadata, sample names etc. is usually necessary to get the data in the proper format, especially if you are combining different kinds of data (such as RNA-seq and proteomics here). I guess you’ve heard them say that 80% of data science is data preparation!…
  • There are now quantitative proteomics data available for many types of TCGA tumor samples.
  • TCGA2STAT is a nice package for importing certain kinds of TCGA data into an R session.
  • Boruta is an interesting alternative for feature selection in a classification context.

This post was prepared with permission from CPTAC.

P.S. I may add some more material on a couple of ways to do multivariate data integration on TCGA data sets later, or make that a separate blog post. Tell me if you are interested.

Finnish companies that do data science

I should start by saying that I have shamelessly poached this blog post from a LinkedIn thread started by one Ville Niemijärvi of Louhia Consulting in Finland. In my defence,  LinkedIn conversations are rather ephemeral and I am not sure how completely they are indexed by search engines, so to me it makes sense to sometimes highlight them in a slightly more permanent manner.

Ville asked for input (and from now on I am paraphrasing and summarising) on companies in Finland that do data analytics “for real”, as in data science, predictive analytics, data mining or statistical modelling. He required that the proposed companies should have several “actual” analysts and be able to show references to work performed in advanced analytics (i e not pure visualization/reporting). In a later comment he also mentioned price optimization, cross-sell analysis, sales prediction, hypothesis testing, and failure modelling.

The companies that had been mentioned when I went through this thread are listed below. I’ve tried to lump them together into categories after a very superficial review and would be happy to be corrected if I have gotten something wrong.

[EDIT 2016-02-04 Added a bunch of companies.]

Louhia analytics consulting (predictive analytics, Azure ML etc.)
BIGDATAPUMP analytics consulting (Hadoop, AWS, cloud etc.)
Houston Analytics analytics consulting (analytics partner of IBM)
Gofore IT architecture
Digia IT consulting
Techila Technologies distributed computing middleware
CGI IT consulting, multinational
Teradata data warehousing, multinational
Avanade IT consulting, multinational
Deloitte financial consulting, multinational
Information Builders business intelligence, multinational
SAS Institute analytics software, multinational
Tieto IT services, multinational (but originally Finnish)
Aureolis business intelligence
Olapcon business intelligence
Big Data Solutions business intelligence
Enfo Rongo business intelligence
Bilot business intelligence
Affecto digital services
Siili digital services
Reaktor digital services
Valuemotive digital services
Solita digital services
Comptel digital services?
Dagmar marketing
Frankly Partners marketing
ROIgrow marketing
Probic marketing
Avaus marketing
InlineMarket marketing automation
Steeri customer analytics
Tulos Helsinki customer analytics
Andumus customer analytics
Avarea customer analytics
Big Data Scoring customer analytics
Suomen Asiakastieto credit & risk management
Silta HR analytics
Quva industrial analytics
Ibisense industrial analytics
Ramentor industrial analytics
Indalgo manufacturing analytics
TTS-Ciptec optimization, sensor
SimAnalytics Logistics, simulation
Relex supply chain analytics
Analyse2 assortment planning
Genevia bioinformatics consultancy
Fonecta directory services
Monzuun analytics as a service
Solutive data visualization
Omnicom communications agency
NAPA naval analytics, ship operations
Primor consulting telecom?

There was an interesting comment saying that CGI manages its global data science “virtual team” from Finland and that they employ several successful Kagglers, one of whom was rated #37 out of 450000 Kaggle users in 2014.

On a personal note, I was happy to find a commercial company (Genevia) which appears to do pretty much the same thing as I do in my day job at Scilifelab Stockholm, that is, bioinformatics consulting (often with an emphasis on high throughput sequencing), except that I do it in an academic context.




List of deep learning implementations in biology

[Note: this list now lives at GitHub, where it will be continuously updated, so please go there instead!]

I’m going to start collecting papers on, and implementations of, deep learning in biology (mostly genomics, but other areas as well) on this page. It’s starting to get hard to keep up! For the purposes of this list, I’ll consider things like single-layer autoencoders, although not literally “deep”, to qualify for inclusion. The categorizations will by necessity be arbitrary and might be changed around from time to time.

In parallel, I’ll try to post some of these on gitxiv as well under the tag bioinformatics plus other appropriate tags.

Please let me know about the stuff I missed!


Neural graph fingerprints [github][gitxiv]

A convolutional net that can learn features which are useful for predicting properties of novel molecules; “molecular fingerprints”. The net works on a graph where atoms are nodes and bonds are edges. Developed by the group of Ryan Adams, who co-hosts the very good Talking Machines podcast.


Pcons2 – Improved Contact Predictions Using the Recognition of Protein Like Contact Patterns [web interface]

Here, a “deep random forest” with five layers is used to improve predictions of which residues (amino acids) in a protein are physically interacting which each other. This is useful for predicting the overall structure of the protein (a very hard problem.)


Gene expression

In modeling gene expression, the inputs are typically numerical values (integers or floats) estimating how much RNA is produced from a DNA template in a particular cell type or condition.

ADAGE – Analysis using Denoising Autoencoders of Gene Expression [github][gitxiv]

This is a Theano implementation of stacked denoising autoencoders for extracting relevant patterns from large sets of gene expression data, a kind of feature construction approach if you will. I have played around with this package quite a bit myself. The authors initially published a conference paper applying the model to a compendium of breast cancer (microarray) gene expression data, and more recently posted a paper on bioRxiv where they apply it to all available expression data (microarray and RNA-seq) on the pathogen Pseudomonas aeruginosa. (I understand that this manuscript will soon be published in a journal.)

Learning structure in gene expression data using deep architectures [paper]

This is also about using stacked denoising autoencoders for gene expression data, but there is no available implementation (as far as I could tell). Included here for the sake of completeness (or something.)

Gene expression inference with deep learning [github][paper]

This deals with a specific prediction task, namely to predict the expression of specified target genes from a panel of about 1,000 pre-selected “landmark genes”. As the authors explain, gene expression levels are often highly correlated and it may be a cost-effective strategy in some cases to use such panels and then computationally infer the expression of other genes. Based on Pylearn2/Theano.

Learning a hierarchical representation of the yeast transcriptomic machinery using an autoencoder model [paper]

The authors use stacked autoencoders to learn biological features in yeast from thousands of microarrays. They analyze the hidden layer representations and show that these encode biological information in a hierarchical way, so that for instance transcription factors are represented in the first hidden layer.

Predicting enhancers and regulatory regions

Here the inputs are typically “raw” DNA sequence, and convolutional networks (or layers) are often used to learn regularities within the sequence. Hat tip to Melissa Gymrek ( for pointing out some of these.

DanQ: a hybrid convolutional and recurrent deep neural network for quantifying the function of DNA sequences [github][gitxiv]

Made for predicting the function of non-protein coding DNA sequence. Uses a convolution layer to capture regulatory motifs (i e single DNA snippets that control the expression of genes, for instance), and a recurrent layer (of the LSTM type) to try to discover a “grammar” for how these single motifs work together. Based on Keras/Theano.

Basset – learning the regulatory code of the accessible genome with deep convolutional neural networks [github][gitxiv]

Based on Torch, this package focuses on predicting the accessibility (or “openness”) of the chromatin – the physical packaging of the genetic information (DNA+associated proteins). This can exist in more condensed or relaxed states in different cell types, which is partly influenced by the DNA sequence (not completely, because then it would not differ from cell to cell.)

DeepSEA – Predicting effects of noncoding variants with deep learning–based sequence model [web server][paper]

Like the packages above, this one also models chromatin accessibility as well as the binding of certain proteins (transcription factors) to DNA and the presence of so-called histone marks that are associated with changes in accessibility. This piece of software seems to focus a bit more explicitly than the others on predicting how single-nucleotide mutations affect the chromatin structure. Published in a high-profile journal (Nature Methods).

DeepBind – Predicting the sequence specificities of DNA- and RNA-binding proteins by deep learning [code][paper]

This is from the group of Brendan Frey in Toronto, and the authors are also involved in the company Deep Genomics. DeepBind focuses on predicting the binding specificities of DNA-binding or RNA-binding proteins, based on experiments such as ChIP-seq, ChIP-chip, RIP-seq,  protein-binding microarrays, and HT-SELEX. Published in a high-profile journal (Nature Biotechnology.)

PEDLA: predicting enhancers with a deep learning-based algorithmic framework [code][paper]

This package is for predicting enhancers (stretches of DNA that can enhance the expression of a gene under certain conditions or in a certain kind of cell, often working at a distance from the gene itself) based on heterogeneous data from (e.g.) the ENCODE project, using 1,114 features altogether.

DEEP: a general computational framework for predicting enhancers

Genome-Wide Prediction of cis-Regulatory Regions Using Supervised Deep Learning Methods (and several other papers applying various kinds of deep networks to regulatory region prediction) [code][one paper out of several]

Wyeth Wasserman’s group have made a kind of toolkit (based on the Theano tutorials) for applying different kinds of deep learning architectures to cis-regulatory element (DNA stretches that can modulate the expression of a nearby gene) prediction. They use a specific “feature selection layer” in their nets to restrict the number of features in the models. This is implemented as an additional sparse one-to-one linear layer between the input layer and the first hidden layer of a multi-layer perceptron.


Predicting DNA Methylation State of CpG Dinucleotide Using Genome Topological Features and Deep Networks [paper][web server]

This implementation uses a stacked autoencoder with a supervised layer on top of it to predict whether a certain type of genomic region called “CpG islands” (stretches with an overrepresentation of a sequence pattern where a C nucleotide is followed by a G) is methylated (a chemical modification to DNA that can modify its function, for instance methylation in the vicinity of a gene is often but not always related to the down-regulation or silencing of that gene.) This paper uses a network structure where the hidden layers in the autoencoder part have a much larger number of nodes than the input layer, so it would have been nice to read the authors’ thoughts on what the hidden layers represent.

Single-cell applications

CellCnn – Representation Learning for detection of disease-associated cell subsets

This is a convolutional network (Lasagne/Theano) based approach for “Representation Learning for detection of phenotype-associated cell subsets.” It is interesting because most neural network approaches for high-dimensional molecular measurements (such as those in the gene expression category above) have used autoencoders rather than convolutional nets.

Population genetics

Deep learning for population genetic inference [paper]

No implementation available yet but says an open-source one will be made available soon.


This is a harder category to populate because a lot of theoretical work on neural networks and deep learning has been intertwined with neuroscience. For example, recurrent neural networks have long been used for modeling e.g. working memory and attention. In this post I am really looking for pure applications of DL rather than theoretical work, although that is extremely interesting.

For more applied DL, I have found

Deep learning for neuroimaging: a validation study [paper]

SPINDLE: SPINtronic deep learning engine for large-scale neuromorphic computing [paper]

I’m sure there are many others. Maybe digging up some seminal neuroscience papers modeling brain areas and functions with different kinds of neural networks would be a worthy topic for a future blog post.



ASCII Autoencoder

Joel and I were playing around with TensorFlow, the deep learning library that Google recently released and that you have no doubt heard of. We had put together a little autoencoder implementation and were trying to get a handle on how well it was working.

An autoencoder can be viewed as a neural network where the final layer, the output layer, is supposed to reconstruct the values that have been fed into the input layer, possibly after some distortion of the inputs (like forcing a fraction of them to be zero, dropout, or adding some random noise). In the case with corrupted, it’s called a denoising autoencoder, and the purpose of adding the noise or dropout is to make the system discover more robust statistical regularities in the input data (there is some good discussion here).

An autoencoder often has fewer nodes in the hidden layer(s) than in the input and is then used to learn a more compact and general representation of the data (the code or encoding). With only one hidden layer and linear activation functions, the encoding should be essentially the same as one gets from PCA (principal component analysis), but non-linear activation functions (e g sigmoid and tanh) will yield different representations, and multi-layer or stacked autoencoders will add a hierarchical aspect.

Some references on autoencoders:

Ballard (1987) – Modular learning in neural networks

Andrew Ng’s lecture notes on sparse autoencoders

Vincent et al (2010) – Stacked denoising autoencoders

Tan et al (2015) – ADAGE analysis of publicly available gene expression data collections illuminates Pseudomonas aeruginosa-host interactions

Anyway, we were trying some different parametrizations of the autoencoder (its training performance can depend quite a lot on how the weights are initialized, the learning rate and the number of hidden nodes) and felt it was a bit boring to just look at a single number (the reconstruction error). We wanted to get a feel for how training is progressing across the input data matrix, so we made the script output for each 1000 rounds of training a colored block of text in the terminal where the background color represents the absolute difference between the target value and the reconstructed value using bins. The “best” bin (bin 0) is dark green and represents that the reconstruction is very close to the original input; the “bad” bins have reddish colors. If the data point has been shifted t0 a new bin in the last 1000 rounds (i e the reconstruction has improved or deteriorated noticeably), a colored digit indicating the new bin is shown in the foreground. (This makes more sense when you actually look at it.) We only show the first 75 training examples and the first 75 features, so if your data set is larger than that you won’t see all of it.

The code is on GitHub. There are command-line switches for controlling the number of hidden nodes, learning rate, and other such things. There are probably many aspects that could be improved but we thought this was a fun way to visualize the progress and see if there are any regions that clearly stand out in some way.

Here are a few screenshots of an example execution of the script.

As the training progresses, the overall picture gets a bit greener (the reconstruction gets closer to the input values) and the reconstructions get a bit more stable (i e not as many values have a digit on them to indicate that the reconstruction has improved or deteriorated). The values under each screenshot indicates the number of training cycles and the mean squared reconstruction error.

Watson hackathon in Uppsala

Today I spent most of the day trying to grok IBM Watson’s APIs during a hackathon (Hackup) in Uppsala, where the aim was to develop useful apps using those APIs. Watson is, of course, famous for being good at Jeopardy and for being at the center for IBM’s push into healthcare analytics, but I hadn’t spent much time before this hackathon checking out exactly what is available to users now in terms of APIs etc. It turned out to be a fun learning experience and I think a good time was had by all.

We used IBM’s Bluemix platform to develop apps. As the available Watson API’s (also including the Alchemy APIs that are now part of Bluemix) are mostly focused on natural language analysis (rather than generic classification and statistical modeling), our team – consisting of me and two other bioinformaticians from Scilifelab – decided to try to build a service for transcribing podcasts (using the Watson Speech To Text API) in order to annotate and tag them using the Alchemy APIs for keyword extraction, entity extraction etc. This, we envisioned, would allow podcast buffs to identify in which episode of their favorite show a certain topic was discussed, for instance. Eventually, after ingesting a large number of podcast episodes, the tagging/annotation might also enable things like podcast recommendations and classification, as podcasts could be compared to each other based on themes and keywords. This type of “thematic mapping” could also be interesting for following a single podcast’s thematic development.

As is often the case, we spent a fair amount of time on some supposedly mundane details. Since the speech-to-text conversion was relatively slow, we tried different strategies to split the audio files and process them in parallel, but could not quite make it work. Still, we ended up with a (Python-based) solution that was indeed able to transcribe and tag podcast episodes, but it’s still missing a front-end interface and a back-end database to hold information about multiple podcast episodes.

There were many other teams who developed cool apps. For instance one team made a little app for voice control of a light switch using a Raspberry Pi, and another team had devised an “AI shopper” that will remind you to buy stuff that you have forgotten to put on your shopping list. One entry was a kind of recommendation system for what education you should pursue, based on comparing a user-submitted text against a model trained on papers from people in different careers, and another one was an app for quantifying the average positive/negative/neutral sentiments found in tweets from different accounts (e.g. NASA had very positive tweets on average whereas BBC News was fairly negative).

All in all, a nice experience, and it was good to take a break from the Stockholm scene and see what’s going on in my old home town. Good job by Jason Dainter and the other organizers!

Post Navigation