Follow the Data

A data driven blog

Archive for the category “Tools and Software”

Dynamics in Swedish Twitter communities


I made a community decomposition of Swedish Twitter accounts in 2015 and 2016 and you can explore it in an online app.


As reported on this blog a couple of months ago, (and also here). I have (together with Mattias Östmar) been investigating the community structure of Swedish Twitter users. The analysis we posted then addressed data from 2015 and we basically just wanted to get a handle on what kind of information you can get from this type of analysis.

With the processing pipeline already set up, it was straightforward to repeat the analysis for the fresh data from 2016 as soon as Mattias had finished collecting it. The nice thing about having data from two different years in that we can start to look at the dynamics – namely, how stable communities are, which communities are born or disappear, and how people move between them.

The app

First of all, I made an app for exploring these data. If you are interested in this topic, please help me understand the communities that we have detected by using the “Suggest topic” textbox under the “Community info” tab. That is an attempt to crowdsource the “annotation” of these communities. The suggestions that are submitted are saved in a text file which I will review from time to time and update the community descriptions accordingly.

The fastest climbers

By looking at the data in the app, we can find out some pretty interesting things. For instance, the account that easily increased to most in influence (measured in PageRank) was @BjorklundVictor, who climbed from a rank of 3673 in 2015 in community #4 (which we choose to annotate as an “immigration” community) to a rank of 3 (!) in community #4 in 2016 (this community has also been classified as an immigration-discussion community, and it is the most similar one of all 2016 communities to the 2015 immigration community.) I am not personally familiar with this account, but he must have done something to radically increase his reach in 2016.

Some other people/accounts that increased a lot in influence were professor Agnes Wold (@AgnesWold) who climbed from rank 59 to rank 3 in the biggest community, which we call the “pundit cluster” (it has ID 1 both in 2015 and 2016), @staffanlandin, who went from #189 to #16 in the same community, and @PssiP, who climbed from rank 135 to rank 8 in the defense/prepping community (ID 16 in 2015, ID 9 in 2016).

Some people have jumped to a different community and improved their rank in that way, like @hanifbali, who went from #20 in community 1 (general punditry) in 2015 to the top spot, #1 in the immigration cluster (ID 4) in 2016, and @fleijerstam, who went from #200 in the pundit community in 2015 to #10 in the politics community (#3) in 2016.

Examples of users who lost a lot of ground in their own community are @asaromson (Åsa Romson, the ex-leader of the Green Party; #7 -> #241 in the green community) and @rogsahl (#10 -> #905 in the immigration community).

The most stable communities

It turned out that the most stable communities (i.e. the communities that had the most members in common relative to their total sizes in 2015 and 2016 respectively) were the ones containing accounts using a different language from Swedish, namely the Norwegian, Danish and Finnish communities.

The least stable community

Among the larger communities in 2015, we identified the one that was furthest from having a close equivalent in 2016. This was 2015 community 9, where the most influential account was @thefooomusic. This is a boy band whose popularity arguably hit a peak in 2015. The community closest to it in 2016 is community 24, but when we looked closer at that (which you can also do in the app!), we found that many YouTube stars had “migrated” into 2016 cluster 24 from 2015 cluster 84, which upon inspection turned out to be a very clear Swedish YouTuber cluster with stars such as Clara Henry, William Spetz and Therese Lindgren.

So in other words, the The Fooo fan cluster and the YouTuber cluster from 2015 merged into a mixed cluster in 2016.

New communities

We were hoping to see some completely new communities appear in 2016, but that did not really happen, at least not for the top 100 communities. Granted, there was one that had an extremely low similarity to any 2015 community, but that turned out to be a “community” topped by @SJ_AB, a railway company that replies to a large number of customer queries and complaints on Twitter (which, by the way, makes it the top account of them all in terms of centrality.) Because this company is responding to queries from new people all the time, it’s not really part of a “community” as such, and the composition of the cluster will naturally change a lot from year to year.

Community 24, which was discussed above, was also dissimilar from all the 2015 communitites, but as described, we notice it has absorbed users from 2015 clusters 9 (The Fooo) and 84 (YouTubers).

Movement between the largest communities

The similarity score for the “pundit clusters” (community 1 in 2015 and community 1 in 2016, respectively) somewhat surprisingly showed that these were not very similar overall, although many of the top-ranked users are the same. A quick inspection also showed that the entire top list of community 3 in 2015 moved to community 1 in 2016, which makes the 2015 community 3 the closest equivalent to the 2016 community 1. Both of these communities can be characterized as general political discussion/punditry clusters.

Comparison: The defense/prepper community in 2015 vs 2016

In our previous blog post on this topic, we presented a top-10 list of defense Twitterers and compared that to a manually curated list from Swedish daily Svenska Dagbladet. Here we will present our top-10 list for 2016.

Username Rank in 2016 Rank in 2015 Community ID in 2016 Community ID in 2015
patrikoksanen 1 3 9 16
hallonsa 2 5 9 16
Cornubot 3 1 9 16
waterconflict 4 6 9 16
wisemanswisdoms 5 2 9 16
JohanneH 6 9 9 16
mikaelgrev 7 7 9 16
PssiP 8 135 9 16
oplatsen 9 11 9 16
stakmaskin 10 31 9 16

Comparison: The green community in 2015 vs 2016

One community we did not touch on in the last blog post is the green, environmental community. Here’s a comparison of the main influencers in that category in 2016 vs 2015.

Username Rank in 2016 Rank in 2015 Community ID in 2016 Community ID in 2015
rickardnordin 1 4 13 29
Ekobonden 2 1 13 109
ParHolmgren 3 19 13 29
BjornFerry 4 12 13 133
PWallenberg 5 12 13 109
mattiasgoldmann 6 3 13 29
JKuylenstierna 7 10 13 29
Axdorff 8 3 13 153
fores_sverige 9 11 13 29
GnestaEmma 10 17 13 29


Of course, many parts of this analysis could be improved and there are some important caveats. For example, the Infomap algorithm is not deterministic, which means that you are likely to get somewhat different results each time you run it. For these data, we have run it a number of times and seen that you get results that are similar in a general sense each time (in terms of community sizes, top influencers and so on), but it should be understood that some accounts (even top influencers) can in some cases move around between communities just because of this non-deterministic aspect of the algorithm.

Also, it is possible that the way we use to measure community similarity (the Jaccard index, which is the ratio between the number of members in common between two communities and the number of members that are in any or both of the communities – or to put it in another way, the intersection divided by the union) is too coarse, because it does not consider the influence of individual users.

Swedish school fires and Kaggle open data

For quite a while now, I have been rather mystified and intrigued by the fact that Sweden has one of the highest rates of school fires due to arson. According to the Division of Fire Safety Engineering at Lund University, “Almost every day between one and two school fires occur in Sweden. In most cases arson is the cause of the fire.” This is a lot for a small country with less than 10 million inhabitants, and the associated costs can be up to a billion SEK (around 120 million USD) per year.

It would be hard to find a suitable dataset to address the question why arson school fires are so frequent in Sweden compared to other countries in a data-driven way – but perhaps it would be possible to stay within a Swedish context and find out which properties and indicators of Swedish towns (municipalities, to be exact) might be related to a high frequency of school fires?

To answer this question, I  collected data on school fire cases in Sweden between 1998 and 2014 through a web site with official statistics from the Swedish Civil Contingencies Agency. As there was no API to allow easy programmatic access to schools fire data, I collected them by a quasi-manual process, downloading XLSX report generated from the database year by year, after which I joined these with an R script into a single table of school fire cases where the suspected reason was arson. (see Github link below for full details!)

To complement  these data, I used a list of municipal KPI:s (key performance indicators) from 2014, that Johan Dahlberg put together for our contribution in Hack for Sweden earlier this year. These KPIs were extracted from Kolada (a database of Swedish municipality and county council statistics) by repeatedly querying its API.

There is a Github repo containing all the data and detailed information on how it was extracted.

The open Kaggle dataset lives at So far, the process of uploading and describing the data has been smooth. I’ve learned that each Kaggle dataset has an associated discussion forum, and (potentially) a bunch of “kernels”, which are analysis scripts or notebooks in Python, R or Julia. I hope that other people will contribute script and analyses based on these data. Please do if you find this dataset intriguing!

Cumulative biology and meta-analysis of gene expression data

In talks that I have given in the past few years, I have often made the point that most of genomics has not been “big data” in the usual sense, because although the raw data files can often be large, they are often processed in a more or less predictable way until they are “small” (e.g., tables of gene expression measurements or genetic variants in a small number of samples). This in turn depends on the fact that it is hard and expensive to obtain biological samples, so in a typical genomics project the sample size is small (from just a few to tens or in rare cases hundreds or thousands) while the dimensionality is large (e.g. 20,000 genes, 10,000 proteins or a million SNPs). This is in contrast to many “canonical big data” scenarios where one has a large number of examples (like product purchases) with a small dimensionality (maybe the price, category and some other properties of the product.)

Because of these issues, I have been hopeful about using published data on e.g. gene expression based on RNA sequencing or on metagenomics to draw conclusions based on data from many studies. In the former case (gene expression/RNA-seq) it could be to build classifiers for predicting tissue or cell type for a given gene expression profile. In the latter case (metagenomics/metatranscriptomics, maybe even metaproteomics) it could also be to build classifiers but also to discover completely new varieties of e.g. bacteria or viruses from the “biological dark matter” that makes up a large fraction of currently generated metagenomics data. These kinds of analysis are usually called meta-analysis, but I am fond of the term cumulative biology, which I came across in a paper by Samuel Kaski and colleagues (Toward Computational Cumulative Biology by Combining Models of Biological Datasets.)

Of course, there is nothing new about meta-analysis or cumulative biology – many “cumulative” studies have been published about microarray data – but nevertheless, I think that some kind of threshold has been crossed when it comes to really making use of the data deposited in public repositories. There has been development both in APIs allowing access to public data, in data structures that have been designed to deal specifically with large sequence data, and in automating analysis pipelines.

Below are some interesting papers and packages that are all in some way related to analyzing public gene expression data in different ways. I annotate each resource with a couple of tags.

Sequence Bloom Trees. [data structures] These data structures (described in the paper Fast search of thousands of short-read sequencing experiments) allow indexing of a very large number of sequences into a data structure that can be rapidly queried with your own data. I first tried it about a year ago and found it to be useful to check for the presence of short snippets of interest (RNA sequences corresponding to expressed peptides of a certain type) in published transcriptomes. The authors have made available a database of 2,652 RNA-seq experiments from human brain, breast and blood which served as a very useful reference point.

The Lair. [pipelines, automation, reprocessing] Lior Pachter and the rest of the gang behind popular RNA-seq analysis tools Kallisto and Sleuth have taken their concept further with Lair, a platform for interactive re-analysis of published RNA-seq datasets. They use a Snakemake based analysis pipeline to process and analyze experiments in a consistent way – see the example analyses listed here. Anyone can request a similar re-analysis of a published data set by providing a config file, design matrix and other details as described here.

Toil. [pipelines, automation, reprocessing] The abstract of this paper, which was recently submitted to bioRxiv, states: Toil is portable, open-source workflow software that supports contemporary workflow definition languages and can be used to securely and reproducibly run scientific workflows efficiently at large-scale. To demonstrate Toil, we processed over 20,000 RNA-seq samples to create a consistent meta-analysis of five datasets free of computational batch effects that we make freely available. Nearly all the samples were analysed in under four days using a commercial cloud cluster of 32,000 preemptable cores. The authors used their workflow software to quantify expression in  four studies: The Cancer Genome Atlas (TCGA), Therapeutically Applicable Research To Generate Effective Treatments (TARGET), Pacific Pediatric Neuro-Oncology Consortium (PNOC), and the Genotype Tissue Expression Project (GTEx).

EBI’s RNA-seq-API. [API, discovery, reprocessing, compendium] The RESTful RNA-seq Analysis API provided by the EBI currently contains raw, FPKM and TPM gene and exon counts for a staggering 265,000 public sequencing runs in 264 different species, as well as ftp locations of CRAM, bigWig and bedGraph files. See the documentation here.

Digital Expression Explorer. [reprocessing, compendium] This resource contains hundreds of thousands of uniformly processed RNA-seq data sets (e.g., >73,000 human data sets and >97,000 mouse ones). The data sets were processed into gene-level counts, which led to some Twitter debate between the transcript-level quantification hardliners and the gene-count-tolerant communities, if I may label the respective camps in that way. These data sets can be downloaded in bulk.

CompendiumDb. [API, discovery] This is an R package that facilitates the programmatic retrieval of functional genomics data (i.e., often gene expression data) from the Gene Expression Omnibus (GEO), one of the main repositories for this kind of data.

Omics Discovery Index (OmicsDI). [discovery] This is described as a “Knowledge Discovery framework across heterogeneous data (genomics, proteomics and metabolomics)” and is mentioned here both because a lot of it is gene expression data and because it seems like a good resource for finding data across different experimental types for the same conditions.

MetaRNASeq. [discovery] A browser-based query system for finding RNA-seq experiments that fulfill certain search criteria. Seems useful when looking for data sets from a certain disease state, for example.

Tradict. [applications of meta-analysis] In this study, the authors analyzed 23,000 RNA-seq experiments to find out whether gene expression profiles could be reconstructed from a small subset of just 100 marker genes (out of perhaps 20,000 available genes). The author claims that it works well and the manuscript contains some really interesting graphs showing, for example, how most of the variation in gene expression is driven by developmental stage and tissue.

In case you think that these types of meta-analysis are only doable with large computing clusters with lots of processing power and storage, you’ll be happy to find out that it is easy to analyze RNA-seq experiments in a streaming fashion, without having to download FASTQ or even BAM files to disk (Valentine Svensson wrote a nice blog post about this), and with tools such as Kallisto, it does not really take that long to quantify the expression levels in a sample.

Finally, I’ll acknowledge that the discovery-oriented tools above (APIs, metadata search etc) still work on the basis of knowing what kind of data set you are looking for. But another interesting way of searching for expression data would be querying by content, that is, showing a search system the data you have at hand and asking it to provide the data sets most similar to it. This is discussed in the cumulative biology paper mentioned at the start of this blog post: “Instead of searching for datasets that have been described similarly, which may not correspond to a statistical similarity in the datasets themselves, we would like to conduct that search in a data-driven way, using as the query the dataset itself or a statistical (rather than a semantic) description of it.” In a similar vein, Titus Brown has discussed using MinHash signatures for identifying similar samples and finding collaborators.

List of deep learning implementations in biology

[Note: this list now lives at GitHub, where it will be continuously updated, so please go there instead!]

I’m going to start collecting papers on, and implementations of, deep learning in biology (mostly genomics, but other areas as well) on this page. It’s starting to get hard to keep up! For the purposes of this list, I’ll consider things like single-layer autoencoders, although not literally “deep”, to qualify for inclusion. The categorizations will by necessity be arbitrary and might be changed around from time to time.

In parallel, I’ll try to post some of these on gitxiv as well under the tag bioinformatics plus other appropriate tags.

Please let me know about the stuff I missed!


Neural graph fingerprints [github][gitxiv]

A convolutional net that can learn features which are useful for predicting properties of novel molecules; “molecular fingerprints”. The net works on a graph where atoms are nodes and bonds are edges. Developed by the group of Ryan Adams, who co-hosts the very good Talking Machines podcast.


Pcons2 – Improved Contact Predictions Using the Recognition of Protein Like Contact Patterns [web interface]

Here, a “deep random forest” with five layers is used to improve predictions of which residues (amino acids) in a protein are physically interacting which each other. This is useful for predicting the overall structure of the protein (a very hard problem.)


Gene expression

In modeling gene expression, the inputs are typically numerical values (integers or floats) estimating how much RNA is produced from a DNA template in a particular cell type or condition.

ADAGE – Analysis using Denoising Autoencoders of Gene Expression [github][gitxiv]

This is a Theano implementation of stacked denoising autoencoders for extracting relevant patterns from large sets of gene expression data, a kind of feature construction approach if you will. I have played around with this package quite a bit myself. The authors initially published a conference paper applying the model to a compendium of breast cancer (microarray) gene expression data, and more recently posted a paper on bioRxiv where they apply it to all available expression data (microarray and RNA-seq) on the pathogen Pseudomonas aeruginosa. (I understand that this manuscript will soon be published in a journal.)

Learning structure in gene expression data using deep architectures [paper]

This is also about using stacked denoising autoencoders for gene expression data, but there is no available implementation (as far as I could tell). Included here for the sake of completeness (or something.)

Gene expression inference with deep learning [github][paper]

This deals with a specific prediction task, namely to predict the expression of specified target genes from a panel of about 1,000 pre-selected “landmark genes”. As the authors explain, gene expression levels are often highly correlated and it may be a cost-effective strategy in some cases to use such panels and then computationally infer the expression of other genes. Based on Pylearn2/Theano.

Learning a hierarchical representation of the yeast transcriptomic machinery using an autoencoder model [paper]

The authors use stacked autoencoders to learn biological features in yeast from thousands of microarrays. They analyze the hidden layer representations and show that these encode biological information in a hierarchical way, so that for instance transcription factors are represented in the first hidden layer.

Predicting enhancers and regulatory regions

Here the inputs are typically “raw” DNA sequence, and convolutional networks (or layers) are often used to learn regularities within the sequence. Hat tip to Melissa Gymrek ( for pointing out some of these.

DanQ: a hybrid convolutional and recurrent deep neural network for quantifying the function of DNA sequences [github][gitxiv]

Made for predicting the function of non-protein coding DNA sequence. Uses a convolution layer to capture regulatory motifs (i e single DNA snippets that control the expression of genes, for instance), and a recurrent layer (of the LSTM type) to try to discover a “grammar” for how these single motifs work together. Based on Keras/Theano.

Basset – learning the regulatory code of the accessible genome with deep convolutional neural networks [github][gitxiv]

Based on Torch, this package focuses on predicting the accessibility (or “openness”) of the chromatin – the physical packaging of the genetic information (DNA+associated proteins). This can exist in more condensed or relaxed states in different cell types, which is partly influenced by the DNA sequence (not completely, because then it would not differ from cell to cell.)

DeepSEA – Predicting effects of noncoding variants with deep learning–based sequence model [web server][paper]

Like the packages above, this one also models chromatin accessibility as well as the binding of certain proteins (transcription factors) to DNA and the presence of so-called histone marks that are associated with changes in accessibility. This piece of software seems to focus a bit more explicitly than the others on predicting how single-nucleotide mutations affect the chromatin structure. Published in a high-profile journal (Nature Methods).

DeepBind – Predicting the sequence specificities of DNA- and RNA-binding proteins by deep learning [code][paper]

This is from the group of Brendan Frey in Toronto, and the authors are also involved in the company Deep Genomics. DeepBind focuses on predicting the binding specificities of DNA-binding or RNA-binding proteins, based on experiments such as ChIP-seq, ChIP-chip, RIP-seq,  protein-binding microarrays, and HT-SELEX. Published in a high-profile journal (Nature Biotechnology.)

PEDLA: predicting enhancers with a deep learning-based algorithmic framework [code][paper]

This package is for predicting enhancers (stretches of DNA that can enhance the expression of a gene under certain conditions or in a certain kind of cell, often working at a distance from the gene itself) based on heterogeneous data from (e.g.) the ENCODE project, using 1,114 features altogether.

DEEP: a general computational framework for predicting enhancers

Genome-Wide Prediction of cis-Regulatory Regions Using Supervised Deep Learning Methods (and several other papers applying various kinds of deep networks to regulatory region prediction) [code][one paper out of several]

Wyeth Wasserman’s group have made a kind of toolkit (based on the Theano tutorials) for applying different kinds of deep learning architectures to cis-regulatory element (DNA stretches that can modulate the expression of a nearby gene) prediction. They use a specific “feature selection layer” in their nets to restrict the number of features in the models. This is implemented as an additional sparse one-to-one linear layer between the input layer and the first hidden layer of a multi-layer perceptron.


Predicting DNA Methylation State of CpG Dinucleotide Using Genome Topological Features and Deep Networks [paper][web server]

This implementation uses a stacked autoencoder with a supervised layer on top of it to predict whether a certain type of genomic region called “CpG islands” (stretches with an overrepresentation of a sequence pattern where a C nucleotide is followed by a G) is methylated (a chemical modification to DNA that can modify its function, for instance methylation in the vicinity of a gene is often but not always related to the down-regulation or silencing of that gene.) This paper uses a network structure where the hidden layers in the autoencoder part have a much larger number of nodes than the input layer, so it would have been nice to read the authors’ thoughts on what the hidden layers represent.

Single-cell applications

CellCnn – Representation Learning for detection of disease-associated cell subsets

This is a convolutional network (Lasagne/Theano) based approach for “Representation Learning for detection of phenotype-associated cell subsets.” It is interesting because most neural network approaches for high-dimensional molecular measurements (such as those in the gene expression category above) have used autoencoders rather than convolutional nets.

Population genetics

Deep learning for population genetic inference [paper]

No implementation available yet but says an open-source one will be made available soon.


This is a harder category to populate because a lot of theoretical work on neural networks and deep learning has been intertwined with neuroscience. For example, recurrent neural networks have long been used for modeling e.g. working memory and attention. In this post I am really looking for pure applications of DL rather than theoretical work, although that is extremely interesting.

For more applied DL, I have found

Deep learning for neuroimaging: a validation study [paper]

SPINDLE: SPINtronic deep learning engine for large-scale neuromorphic computing [paper]

I’m sure there are many others. Maybe digging up some seminal neuroscience papers modeling brain areas and functions with different kinds of neural networks would be a worthy topic for a future blog post.



Some interesting new algorithms

Just wanted to note down some new algorithms that I came across for future reference. Haven’t actually tried any of these yet.

  • LIBFMM, a library for Field-aware Factorization machines. Developed by a group at National Taiwan University, this technique has been used to win two Kaggle click-through competitions. (Criteo, Avazu)
  • Random Bits Regression, a “strong general predictor for big data” (paper). “This method first generates a large number of random binary intermediate/derived features based on the original input matrix, and then performs regularized  linear/logistic regression on those intermediate/derived features to predict the outcome.
  • BIDMach, a CPU and GPU-accelerated machine learning library that shows some amazing benchmark results compared to Spark, Vowpal Wabbit, scikit-learn etc.

And another one which is not as new, but which I wanted to highlight because of a nice blog post about interactions and generalization by David Chudzicki:

Deep learning and genomics: the splicing code [and breast cancer features]

Last summer, I wrote a little bit about potential applications of deep learning to genomics. What I had in mind then was (i) to learn a hierarchy of cell types based on single-cell RNA sequencing data (with gene expression measures in the form of integers or floats as inputs) and (ii) to discover features in metagenomics data (based on short sequence snippets; k-mers). I had some doubts regarding the latter application because I was not sure how much the system could learn from short k-mers. Well, now someone has tried deep learning from DNA sequence features!

Let’s back up a little bit. One of many intriguing questions in biology is exactly how splicing works. A lot is known about the rules controlling it but not everything. A recent article in Science, The human splicing code reveals new insights into the genetic determinants of disease (unfortunately paywalled), used a machine learning approach (ensembles of neural networks) to predict splicing events and the effects of single-base mutations on the same using only DNA sequence information as input. Melissa Gymrek has a good blog post on the paper, so I won’t elaborate too much. Importantly though, in this paper the features are still hand-crafted (there are 1393 sequence based features).

In an extension of this work, the same group used deep learning to actually learn the features from the sequence data. Hannes Bretschneider posted this presentation from NIPS 2014 describing the work, and it is very interesting. They used a convolutional network that was able to discover things like the reading frame (the three-nucleotide periodicity resulting from how amino acids are encoded in protein-coding DNA stretches) and known splicing signals.

They have also made available a GPU-accelerated deep learning library for DNA sequence data for Python: Hebel. Right now it seems like only feedforward nets are available (not the convolutional nets mentioned in the talk). I am currently trying to install the package on my Mac.

Needless to say, I think this is a very interesting development and I hope to try this approach on some entirely different problem.

Edit 2015-01-06. Well, what do you know! Just found out that my suggestion (i) has been tried as well. At the currently ongoing PSB’15 conference, Jie Tan has presented work using a denoising autoencoder network to learn a representation of breast cancer gene expression data. The learned features were shown to represent things like tumor vs. normal tissue status, estrogen receptor (ER) status and molecular subtypes. I had thought that there wasn’t enough data yet to support this kind of approach (and even told someone who suggested using The Cancer Genome Atlas [TCGA] data as much at a data science workshop last month – this work uses TCGA data as well as data from METABRIC), and the authors remark in the paper that it is surprising that the method seems to work so well. Previously my thinking was that we needed to await the masses of single-cell gene expression data that are going to come out in the coming years.

Hadley Wickham lecture: ggvis, tidyr, dplyr and much more

Another week, another great meetup. This time, the very prolific Hadley Wickham visited the Stockholm R useR group and talked for about an hour about his new projects.

Perhaps some background is in order. Hadleys PhD thesis (free pdf here) is a very inspiring tour of different aspects of practical data analysis issues, such as reshaping data into a “tidy” for that is easy to work with (he developed the R reshape package for this), visualizing clustering and classification problems (see his classifly, clusterfly, and meifly packages) and creating a consistent language for describing plots and graphics (which resulted in the influential ggplot2 package). He has also made the plyr package as a more consistent version of the various “apply” functions in R. I learned a lot from this thesis.

Today, Hadley talked about several new packages that he has been developing to further improve on his earlier toolkit. He said that in general, his packages become simpler and simpler as he re-defines the basic operations needed for data analysis.

  • The newest one (“I wrote it about four days ago”, Hadley said) is called tidyr (it’s not yet on CRAN but can be installed from GitHub) and provides functions for getting data into the “tidy” format mentioned above. While reshape had the melt and cast commands, tidyr has gather, separate, and spread.
  • dplyr – the “next iteration of plyr”, which is faster and focuses on data frames. It uses commands like select, filter, mutate, summarize, arrange.
  • ggvis – a “dynamic version of ggplot2” which is designed for responsive dynamic graphics, streaming visualization and meant for the web. This looked really nice. For example, you can easily add sliders to a plot so you can change the parameters and watch how the plot changes in real time. ggvis is built on Shiny but provides easier ways to make the plots. You can even embed dynamic ggvis plots in R markdown documents with knitR so that the resulting report can contain sliders and other things. This is obviously not possible with PDFs though. ggvis will be released on CRAN “in a week or so”.

Hadley also highlighted the magrittr package which implements a pipe operator for R (Magritte/pipe … get it? (groan)) The pipe looks like %>% and at first blush it may not look like a big deal, but Hadley made a convincing case that using the pipe together with (for example) dplyr results in code that is much easier to read, write and debug.

Hadley is writing a book, Advanced R (wiki version here), which he said has taught him a lot about the inner workings of R. He mentioned Rcpp as an excellent way to write C++ code and embed it in R packages. The bigvis package was mentioned as a “proof of concept” of how one might visualize big data sets (where the number of data points is larger than the number of pixels on the screen, so it is physically impossible to plot everything and summarization is necessary.)

Follow the Data podcast, episode 3: Grokking Big Data with Paco Nathan

In this third episode of the Follow the Data podcast we talk to Paco Nathan, Data Scientist at Concurrent Inc.

Podcast link:

Paco’s blog:

The running time is about one hour.

Paco’s internet connection died just as we were about to start the podcast so he had to connect via Skype on the iPhone. We apologize on the behalf of his internet provider in Silicon Valley for the reduced sound quality caused by this.

Here’s a few links to stuff we discussed:
An application framework for Java developers to quickly and easily develop robust Data Analytics and Data Management applications on Apache Hadoop.
A dialect of Lisp that runs on the JVM.
A Scala library that makes it easy to write MapReduce jobs in Hadoop.
A simple command line interface for building large-scale data processing jobs based on Cascading.
states that it is impossible for a distributed computer system to simultaneously provide all three of the following guarantees: Consistency, Availability, Partition tolerance
an article on the USB-sized Oxford Nanopore MinION sequencer
Previously known as Data Without Borders this organisation aims to do good with Big Data.
Prediction based insurance for farmers. All_Watched_Over_by_Machines_of_Loving_Grace_(TV_series)
An interesting take on how programming culture has affected life. Link to episode #2 (  “The use and abuse of vegetational concepts” – about how the idea of ecosystems came to be, sprung out of the notion of harmony in nature, how this influenced cybernetics and the perils of taking this animistic concept too far.
A great way to teach kids to code.
Another interesting tool for teaching kids to code and build games.
Free form virtual reality game.
Some info on arduino-based wireless wind measurement project by Karl-Petter Åkesson (in Swedish).
A pioneering internet retailer that Paco was one of the founders for.

What can “big data” (read “Hadoop”) do for genomics?

Prompted by the recent news that Cloudera and Mount Sinai School of Medicine will collaborate to “solve medical challenges using big data” (more specifically, Cloudera’s Jeff Hammerbacher, ex-big data guru at Facebook, will collaborate with the equally trailblazing mathematician/biologist Eric Schadt at Mount Sinai’s Institute for Genomics and Multiscale Biology) and that NextBio will collaborate with Intel to “optimize the Hadoop stack and advance big data technologies in medicine”, I would like to offer some random thoughts on possible use cases.

Note that “big data” essentially means “Hadoop” in the above press releases, and that the “medicine” they mention should be understood as “genomic medicine” or just “genomics”. Since I happen to know a thing or two about genomics, I will limit myself to (parts of) genomics and Hadoop/MapReduce in this post. For a good overview of big data and medicine in a broader sense than I can describe here, check out this rather nice GigaOm article.

Existing Hadoop/MapReduce stuff for NGS

In the world of high-throughput, or next-generation sequencing (NGS), which is rapidly becoming more and more indispensable for genomics, there are a few Hadoop-based frameworks that I am aware of and that should probably be mentioned first. Packages like Cloudburst and Crossbow leverage Hadoop to perform “read mapping” (approximate string matching for taking a DNA sequence from the sequencer and figuring out where in a known genome it came from), Myrna and Eoulsan do the same but also extend the workflow to quantifying gene expression and identifying differentially expressed genes based on the sequences, and Contrail does Hadoop-based de novo assembly (piecing together a new genome from sequences without previous knowledge, like an extremely difficult jigsaw puzzle). These are essentially MapReduce implementations of existing software, which is all good and fine, but I haven’t seen these tools being used much so far. Perhaps one reason is that read mapping is usually not a major bottleneck compared to some other steps, and with recently released software such as SeqAlto and SNAP (thx Tom Dyar) (and another package that I’m sure I read about the other day but can’t seem find right now) promising a further 10x-100x speed increase compared to existing tools, there is just not a pressing need at the moment. Contrail, the de novo assembler,  does offer an opportunity for research groups who don’t have access to a very RAM-rich computers (de novo assembly is notoriously memory hungry, with 512 Gb RAM machines often being strained to the limit on certain data sets) to perform assembly on commodity clusters.

Then there are the projects that attempt to build a Hadoop infrastructure for next-generation sequencing, like Seal, which provides “map-reducification” for a number of common NGS operations, or Hadoop-BAM (a library for processing BAM files, a common sequence alignment format, in Hadoop) and SeqPig (a library with import and export functions to allow common bioinformatics formats to be used in Pig).

What Hadoop could be useful for

I’m sure people smarter than me will come up with many different use cases for Hadoop in genomics and medicine. At this point, however, I would suggest these general themes:

  • Statistical associations between various kinds of data vectors – clinical, environmental, molecular, microbial... This is more or less a batch-processing problem and thus suited to Hadoop. NextBio (the company mentioned in the beginning, who are teaming up with Intel) are doing this as a core part of their business; computing correlations between gene expression levels in different tissues, diseases and conditions and clinical information, drug data etc. However, this concept could (and should) be extended to other things like environmental information, lifestyle factors, genetic variants (SNV, structural variations, copy number variations etc.), epigenetic data (chromatic structure, DNA methylation, histone modifications …), personal microbiomes (the gut microbiota in each patient etc.) Of course, collecting and compiling the data to perform these correlations will be hard; a much harder “big data” problem than computing the actual correlations.  SolveBio is a new company that seems to want to understand cancer by compiling vast quantities of data in such a way. This is how they put it in an interview (titled, ambitiously, “The Cloud Will Cure Cancer“): “Patients can measure every feature, as the technology becomes cheaper: genome sequence, gene expression in every accessible tissue, chromatin state, small molecules and metabolites, indigenous microbes, pathogens, etc. These data pools can be created by anyone who has the consent of the patients: universities, hospitals, or companies. The resulting networks, the “data tornado”, will be huge. This will be a huge amount of data and a huge opportunity to use statistical learning for medicine.” In fact, a third recently announced bigdata/genomics collaboration, between Google and the Institute for Systems Biology (ISB), has already started to explore what this type of tools could look like in their Cancer Regulome Explorer. ISB has used the Google Compute Engine to scale a random forest algorithm to 600,000 cores across Google’s global data centers in order to “explore associations between DNA, RNA, epigenetic, and clinical cancer data.” See this case study for some more details (not many more to be honest.)
  • Metagenomics. This means, according to one definition, “the application of modern genomics techniques to the study of communities of microbial organisms directly in their natural environments, bypassing the need for isolation and lab cultivation of individual species.(There is really nothing “meta” about it, it’s just that you are looking at many species at once, which is why it is also called environmental genomics or community genomics in some cases.) For example, Craig Venter’s project to sequence as many living things as possible in the Sargasso sea is metagenomics, as is sequencing samples from the human gut, snot etc. in search of novel bacteria, viruses and fungi (or just characterizing the variety of known ones.) It’s a fascinating field; for an easy introduction, see the TED Talk called “What’s left to explore?” by Nathan Wolfe. Analyzing sequences from metagenomics projects is of course much more difficult than usual, because you are randomly sampling sequences for which you don’t know the source organism but have to infer it in some way. This calls for smart use of proper data structures for indexing and querying, and as much parallelization as possible, very likely in some Hadoopy kind of way. C Titus Brown has written a lot of interesting stuff about the metagenomics data deluge on his blog, Living in an Ivory Basement, where he has explored esoteric and useful things such as probabilistic de Bruijn graphs. Lately, compressive genomics – algorithms that compute directly on compressed genomic data – has become something of a buzz phrase (although similar ideas have been used for quite some time). Some combination of all of these approaches will be needed to combat the inevitable information overload.

Beyond batch processing

In my mind, Hadoop has been associated with batch processing, but today I heard that the newest version of Hadoop not only includes a completely overhauled version of MapReduce called YARN, but it will even allow using other kinds of frameworks, such as streaming real-time analytics frameworks, to operate on the data stored in HDFS. I’ve been thinking about possible applications of stream analytics in next-generation sequencing. Surprisingly, there is already software for streaming quantification of sequences, eXpress – these guys are surely ahead of their time. The immediate use case I can think of is for the USB-stick-sized MinION nanopore sequencer, which reportedly will produce output in a real-time manner (which no sequencers do today as far as I know) so that you can start your analysis while the sequencer is still running. If the vision about “genomic observatories” to “take the planet’s biological pulse” comes true, I’m sure there will be plenty of work to do for the stream analytics clusters of the world …

This has been a rambling post that will probably need a few updates in the coming days – congratulations and thanks if you made it to the end!

MLDemos visualizes what classifiers do

MLDemos is based on a really nice idea – to visualize how different classifiers construct the decision boundaries around arbitrary sets of data points. I had of course seen the concept of decision boundaries before; in many machine-learning classes you will draw or at least get to see boundaries or surfaces that delineate the parts of the sample space where a classifier will yield different predictions. In MLDemos, you get to draw the points in the (2-D) sample space by hand, and you can choose between a variety of different algorithms. Or if you want, you can upload your own data sets. The software doesn’t just do decision boundaries, it also visualizes regression, clustering and dynamical systems in cool and downright beautiful ways.

Post Navigation