Follow the Data

A data driven blog

GraphLab Create

Just a heads-up that you can now get a free beta version of GraphLab Create. It’s a Python library that lets you use GraphLab functionality to easily do stuff like calculating page ranks, building recommender systems and so on. Good for people like me who don’t have time or patience for complicated installation processes (you can just use pip install.) So far I’ve only worked through some examples, like the Six Degrees of Kevin Bacon tutorial from Strata 2014, while waiting for inspiration for to strike regarding what I should implement for my own purposes. It seems quite intuitive so far.

Cancer, machine learning and data integration

Machine Learning Methods in the Computational Biology of Cancer is an arXiv preprint of a pretty nice article dealing with some analysis that can be used for high-dimensional biological (and other) data – although the examples come from cancer research, they could easily be about something else. This paper does a good job of describing penalized regression methods such as lasso, ridge regression and elastic net. It also goes into compressed sensing and its applicability to biology, although cautioning that it cannot yet be straightforwardly applied to biological data. This is because compressed sensing is based on the assumption that one can choose the “measurement matrix” freely, whereas in biology, it (usually called “design matrix” in this context) is already fixed.

The Critical Assessment of Massive Data Analysis (CAMDA) 2014 conference has released its data analysis challenges. Last year’s challenges on toxicogenomics and toxicity prediction will be reprised (perhaps in modified form, I didn’t check), but they have added a new challenge which I find interesting because it focuses on data integration (combining distinct data sets on gene, protein and micro-RNA expression as well as gene structural variations and DNA methylation) and uses published data from the International Cancer Genome Consortium (ICGC). I think it’s a good thing to re-analyze, mash up and meta-analyze data from these large-scale projects, and the CAMDA challenges are interesting because they are so open-ended, in contrast to e g Kaggle challenges (which I also like but in a different way). The goals in the CAMDA challenges are quite open to interpretation (and also ambitious), for instance:

  • Question 1: What are disease causal changes? Can the integration of comprehensive multi-track -omics data give a clear answer?
  • Question 2: Can personalized medicine and rational drug treatment plans be derived from the data? And how can we validate them down the road?

Two good resources (about sklearn and deep learning)

I have been using R, mostly happily, for the past 6 or 7 years, for its variety of statistical and machine learning packages and the relative ease of producing nice-looking plots. At the same time I am a big user of Python for things that R really doesn’t do that well, such as large-scale string manipulation. I had been aware of scikit-learn (or sklearn) for a while as a potential way to be able to do “everything” in Python including stats and plotting, but never really felt the pull to start using it. In the beginning, it felt too immature; later, it felt too messy when I looked at the documentation.

Last week, however, I came across a really good tutorial by Jake Vanderplas that finally made sklearn click for me and perhaps will push me over the edge to start using it. (I don’t expect to leave R any time soon, though…) The tutorial shows, step by step, how to divide your data set into training and test sets, fit models and make predictions, perform grid searches for parameter settings, plot learning curves etc.

 Deep learning is another subject (although much bigger than sklearn of course) that I have kept up a passing interest in but never really looked into properly, because I wasn’t sure where to start. The new book Deep learning: Methods and applications (PDF link) by Li Deng and Dong Yu seems like a good place to start. I’ve only read a few chapters, but so far it has done a good job of clarifying terms and putting deep learning methods into a historical context.

Machine learning goings-on in Stockholm

The predictive analytics scene in Stockholm hasn’t been very vibrant, but at least we now have the Machine Learning Stockholm meetup group, which had its inaugural session hosted at Spotify on February 25 under the heading “Graph-parallel machine learning”. There was a short introduction to graph-centric computing paradigms and hands-on demos of GraphLab and Giraph.

The Stockholm R useR group has hosted good analytics-themed meetups from time to time. This past Saturday (March 8) On Saturday (March 29), they organized will organize a hackathon with two tracks: one predictive modelling contest about predicting flat prices (always a timely theme in this town) and one “R for beginners” track.

Finally, the Stockholm-based Watty are looking for a head of machine learning development (or maybe several machine learning engineers; see ad) to lead a team that will diagnose the energy use of buildings and work to minimize energy waste.

Some resources

  • DataTau seems like a worthwhile Reddit-like site devoted to all things data.
  • Foundations of data science [PDF link]. A (quite complete) draft of a rather mathematically oriented book on data science. I haven’t had time to read it yet but it looks interesting. It seems to put quite a lot of emphasis on understanding the quirks of high-dimensional spaces.
  • Techniques to improve the accuracy of your predictive models. Useful (1h 20 min long) video of a presentation given by Phil Brierley at an R user group meeting in Melbourne.

Hadoop (and other parallel computing framework) solutions for genomics & proteomics

OK, so let’s see what applications of parallel computing frameworks in “high throughput biology”, represented here by genomics and proteomics, that we can find currently. I’ll focus mostly on Hadoop since I find it interesting to look at how much traction it gets in life science. It seems pretty clear that it’s not used as much in this space as in various other areas such as retail, advertising, gaming and so on. This could be because

  • the map-reduce framework lends itself more easily to transaction data (who bought what on Amazon, who clicked on a certain link etc.) or other kinds of data like tweets that can be represented in a single line of text
  • biological data sets are simply not that big yet (the data volumes of a Netflix or a Facebook dwarf those of even the most powerful sequencing centers)
  • computational biologists usually work on supercomputing clusters that are provided by universities or research institutes (and that are administered by someone else), or on a single server – but not on large clusters of cheap machines which they can administer themselves
  • there are too few programmers in biology who can (or have time to) work with these systems

Any other suggestions for reasons behind this discrepancy?

The parallel computing framework that has been used the most in bioinformatics is probably MPI, Message Passing Interface, for example in mpiBLAST and lately in the Ray (DNA sequence) assembler, which is very powerful on the kind of MPI enabled clusters that are often found in academia.

Before we move over to Hadoop, I just wanted to mention that a newer cluster computing framework, Spark, has recently been used by Adam Roberts et al. for “streaming fragment assignment” in the cloud. Depending on on your background, you can understand this from a mathematical point of view as performing expectation maximization efficiently in a distributed way, or from a biological point of view as assigning sequence reads to their likely transcript of origin in a fast and probabilistic manner.

Now to the Hadoop applications in bioinformatics. For this part, I have relied partly on a review article, Survey of MapReduce frame operation in bioinformatics.


Hydra – A Hadoop-based search engine for matching spectra from shotgun mass spectrometry against increasingly large sequence databases (see paper)

Chorus – Not sure if this actually uses Hadoop but says it uses the map-reduce paradigm on Amazon EC2. It is intended as a cloud enabled storage area for all of the world’s mass spectrometry data.


SeqPig – a library to enable the usage of Hadoop Pig features for analyzing high-throughput sequencing data sets. Builds on Hadoop-BAM, a useful Java library for dealing with various high-throughput sequencing formats such as BAM, FASTQ, and BCF in Hadoop.

Seal – A suite of tools for DNA sequence alignment and related tasks, which is the only Hadoop tool for genomics that I know is actively used in production.

Cloudburst – Sequence alignment. An early demonstration of what is possible which is not used much now (from what I can tell)

Crossbow – Resequencing analysis (sequence alignment + SNP calling)

Eoulsan – RNA sequencing analysis pipeline interfacing mostly existing tools

Myrna – RNA sequencing analysis pipeline with newly written code for some steps

Fx – RNA sequencing analysis pipeline. Uses Hadoop and the excellent RNA-seq aligner GSNAP, so probably a good solution

SeqWare – Includes LIMS-type functionality, a workflow engine, a query engine etc. for handling high-throughput sequencing data.

Contrail – De novo sequence assembly based on Hadoop. Very interesting approach (see e g this presentation) but when I tried it, it was rather poorly documented and I was unable to get satisfactory results. Lately I have been thinking about whether new graph-based parallel computing frameworks like GraphLab could be adapted for de novo assembly (which is essentially a graph traversal problem).

There are also other projects listed in the review article I referenced above, but I have skipped those for various reasons (e.g. extremely niche applications, solving problems that are too small to need Hadoop, etc.)

I’d be happy to receive feedback on this little survey with things I’ve missed, reasons for why the adoption is not higher, suggestions for tools that should be developed, and so on.

Analytics challenges in genomics

Continuing on the theme of data analysis and genomics, here is a presentation I gave for the Data Mining course at Uppsala university in October this year. It talks a little bit about massively parallel DNA sequencing, then goes on to mention grand visions such as sequencing millions of genomes, discovering new species by metagenomics, “genomic observatories” etc, then goes into the practical difficulties and finally suggests some strategies like prediction contests. Enjoy!

Genomics and data stuff

I’ve noticed that a lot of the traffic to this blog is driven by queries about genomics and “big data”. I guess I’d better re-post some of the more useful resources. Perhaps I will even make a little mini-series about it.

So, without further ado: Anyone who is interested in the intersection of genomics and (big) data (science) might want to check out the following.

Of course there are many others. I’ll follow up with more material during the next few weeks.

Data related vacancies in Sweden (mostly Stockholm)

A sampling of currently available positions.

New(ish) resources for learning about deep, statistical & machine learning

Post Navigation


Get every new post delivered to your Inbox.

Join 74 other followers