Follow the Data

A data driven blog

Archive for the tag “bioinformatics”

Ramblings on graphs

During a few years at the beginning of the millennium, there was a flurry of excitement around graph theory and network science, with things like small-world networks and scale-free networks being (re-)discovered, and popular books like Six Degrees and Linked being published. I, too, made some efforts to apply ideas from graph theory to biology, usually guided by my friend, the graph ninja Petter Holme. (E.g. Subnetwork hierarchies of biochemical pathways where we looked at edge betweenness and community structure in biochemical networks and and Role-similarity based functional prediction in networked systems: Application to the yeast proteome where we used a PageRank-like method and the sociological idea of “role similarity” to predict functions of proteins.) I also dabbled in neuroscience applications of graph theory such as the connection between epilepsy and brain connectivity structures (alas, never published).

Then I drifted away from graphs and networks (the same thing, really!) to focus on statistics and arcane aspects of DNA sequencing. Lately, however, I’ve seen a lot of interesting graph-related developments.

The first is a superficial observation: Facebook has introduced Graph Search and Spotify has introduced the Music Graph as perfectly normal “products” without seeing the need for a catchier name. Has the word “graph” become quasi-mainstream, as happened with the “six degrees” concept which we’ve seen referenced many times in pop culture?

On the academic side, theory has now been developed for temporal graphs, which go beyond the static networks of yesterday and attempt to incorporate the time aspect. Much of this is due to the aforementioned graph ninja – see e g his book (cowritten with Jari Saramäki) or recent single-author paper in PLoS CompBio.

On the “big graph” side, one of the most interesting developments is GraphLab, a parallel computing framework for machine learning on graphs. This meaty presentation contrasts “data-parallel” problems, for which MapReduce works well, with “graph-parallel” problems (such as PageRank) for which GraphLab works well.

Other “big graph” projects that I’ve heard of but not really tried are Google’s Pregel, Apache Giraph, which as I understand it is an open-source implementation of Pregel (like how Hadoop was initially an open source implementation of Google’s MapReduce), and Neo4j.

GraphChi is a spin-off from GraphLab and is said to enable very fast processing of huge graphs on a single laptop by “using a novel algorithm to process the graph from disk”. This turned out not to be idle hype – when I downloaded it and tried it on my Mac on some sample (large) networks, it was blazingly fast. Implemented example applications include PageRank calculations, matrix factorizations, and triangle counting, but the framework is extensible. Definitely something to try if you don’t feel like setting up clusters. The installation process was easy too.

From a bioinformatics angle, so-called de novo genome assembly is essentially a graph traversal problem. It can be cast in different forms (usually based string graphs or de Bruijn graphs) – the paper Parametric Complexity of Sequence Assembly: Theory and Applications to Next Generation Sequencing by Nagarajan and Pop gives a thorough overview of the various tough traversal problems that one ends up with (Hamiltonian paths etc.)

I’ve learned from the informative and frequently updated blog that a lot of tricky things are being done to improve the performance and scalability of genome assembly, like cascading Bloom filters and MPI and Hadoop implementations, but I can’t help wondering if it wouldn’t help to use one of these inherently graph-oriented approaches. Does anyone know if there are projects along these lines? A recent Twitter comment from Mike Schatz (assembly expert and developer of the Hadoop-based Contrail assembler) said that he was thinking about it but still waiting for a good open-source implementation of Pregel. I guess that implies he doesn’t consider Giraph to be sufficiently developed at this point.

Compressive sensing and bioinformatics

Compressive (or compressed) sensing is (as far as I can tell; I have learned everything I know about it from the excellent Nuit Blanche blog) a relatively newly named research subject somewhere around the intersection of signal processing, linear algebra and statistical learning. Roughly speaking, it deals with problems like how to reconstruct a signal from the smallest possible number of measurements and how to complete a matrix (to fill in missing matrix elements – if you think that sounds kind of abstract, this  presentation on collaborative filtering by Erik Bernhardssons nicely explains how Spotify uses matrix completion for song recommendations; NetFlix and others also do similar things). The famous mathematician Terence Tao, who has done a lot of work in developing compressed sensing, has much better explanations than I can give, for instance this 2007 blog post that exemplifies the theory with the “single-pixel camera” and this PDF presentation where he explains how CS relates to standard linear (Ax = b) equation systems that are under-determined (more variables than measurements). Igor Carron of the Nuit Blanche blog also has a “living document” with a lot of explanations about what CS is about (probably more than you will want to refer to in a single sitting).

Essentially, compressed sensing works well when the signal is sparse in some basis (this statement will sound much clearer if you have read one of the explanations I linked above). The focus on under-determined equation systems made me think about whether this could be useful for bioinformatics, where we frequently encounter the case where we have few measurements of a lot of variables (think of, e.g., the difficulty of obtaining patient samples, and the large number of gene expression measurements that are taken from them). The question is, though, whether gene expression vectors (for example) can be thought of as sparse in some sense. Another train of thought I had is that it would be good to develop even more approximate and, yes, compressive methods for things like metagenomic sequencing, where the sheer amount of information pretty quickly starts to break the available software. (C Titus Brown is one researcher who is developing software tools to this end.)

Of course, I was far from being the first one to make the connection between compressive sensing and bioinformatics. Olgica Milenkovic had thoughtfully provided a presentation on sparse problems in bioinformatics (problems that thus could be addressable with CS techniques).

Apart from the applications outlined in the above presentation, I was excited to see a new paper about a CS approach to metagenomics:

Quikr – A method for rapid reconstruction of bacterial communities using compressed sensing

Also there are a couple of interesting earlier publications:

A computational model for compressed sensing RNAi cellular screening

Compressive sensing DNA microarrays

This & that

  • The BigML blog has been on a roll lately with many interesting posts. I particularly liked this one, Bedtime for Boosting, which goes pretty deep into benchmarking various versions of the boosting algorithms we all know and love (?).
  • Mark Gerstein of Yale University has a nice slide deck about the big data blizzard in genomics (<– pdf link). There are lots of ideas here about how to build predictive models based on, for example, ENCODE data. I won’t get into the ongoing controversy around ENCODE here, suffice to say that I think the ENCODE data sets are a good resource for starting to build statistical models of genomic regulation on a larger scale.
  • The O’Reilly Radar has a good post about how Python data tools just keep getting better.
  • An “ultra-tricky” bioinformatics challenge will be run by Genome Biology on DNA Day (April 25), with a “truly awesome” prize. Intriguing.

Topology and data analysis: Gunnar Carlsson and Ayasdi

A few months ago, I read in Wired [Data-Visualization Firm’s New Software Autonomously Finds Abstract Connections] and Guardian [New big data firm to pioneer topological data analysis] about Ayasdi, the new data visualization & analytics company founded by professor Gunnar Carlsson at Stanford that has received millions of funding from Khosla Ventures, DARPA and other places. Today, I had the opportunity to hear Carlsson speak at the Royal Institute of Technology (KTH) in Stockholm about the mathematics underlying Ayasdi’s tools. I was very eager to hear how topology (Carlsson’s specialty) connects to data visualization, and about their reported success in classifying tumor samples from patients [Topology based data analysis identifies a subgroup of breast cancers with a unique mutational profile and excellent survival].

The talk was very nice. Actually there were two talks – one for a more “general audience” (which, in truth, probably consisted mostly of hardcore maths or data geeks) and one that went much deeper into the mathematics – and completely over my head.

One thing that intrigued me was that almost all of his examples were taken from biology: disease progression, cell cycle gene expression, RNA hairpin folding, copy number variations, mapping genotypes to geographic regions … the list goes on. I suppose it’s simply because Carlsson happens to work a lot with biologists at Stanford, but could it be that biology is an especially fertile area for innovation in data analysis? As Carlsson highlights in a review paper [Topology and Data], data sets from high-throughput biology often have many dimensions that are superfluous or have unknown significance, and there is no good understanding of which distance measures between data points that should be used. That is one reason to consider methods that do not break down when the scale is changed – such as topological methods, which are insensitive to the actual choice of metrics.

(Incidentally, I think there is a potential blog post waiting to be written about how many famous data scientists have come out of a biology/bioinformatics background. Names like Pete Skomoroch, Toby Segaran and Michael Driscoll come to mind, and those were just the ones I thought of instantly.)

Another nice aspect of the talk was that it was no sales pitch for Ayasdi (the company was hardly even mentioned) but more of a bird’s eye view of topology and its relation to clustering and visualization. In my (over)simplified understanding, the methods presented represent the data as a network where the nodes, which are supposed to represent “connected components” in an ideal scenario, are clusters derived using, e.g., hierarchical clustering. However, there is no cutoff value defined for breaking the data into clusters, but instead the whole outcome of the clustering – its profile, so to speak – is encoded and used in the following analysis. Carlsson mentioned that one of the points of this sort of network representation of data was to “avoid breaking things apart”, as clustering algorithms do. He talked about classifying the data point clouds using “barcodes”, ensuring persistence across changes of scale. The details of how these barcodes were calculated were beyond my comprehension.

Carlsson showed some examples of how visualizations created using his methods improved on hierarchical clustering dendrograms or PCA/MDS plots. He said that one of the advantages of the method is that it can identify small features in a large data set. These features would, he said, be “washed out” in PCA and not be picked up by standard clustering methods.

I look forward to learning more about topological data analysis. There are some (links to) papers at Ayasdi’s web site if you are interested, e.g.: Extracting insights from the shape of complex data using topology, Topology and Data (already mentioned above), Topological Methods for the Analysis of High Dimensional Data Sets and 3D Object Recognition.

Synapse – a Kaggle for molecular medicine?

I have frequently extolled the virtues of collaborative crowdsourced research, online prediction contests and similar subjects on these pages. Almost 2 years ago, I also mentioned Sage Bionetworks, which had started some interesting efforts in this area at the time.

Last Thursday, I (together with colleagues) got a very interesting update on what Sage is up to at the moment, and those things tie together a lot of threads that I am interested in – prediction contests, molecular diagnostics, bioinformatics, R and more. We were visited by Adam Margolin, who is director of computational biology at Sage (one of their three units).

He described how Sage is compiling and organizing public molecular data (such as that contained in The Cancer Genome Atlas) and developing tools for working with it, but more importantly, that they had hit upon prediction contests as the most effective way to generate modelling strategies for prognostic and diagnostic applications based on these data. (As an aside, Sage now appears to be focusing mostly on cancers rather than all types of disease as earlier; applications include predicting cancer subtype severity and survival outcomes.) Adam thinks that objectively scored prediction contests lets researchers escape from the “self-assessment trap“, where one always unconsciously strives to present the performance of one’s models in the most positive light.

They considered running their competitions on Kaggle (and are still open to it, I think) but given that they already had a good infrastructure for reproducible research, Synapse, they decided to tweak that instead and run the competitions on their own platform. Also, Google donated 50 million core hours (“6000 compute years”) and petabyte-scale storage for the purpose.

There was another reason not to use Kaggle as well. Sage wanted participants to not only upload predictions for which the results is shown on a dynamic leaderboard (which they do), but also to force them to provide runnable code which is actually executed on the Sage platform to generate the predictions. The way it works is that competitors need to use R to build their models, and they need to implement two methods, customTrain() and customPredict() (analogous to the train() and predict() methods implemented by most or all statistical learning methods in R) which are called by the server software. Many groups do not like to use R for their model development but there are ways to easily wrap arbitrary types of code inside R.

The first full-scale competition run on Synapse (which is, BTW, not only a competition platform but a “collaborative compute space that allows scientists to share and analyze data together”, as the web page states) was the Sage/DREAM Breast Cancer Prognosis Challenge, which uses data from a cohort of almost 2,000 breast cancer patients. (The DREAM project is itself worthy of another blog post as a very early (in its seventh year now, I think) platform for objective assessment of predictive models and reverse engineering in computational biology, but I digress …)

The goal of the Sage/DREAM breast cancer prognosis challenge is to find out whether it is possible to identify reliable prognostic molecular signatures for this disease. This question, in a generalized form (can we define diseases, subtypes and outcomes from a molecular pattern?), is still a hot one after many years of a steady stream of published gene expression signatures that have usually failed to replicate, or are meaningless (see e g Most Random Gene Expression Signatures Are Significantly Associated with Breast Cancer Outcome). Another competition that I plugged on this blog, SBV Improver, also had as its goal to assess if informative signatures could be found and its outcomes were disclosed recently. The result there was that out of four diseases addressed (multiple sclerosis, lung cancer, psoriasis, COPD), the molecular portrait (gene expression pattern) for one of them (COPD) did not add any information at all to known clinical characteristics, while for the others the gene expression helped to some extent, notably in psoriasis where it could discriminate almost perfectly between healthy and diseased tissue.

In the Sage/DREAM challenge, the cool thing is that you can directly (after registering an account) lift the R code from the leaderboard and try to reproduce the methods. The team that currently leads, Attractor Metagenes, has implemented a really cool (and actually quite simple) approach to finding “metagenes” (weighted linear combinations of actual genes) by an iterative approach that converges to certain characteristic metagenes, thus the “attractor” in the name. There is a paper on arXiv outlining the approach. Adam Margolin said that the authors have had trouble getting the paper published, but the Sage/DREAM competition has at least objectively shown that the method is sound and it should find its way into the computational biology toolbox now. I for one will certainly try it for some of my work projects.

The fact that Synapse stores both data and models in an open way has some interesting implications. For instance, the models can be applied to entirely new data sets, and they can be ensembled very easily (combined to get an average / majority vote / …). In fact, Sage even encourages competitors to make ensemble versions of models on the leaderboard to generate new models while the competition is going on! This is one step beyond Kaggle. Indeed, there is a team (ENSEMBLE) that specializes in this approach and they are currently at #2 on the leaderboard after Attractor Metagenes.

In the end, the winning team will be allowed to publish a paper about how they did it in Science Translational Medicine without peer review – the journal (correctly I think) assumes that the rigorous evaluation process in Synapse is more objective that peer review. Kudos to Science Translational Medicine for that.

There’s a lot more interesting things to mention, like how Synapse is now tackling “pan-cancer analysis” (looking for commonalities between *all* cancers), how they looked at millions of models to find out general rules of thumb about predictive models (discretization makes for worse performance, elastic net algorithms work best on average, prior knowledge and feature engineering is essential for good performance, etc.)
Perhaps the most remarkable thing in all of this, though, is that someone has found a way to build a crowdsourced card game, The Cure, on top of the Sage/DREAM breast cancer prognosis challenge in order to find even better solutions. I have not quite grasped how they did this – the FAQ states:

TheCure was created as a fun way to solicit help in guiding the search for stable patterns that can be used to make biologically and medically important predictions. When people play TheCure they use their knowledge (or their ability to search the Web or their social networks) to make informed decisions about the best combinations of variables (e.g. genes) to use to build predictive patterns. These combos are the ‘hands’ in TheCure card game. Every time a game is played, the hands are evaluated and stored. Eventually predictors will be developed using advanced machine learning algorithms that are informed by the hands played in the game.

But I’ll try The Cure right now and see if I can figure out what it is doing. You’re welcome to join me!

Phylo – an alignment game

I’ve been playing some Phylo while snowed in during this weekend. This nifty game, developed by a group at McGill University in Canada, reminds me a lot of FoldIt, which I’ve mentioned several times on this blog. Like FoldIt, Phylo works well just as a logic/pattern-recognition game, but also has a hidden (well, actually not hidden at all) agenda; it tries to apply the strategies used by the (most skillful) players to actual scientific problems. In the case of Phylo, the problem that you are trying to solve is multiple sequence alignment, or described more simply, trying to match up DNA sequences from different species to each other. Multiple sequence alignment is one of the truly classic problems in bioinformatics, and there are many good algorithms for it, but these could still be improved. The idea of Phylo is to leverage human beings’ superior pattern recognition capabilities to solve really tricky multiple alignment problems. Related (or presumably related) DNA sequences from various organisms have already been matched up against each other (aligned) by an existing algorithm, and the idea is that human players may be able to further optimize the alignments “by eye”.

I think there are two things that are really cool about this game. The first thing is that the creators are actually picking the problems from a public resource, the UCSC Genome Browser, where they have located a number of poorly aligned stretches of DNA close to genes (stretches in so-called “promoter regions”). These are regions for which one might suspect that the best alignment hasn’t been found. Also, these regions are interesting from a disease perspective, and each task in Phylo has to do with a certain disease or type of disease.

The second thing that I like is the educational aspect of the game. I’ve studied alignment algorithms (a long time ago), and even though I knew about the scoring schemes on a theoretical level, I hadn’t really understood them in a tangible way before I played Phylo. It’s funny how a game with scores makes you motivated to understand how something works. If I was teaching on a bioinformatics course, I would not hesitate to have the students play Phylo in conjunction with the material on sequence alignment. Never mind the exam, just solve level 9 and you’ve passed the course!

Sequencing data storm

Today, I attended a talk given by Wang Jun, a humorous and t-shirt-clad whiz kid who set up the bioinformatics arm of Beijing Genomics Institute (BGI) as a 23-year-old PhD student, became a professor at 27, and is now the director of BGI’s facility in Shenzhen, near Hong Kong. Although I work with bioinformatics at a genome institute myself, this presentation really drove home how much storage, computing power and know-how is really required for biology now and in the near future.

BGI does staggering amounts of genome sequencing – “If it tastes good, sequence it! If it is useful, sequence it!” as Wang Jun joked – from indigenous Chinese plants to rice, pandas and humans. They have a very interesting individual genome project where they basically apply many different techniques on samples from the same person and compare the results against known references. One of many interesting results from this project was the finding that human genomes not only vary in single “DNA letter” variants (so called SNPs, single nucleotide polymorphisms) or the number of times certain stretches of DNA are repeated (“copy number variations”) – it now turns out there are DNA snippets that, in largely binary fashion, some people have and some don’t.

Although the existing projects demand a lot of resources and manpower – the BGI has 250 bioinformaticians (!) which is still too few; according to Wang they want to quickly increase this number to 500 – this is nothing compared to what will happen after the next wave of sequencing technologies, when we will start to sequence single cells from different (or the same) tissues in an individual. Already, the data sets generated are so vast that they cannot be distributed over the internet. Wang  recounted how he had to bring ten terabyte drives to Europe by himself in order to share his data with researchers at EBI (European Bioinformatics Institute). Now, they are trying out cloud computing as a way to avoid moving the data around.

Wang attributed a lot of BGI’s success to young, hardworking programmers and scientists – many of them university dropouts – who don’t have any preconceptions about science and therefore are prepared to try anything. “These are teenagers that publish in Nature,” said Jun, apparently feeling that he was (at 33) already over the hill. “They don’t run on a 24-hour cycle like us, they run on 36h-cycles and bring sleeping bags to the lab.”

All in all, good fun.

Post Navigation