Follow the Data

A data driven blog

Archive for the tag “rambling”

Ramblings on graphs

During a few years at the beginning of the millennium, there was a flurry of excitement around graph theory and network science, with things like small-world networks and scale-free networks being (re-)discovered, and popular books like Six Degrees and Linked being published. I, too, made some efforts to apply ideas from graph theory to biology, usually guided by my friend, the graph ninja Petter Holme. (E.g. Subnetwork hierarchies of biochemical pathways where we looked at edge betweenness and community structure in biochemical networks and and Role-similarity based functional prediction in networked systems: Application to the yeast proteome where we used a PageRank-like method and the sociological idea of “role similarity” to predict functions of proteins.) I also dabbled in neuroscience applications of graph theory such as the connection between epilepsy and brain connectivity structures (alas, never published).

Then I drifted away from graphs and networks (the same thing, really!) to focus on statistics and arcane aspects of DNA sequencing. Lately, however, I’ve seen a lot of interesting graph-related developments.

The first is a superficial observation: Facebook has introduced Graph Search and Spotify has introduced the Music Graph as perfectly normal “products” without seeing the need for a catchier name. Has the word “graph” become quasi-mainstream, as happened with the “six degrees” concept which we’ve seen referenced many times in pop culture?

On the academic side, theory has now been developed for temporal graphs, which go beyond the static networks of yesterday and attempt to incorporate the time aspect. Much of this is due to the aforementioned graph ninja – see e g his book (cowritten with Jari Saramäki) or recent single-author paper in PLoS CompBio.

On the “big graph” side, one of the most interesting developments is GraphLab, a parallel computing framework for machine learning on graphs. This meaty presentation contrasts “data-parallel” problems, for which MapReduce works well, with “graph-parallel” problems (such as PageRank) for which GraphLab works well.

Other “big graph” projects that I’ve heard of but not really tried are Google’s Pregel, Apache Giraph, which as I understand it is an open-source implementation of Pregel (like how Hadoop was initially an open source implementation of Google’s MapReduce), and Neo4j.

GraphChi is a spin-off from GraphLab and is said to enable very fast processing of huge graphs on a single laptop by “using a novel algorithm to process the graph from disk”. This turned out not to be idle hype – when I downloaded it and tried it on my Mac on some sample (large) networks, it was blazingly fast. Implemented example applications include PageRank calculations, matrix factorizations, and triangle counting, but the framework is extensible. Definitely something to try if you don’t feel like setting up clusters. The installation process was easy too.

From a bioinformatics angle, so-called de novo genome assembly is essentially a graph traversal problem. It can be cast in different forms (usually based string graphs or de Bruijn graphs) – the paper Parametric Complexity of Sequence Assembly: Theory and Applications to Next Generation Sequencing by Nagarajan and Pop gives a thorough overview of the various tough traversal problems that one ends up with (Hamiltonian paths etc.)

I’ve learned from the informative and frequently updated blog that a lot of tricky things are being done to improve the performance and scalability of genome assembly, like cascading Bloom filters and MPI and Hadoop implementations, but I can’t help wondering if it wouldn’t help to use one of these inherently graph-oriented approaches. Does anyone know if there are projects along these lines? A recent Twitter comment from Mike Schatz (assembly expert and developer of the Hadoop-based Contrail assembler) said that he was thinking about it but still waiting for a good open-source implementation of Pregel. I guess that implies he doesn’t consider Giraph to be sufficiently developed at this point.


Post Navigation