Follow the Data

A data driven blog

Archive for the tag “graphs”

Finding communities in the Swedish Twitterverse with a mention graph approach

Mattias Östmar and me have published an analysis of the “big picture” of discourse in the Swedish Twitterverse that we have been working on for a while, on and off. Mattias hatched the idea to take a different perspective from looking at keywords or numbers of followers or tweets, and instead try to focus on engagement and interaction by looking at reciprocal mention graphs – graphs where two users get a link between them if both have mentioned each other at least once (as happens by default when you reply to a tweet, for example.) He then applied an eigenvector centrality measure to that network and was able to measure the influence of each user in that way (described in Swedish here).

In the present analysis we went further and tried to identify communities in the mention network by clustering the graph. After trying some different methods we eventually went with Infomap, a very general information-theory based method (it handles both directed and undirected, weighted and unweighted networks, and can do multi-level decompositions) that seems to work well for this purpose. Infomap not only detects clusters but also ranks each user by a PageRank measure so that the centrality score comes for free.

We immediately recognized from scanning the top accounts in each cluster that there seemed to be definite themes to the clusters. The easiest to pick out were Norwegian and Finnish clusters where most of the tweets were in those languages (but some were in Swedish, which had caused those accounts to be flagged as “Swedish”.) But it was also possible to see (at this point still by recognizing names of famous accounts) that there were communities that seemed to be about national defence or the state of Swedish schools, for instance. This was quite satisfying as we hadn’t used the actual contents of the tweets – no keywords or key phrases – just the connectivity of the network!

Still, knowing about famous accounts can only take us so far, so we did a relatively simple language analysis of the top 20 communities by size. We took all the tweets from all users in those communities, built a corpus of words of those, and calculated the TF-IDFs for each word in each community. In this way, we were able to identify words that were over-represented in a community with respect to the other communities.

The words that feel out of this analysis were in many cases very descriptive of the communities, and apart from the school and defence clusters we quickly identified an immigration-critical cluster, a cluster about stock trading, a sports cluster, a cluster about the boy band The Fooo Conspiracy, and many others. (In fact, we have since discovered that there are a lot of interesting and thematically very specific clusters beyond the top 20 which we are eager to explore!)

As detailed in the analysis blog post, the list of top ranked accounts in our defence community was very close to a curated list of important defence Twitter accounts recently published by a major Swedish daily. This probably means that we can identify the most important Swedish tweeps for many different topics without manual curation.

This work was done on tweets from 2015, but in mid-January we will repeat the analysis on 2016 data.

There is some code describing what we did on GitHub.

 

Ramblings on graphs

During a few years at the beginning of the millennium, there was a flurry of excitement around graph theory and network science, with things like small-world networks and scale-free networks being (re-)discovered, and popular books like Six Degrees and Linked being published. I, too, made some efforts to apply ideas from graph theory to biology, usually guided by my friend, the graph ninja Petter Holme. (E.g. Subnetwork hierarchies of biochemical pathways where we looked at edge betweenness and community structure in biochemical networks and and Role-similarity based functional prediction in networked systems: Application to the yeast proteome where we used a PageRank-like method and the sociological idea of “role similarity” to predict functions of proteins.) I also dabbled in neuroscience applications of graph theory such as the connection between epilepsy and brain connectivity structures (alas, never published).

Then I drifted away from graphs and networks (the same thing, really!) to focus on statistics and arcane aspects of DNA sequencing. Lately, however, I’ve seen a lot of interesting graph-related developments.

The first is a superficial observation: Facebook has introduced Graph Search and Spotify has introduced the Music Graph as perfectly normal “products” without seeing the need for a catchier name. Has the word “graph” become quasi-mainstream, as happened with the “six degrees” concept which we’ve seen referenced many times in pop culture?

On the academic side, theory has now been developed for temporal graphs, which go beyond the static networks of yesterday and attempt to incorporate the time aspect. Much of this is due to the aforementioned graph ninja – see e g his book (cowritten with Jari Saramäki) or recent single-author paper in PLoS CompBio.

On the “big graph” side, one of the most interesting developments is GraphLab, a parallel computing framework for machine learning on graphs. This meaty presentation contrasts “data-parallel” problems, for which MapReduce works well, with “graph-parallel” problems (such as PageRank) for which GraphLab works well.

Other “big graph” projects that I’ve heard of but not really tried are Google’s Pregel, Apache Giraph, which as I understand it is an open-source implementation of Pregel (like how Hadoop was initially an open source implementation of Google’s MapReduce), and Neo4j.

GraphChi is a spin-off from GraphLab and is said to enable very fast processing of huge graphs on a single laptop by “using a novel algorithm to process the graph from disk”. This turned out not to be idle hype – when I downloaded it and tried it on my Mac on some sample (large) networks, it was blazingly fast. Implemented example applications include PageRank calculations, matrix factorizations, and triangle counting, but the framework is extensible. Definitely something to try if you don’t feel like setting up clusters. The installation process was easy too.

From a bioinformatics angle, so-called de novo genome assembly is essentially a graph traversal problem. It can be cast in different forms (usually based string graphs or de Bruijn graphs) – the paper Parametric Complexity of Sequence Assembly: Theory and Applications to Next Generation Sequencing by Nagarajan and Pop gives a thorough overview of the various tough traversal problems that one ends up with (Hamiltonian paths etc.)

I’ve learned from the informative and frequently updated homolog.us blog that a lot of tricky things are being done to improve the performance and scalability of genome assembly, like cascading Bloom filters and MPI and Hadoop implementations, but I can’t help wondering if it wouldn’t help to use one of these inherently graph-oriented approaches. Does anyone know if there are projects along these lines? A recent Twitter comment from Mike Schatz (assembly expert and developer of the Hadoop-based Contrail assembler) said that he was thinking about it but still waiting for a good open-source implementation of Pregel. I guess that implies he doesn’t consider Giraph to be sufficiently developed at this point.

Post Navigation