Follow the Data

A data driven blog

Archive for the tag “clustering”

Finding communities in the Swedish Twitterverse with a mention graph approach

Mattias Östmar and me have published an analysis of the “big picture” of discourse in the Swedish Twitterverse that we have been working on for a while, on and off. Mattias hatched the idea to take a different perspective from looking at keywords or numbers of followers or tweets, and instead try to focus on engagement and interaction by looking at reciprocal mention graphs – graphs where two users get a link between them if both have mentioned each other at least once (as happens by default when you reply to a tweet, for example.) He then applied an eigenvector centrality measure to that network and was able to measure the influence of each user in that way (described in Swedish here).

In the present analysis we went further and tried to identify communities in the mention network by clustering the graph. After trying some different methods we eventually went with Infomap, a very general information-theory based method (it handles both directed and undirected, weighted and unweighted networks, and can do multi-level decompositions) that seems to work well for this purpose. Infomap not only detects clusters but also ranks each user by a PageRank measure so that the centrality score comes for free.

We immediately recognized from scanning the top accounts in each cluster that there seemed to be definite themes to the clusters. The easiest to pick out were Norwegian and Finnish clusters where most of the tweets were in those languages (but some were in Swedish, which had caused those accounts to be flagged as “Swedish”.) But it was also possible to see (at this point still by recognizing names of famous accounts) that there were communities that seemed to be about national defence or the state of Swedish schools, for instance. This was quite satisfying as we hadn’t used the actual contents of the tweets – no keywords or key phrases – just the connectivity of the network!

Still, knowing about famous accounts can only take us so far, so we did a relatively simple language analysis of the top 20 communities by size. We took all the tweets from all users in those communities, built a corpus of words of those, and calculated the TF-IDFs for each word in each community. In this way, we were able to identify words that were over-represented in a community with respect to the other communities.

The words that feel out of this analysis were in many cases very descriptive of the communities, and apart from the school and defence clusters we quickly identified an immigration-critical cluster, a cluster about stock trading, a sports cluster, a cluster about the boy band The Fooo Conspiracy, and many others. (In fact, we have since discovered that there are a lot of interesting and thematically very specific clusters beyond the top 20 which we are eager to explore!)

As detailed in the analysis blog post, the list of top ranked accounts in our defence community was very close to a curated list of important defence Twitter accounts recently published by a major Swedish daily. This probably means that we can identify the most important Swedish tweeps for many different topics without manual curation.

This work was done on tweets from 2015, but in mid-January we will repeat the analysis on 2016 data.

There is some code describing what we did on GitHub.


Hello 2012!

The first blog post of the new year. I made some updates to the Swedish big data company list from last year. I’ll recap the additions here so you don’t have to click on that link –

  • Markify is a service that searches a large set of databases for registered trademarks that are similar, in sound or in writing, to a given query – like a name you have thought up for your next killer startup. As described on the company’s website, determining similarity is not that clear-cut, so (according to this write-up) they have adopted a data-driven strategy where they train their algorithm on “actual case literature of disputed trademark claims to help it discover trademarks that were similar enough to be contested.” They claim it’s the worl’d most accurate comprehensive trademark search.
  • alaTest compiles, analyzes and rates product reviews to help customers select the most suitable product for them.
  • Intellus is a business process / business intelligence company. Frankly, these terms and web sites like theirs normally make me fall asleep, but they have an ad for a master’s project out where they propose research to “find and implement an effective way of automating analysis in non-normalized data by applying different approaches of machine learning”, where the “platform for distributed big data analysis is already in place.” They promise a project at “the bleeding edge technology of machine learning and distributed big data analysis.”
  • Although I haven’t listed AstraZeneca as a “big data” company (yet), they seem to be jumping the “data science” train as they are now advertising for “data angels” (!) and “predictive science data experts.”

On the US stage, I’m curious about a new company called BigML, which is apparently trying to tackle a problem that many have thought about or tried to solve, but which has proven very difficult, that is, to provide a user-friendly and general solution for building predictive models based on a data source. A machine learning solution for regular people, as it were. This blog post talks about some of the motivations behind it. I’ve applied for an invite and will write up a blog post if I get the chance to try it.

Finally, I’d like to recommend this Top 10 data mining links of 2011 list. I’m not usually very into top-10 lists, but this one contained some interesting stuff that I had missed. Of course, there is the MIC/MINE method which was published in Science, a clever generalization of correlation that works for non-linear relationships (to over-simplify a bit).  As this blog post puts it, “the consequential metric goes far beyond traditional measures of correlation, and rather towards what I would think of as a general pattern recognition algorithm that is sensitive to any type of systematic pattern between two variables (see the examples in Fig. 2 of the paper).”

Then there are of course the free data analysis textbooks, the free online ML and AI courses and IBM’s systems that defeated human Jeopardy champions, all of which I have covered here (I think.) Finally, there are links to two really cool papers. The first of them, Graphical Inference for Infoviz (where one of the authors is R luminary Hadley Wickham), introduces a very interesting method of “visual hypothesis testing” based on generating “decoy plots” that are based on the null hypothesis distribution, and letting a test person pick out the actual observed data among the decoys. The procedure has been implemented in an R package called nullabor. I really liked their analogy between hypothesis testing and a trial (the term “the statistical justice system”!):

Hypothesis testing is perhaps best understood with an analogy to the criminal justice system. The accused (data set) will be judged guilty or innocent based on the results of a trial (statistical test). Each trial has a defense (advocating for the null hypothesis) and a prosecution (advocating for the alternative hypothesis). On the basis of how evidence (the test statistic) compares to a standard (the p-value), the judge makes a decision to convict (reject the null) or acquit (fail to reject the null hypothesis). Unlike the criminal justice system, in the statistical justice system (SJS) evidence is based on the similarity between the accused and known innocents, using a specific metric defined by the test statistic. The population of innocents, called the null distribution, is generated by the combination of null hypothesis and test statistic. To determine the guilt of the accused we compute the proportion of innocents who look more guilty than the accused. This is the p-value, the probability that the accused would look this guilty if they actually were innocent.

The other very cool article is from Gary King’s lab and deals with the question of comparing different clusterings of data, and specifically determining a useful or insightful clustering for the user. They did this by implementing all (!) known clustering methods plus some new ones in a common interface in an R package. They then cluster text documents using all clustering methods and project the clusterings into a space that can be visualized and interactively explored to get a feeling for what the different methods are doing.

Post Navigation