Repost: The big picture of public discourse on Twitter by clustering metadata
Note: this is a re-post of an analysis previously hosted at mindalyzer.com. Originally published in late December 2016, this blog post was later followed up by this extended analysis on Follow the Data.
The big picture of public discourse on Twitter by clustering metadata | Mindalyzer
Authors: Mattias Östmar, mattiasostmar (a) gmail.com, Mikael Huss, mikael.huss (a) gmail.com
Summary: We identify communities in the Swedish twitterverse by analyzing a large network of millions of reciprocal mentions in a sample of 312,292,997 tweets from 435,792 twitter accounts in 2015 and show that politically meaningful communities among others can be detected without having to read or search for specific words or phrases.
Inspired by Hampus Brynolf’s Twittercensus, we wanted to perform a large-scale analysis of the Swedish Twitterverse, but from a different perspective where we focus on discussions rather than follower statistics.
All images are licensed under Creative Commons CC-BY (mention the source) and the data is released under Creative Commons Zero which means you can freely download and use it for your own purposes, no matter what. The underlaying tweets are restricted by Twitters Developer Agreement and Policy and cannot be shared due to their restrictions, which are mainly there to protect the privacy of all Twitter users.
A pipeline connecting the different code parts for reproducing this experiment is available at github.com/hussius/bigpicture-twitterdiscouse.
The dataset was created by continously polling Twitter’s REST API for recent tweets from a fixed set of Twitter accounts during 2015. The API gives out tweets from before the polling starts as well, but Twitter does not document how those are selected. A more in depth description of how the dataset was created and what it looks like can be found at mindalyzer.com.
From the full dataset of tweets, the tweets originating from 2015 was filtered out and a network of reciprocal mentions was created by parsing out any at-mentions (e.g. ‘@mattiasostmar’) in them. Retweets of others people’s tweets have not been counted, even though they might contain mentions of other users. We look at reciprocal mention graphs, where a link between two users means that both have mentioned each other on Twitter at least once in the dataset (i.e. addressed the other user with that user’s Twitter handle, as happens by default when you reply to a Tweet, for instance). We take this as a proxy for a discussion happening between those two users. The mention graphs were generated using the NetworkX package for Python. We naturally model the graph as undirected (as both users sharing a link are interacting with each other, there is no notion of directionality) and unweighted. One could easily imagine a weighted version of the mention graph where the weight would represent the total number of reciprocal mentions between the users, but we did not feel that this was needed to achieve interesting results.
The final graph consisted of 377.545 nodes (Twitter accounts) and 15.862.275 edges (reciprocal mentions connecting Twitter accounts). The average number of reciprocal mentions for nodes in the graph was 42. The code for the graph creation can be found here and you can also download the pickled graph in NetworkX-format (104,5MB, license:CCO).
The visualizations of the graphs were done in Gephi using the Fruchterman Reingold layout algoritm and thereafter adjusting the nodes with the Noverlap algorithm and finally the labels where adjusted with the algoritm Label adjust. Node size were set based on the ‘importance’ measure that comes out of the Infomap algoritm.
In order to find communities in the mention graph (in other words, to cluster the mention graph), we use Infomap, an information-theory based approach to multi-level community detection that has been used for e.g. mapping of biogeographical regions such as Edler, Etal 2015 and scientific publications such as Rosvall, Bergström 2010, among many examples. This algorithm, which can be used both for directed and undirected, weighted and unweighted networks, allows for multi-level community detection, but here we only show results from a single partition into communities. (We also tried a multi-level decomposition, but did not feel that this added to the analysis presented here.)
“The Infomap algorithm returned a bunch of clusters along with a score for each user indicating how “central” that person was in the network, measured by a form of PageRank, which is the measure Google introduced to rank web pages. Roughly speaking, a person involved in a lot of discussions with other users who are in turn highly ranked would get high scores by this measure. For some clusters, it was enough to have a quick glance at the top ranked users to get a sense of what type of discourse that defines that cluster. To be able to look at them all we performed language analysis of each cluster’s users tweets to see what words were the most distinguishing. That way we also had words to judge the quaility of the clusters from.
What did we find?¶
We took the top 20 communities in terms of size, collected the tweets during 2015 from each member in those clusters, and created a textual corpus out of that (more specifically, a Dictionary using the Gensim package for Python). Then, for each community, we tried to find the most over-represented words used by people in that community by calculating the TF-IDF (term frequency-inverse document frequency) for each word in each community, and looking at the top 10 words for each community.
When looking at these overrepresented words, it was really easy to assign “themes” to our clusters. For instance, communities representing Norwegian and Finnish users (who presumably sometimes tweet in Swedish) were trivial to identify. It was also easy to spot a community dedicated to discussing the state of Swedish schools, another one devoted to the popular Swedish band The Fooo Conspiracy, and an immigration-critical cluster. In fact we have defined dozens of thematically distinct communities and continue to find new ones.
A “military defense” community¶
|Number of nodes||1224|
|Number of edges||14254|
|Data (GEXF graph)||Download (license: CC0)|
One of the communities we found, which tends to discuss military defense issues and “prepping”, is shown in a graph below. This corresponds almost eerily well to a set of Swedish Twitter users highlighted in large Swedish daily Svenska Dagbladet Försvarstwittrarna som blivit maktfaktor i debatten. In fact, of their list of the top 10 defense bloggers, we find each and every one of them in our top 18. Remember that our analysis uses no pre-existing knowledge of what we are looking for: the defense cluster just fell out of the mention graph decomposition.
|Top 10 accounts|
Top distinguishing words (measured by TF-IDF:
#svfm #säkpol #fofrk russian #föpol #svpol ukraine ryska #ukraine ryska russia nato putin
The graph below shows the top 104 accounts in the cluster ranked by Infomap algorithm’s importance measure. You can also download a zoomable pdf.
Graph of “general pundit” community¶
|Number of nodes||1100 (most important of 7332)|
|Number of edges||38684 (most important of 92332|
|Data (GEXF)||Download (license: CC0)|
The largest cluster is a bit harder to summarize easily than many of the other ones, but we think of it as a “pundit cluster” with influential political commentators, for example political journalists and politicians from many different parties. The most influential user in this community according to our analysis is @sakine, Sakine Madon, who was also the most influential Twitter user in Mattias eigenvector centrality based analysis of the whole mention graph (i.e. not divided into communities).
Top distinguishing words (measured by TF-IDF:
#svpol nya tycker borde läs bättre löfven svensksåld regeringen sveriges #eupol jobb
The graph below shows the top 106 accounts in the cluster ranked by Infomap algorithm’s importance measure. You can also download a zoomable pdf.
Graph of “immigration” community¶
|Number of nodes||2308)|
|Number of edges||33546|
|Data (GEXF)||Download (license: CC0)|
One of the larger clusters consists of accounts clearly focused on immigration issues judging by the most distinguishing words. An observation is that while all the larger Swedish political parties official Twitter accounts are located within the “general pundit” community, Sverigedemokraterna (The Sweden Democrats) that was born out of immigration critical sentiments is the only one of them located in this commuity. This suggests that they have (or at least had in the period up until 2015) an outsider position in the public discourse on Twitter that might or might not reflect such a position in the general public political discourse in Sweden. There is much debate and worry about filter bubbles formed by algorithms that selects what people get to see. Research such as Credibility and trust of information in online environments suggests that the social filtering of content is a strong factor for influence. Strong ties such as being part of a conversation graph such as this would most likely be an important factor in shaping of your world views.
Top distiguishing words (TF-IDF):
#motgift #natpol #migpol #dkpol 7-klövern #svpbs #artbymisen #tcot sanandaji arnstad massinvandring #sdu14 #amazoncart #ringp1 rlm riktpunkt.nu: #pkbor tino #pklogik #nordiskungdom
Since we have the pipeline ready, we can easily redo it for 2016 when the data are in hand. Possibly this will reveal dynamical changes in what gets discussed on Twitter, and may give indications on how people are moving between different communities. It could also be interesting to experiment with a weighted version of the graph, or to examine a hierarchical decomposition of the graph into multiple levels.
2017 Mattias Östmar
Graciously supported by The Swedish Memetic Society