Follow the Data

A data driven blog

Archive for the tag “sweden”

Repost: The big picture of public discourse on Twitter by clustering metadata

Note: this is a re-post of an analysis previously hosted at mindalyzer.com. Originally published in late December 2016, this blog post was later followed up by this extended analysis on Follow the Data.

The big picture of public discourse on Twitter by clustering metadata | Mindalyzer

Authors: Mattias Östmar, mattiasostmar (a) gmail.com, Mikael Huss, mikael.huss (a) gmail.com

Summary: We identify communities in the Swedish twitterverse by analyzing a large network of millions of reciprocal mentions in a sample of 312,292,997 tweets from 435,792 twitter accounts in 2015 and show that politically meaningful communities among others can be detected without having to read or search for specific words or phrases.

Background

Inspired by Hampus Brynolf’s Twittercensus, we wanted to perform a large-scale analysis of the Swedish Twitterverse, but from a different perspective where we focus on discussions rather than follower statistics.

All images are licensed under Creative Commons CC-BY (mention the source) and the data is released under Creative Commons Zero which means you can freely download and use it for your own purposes, no matter what. The underlaying tweets are restricted by Twitters Developer Agreement and Policy and cannot be shared due to their restrictions, which are mainly there to protect the privacy of all Twitter users.

Method

Code pipeline

A pipeline connecting the different code parts for reproducing this experiment is available at github.com/hussius/bigpicture-twitterdiscouse.

The Dataset

The dataset was created by continously polling Twitter’s REST API for recent tweets from a fixed set of Twitter accounts during 2015. The API gives out tweets from before the polling starts as well, but Twitter does not document how those are selected. A more in depth description of how the dataset was created and what it looks like can be found at mindalyzer.com.

Graph construction

From the full dataset of tweets, the tweets originating from 2015 was filtered out and a network of reciprocal mentions was created by parsing out any at-mentions (e.g. ‘@mattiasostmar’) in them. Retweets of others people’s tweets have not been counted, even though they might contain mentions of other users. We look at reciprocal mention graphs, where a link between two users means that both have mentioned each other on Twitter at least once in the dataset (i.e. addressed the other user with that user’s Twitter handle, as happens by default when you reply to a Tweet, for instance). We take this as a proxy for a discussion happening between those two users. The mention graphs were generated using the NetworkX package for Python. We naturally model the graph as undirected (as both users sharing a link are interacting with each other, there is no notion of directionality) and unweighted. One could easily imagine a weighted version of the mention graph where the weight would represent the total number of reciprocal mentions between the users, but we did not feel that this was needed to achieve interesting results.

The final graph consisted of 377.545 nodes (Twitter accounts) and 15.862.275 edges (reciprocal mentions connecting Twitter accounts). The average number of reciprocal mentions for nodes in the graph was 42. The code for the graph creation can be found here and you can also download the pickled graph in NetworkX-format (104,5MB, license:CCO).

The visualizations of the graphs were done in Gephi using the Fruchterman Reingold layout algoritm and thereafter adjusting the nodes with the Noverlap algorithm and finally the labels where adjusted with the algoritm Label adjust. Node size were set based on the ‘importance’ measure that comes out of the Infomap algoritm.

Community detection

In order to find communities in the mention graph (in other words, to cluster the mention graph), we use Infomap, an information-theory based approach to multi-level community detection that has been used for e.g. mapping of biogeographical regions such as Edler, Etal 2015 and scientific publications such as Rosvall, Bergström 2010, among many examples. This algorithm, which can be used both for directed and undirected, weighted and unweighted networks, allows for multi-level community detection, but here we only show results from a single partition into communities. (We also tried a multi-level decomposition, but did not feel that this added to the analysis presented here.)

“The Infomap algorithm returned a bunch of clusters along with a score for each user indicating how “central” that person was in the network, measured by a form of PageRank, which is the measure Google introduced to rank web pages. Roughly speaking, a person involved in a lot of discussions with other users who are in turn highly ranked would get high scores by this measure. For some clusters, it was enough to have a quick glance at the top ranked users to get a sense of what type of discourse that defines that cluster. To be able to look at them all we performed language analysis of each cluster’s users tweets to see what words were the most distinguishing. That way we also had words to judge the quaility of the clusters from.

What did we find?

We took the top 20 communities in terms of size, collected the tweets during 2015 from each member in those clusters, and created a textual corpus out of that (more specifically, a Dictionary using the Gensim package for Python). Then, for each community, we tried to find the most over-represented words used by people in that community by calculating the TF-IDF (term frequency-inverse document frequency) for each word in each community, and looking at the top 10 words for each community.

When looking at these overrepresented words, it was really easy to assign “themes” to our clusters. For instance, communities representing Norwegian and Finnish users (who presumably sometimes tweet in Swedish) were trivial to identify. It was also easy to spot a community dedicated to discussing the state of Swedish schools, another one devoted to the popular Swedish band The Fooo Conspiracy, and an immigration-critical cluster. In fact we have defined dozens of thematically distinct communities and continue to find new ones.

A “military defense” community

Number of nodes 1224
Number of edges 14254
Data (GEXF graph) Download (license: CC0)

One of the communities we found, which tends to discuss military defense issues and “prepping”, is shown in a graph below. This corresponds almost eerily well to a set of Swedish Twitter users highlighted in large Swedish daily Svenska Dagbladet Försvarstwittrarna som blivit maktfaktor i debatten. In fact, of their list of the top 10 defense bloggers, we find each and every one of them in our top 18. Remember that our analysis uses no pre-existing knowledge of what we are looking for: the defense cluster just fell out of the mention graph decomposition.

Top 10 accounts
1. Cornubot
2. wisemanswisdoms
3. patrikoksanen
4. annikanc
5. hallonsa
6. waterconflict
7. mikaelgrev
8. Spesam
9. JohanneH
10. Jagarchefen

Top distinguishing words (measured by TF-IDF:

#svfm
#säkpol
#fofrk
russian
#föpol
#svpol
ukraine
ryska
#ukraine
ryska
russia
nato
putin

The graph below shows the top 104 accounts in the cluster ranked by Infomap algorithm’s importance measure. You can also download a zoomable pdf.

Defence cluster top 104

Graph of “general pundit” community

Number of nodes 1100 (most important of 7332)
Number of edges 38684 (most important of 92332
Data (GEXF) Download (license: CC0)

The largest cluster is a bit harder to summarize easily than many of the other ones, but we think of it as a “pundit cluster” with influential political commentators, for example political journalists and politicians from many different parties. The most influential user in this community according to our analysis is @sakine, Sakine Madon, who was also the most influential Twitter user in Mattias eigenvector centrality based analysis of the whole mention graph (i.e. not divided into communities).

Accounts
1. Sakine
2. oisincantwell
3. RebeccaWUvell
4. fvirtanen
5. niklasorrenius
OhlyLars
7. Edward_Blom
8. danielswedin
9. Ivarpi
10. detljuvalivet

Top distinguishing words (measured by TF-IDF:

#svpol
nya
tycker
borde
läs
bättre
löfven
svensksåld
regeringen
sveriges
#eupol
jobb

The graph below shows the top 106 accounts in the cluster ranked by Infomap algorithm’s importance measure. You can also download a zoomable pdf.

pundits cluster top 106

Graph of “immigration” community

Number of nodes 2308)
Number of edges 33546
Data (GEXF) Download (license: CC0)

One of the larger clusters consists of accounts clearly focused on immigration issues judging by the most distinguishing words. An observation is that while all the larger Swedish political parties official Twitter accounts are located within the “general pundit” community, Sverigedemokraterna (The Sweden Democrats) that was born out of immigration critical sentiments is the only one of them located in this commuity. This suggests that they have (or at least had in the period up until 2015) an outsider position in the public discourse on Twitter that might or might not reflect such a position in the general public political discourse in Sweden. There is much debate and worry about filter bubbles formed by algorithms that selects what people get to see. Research such as Credibility and trust of information in online environments suggests that the social filtering of content is a strong factor for influence. Strong ties such as being part of a conversation graph such as this would most likely be an important factor in shaping of your world views.

Accounts
1. Jon_Brenelli
2. perraponken
3. sjunnedotcom
4. RolandXSweden
5. inkonsekvenshen
6. AnnaTSL
7. TommyFunebo
8. doppler3ffect
9. Stassministern
10. rogsahl

Top distiguishing words (TF-IDF):

#motgift
#natpol
#migpol
#dkpol
7-klövern
#svpbs
#artbymisen
#tcot
sanandaji
arnstad
massinvandring
#sdu14
#amazoncart
#ringp1
rlm
riktpunkt.nu:
#pkbor
tino
#pklogik
#nordiskungdom

immigration cluster top 102

Future work

Since we have the pipeline ready, we can easily redo it for 2016 when the data are in hand. Possibly this will reveal dynamical changes in what gets discussed on Twitter, and may give indications on how people are moving between different communities. It could also be interesting to experiment with a weighted version of the graph, or to examine a hierarchical decomposition of the graph into multiple levels.

2017 Mattias Östmar

License:
CC BY

Graciously supported by The Swedish Memetic Society

Advertisements

Swedish school fires and Kaggle open data

For quite a while now, I have been rather mystified and intrigued by the fact that Sweden has one of the highest rates of school fires due to arson. According to the Division of Fire Safety Engineering at Lund University, “Almost every day between one and two school fires occur in Sweden. In most cases arson is the cause of the fire.” This is a lot for a small country with less than 10 million inhabitants, and the associated costs can be up to a billion SEK (around 120 million USD) per year.

It would be hard to find a suitable dataset to address the question why arson school fires are so frequent in Sweden compared to other countries in a data-driven way – but perhaps it would be possible to stay within a Swedish context and find out which properties and indicators of Swedish towns (municipalities, to be exact) might be related to a high frequency of school fires?

To answer this question, I  collected data on school fire cases in Sweden between 1998 and 2014 through a web site with official statistics from the Swedish Civil Contingencies Agency. As there was no API to allow easy programmatic access to schools fire data, I collected them by a quasi-manual process, downloading XLSX report generated from the database year by year, after which I joined these with an R script into a single table of school fire cases where the suspected reason was arson. (see Github link below for full details!)

To complement  these data, I used a list of municipal KPI:s (key performance indicators) from 2014, that Johan Dahlberg put together for our contribution in Hack for Sweden earlier this year. These KPIs were extracted from Kolada (a database of Swedish municipality and county council statistics) by repeatedly querying its API.

There is a Github repo containing all the data and detailed information on how it was extracted.

The open Kaggle dataset lives at https://www.kaggle.com/mikaelhuss/swedish-school-fires. So far, the process of uploading and describing the data has been smooth. I’ve learned that each Kaggle dataset has an associated discussion forum, and (potentially) a bunch of “kernels”, which are analysis scripts or notebooks in Python, R or Julia. I hope that other people will contribute script and analyses based on these data. Please do if you find this dataset intriguing!

A good week for (big) data (science)

Perhaps as a subconscious compensation for my failure to attend Strata 2012 last week (I did watch some of the videos and study the downloads from the “Two Most Important Algorithms in Predictive Modeling Today” session), I devoted this week to more big-data/data-science things than usual.

Monday to Wednesday were spent at a Hadoop and NGS (Next Generation [DNA] Sequencing) data processing hackathon hosted by CSC in Espoo, Finland. All of the participants were very nice and accomplished; I’ll just single out two people for having developed high-throughput DNA sequencing related Hadoop software: Matti Niemenmaa, who is the main developer of Hadoop-BAM, a library for manipulating aligned sequence data in the cloud, and Luca Pireddu, who is the main developer of Seal, which is a nice Hadoop toolkit for sequencing data which enables running several different types of tasks in distributed fashion. Other things we looked at was the CloudBioLinux project, map/reduce sequence assembly using Contrail and CSC’s biological high-throughput data analysis platform Chipster.

On Friday, me and blog co-author Joel went to record our first episode of the upcoming Follow the Data podcast series with Fredrik Olsson and Magnus Sahlgren from Gavagai. In the podcast series, we will try to interview mainly Swedish but also other companies that we feel are big data or analytics related in an interesting way. Today I have been listening to the first edit and feel relatively happy with it, even though it is quite rough, owing to our lack of experience. I also hate to hear my own recorded voice, especially in English … I am working on one or two blog posts to summarize the highlights of the podcast (which is in English) and the following discussion in Swedish.

Over the course of the week, I’ve also worked in the evenings and on planes to finish an assignment for an academic R course I am helping out with. I decided to experiment a bit with this assignment and to base it on a Kaggle challenge. The students will download data from Kaggle and get instructions that can be regarded as a sort of “prediction contests 101”, discussing the practical details of getting your data into shape, evaluating your models, figuring out which variables are most important and so on. It’s been fun and can serve as a checklist for my self in the future.

Stay tuned for the first episode of Follow the Data podcast!

Hello 2012!

The first blog post of the new year. I made some updates to the Swedish big data company list from last year. I’ll recap the additions here so you don’t have to click on that link –

  • Markify is a service that searches a large set of databases for registered trademarks that are similar, in sound or in writing, to a given query – like a name you have thought up for your next killer startup. As described on the company’s website, determining similarity is not that clear-cut, so (according to this write-up) they have adopted a data-driven strategy where they train their algorithm on “actual case literature of disputed trademark claims to help it discover trademarks that were similar enough to be contested.” They claim it’s the worl’d most accurate comprehensive trademark search.
  • alaTest compiles, analyzes and rates product reviews to help customers select the most suitable product for them.
  • Intellus is a business process / business intelligence company. Frankly, these terms and web sites like theirs normally make me fall asleep, but they have an ad for a master’s project out where they propose research to “find and implement an effective way of automating analysis in non-normalized data by applying different approaches of machine learning”, where the “platform for distributed big data analysis is already in place.” They promise a project at “the bleeding edge technology of machine learning and distributed big data analysis.”
  • Although I haven’t listed AstraZeneca as a “big data” company (yet), they seem to be jumping the “data science” train as they are now advertising for “data angels” (!) and “predictive science data experts.”

On the US stage, I’m curious about a new company called BigML, which is apparently trying to tackle a problem that many have thought about or tried to solve, but which has proven very difficult, that is, to provide a user-friendly and general solution for building predictive models based on a data source. A machine learning solution for regular people, as it were. This blog post talks about some of the motivations behind it. I’ve applied for an invite and will write up a blog post if I get the chance to try it.

Finally, I’d like to recommend this Top 10 data mining links of 2011 list. I’m not usually very into top-10 lists, but this one contained some interesting stuff that I had missed. Of course, there is the MIC/MINE method which was published in Science, a clever generalization of correlation that works for non-linear relationships (to over-simplify a bit).  As this blog post puts it, “the consequential metric goes far beyond traditional measures of correlation, and rather towards what I would think of as a general pattern recognition algorithm that is sensitive to any type of systematic pattern between two variables (see the examples in Fig. 2 of the paper).”

Then there are of course the free data analysis textbooks, the free online ML and AI courses and IBM’s systems that defeated human Jeopardy champions, all of which I have covered here (I think.) Finally, there are links to two really cool papers. The first of them, Graphical Inference for Infoviz (where one of the authors is R luminary Hadley Wickham), introduces a very interesting method of “visual hypothesis testing” based on generating “decoy plots” that are based on the null hypothesis distribution, and letting a test person pick out the actual observed data among the decoys. The procedure has been implemented in an R package called nullabor. I really liked their analogy between hypothesis testing and a trial (the term “the statistical justice system”!):

Hypothesis testing is perhaps best understood with an analogy to the criminal justice system. The accused (data set) will be judged guilty or innocent based on the results of a trial (statistical test). Each trial has a defense (advocating for the null hypothesis) and a prosecution (advocating for the alternative hypothesis). On the basis of how evidence (the test statistic) compares to a standard (the p-value), the judge makes a decision to convict (reject the null) or acquit (fail to reject the null hypothesis). Unlike the criminal justice system, in the statistical justice system (SJS) evidence is based on the similarity between the accused and known innocents, using a specific metric defined by the test statistic. The population of innocents, called the null distribution, is generated by the combination of null hypothesis and test statistic. To determine the guilt of the accused we compute the proportion of innocents who look more guilty than the accused. This is the p-value, the probability that the accused would look this guilty if they actually were innocent.

The other very cool article is from Gary King’s lab and deals with the question of comparing different clusterings of data, and specifically determining a useful or insightful clustering for the user. They did this by implementing all (!) known clustering methods plus some new ones in a common interface in an R package. They then cluster text documents using all clustering methods and project the clusterings into a space that can be visualized and interactively explored to get a feeling for what the different methods are doing.

Mass e-epidemiology

The LifeGene project, which was recently started in Sweden, may in due time generate one of the most complex and interesting data sets ever. The project will study health, lifestyle and genetics (and much more) in the long term in a cohort of 500.000 (this is not a typo!) individuals. Participants will donate blood samples and be subjected to physical measurements (waist and hip circumference, blood pressure etc), but for a smaller subset of participants the study will really go deep, with global analysis of DNA, RNA, protein, metabolite and toxin levels, as well as epigenomics (simplifying a bit, this means genomic information that is not directly encoded in the DNA sequence). Two testing centres have opened during the fall – one in Stockholm and, more recently, one in Umeå.

Environmental factors will be examined too: “Exposures such as diet, physical activity, smoking, prenatal environment, infections, sleep-disorders, socioeconomic and psychosocial status, to name a few, will be assessed.” The data collection will be done through for instance mobile phones and the web, with sampling rates adjusted based on age and life events. The project consortium calls the approach e-epidemiology.

This might make each participant feel a bit like David Ewing Duncan, the man who decided to try as many genetic, medical and toxicological test on himself as he could, and wrote a book about it. Will they suffer from information overload from self-related data? For the statisticians involved, information overload is a certainty. It will be a tough – but interesting – task to collect, store and mine these data. But exactly this kind of project, which relates hereditary factors to environment and lifestyle and correlates these to outcomes (like disease states), is much needed.

Post Navigation