Playing with Swedish election data
The Swedish general elections were held on September 14, 2014, and resulted in a messy parliamentary situation in which the party receiving the most votes, the Social Democrats, had a hard time putting together a functioning government. The previous right-wing (by Swedish standards) government led by the Moderates was voted out after eight years in power. The most discussed outcome was that the nationalist Sweden Democrats party surged to 12.9% of the vote, up from about 5% in the 2010 elections.
I read data journalist Jens Finnäs’ interesting blog post “Covering election night with R“. He was covering the elections with live statistical analysis on Twitter. I approached him afterwards and asked for the data sets he had put together on voting results and various characteristics of the voting municipalities, in order to do some simple statistical analysis of voting patterns, and he kindly shared them with me (they are also available on GitHub, possibly in a slightly different form, I haven’t checked).
What I wanted to find out was whether there was a clear urban-rural separation in voting patterns and whether a principal component analysis would reveal a “right-left” axis and a “traditional-cosmopolitan” axis corresponding to a schematic that I have seen a couple of times now. I also wanted to see if a random forest classifier would be able to predict the vote share of the Sweden Democrats, or some other party, in a municipality, based only on municipality descriptors.
There are some caveats in this analysis. For example, we are dealing with compositional data here in the voting outcomes: all the vote shares must sum to one (or 100%). That means that neither PCA nor random forests may be fully appropriate. Wilhelm Landerholm pointed me to an interesting paper about PCA for compositional data. As for the random forests, I suppose I should use some form of multiple-output RF, which could produce one prediction per party, but since this was a quick and dirty analysis, I just did it party by party.
The analysis is available as a document with embedded code and figures at Rpubs, or on GitHub if you prefer that. You’ll have to go there to see the plots, but some tentative “results” that I took away were:
- There are two axes where one (PC1) can be interpreted as a Moderate – Social Democrat axis (rather than a strict right vs left axis), and one (PC2) that can indeed be interpreted as a traditionalist – cosmopolitan axis, with the Sweden Democrats at one end, and the Left party (V) (also to some extent the environmental party, MP, the Feminist initiative, FI, and the liberal FP) at the other end.
- There is indeed a clear difference between urban and rural voters (urban voters are more likely to vote for the Moderates, rural voters for the Social democrats).
- Votes for the Sweden Democrats are also strongly geographically determined, but here it is more of a gradient along Sweden’s length (the farther north, the less votes – on average – for SD).
- Surprisingly (?), the reported level of crime in a municipality doesn’t seem to affect voting patterns at all.
- A random forest classifier can predict votes for a party pretty well on unseen data based on municipality descriptors. Not perfectly by any means, but pretty well.
- The most informative features for predicting SD vote share were latitude, longitude, proportion of motorcycles, and proportion of educated individuals.
- The most informative feature for predicting Center party votes was the proportion of tractors 🙂 Likely a confounder/proxy for rural-ness.
There are other things that would be cool to look at, such as finding the most “atypical” municipalities based on the RF model. Also there is some skew in the RF prediction scatter plots that should be examined. I’ll leave it as is for now, and perhaps return to it at some point.
Edit 2015-01-03. I read a post at the Swedish Cornucopia blog, which points out that the number of asylum seekers per capita is positively correlated with SD votes (Pearson r~0.32) and negatively correlated (r~-0.34) with votes for the moderates, M). The author thought this was highly significant but I felt that there were probably more important indicators. I therefore downloaded data on asylum seekers per capita, which had been put together based on combining the Migration Board’s statistics on asylum seekers from December 2014 with Statistics Sweden’s population statistics, and introduced this as a new indicator in my models. I pushed the updated version to GitHub. My interpretation based on the PCA and random forest analyses is that the number of asylum seekers per capita is not among the most important indicators for explaining the SD vote share.