Follow the Data

A data driven blog

Playing with Swedish election data

The Swedish general elections were held on September 14, 2014, and resulted in a messy parliamentary situation in which the party receiving the most votes, the Social Democrats, had a hard time putting together a functioning government. The previous right-wing (by Swedish standards) government led by the Moderates was voted out after eight years in power. The most discussed outcome was that the nationalist Sweden Democrats party surged to 12.9% of the vote, up from about 5% in the 2010 elections.

I read data journalist Jens Finnäs’ interesting blog post “Covering election night with R“. He was covering the elections with live statistical analysis on Twitter. I approached him afterwards and asked for the data sets he had put together on voting results and various characteristics of the voting municipalities, in order to do some simple statistical analysis of voting patterns, and he kindly shared them with me (they are also available on GitHub, possibly in a slightly different form, I haven’t checked).

What I wanted to find out was whether there was a clear urban-rural separation in voting patterns and whether a principal component analysis would reveal a “right-left” axis and a “traditional-cosmopolitan” axis corresponding to a schematic that I have seen a couple of times now. I also wanted to see if a random forest classifier would be able to predict the vote share of the Sweden Democrats, or some other party, in a municipality, based only on municipality descriptors.

There are some caveats in this analysis. For example, we are dealing with compositional data here in the voting outcomes: all the vote shares must sum to one (or 100%). That means that neither PCA nor random forests may be fully appropriate. Wilhelm Landerholm pointed me to an interesting paper about PCA for compositional data. As for the random forests, I suppose I should use some form of multiple-output RF, which could produce one prediction per party, but since this was a quick and dirty analysis, I just did it party by party.

The analysis is available as a document with embedded code and figures at Rpubs, or on GitHub if you prefer that. You’ll have to go there to see the plots, but some tentative “results” that I took away were:

  • There are two axes where one (PC1) can be interpreted as a Moderate – Social Democrat axis (rather than a strict right vs left axis), and one (PC2) that can indeed be interpreted as a traditionalist – cosmopolitan axis, with the Sweden Democrats at one end, and the Left party (V) (also to some extent the environmental party, MP, the Feminist initiative, FI, and the liberal FP) at the other end.
  • There is indeed a clear difference between urban and rural voters (urban voters are more likely to vote for the Moderates, rural voters for the Social democrats).
  • Votes for the Sweden Democrats are also strongly geographically determined, but here it is more of a gradient along Sweden’s length (the farther north, the less votes – on average – for SD).
  • Surprisingly (?), the reported level of crime in a municipality doesn’t seem to affect voting patterns at all.
  • A random forest classifier can predict votes for a party pretty well on unseen data based on municipality descriptors. Not perfectly by any means, but pretty well.
  • The most informative features for predicting SD vote share were latitude, longitude, proportion of motorcycles, and proportion of educated individuals.
  • The most informative feature for predicting Center party votes was the proportion of tractors :) Likely a confounder/proxy for rural-ness.

There are other things that would be cool to look at, such as finding the most “atypical” municipalities based on the RF model. Also there is some skew in the RF prediction scatter plots that should be examined. I’ll leave it as is for now, and perhaps return to it at some point.

Book and MOOC

As of today, Amazon.com is stocking a book to which I have contributed, RNA-seq Data Analysis: A Practical Approach. I realize the title might sound obscure to readers who are unfamiliar with genomics and bioinformatics. Simply put, RNA-seq is short for RNA sequencing, a method for measuring what we call gene expression. While the DNA contained in each cell is (to a first approximation) identical, different tissues and cell types turn their genes on and off in different ways in response to different conditions. The process when DNA is transcribed to RNA is called gene expression. RNA-seq has become a rather important experimental method and the lead author of our book, Eija Korpelainen, wanted to put together a user-friendly, practical and hopefully unbiased compendium of the existing RNA-seq data analysis methods and toolkits, without neglecting underlying theory. I contributed one chapter, the one about differential expression analysis, which basically means statistical testing for significant gene expression differences between groups of samples.

I am also currently involved as an assistant teacher in the Explore Statistics with R course given by Karolinska Institutet through the edX MOOC platform. Specifically, I have contributed material to the final week (week 5) which will start next Tuesday (October 7th). That material is also about RNA-seq analysis – I try to show a range of tools available in R which allow you to perform a complete analysis workflow for a typical scenario. Until the fifth week starts, I am helping out with answering student questions in the forums. It’s been a positive experience so far, but it is clear that one can never prepare enough for a MOOC – errors in phrasing, grading, etc are bound to pop up. Luckily, several gifted students are doing an amazing job of answering the questions from other students, while teaching us teachers a thing or two about the finer points of R.

Speaking of MOOCs, Coursera’s Mining Massive Datasets course featuring Jure Leskovec, Anand Rajaraman and Jeff Ullman started today. My plan is to try to follow it – we shall see if I have time.

Data science in China

China’s been on my mind lately, as I’ve been putting together my visa application to go there in October. I feel that the “(big) data (science)” mindset has really taken root there, which is maybe not so surprising in such a huge and populous country where it’s natural to think of billions of potential customers and where engineering and quantitative sciences are appreciated (e g many of the Communist party leaders are science PhDs and engineers).

For instance, I heard about Mayer-Schönberger’s and Cukier’s book Big Data: A Revolution That Will Transform How We Live, Work, and Think before it appeared in English from professor 周涛 (Zhou Tao) who told me that it was already being translated into Chinese. In fact if you look at the publication dates on Amazon, it looks like the Chinese version was published first, but that could hardly have been the case, or?

I learned the abbrevation BAT – Baidu, Alibaba and TenCent – for the three big Chinese internet companies from Quora in a thread where they were pinpointed as also being the main big data players in China right now. Baidu, of course, has made waves with the relatively recent announcement that Andrew Ng (deep learning and general machine learning guru of Stanford, Google and Coursera) has joined their artificial intelligence lab which he will head and where he intends to implement some truly visionary projects. (You can hear more about Ng’s plans for the future of AI here – the audio is pretty bad though.) TenCent is the company that developed WeChat and QQ – huge platforms in China and some other parts of the world – but I don’t know  much about their data science efforts so I’ll pass them over in silence. Finally, Alibaba was of course recently introduced as a publicly traded company to great fanfare, and offers a very interesting service (connecting customers directly to manufacturers).

Here is a recent blog post about how Alibaba use Spark and GraphX to analyze their e-commerce platform, which is one of the largest in the world, collecting hundreds of petabytes of data.

Alibaba are currently looking for senior data scientists for their Hangzhou office. Looks like a fun gig.

By the way, the international conference on machine learning 2014 was held in Beijing. Here is a link to the videos (some of which seem quite interesting).

I’d love to learn more about data science in China and welcome any comments.

Videos for rainy days

If, like me, you are on vacation, you might have time to watch some of these cool videos on a rainy day (unless you read books instead, as you probably should):

Video from the Europe-wide machine learning meetup on June 15. Andrew Ng’s talk is probably the highlight, but I also enjoyed Muthu Muthukrishnan’s “On Sketching” and Sam Bessalah’s “Abstract algebra for stream learning.”


Video (+ slides) from the Deep Learning meetup in Stockholm on June 9
I saw these live and they were quite good. After some preliminaries, the first presentation (by Pawel Herman) starts at around 00:06:00 in the video and the second presentation (by Josephine Sullivan) starts at around 1:18:30.


Videos from the Big Data in Biomedicine event at Stanford
. Obviously I haven’t seen all of these, but the ones I have seen have been of really high quality. I particularly enjoyed the talk by Google’s David Glazer on the search behemoth’s efforts in genomics and Sandrine Dudoit on the role of statisticians in data science (where she echoes to some extent Terry Speed’s pessimistic views), but I think all of the talks are worth watching.

Data size estimates

As part of preparing for a talk, I collected some available information on data sizes in a few corporations and other organizations. Specifically, I looked for estimates of the amount of data processed per day and the amount of data stored by each organization. For what it’s worth, here are the numbers I currently have. Feel free to add new data points, correct misconceptions etc.

Data processed per day

Organization Est. amount of data processed per day Source
eBay 100 pb http://www-conf.slac.stanford.edu/xldb11/talks/xldb2011_tue_1055_TomFastner.pdf
Google 100 pb http://www.slideshare.net/kmstechnology/big-data-overview-2013-2014
Baidu 10-100 pb http://on-demand.gputechconf.com/gtc/2014/presentations/S4651-deep-learning-meets-heterogeneous-computing.pdf
NSA 29 pb http://arstechnica.com/information-technology/2013/08/the-1-6-percent-of-the-internet-that-nsa-touches-is-bigger-than-it-seems/
Facebook 600 Tb https://code.facebook.com/posts/229861827208629/scaling-the-facebook-data-warehouse-to-300-pb/
Twitter 100 Tb http://www.kdd.org/sites/default/files/issues/14-2-2012-12/V14-02-02-Lin.pdf
Spotify 2.2 Tb (compressed; becomes 64 Tb in Hadoop) http://www.slideshare.net/AdamKawa/hadoop-operations-powered-by-hadoop-hadoop-summit-2014-amsterdam
Sanger Institute 1.7 Tb (DNA sequencing data only) http://www.slideshare.net/insideHPC/cutts

100 pb seems to be the amount du jour for the giants. I was a bit surprised that eBay reported already in 2011 that they were processing 100 pb/day. As I mentioned in an earlier post, I suspect a lot of this is self-generated data from “query rewriting”, but I am not sure.

Data stored

Organization Est. amount of data stored Source
Google 15,000 pb (=15 exabytes) https://what-if.xkcd.com/63/
NSA 10,000 pb (possibly overestimated, see source) http://www.forbes.com/sites/netapp/2013/07/26/nsa-utah-datacenter/
Baidu 2,000 pb http://on-demand.gputechconf.com/gtc/2014/presentations/S4651-deep-learning-meets-heterogeneous-computing.pdf
Facebook 300 pb https://code.facebook.com/posts/229861827208629/scaling-the-facebook-data-warehouse-to-300-pb/
Ebay 90 pb http://www.itnews.com.au/News/342615,inside-ebay8217s-90pb-data-warehouse.aspx
Sanger (sequencing equipment 22 pb (for DNA sequencing data only; ~45 pb for everything per Ewan Birney May 2014) http://insidehpc.com/2013/10/07/sanger-institute-deploys-22-petabytes-lustre-powered-ddn-storage/
Spotify 10 pb http://www.slideshare.net/AdamKawa/hadoop-operations-powered-by-hadoop-hadoop-summit-2014-amsterdam

It can be noted that eBay appears to store less than what it processes in a single day (perhaps related to the query rewriting thing mentioned above) while Google, Baidu and NSA (of course) hoard data. I didn’t find an estimate of how much data Twitter stores, but the size of all existing tweets cannot be that large, perhaps less than the 100 Tb they claim to process every day. In 2011, it was 20 Tb (link) so it might be hovering around 100 Tb now.

Lessons learned from mining heterogeneous cancer data sets

How much can we learn about cancer treatment and prevention by large-scale data collection and analysis?

An interesting paper was just published: “Assessing the clinical utility of cancer genomic and proteomic data across tumor types“. I am afraid the article is behind a paywall, but no worries – I will summarize the main points here! Basically the authors have done a large-scale data mining study of data published within The Cancer Genome Atlas (TCGA) project, a very ambitious effort to collect molecular data on different kinds of tumors. The main question they ask is how much clinical utility these molecular data can add to conventional clinical information such as tumor stage, tumor grade, age and gender.

The lessons I drew from the paper are:

  • The molecular data does not add that much predictive information beyond the clinical information. As the authors put it in the discussion, “This echoes the observation that the number of cancer prognostic molecular markers in clinical use is pitifully small, despite decades of protracted and tremendous efforts.” It is an unfortunate fact of life in cancer genomics that many molecular classifiers (based on gene expression patterns usually) have been proposed to predict tumor severity, patient survival and so on, but different groups keep coming up with different gene sets and they tend not to be validated in independent cohorts.
  • When looking at what factors explain most of the variation, the type of tumor explains the most (37.4%), followed by the type of data used (that is, gene expression, protein expression, micro-RNA expression, DNA methylation or copy number variations) which explains 17.4%, with the interaction between tumor type and data type in third place (11.8%), suggesting that some data types are more informative for certain tumors than others. The algorithm used is fairly unimportant (5.2%). At the risk of drawing unwarranted conclusions, it is tempting to generalize this into something like this: the most important factor is the intrinsic difficulty of modeling the system, the next most important factor is the decision of what data to collect and/or feature engineering, while the type of algorithm used for learning the model comes far behind.
  • Perhaps surprisingly, there was essentially no cross-tumor predictive power between models. (There was one exception to this.) That is, a model built for one type of tumor was typically useless when predicting the prognosis for another tumor type.
  • Individual molecular features (expression levels of individual genes, for instance) did not add predictive power beyond what was already in the clinical information, but in some cases molecular subtype did. The molecular subtype is a “molecular footprint” derived using consensus NMF (nonnegative matrix factorization, an unsupervised method that can be used for dimension reduction and clustering). This footprint that described a general pattern was informative whereas the individual features making up the footprint weren’t. This seems consistent with the issue mentioned above about gene sets failing to consistently predict tumor severity. The predictive information is on a higher level than the individual genes.

The authors argue that one reason for the failure of predictive modeling in cancer research has been that investigators have relied too much on p values to say something about the clinical utility of their markers, when they should instead have focused more on the effect size, or the magnitude of difference in patient outcomes.

They also make a good point about reliability and reproducibility. I quote: “The literature of tumor biomarkers is plagued by publication bias and selective and/or incomplete reporting“. To help combat these biases, the authors (many of whom are associated with Sage Biosystems, who I have mentioned repeatedly on this blog) have made available an open model-assessment platform, including of course all the models from the paper itself, but which can also be used to assess your own favorite model.

Hadley Wickham lecture: ggvis, tidyr, dplyr and much more

Another week, another great meetup. This time, the very prolific Hadley Wickham visited the Stockholm R useR group and talked for about an hour about his new projects.

Perhaps some background is in order. Hadleys PhD thesis (free pdf here) is a very inspiring tour of different aspects of practical data analysis issues, such as reshaping data into a “tidy” for that is easy to work with (he developed the R reshape package for this), visualizing clustering and classification problems (see his classifly, clusterfly, and meifly packages) and creating a consistent language for describing plots and graphics (which resulted in the influential ggplot2 package). He has also made the plyr package as a more consistent version of the various “apply” functions in R. I learned a lot from this thesis.

Today, Hadley talked about several new packages that he has been developing to further improve on his earlier toolkit. He said that in general, his packages become simpler and simpler as he re-defines the basic operations needed for data analysis.

  • The newest one (“I wrote it about four days ago”, Hadley said) is called tidyr (it’s not yet on CRAN but can be installed from GitHub) and provides functions for getting data into the “tidy” format mentioned above. While reshape had the melt and cast commands, tidyr has gather, separate, and spread.
  • dplyr – the “next iteration of plyr”, which is faster and focuses on data frames. It uses commands like select, filter, mutate, summarize, arrange.
  • ggvis – a “dynamic version of ggplot2″ which is designed for responsive dynamic graphics, streaming visualization and meant for the web. This looked really nice. For example, you can easily add sliders to a plot so you can change the parameters and watch how the plot changes in real time. ggvis is built on Shiny but provides easier ways to make the plots. You can even embed dynamic ggvis plots in R markdown documents with knitR so that the resulting report can contain sliders and other things. This is obviously not possible with PDFs though. ggvis will be released on CRAN “in a week or so”.

Hadley also highlighted the magrittr package which implements a pipe operator for R (Magritte/pipe … get it? (groan)) The pipe looks like %>% and at first blush it may not look like a big deal, but Hadley made a convincing case that using the pipe together with (for example) dplyr results in code that is much easier to read, write and debug.

Hadley is writing a book, Advanced R (wiki version here), which he said has taught him a lot about the inner workings of R. He mentioned Rcpp as an excellent way to write C++ code and embed it in R packages. The bigvis package was mentioned as a “proof of concept” of how one might visualize big data sets (where the number of data points is larger than the number of pixels on the screen, so it is physically impossible to plot everything and summarization is necessary.)

Deep learning and genomics?

Yesterday, I attended an excellent meetup organized by the Stockholm Machine Learning meetup group at Spotify’s headquarters. There were two presentations: First one by Pawel Herman, who gave a very good general introduction into the roots, history, present and future of deep learning, and a more applied talk by Josephine Sullivan, where she showed some impressive results obtained by her group in image recognition as detailed in a recent paper titled “CNN features off-the-shelf: An astounding baseline for recognition” [pdf]. I’m told that slides from the presentations will be posted on the meetup web page soon.

Anyway, this meetup naturally got me thinking about whether deep learning could be used for genomics in some fruitful way. At first blush it does not seem like a good match: deep learning models have an enormous number of parameters and mostly seem to be useful with a very large number of training examples (although not as many as the number of parameters perhaps). Unfortunately, the sample sizes in genomics are usually small – it’s a very small n, large p domain at least in a general sense.

I wonder whether it would make sense to throw a large number of published human gene expression data sets (microarray or RNA-seq; there should be thousands of these now) into a deep learner to see what happens. The idea would not necessarily be to create a good classification model, but rather to learn a good hierarchical representation of gene expression patterns. Both Pawel and Josephine stressed that one of the points of deep learning is to automatically learn a good multi-level data representation, such as a set of more and more abstract set of visual categories in the case of picture classification. Perhaps we could learn something about abstract transcriptional states on various levels. Or not.

There are currently two strains of genomics that I feel are especially interesting from a “big data” perspective, namely single-cell transcriptomics and metagenomics (or metatranscriptomics, metaproteomics and what have you). Perhaps deep learning could actually be a good paradigm for analyzing single-cell transcriptomics (single-cell RNA-seq) data. Some researchers are talking about generating tens of thousands of single-cell expression profiles. The semi-redundant information obtained from many similar but not identical profiles is reminiscent of the redundant visual features that deep learning methods like to consume as input (according to the talks yesterday). Maybe this type of data would fit better than the “published microarray data” idea above.

For metagenomics (or meta-X-omics), it’s harder to speculate on what a useful deep learning solution would be. I suppose one could try to feed millions or billions of bits of sequences (k-mers) to a deep learning system in the hope of learning some regularities in the data. However, it was also mentioned at the meetup that deep learning methods still have a ways to go when it comes to natural language processing, and it seems to me that DNA “words” are closer to natural language than they are to pixel data.

I suppose we will find out eventually what can be done in this field now that Google has joined the genomics party!

Stockholm data happenings

The weather may be terrible at the moment in Stockholm (it was really a downer to come back from the US this morning) but there are a couple of interesting data-related events coming up. The past week, I missed two interesting events: the KTH Symposium on Big Data (past Mon, May 26) and the AWS Summit (past Tue, May 27).

In June, there will be meetups on deep learning (Machine Learning Stockholm group, June 9 at Spotify) and on Shiny and ggvis presented by Hadley Wickham himself (Stockholm useR group, June 16 at Pensionsmyndigheten.) There are wait lists for both.

Danny Bickson is giving a tutorial on GraphLab at Stockholm iSocial Summer school June 2-4. He has indicated that he would be happy to chat with anyone who is interested in connection with this.

King are looking for a “data guru” – a novel job title!

Finally, Wilhelm Landerholm, a seasoned data scientist who was way ahead of the hype curve, has finally started (or revived?) his blog on big data, which unfortunately is Swedish only: We Want Your Data.

 

 

BioFrontiers Symposium presentation

I just returned from Boulder, Colorado (lovely place!) where I was one of the speakers at the BioFrontiers Symposium on Big Data, Genomics and Molecular Networks. Here are the slides for the presentation, in the form that they were supposed to be in (in actual fact I ended up working from a slightly outdated version on stage.)

The other talks were all good. Some themes that came up a few times were the importance of collaboration, getting data out in the open from data silos, and improved software development standards. (Maybe I just remember those because I was talking about the first two of those myself.) Again, I appreciated all the talks but some nuggets that have stayed with me are Sean Eddy’s flashbacks to 1980′s sequence analysis, David Haussler’s talk about why we have a chance to beat cancer (“Cancer isn’t smart, it dies with the patient”; “A tumor genome is an evolving metagenome, a mixture of genomes of subclones”) and about how to set up a ~100 pb repository of cancer related sequence data, and Michael Snyder’s story of obsessively characterizing not only his own genome, proteome and transcriptome but also antibody-ome, microbiome, methylome, etc, to which he is now also adding sensors to measure sleep, number of steps per day etc.

EDIT: I have a question to my readers regarding the estimates for “Tb processed per day” and “pb stored” for different companies and organizations in the linked presentation. One of the big surprises was that Ebay claim to process so much data (100 pb per day), much more than Google (20 pb per day). My sources for the Ebay figure are this PDF and this interview; the number pops up in many other places. Is this because of the “query rewriting” mentioned in this blog post or?

Post Navigation

Follow

Get every new post delivered to your Inbox.

Join 123 other followers