Follow the Data

A data driven blog

Videos for rainy days

If, like me, you are on vacation, you might have time to watch some of these cool videos on a rainy day (unless you read books instead, as you probably should):

Video from the Europe-wide machine learning meetup on June 15. Andrew Ng’s talk is probably the highlight, but I also enjoyed Muthu Muthukrishnan’s “On Sketching” and Sam Bessalah’s “Abstract algebra for stream learning.”


Video (+ slides) from the Deep Learning meetup in Stockholm on June 9
I saw these live and they were quite good. After some preliminaries, the first presentation (by Pawel Herman) starts at around 00:06:00 in the video and the second presentation (by Josephine Sullivan) starts at around 1:18:30.


Videos from the Big Data in Biomedicine event at Stanford
. Obviously I haven’t seen all of these, but the ones I have seen have been of really high quality. I particularly enjoyed the talk by Google’s David Glazer on the search behemoth’s efforts in genomics and Sandrine Dudoit on the role of statisticians in data science (where she echoes to some extent Terry Speed’s pessimistic views), but I think all of the talks are worth watching.

Data size estimates

As part of preparing for a talk, I collected some available information on data sizes in a few corporations and other organizations. Specifically, I looked for estimates of the amount of data processed per day and the amount of data stored by each organization. For what it’s worth, here are the numbers I currently have. Feel free to add new data points, correct misconceptions etc.

Data processed per day

Organization Est. amount of data processed per day Source
eBay 100 pb http://www-conf.slac.stanford.edu/xldb11/talks/xldb2011_tue_1055_TomFastner.pdf
Google 100 pb http://www.slideshare.net/kmstechnology/big-data-overview-2013-2014
Baidu 10-100 pb http://on-demand.gputechconf.com/gtc/2014/presentations/S4651-deep-learning-meets-heterogeneous-computing.pdf
NSA 29 pb http://arstechnica.com/information-technology/2013/08/the-1-6-percent-of-the-internet-that-nsa-touches-is-bigger-than-it-seems/
Facebook 600 Tb https://code.facebook.com/posts/229861827208629/scaling-the-facebook-data-warehouse-to-300-pb/
Twitter 100 Tb http://www.kdd.org/sites/default/files/issues/14-2-2012-12/V14-02-02-Lin.pdf
Spotify 2.2 Tb (compressed; becomes 64 Tb in Hadoop) http://www.slideshare.net/AdamKawa/hadoop-operations-powered-by-hadoop-hadoop-summit-2014-amsterdam
Sanger Institute 1.7 Tb (DNA sequencing data only) http://www.slideshare.net/insideHPC/cutts

100 pb seems to be the amount du jour for the giants. I was a bit surprised that eBay reported already in 2011 that they were processing 100 pb/day. As I mentioned in an earlier post, I suspect a lot of this is self-generated data from “query rewriting”, but I am not sure.

Data stored

Organization Est. amount of data stored Source
Google 15,000 pb (=15 exabytes) https://what-if.xkcd.com/63/
NSA 10,000 pb (possibly overestimated, see source) http://www.forbes.com/sites/netapp/2013/07/26/nsa-utah-datacenter/
Baidu 2,000 pb http://on-demand.gputechconf.com/gtc/2014/presentations/S4651-deep-learning-meets-heterogeneous-computing.pdf
Facebook 300 pb https://code.facebook.com/posts/229861827208629/scaling-the-facebook-data-warehouse-to-300-pb/
Ebay 90 pb http://www.itnews.com.au/News/342615,inside-ebay8217s-90pb-data-warehouse.aspx
Sanger (sequencing equipment 22 pb (for DNA sequencing data only; ~45 pb for everything per Ewan Birney May 2014) http://insidehpc.com/2013/10/07/sanger-institute-deploys-22-petabytes-lustre-powered-ddn-storage/
Spotify 10 pb http://www.slideshare.net/AdamKawa/hadoop-operations-powered-by-hadoop-hadoop-summit-2014-amsterdam

It can be noted that eBay appears to store less than what it processes in a single day (perhaps related to the query rewriting thing mentioned above) while Google, Baidu and NSA (of course) hoard data. I didn’t find an estimate of how much data Twitter stores, but the size of all existing tweets cannot be that large, perhaps less than the 100 Tb they claim to process every day. In 2011, it was 20 Tb (link) so it might be hovering around 100 Tb now.

Lessons learned from mining heterogeneous cancer data sets

How much can we learn about cancer treatment and prevention by large-scale data collection and analysis?

An interesting paper was just published: “Assessing the clinical utility of cancer genomic and proteomic data across tumor types“. I am afraid the article is behind a paywall, but no worries – I will summarize the main points here! Basically the authors have done a large-scale data mining study of data published within The Cancer Genome Atlas (TCGA) project, a very ambitious effort to collect molecular data on different kinds of tumors. The main question they ask is how much clinical utility these molecular data can add to conventional clinical information such as tumor stage, tumor grade, age and gender.

The lessons I drew from the paper are:

  • The molecular data does not add that much predictive information beyond the clinical information. As the authors put it in the discussion, “This echoes the observation that the number of cancer prognostic molecular markers in clinical use is pitifully small, despite decades of protracted and tremendous efforts.” It is an unfortunate fact of life in cancer genomics that many molecular classifiers (based on gene expression patterns usually) have been proposed to predict tumor severity, patient survival and so on, but different groups keep coming up with different gene sets and they tend not to be validated in independent cohorts.
  • When looking at what factors explain most of the variation, the type of tumor explains the most (37.4%), followed by the type of data used (that is, gene expression, protein expression, micro-RNA expression, DNA methylation or copy number variations) which explains 17.4%, with the interaction between tumor type and data type in third place (11.8%), suggesting that some data types are more informative for certain tumors than others. The algorithm used is fairly unimportant (5.2%). At the risk of drawing unwarranted conclusions, it is tempting to generalize this into something like this: the most important factor is the intrinsic difficulty of modeling the system, the next most important factor is the decision of what data to collect and/or feature engineering, while the type of algorithm used for learning the model comes far behind.
  • Perhaps surprisingly, there was essentially no cross-tumor predictive power between models. (There was one exception to this.) That is, a model built for one type of tumor was typically useless when predicting the prognosis for another tumor type.
  • Individual molecular features (expression levels of individual genes, for instance) did not add predictive power beyond what was already in the clinical information, but in some cases molecular subtype did. The molecular subtype is a “molecular footprint” derived using consensus NMF (nonnegative matrix factorization, an unsupervised method that can be used for dimension reduction and clustering). This footprint that described a general pattern was informative whereas the individual features making up the footprint weren’t. This seems consistent with the issue mentioned above about gene sets failing to consistently predict tumor severity. The predictive information is on a higher level than the individual genes.

The authors argue that one reason for the failure of predictive modeling in cancer research has been that investigators have relied too much on p values to say something about the clinical utility of their markers, when they should instead have focused more on the effect size, or the magnitude of difference in patient outcomes.

They also make a good point about reliability and reproducibility. I quote: “The literature of tumor biomarkers is plagued by publication bias and selective and/or incomplete reporting“. To help combat these biases, the authors (many of whom are associated with Sage Biosystems, who I have mentioned repeatedly on this blog) have made available an open model-assessment platform, including of course all the models from the paper itself, but which can also be used to assess your own favorite model.

Hadley Wickham lecture: ggvis, tidyr, dplyr and much more

Another week, another great meetup. This time, the very prolific Hadley Wickham visited the Stockholm R useR group and talked for about an hour about his new projects.

Perhaps some background is in order. Hadleys PhD thesis (free pdf here) is a very inspiring tour of different aspects of practical data analysis issues, such as reshaping data into a “tidy” for that is easy to work with (he developed the R reshape package for this), visualizing clustering and classification problems (see his classifly, clusterfly, and meifly packages) and creating a consistent language for describing plots and graphics (which resulted in the influential ggplot2 package). He has also made the plyr package as a more consistent version of the various “apply” functions in R. I learned a lot from this thesis.

Today, Hadley talked about several new packages that he has been developing to further improve on his earlier toolkit. He said that in general, his packages become simpler and simpler as he re-defines the basic operations needed for data analysis.

  • The newest one (“I wrote it about four days ago”, Hadley said) is called tidyr (it’s not yet on CRAN but can be installed from GitHub) and provides functions for getting data into the “tidy” format mentioned above. While reshape had the melt and cast commands, tidyr has gather, separate, and spread.
  • dplyr – the “next iteration of plyr”, which is faster and focuses on data frames. It uses commands like select, filter, mutate, summarize, arrange.
  • ggvis – a “dynamic version of ggplot2″ which is designed for responsive dynamic graphics, streaming visualization and meant for the web. This looked really nice. For example, you can easily add sliders to a plot so you can change the parameters and watch how the plot changes in real time. ggvis is built on Shiny but provides easier ways to make the plots. You can even embed dynamic ggvis plots in R markdown documents with knitR so that the resulting report can contain sliders and other things. This is obviously not possible with PDFs though. ggvis will be released on CRAN “in a week or so”.

Hadley also highlighted the magrittr package which implements a pipe operator for R (Magritte/pipe … get it? (groan)) The pipe looks like %>% and at first blush it may not look like a big deal, but Hadley made a convincing case that using the pipe together with (for example) dplyr results in code that is much easier to read, write and debug.

Hadley is writing a book, Advanced R (wiki version here), which he said has taught him a lot about the inner workings of R. He mentioned Rcpp as an excellent way to write C++ code and embed it in R packages. The bigvis package was mentioned as a “proof of concept” of how one might visualize big data sets (where the number of data points is larger than the number of pixels on the screen, so it is physically impossible to plot everything and summarization is necessary.)

Deep learning and genomics?

Yesterday, I attended an excellent meetup organized by the Stockholm Machine Learning meetup group at Spotify’s headquarters. There were two presentations: First one by Pawel Herman, who gave a very good general introduction into the roots, history, present and future of deep learning, and a more applied talk by Josephine Sullivan, where she showed some impressive results obtained by her group in image recognition as detailed in a recent paper titled “CNN features off-the-shelf: An astounding baseline for recognition” [pdf]. I’m told that slides from the presentations will be posted on the meetup web page soon.

Anyway, this meetup naturally got me thinking about whether deep learning could be used for genomics in some fruitful way. At first blush it does not seem like a good match: deep learning models have an enormous number of parameters and mostly seem to be useful with a very large number of training examples (although not as many as the number of parameters perhaps). Unfortunately, the sample sizes in genomics are usually small – it’s a very small n, large p domain at least in a general sense.

I wonder whether it would make sense to throw a large number of published human gene expression data sets (microarray or RNA-seq; there should be thousands of these now) into a deep learner to see what happens. The idea would not necessarily be to create a good classification model, but rather to learn a good hierarchical representation of gene expression patterns. Both Pawel and Josephine stressed that one of the points of deep learning is to automatically learn a good multi-level data representation, such as a set of more and more abstract set of visual categories in the case of picture classification. Perhaps we could learn something about abstract transcriptional states on various levels. Or not.

There are currently two strains of genomics that I feel are especially interesting from a “big data” perspective, namely single-cell transcriptomics and metagenomics (or metatranscriptomics, metaproteomics and what have you). Perhaps deep learning could actually be a good paradigm for analyzing single-cell transcriptomics (single-cell RNA-seq) data. Some researchers are talking about generating tens of thousands of single-cell expression profiles. The semi-redundant information obtained from many similar but not identical profiles is reminiscent of the redundant visual features that deep learning methods like to consume as input (according to the talks yesterday). Maybe this type of data would fit better than the “published microarray data” idea above.

For metagenomics (or meta-X-omics), it’s harder to speculate on what a useful deep learning solution would be. I suppose one could try to feed millions or billions of bits of sequences (k-mers) to a deep learning system in the hope of learning some regularities in the data. However, it was also mentioned at the meetup that deep learning methods still have a ways to go when it comes to natural language processing, and it seems to me that DNA “words” are closer to natural language than they are to pixel data.

I suppose we will find out eventually what can be done in this field now that Google has joined the genomics party!

Stockholm data happenings

The weather may be terrible at the moment in Stockholm (it was really a downer to come back from the US this morning) but there are a couple of interesting data-related events coming up. The past week, I missed two interesting events: the KTH Symposium on Big Data (past Mon, May 26) and the AWS Summit (past Tue, May 27).

In June, there will be meetups on deep learning (Machine Learning Stockholm group, June 9 at Spotify) and on Shiny and ggvis presented by Hadley Wickham himself (Stockholm useR group, June 16 at Pensionsmyndigheten.) There are wait lists for both.

Danny Bickson is giving a tutorial on GraphLab at Stockholm iSocial Summer school June 2-4. He has indicated that he would be happy to chat with anyone who is interested in connection with this.

King are looking for a “data guru” – a novel job title!

Finally, Wilhelm Landerholm, a seasoned data scientist who was way ahead of the hype curve, has finally started (or revived?) his blog on big data, which unfortunately is Swedish only: We Want Your Data.

 

 

BioFrontiers Symposium presentation

I just returned from Boulder, Colorado (lovely place!) where I was one of the speakers at the BioFrontiers Symposium on Big Data, Genomics and Molecular Networks. Here are the slides for the presentation, in the form that they were supposed to be in (in actual fact I ended up working from a slightly outdated version on stage.)

The other talks were all good. Some themes that came up a few times were the importance of collaboration, getting data out in the open from data silos, and improved software development standards. (Maybe I just remember those because I was talking about the first two of those myself.) Again, I appreciated all the talks but some nuggets that have stayed with me are Sean Eddy’s flashbacks to 1980’s sequence analysis, David Haussler’s talk about why we have a chance to beat cancer (“Cancer isn’t smart, it dies with the patient”; “A tumor genome is an evolving metagenome, a mixture of genomes of subclones”) and about how to set up a ~100 pb repository of cancer related sequence data, and Michael Snyder’s story of obsessively characterizing not only his own genome, proteome and transcriptome but also antibody-ome, microbiome, methylome, etc, to which he is now also adding sensors to measure sleep, number of steps per day etc.

EDIT: I have a question to my readers regarding the estimates for “Tb processed per day” and “pb stored” for different companies and organizations in the linked presentation. One of the big surprises was that Ebay claim to process so much data (100 pb per day), much more than Google (20 pb per day). My sources for the Ebay figure are this PDF and this interview; the number pops up in many other places. Is this because of the “query rewriting” mentioned in this blog post or?

GraphLab Create

Just a heads-up that you can now get a free beta version of GraphLab Create. It’s a Python library that lets you use GraphLab functionality to easily do stuff like calculating page ranks, building recommender systems and so on. Good for people like me who don’t have time or patience for complicated installation processes (you can just use pip install.) So far I’ve only worked through some examples, like the Six Degrees of Kevin Bacon tutorial from Strata 2014, while waiting for inspiration for to strike regarding what I should implement for my own purposes. It seems quite intuitive so far.

Cancer, machine learning and data integration

Machine Learning Methods in the Computational Biology of Cancer is an arXiv preprint of a pretty nice article dealing with some analysis that can be used for high-dimensional biological (and other) data – although the examples come from cancer research, they could easily be about something else. This paper does a good job of describing penalized regression methods such as lasso, ridge regression and elastic net. It also goes into compressed sensing and its applicability to biology, although cautioning that it cannot yet be straightforwardly applied to biological data. This is because compressed sensing is based on the assumption that one can choose the “measurement matrix” freely, whereas in biology, it (usually called “design matrix” in this context) is already fixed.

The Critical Assessment of Massive Data Analysis (CAMDA) 2014 conference has released its data analysis challenges. Last year’s challenges on toxicogenomics and toxicity prediction will be reprised (perhaps in modified form, I didn’t check), but they have added a new challenge which I find interesting because it focuses on data integration (combining distinct data sets on gene, protein and micro-RNA expression as well as gene structural variations and DNA methylation) and uses published data from the International Cancer Genome Consortium (ICGC). I think it’s a good thing to re-analyze, mash up and meta-analyze data from these large-scale projects, and the CAMDA challenges are interesting because they are so open-ended, in contrast to e g Kaggle challenges (which I also like but in a different way). The goals in the CAMDA challenges are quite open to interpretation (and also ambitious), for instance:

  • Question 1: What are disease causal changes? Can the integration of comprehensive multi-track -omics data give a clear answer?
  • Question 2: Can personalized medicine and rational drug treatment plans be derived from the data? And how can we validate them down the road?

Two good resources (about sklearn and deep learning)

I have been using R, mostly happily, for the past 6 or 7 years, for its variety of statistical and machine learning packages and the relative ease of producing nice-looking plots. At the same time I am a big user of Python for things that R really doesn’t do that well, such as large-scale string manipulation. I had been aware of scikit-learn (or sklearn) for a while as a potential way to be able to do “everything” in Python including stats and plotting, but never really felt the pull to start using it. In the beginning, it felt too immature; later, it felt too messy when I looked at the documentation.

Last week, however, I came across a really good tutorial by Jake Vanderplas that finally made sklearn click for me and perhaps will push me over the edge to start using it. (I don’t expect to leave R any time soon, though…) The tutorial shows, step by step, how to divide your data set into training and test sets, fit models and make predictions, perform grid searches for parameter settings, plot learning curves etc.

┬áDeep learning is another subject (although much bigger than sklearn of course) that I have kept up a passing interest in but never really looked into properly, because I wasn’t sure where to start. The new book Deep learning: Methods and applications (PDF link) by Li Deng and Dong Yu seems like a good place to start. I’ve only read a few chapters, but so far it has done a good job of clarifying terms and putting deep learning methods into a historical context.

Post Navigation

Follow

Get every new post delivered to your Inbox.

Join 96 other followers