Follow the Data

A data driven blog

Archive for the tag “data-analysis”

Data size estimates

As part of preparing for a talk, I collected some available information on data sizes in a few corporations and other organizations. Specifically, I looked for estimates of the amount of data processed per day and the amount of data stored by each organization. For what it’s worth, here are the numbers I currently have. Feel free to add new data points, correct misconceptions etc.

Data processed per day

Organization Est. amount of data processed per day Source
eBay 100 pb http://www-conf.slac.stanford.edu/xldb11/talks/xldb2011_tue_1055_TomFastner.pdf
Google 100 pb http://www.slideshare.net/kmstechnology/big-data-overview-2013-2014
Baidu 10-100 pb http://on-demand.gputechconf.com/gtc/2014/presentations/S4651-deep-learning-meets-heterogeneous-computing.pdf
NSA 29 pb http://arstechnica.com/information-technology/2013/08/the-1-6-percent-of-the-internet-that-nsa-touches-is-bigger-than-it-seems/
Facebook 600 Tb https://code.facebook.com/posts/229861827208629/scaling-the-facebook-data-warehouse-to-300-pb/
Twitter 100 Tb http://www.kdd.org/sites/default/files/issues/14-2-2012-12/V14-02-02-Lin.pdf
Spotify 2.2 Tb (compressed; becomes 64 Tb in Hadoop) http://www.slideshare.net/AdamKawa/hadoop-operations-powered-by-hadoop-hadoop-summit-2014-amsterdam
Sanger Institute 1.7 Tb (DNA sequencing data only) http://www.slideshare.net/insideHPC/cutts

100 pb seems to be the amount du jour for the giants. I was a bit surprised that eBay reported already in 2011 that they were processing 100 pb/day. As I mentioned in an earlier post, I suspect a lot of this is self-generated data from “query rewriting”, but I am not sure.

Data stored

Organization Est. amount of data stored Source
Google 15,000 pb (=15 exabytes) https://what-if.xkcd.com/63/
NSA 10,000 pb (possibly overestimated, see source) http://www.forbes.com/sites/netapp/2013/07/26/nsa-utah-datacenter/
Baidu 2,000 pb http://on-demand.gputechconf.com/gtc/2014/presentations/S4651-deep-learning-meets-heterogeneous-computing.pdf
Facebook 300 pb https://code.facebook.com/posts/229861827208629/scaling-the-facebook-data-warehouse-to-300-pb/
Ebay 90 pb http://www.itnews.com.au/News/342615,inside-ebay8217s-90pb-data-warehouse.aspx
Sanger (sequencing equipment 22 pb (for DNA sequencing data only; ~45 pb for everything per Ewan Birney May 2014) http://insidehpc.com/2013/10/07/sanger-institute-deploys-22-petabytes-lustre-powered-ddn-storage/
Spotify 10 pb http://www.slideshare.net/AdamKawa/hadoop-operations-powered-by-hadoop-hadoop-summit-2014-amsterdam

It can be noted that eBay appears to store less than what it processes in a single day (perhaps related to the query rewriting thing mentioned above) while Google, Baidu and NSA (of course) hoard data. I didn’t find an estimate of how much data Twitter stores, but the size of all existing tweets cannot be that large, perhaps less than the 100 Tb they claim to process every day. In 2011, it was 20 Tb (link) so it might be hovering around 100 Tb now.

Topology and data analysis: Gunnar Carlsson and Ayasdi

A few months ago, I read in Wired [Data-Visualization Firm’s New Software Autonomously Finds Abstract Connections] and Guardian [New big data firm to pioneer topological data analysis] about Ayasdi, the new data visualization & analytics company founded by professor Gunnar Carlsson at Stanford that has received millions of funding from Khosla Ventures, DARPA and other places. Today, I had the opportunity to hear Carlsson speak at the Royal Institute of Technology (KTH) in Stockholm about the mathematics underlying Ayasdi’s tools. I was very eager to hear how topology (Carlsson’s specialty) connects to data visualization, and about their reported success in classifying tumor samples from patients [Topology based data analysis identifies a subgroup of breast cancers with a unique mutational profile and excellent survival].

The talk was very nice. Actually there were two talks – one for a more “general audience” (which, in truth, probably consisted mostly of hardcore maths or data geeks) and one that went much deeper into the mathematics – and completely over my head.

One thing that intrigued me was that almost all of his examples were taken from biology: disease progression, cell cycle gene expression, RNA hairpin folding, copy number variations, mapping genotypes to geographic regions … the list goes on. I suppose it’s simply because Carlsson happens to work a lot with biologists at Stanford, but could it be that biology is an especially fertile area for innovation in data analysis? As Carlsson highlights in a review paper [Topology and Data], data sets from high-throughput biology often have many dimensions that are superfluous or have unknown significance, and there is no good understanding of which distance measures between data points that should be used. That is one reason to consider methods that do not break down when the scale is changed – such as topological methods, which are insensitive to the actual choice of metrics.

(Incidentally, I think there is a potential blog post waiting to be written about how many famous data scientists have come out of a biology/bioinformatics background. Names like Pete Skomoroch, Toby Segaran and Michael Driscoll come to mind, and those were just the ones I thought of instantly.)

Another nice aspect of the talk was that it was no sales pitch for Ayasdi (the company was hardly even mentioned) but more of a bird’s eye view of topology and its relation to clustering and visualization. In my (over)simplified understanding, the methods presented represent the data as a network where the nodes, which are supposed to represent “connected components” in an ideal scenario, are clusters derived using, e.g., hierarchical clustering. However, there is no cutoff value defined for breaking the data into clusters, but instead the whole outcome of the clustering – its profile, so to speak – is encoded and used in the following analysis. Carlsson mentioned that one of the points of this sort of network representation of data was to “avoid breaking things apart”, as clustering algorithms do. He talked about classifying the data point clouds using “barcodes”, ensuring persistence across changes of scale. The details of how these barcodes were calculated were beyond my comprehension.

Carlsson showed some examples of how visualizations created using his methods improved on hierarchical clustering dendrograms or PCA/MDS plots. He said that one of the advantages of the method is that it can identify small features in a large data set. These features would, he said, be “washed out” in PCA and not be picked up by standard clustering methods.

I look forward to learning more about topological data analysis. There are some (links to) papers at Ayasdi’s web site if you are interested, e.g.: Extracting insights from the shape of complex data using topology, Topology and Data (already mentioned above), Topological Methods for the Analysis of High Dimensional Data Sets and 3D Object Recognition.

Online course experiences: Coursera Data Analysis and Syracuse U. Data Science

I’ve been following two online data analysis related courses during the past few months: the Data Analysis course given by Johns Hopkins U. through Coursera and the Introduction to Data Science course given by Syracuse U.  through Coursesites.

The Data Analysis course is the third one that I have enrolled in on Coursera, and the first one where I have completed all the coursework (I received my statement of accomplishment this past weekend, yippee!). Of the two previous courses I had enrolled in, I had tried to follow one but given up because of problems with the platform incorrectly grading the quizzes – a childish thing to quit a course over, because it’s the things you learn that should matter, but I felt that the weird grading made me uncertain about what parts of the material I had really understood.

I think the Data Analysis course was quite good, because it focused not only on R and statistics (which is great) but also on more practical aspects of data analysis, like how you might organize your files and write up a good analysis report. It introduced me to things like R markdown and knitr, which I had heard about but not used until now. The course contents were also surprisingly up to date, with things like the medley R package being included in the video lectures. This package, which was developed by a Kaggle competitor for constructing ensemble models more easily, was first mentioned in January 2013 on a Kaggle forum and doesn’t yet exist as an R package, yet it was covered in the course with nice examples of how to run it!

There is a “post-mortem” podcast at Simply Statistics where Jeff Leek (the main instructor of the course) and Roger Peng discuss what went right and what went wrong.

The course videos are on YouTube and course lectures are available on GitHub; both videos and lectures are tagged by week. Some numbers on participation given by Jeff Leek:

There were approximately 102,000 students enrolled in the course, about 51,000 watched videos, 20,000 did quizzes, and 5,500 did/graded the data analysis assignments.

Personally, I would perhaps have liked the contents to be slightly more difficult (because I came in with a fair amount of subject knowledge) but on the other hand the given level of difficulty let me get away with spending 3-5h per week on average on the course, as advertised. I think many students used a lot more.

The other course that I participated in, Introduction to Data Science from Syracuse University, was similar to the Data Analysis course in a way, specifically, in that it used R the vehicle for introducing statistical concepts. However, this course was much more limited in scope and basically did not assume that the students had had any prior exposure to statistics or programming. I felt that this was a mismatch for me and in the end did not finish all of the coursework. I did read the accompanying textbook which, in parts, did a very good job of explaining the value of data analysis in real-world scenarios. I felt that the course would be most useful for people who are curious about “big data” and “data science” and want to dip their toes into it a little bit but not necessarily work with data analysis. Maybe this was the intention.

Foolhardy as I am, I plan to take another MOOC data science course beginning in May, namely Introduction to Data Science. I’ll report back here afterwards!

 

Data services

There’s been a little hiatus here as I have been traveling. I recently learned that Microsoft has launched Codename “Dallas”, a service for purchasing and managing datasets and web services. It seems they are trying to provide consistent APIs to work with different data from the public and private sectors in a clean way. There’s an introduction here.

This type of online data repository seems to be an idea whose time has arrived – I have previously talked about resources like Infochimps, Datamob and Amazon’s Public Data Sets, and there is also theinfo.org, which I seem to have forgotten to mention. A recent commenter on this blog pointed me to the comprehensive knowledge archive network, which is a “registry of open data and content packages”. Then there are the governmental and municipal data repositories, such as data.gov.

Another interesting service, which may have a slightly different focus, is Factual, described by founder Gil Elbaz as a “platform where anyone can share and mash open data“. Factual basically wants to list facts, and puts the emphasis on data accuracy, so you can express opinions on and discuss the validity of any piece of data. Factual also claims to have “deeper data technology” which allows users to explore the data in a more sophisticated way compared to other services like the Amazon Open Data Sets, for instance.

Companies specializing in helping users make sense out of massive data sets are, of course, popping up as well. I have previously written about Good Data, and now the launch of a new seemingly similar company,  Data Applied, has been announced.  Like Good Data, Data Applied offers affordable licenses for cloud-based and social data analysis, with a free trial package (though Good Data’s free version seems to offer more – a 10 MB data warehouse and 1-5 users vs Data Applied’s file size of <100 kb for a single user; someone correct me if I am wrong). The visualization capabilities of Data Applied do seem very nice. It’s still unclear to me how different the offerings of these two companies are but time will tell.

Beautiful data

One of my favorite books of the last few years is Toby Segaran’s Programming Collective Intelligence, where the author really hit the sweet spot between the theory and practice of data analysis. Broadly speaking, the book had two themes: one, how to get hold of raw data from web sites such as eBay, del.icio.us, Facebook, Zillow and so on via APIs, and two, how to draw interesting conclusions from those data using analysis techniques such as clustering, collaborative filtering, matrix decompositions, decision trees etc. Everything was demonstrated in simple Python code, so it was easy to try it all by yourself.

When I heard this spring that Segaran was the co-author of a new book, Programming the Semantic Web, and a co-editor of another one, Beautiful Data, I pre-ordered them both on Amazon to Singapore, where I live. I got the former book about a month ago, but I’ll not discuss this here because frankly, I’ve been too lazy to give it the kind of attention needed to properly evaluate it (following the code examples and so on).

Beautiful Data, on the other hand, is more suited to browsing (and reading at the playground while my kids are playing). I actually got so frustrated waiting for it – although it was released 26 July in the States, I didn’t get it until 21 August – that I downloaded a PDF from the web and read part of it before I got the physical book. (Sorry about that, O’Reilly – but I did pay for the book with my own money!) It’s definitely a nice book. Loosely based on the concept of a previous book, Beautiful Code, it describes various interesting real-life data analysis and visualization projects. There are also a couple of more essay-like chapters. Each chapter is written by different authors, and the scope is very wide. Most people who read the book will probably have a couple of chapters they really like and a couple they don’t care that much about.

One of the more hands-on chapters is the one about the FaceStats site. This site, which I hadn’t heard about before (and which appears to be on a hiatus), lets users upload photos of themselves and judge the photos of other people. In this chapter, the creators of FaceStats walk the reader through a session of exploratory data analysis (i. e. analysis with no specific hypothesis in mind at the beginning), performed in the statistical scripting language R. Among other things, they show how to find the keywords most characteristic of different groups of people. A big surprise for me there was to see the Swedish word “fjortis” as one of the most female-specific (=most used to describe female faces) words in the database! Unfortunately, the authors don’t comment on this. What makes me surprised is both that a Swedish slang term (which means, roughly, an immature adolescent – it’s derived from the word “fjorton” which means “fourteen”) is apparently so common at an international web site, and that it is so strongly associated with females – as far as I know, it can be used for both male and female adolescents in Swedish. Looking at this site, it does seem to be a sort of new English loan word which has had its meaning slightly changed.

Google’s director of research, Peter Norvig, contributes a nice chapter on statistical language modelling. Many of Google’s tricks are probably sketched here. Toby Segaran’s chapter is basically a compressed version of Programming the Semantic Web. One of my favorite chapters is the one by Jeff Hammerbacher, where he describes how he and others built up Facebook’s information platforms. I like his thoughts about the emerging species of data scientists:

At Facebook, we felt that traditional titles such as Business Analyst, Statistician, Engineer, and Research Scientist didn’t quite capture what we were after for our team. The workload for the role was diverse: on any given day, a team member could author a multistage processing pipeline in Python, design a hypothesis test, perform a regression analysis over data samples with R, design and implement an algorithm for some data-intensive product or service in Hadoop, or communicate the results of our analyses to other members of the organization in a clear and concise fashion. To capture the skill set required to perform this multitude of tasks, we created the role of “Data Scientist.”

The part in italics sounds a lot like my everyday work activities. Maybe I’ve been a data scientist all along without even knowing it?

There is lots of other interesting stuff in the book. You will read about how to design an image processing system for a space shuttle going to Mars, how to shoot a Radiohead video without actually using film, how to visualize scientific data in Second Life, and much more.

There’s no point in enumerating all of the interesting topics here – suffice to say that I recommend it to anyone who want to understand more about real-life data analysis challenges. After you’ve been blown away by all the cool projects and methods, don’t forget to cool off with Coco Krumme’s sober chapter which outlines what data can’t do and how we frequently get fooled by data and fail to intuitively understand probabilities. A refreshing pinch of skepticism.

Post Navigation