Follow the Data

A data driven blog

Data size estimates

As part of preparing for a talk, I collected some available information on data sizes in a few corporations and other organizations. Specifically, I looked for estimates of the amount of data processed per day and the amount of data stored by each organization. For what it’s worth, here are the numbers I currently have. Feel free to add new data points, correct misconceptions etc.

Data processed per day

Organization Est. amount of data processed per day Source
eBay 100 pb http://www-conf.slac.stanford.edu/xldb11/talks/xldb2011_tue_1055_TomFastner.pdf
Google 100 pb http://www.slideshare.net/kmstechnology/big-data-overview-2013-2014
Baidu 10-100 pb http://on-demand.gputechconf.com/gtc/2014/presentations/S4651-deep-learning-meets-heterogeneous-computing.pdf
NSA 29 pb http://arstechnica.com/information-technology/2013/08/the-1-6-percent-of-the-internet-that-nsa-touches-is-bigger-than-it-seems/
Facebook 600 Tb https://code.facebook.com/posts/229861827208629/scaling-the-facebook-data-warehouse-to-300-pb/
Twitter 100 Tb http://www.kdd.org/sites/default/files/issues/14-2-2012-12/V14-02-02-Lin.pdf
Spotify 2.2 Tb (compressed; becomes 64 Tb in Hadoop) http://www.slideshare.net/AdamKawa/hadoop-operations-powered-by-hadoop-hadoop-summit-2014-amsterdam
Sanger Institute 1.7 Tb (DNA sequencing data only) http://www.slideshare.net/insideHPC/cutts

100 pb seems to be the amount du jour for the giants. I was a bit surprised that eBay reported already in 2011 that they were processing 100 pb/day. As I mentioned in an earlier post, I suspect a lot of this is self-generated data from “query rewriting”, but I am not sure.

Data stored

Organization Est. amount of data stored Source
Google 15,000 pb (=15 exabytes) https://what-if.xkcd.com/63/
NSA 10,000 pb (possibly overestimated, see source) http://www.forbes.com/sites/netapp/2013/07/26/nsa-utah-datacenter/
Baidu 2,000 pb http://on-demand.gputechconf.com/gtc/2014/presentations/S4651-deep-learning-meets-heterogeneous-computing.pdf
Facebook 300 pb https://code.facebook.com/posts/229861827208629/scaling-the-facebook-data-warehouse-to-300-pb/
Ebay 90 pb http://www.itnews.com.au/News/342615,inside-ebay8217s-90pb-data-warehouse.aspx
Sanger (sequencing equipment 22 pb (for DNA sequencing data only; ~45 pb for everything per Ewan Birney May 2014) http://insidehpc.com/2013/10/07/sanger-institute-deploys-22-petabytes-lustre-powered-ddn-storage/
Spotify 10 pb http://www.slideshare.net/AdamKawa/hadoop-operations-powered-by-hadoop-hadoop-summit-2014-amsterdam

It can be noted that eBay appears to store less than what it processes in a single day (perhaps related to the query rewriting thing mentioned above) while Google, Baidu and NSA (of course) hoard data. I didn’t find an estimate of how much data Twitter stores, but the size of all existing tweets cannot be that large, perhaps less than the 100 Tb they claim to process every day. In 2011, it was 20 Tb (link) so it might be hovering around 100 Tb now.

Advertisements

Single Post Navigation

4 thoughts on “Data size estimates

  1. Siddeshwar on said:

    It help me a lot to build PPT . I included reference as your website

  2. I sited your article into my article and put the citation on it. Thank you, good work

  3. Pingback: 我們都知道 BigData 很重要,政府該做的是動手改革高教體系培養資料科學家 | 網路與創業每日必讀

  4. Pingback: 我們都知道 BigData 很重要,政府該做的是動手改革高教體系培養資料科學家 | 发头条

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: