Data size estimates
As part of preparing for a talk, I collected some available information on data sizes in a few corporations and other organizations. Specifically, I looked for estimates of the amount of data processed per day and the amount of data stored by each organization. For what it’s worth, here are the numbers I currently have. Feel free to add new data points, correct misconceptions etc.
Data processed per day
Organization | Est. amount of data processed per day | Source |
eBay | 100 pb | http://www-conf.slac.stanford.edu/xldb11/talks/xldb2011_tue_1055_TomFastner.pdf |
100 pb | http://www.slideshare.net/kmstechnology/big-data-overview-2013-2014 | |
Baidu | 10-100 pb | http://on-demand.gputechconf.com/gtc/2014/presentations/S4651-deep-learning-meets-heterogeneous-computing.pdf |
NSA | 29 pb | http://arstechnica.com/information-technology/2013/08/the-1-6-percent-of-the-internet-that-nsa-touches-is-bigger-than-it-seems/ |
600 Tb | https://code.facebook.com/posts/229861827208629/scaling-the-facebook-data-warehouse-to-300-pb/ | |
100 Tb | http://www.kdd.org/sites/default/files/issues/14-2-2012-12/V14-02-02-Lin.pdf | |
Spotify | 2.2 Tb (compressed; becomes 64 Tb in Hadoop) | http://www.slideshare.net/AdamKawa/hadoop-operations-powered-by-hadoop-hadoop-summit-2014-amsterdam |
Sanger Institute | 1.7 Tb (DNA sequencing data only) | http://www.slideshare.net/insideHPC/cutts |
100 pb seems to be the amount du jour for the giants. I was a bit surprised that eBay reported already in 2011 that they were processing 100 pb/day. As I mentioned in an earlier post, I suspect a lot of this is self-generated data from “query rewriting”, but I am not sure.
Data stored
Organization | Est. amount of data stored | Source |
15,000 pb (=15 exabytes) | https://what-if.xkcd.com/63/ | |
NSA | 10,000 pb (possibly overestimated, see source) | http://www.forbes.com/sites/netapp/2013/07/26/nsa-utah-datacenter/ |
Baidu | 2,000 pb | http://on-demand.gputechconf.com/gtc/2014/presentations/S4651-deep-learning-meets-heterogeneous-computing.pdf |
300 pb | https://code.facebook.com/posts/229861827208629/scaling-the-facebook-data-warehouse-to-300-pb/ | |
Ebay | 90 pb | http://www.itnews.com.au/News/342615,inside-ebay8217s-90pb-data-warehouse.aspx |
Sanger (sequencing equipment | 22 pb (for DNA sequencing data only; ~45 pb for everything per Ewan Birney May 2014) | http://insidehpc.com/2013/10/07/sanger-institute-deploys-22-petabytes-lustre-powered-ddn-storage/ |
Spotify | 10 pb | http://www.slideshare.net/AdamKawa/hadoop-operations-powered-by-hadoop-hadoop-summit-2014-amsterdam |
It can be noted that eBay appears to store less than what it processes in a single day (perhaps related to the query rewriting thing mentioned above) while Google, Baidu and NSA (of course) hoard data. I didn’t find an estimate of how much data Twitter stores, but the size of all existing tweets cannot be that large, perhaps less than the 100 Tb they claim to process every day. In 2011, it was 20 Tb (link) so it might be hovering around 100 Tb now.