Did you know that the word data means “things given” in Latin? That’s just one of the things I learned from a very interesting (free) article, The Pathologies of Big Data by former computational neuroscientist Adam Jacobs. He also makes the perceptive comment that the word data tends to get uses as a mass noun in English, as if it denoted a substance. (After reading these interesting insights, it was no surprise to learn that Jacobs also has a degree in linguistics.)
The article discusses what “big data” really means in this day and age when we can actually keep, for instance, a dataset containing information about the entire world population in memory (not to mention on disk) on a pretty ordinary Dell server. Jacobs argues that getting stuff into databases is easy, but getting it out (in a useful form) is hard; the bottleneck lies in the analysis rather than the raw data manipulation.
He also argues that most data-processing tools, including standard relational database management systems, are not really built for the kinds of huge datasets we are starting to encounter now. Although we can in principle keep billions of rows of data in RAM, we can’t easily manipulate them using something like PostgreSQL. And other solutions like the statistics programming language R (one of my favourites) run into hard-coded memory usage limits, often about 4 GB.
A recommended read for those interested in the nerdier side of data.
On a related note, O’Reilly released a report, Big Data: Technologies and Techniques for Large-Scale Data, in January. I haven’t read it (it costs quite a lot of money to buy the PDF), but there is a sample PDF which makes for pretty interesting reading in itself.