Follow the Data

A data driven blog

Archive for the tag “databases”

Couch DB — mapreduce for the masses

Couch DB, since a while back an Apache Foundation project, is a document-oriented database that can be queried with simple javascript queries in map/reduce fashion. Couch DB is built upon Erlang, which is a very interesting functional language built for extreme reliability in the telecom industry. One of the advantages of erlang is the support for parallelism, just add more cores and servers, and the map/reduce queries will go faster. Normal databases like mysql or postgres cant scale to several servers, and the end game is to buy one really big iron if you have built your application around a single database, that problem is no more with technology like CouchDB. This neat interactive demo shows what couch db is all about.

What is big data?

Did you know that the word data means “things given” in Latin? That’s just one of the things I learned from a very interesting (free) article, The Pathologies of Big Data by former computational neuroscientist Adam Jacobs. He also makes the perceptive comment that the word data tends to get uses as a mass noun in English, as if it denoted a substance. (After reading these interesting insights, it was no surprise to learn that Jacobs also has a degree in linguistics.)

The article discusses what “big data” really means in this day and age when we can actually keep, for instance, a dataset containing information about the entire world population in memory (not to mention on disk) on a pretty ordinary Dell server. Jacobs argues that getting stuff into databases is easy, but getting it out (in a useful form) is hard; the bottleneck lies in the analysis rather than the raw data manipulation.

He also argues that most data-processing tools, including standard relational database management systems, are not really built for the kinds of huge datasets we are starting to encounter now. Although we can in principle keep billions of rows of data in RAM, we can’t easily manipulate them using something like PostgreSQL. And other solutions like the statistics programming language R (one of my favourites) run into hard-coded memory usage limits, often about 4 GB.

A recommended read for those interested in the nerdier side of data.

On a related note, O’Reilly released a report, Big Data: Technologies and Techniques for Large-Scale Data,  in January. I haven’t read it (it costs quite a lot of money to buy the PDF), but there is a sample PDF which makes for pretty interesting reading in itself.

Post Navigation