Follow the Data

A data driven blog

Archive for the tag “resources”

Summer reading

Some nice reading for the summer (in case of a rainy day of course):

  • Prediction, Learning and Games¬†(PDF link) – Nice textbook on prediction. Via @ML_hipster (worth following on Twitter if you like @bigdatahipster and/or authentic, hand-crafted decision trees)
  • Data Science 101, a very nice blog which points to a multitude of resources
  • School of Data and the accompanying Data Wrangling Handbook
  • Agile Data by Russell Jurney (who is well worth following on Twitter and especially Quora). This book isn’t finished yet but can be viewed in its current state of development at the given link, which is within the Open Feedback Publishing System at O’Reilly Media. So you can, on one hand, read the book (or parts of it) for free before publication, and on the other hand, provide feedback and thus shape the contents of the book.
  • (edit 17/7 2012) Might as well throw this one in: Data Jujitsu: The Art of Turning Data into Product by DJ Patil, a free O’Reilly Radar report (epub/PDF/mobile).

Couch DB — mapreduce for the masses

Couch DB, since a while back an Apache Foundation project, is a document-oriented database that can be queried with simple javascript queries in map/reduce fashion. Couch DB is built upon Erlang, which is a very interesting functional language built for extreme reliability in the telecom industry. One of the advantages of erlang is the support for parallelism, just add more cores and servers, and the map/reduce queries will go faster. Normal databases like mysql or postgres cant scale to several servers, and the end game is to buy one really big iron if you have built your application around a single database, that problem is no more with technology like CouchDB. This neat interactive demo shows what couch db is all about.


To follow up on yesterday’s post about data sources on the web, I’d like to mention an interesting resource, predict.i2pi, which automatically builds predictive models based on data that you upload. Using it could hardly be simpler – you just have to prepare a comma-separated text file with attributes (predictor variables) and one or more¬† target values (response variables), with the latter being identified as such by putting a star (*) in front of the variable name in the header row. The system will then match your particular data file to a set of suitable prediction algorithms (for example, regression models rather than classification models for a continuous response variable), evaluate the performance of these algorithms on a hold-out set from your data, and output the best results. As the site itself puts it,

Our team of elves will work on your file, running it against a range of model types and keeping track of the best ones. Every now and then we will update your page indicating the best models to date.

There’s also an API for predict.i2pi, and developers of statistical learning methods are encouraged to integrate their own favourite algorithms into the system. Read this blog post for more details.

For in-depth background on the various statistical learning and machine learning algorithms, you could do worse than to check out the lectures at There’s really an astounding amount of information there about lots of different fields, but in particular computer science, with a skew towards machine learning.

Post Navigation