One of my favorite books of the last few years is Toby Segaran’s Programming Collective Intelligence, where the author really hit the sweet spot between the theory and practice of data analysis. Broadly speaking, the book had two themes: one, how to get hold of raw data from web sites such as eBay, del.icio.us, Facebook, Zillow and so on via APIs, and two, how to draw interesting conclusions from those data using analysis techniques such as clustering, collaborative filtering, matrix decompositions, decision trees etc. Everything was demonstrated in simple Python code, so it was easy to try it all by yourself.
When I heard this spring that Segaran was the co-author of a new book, Programming the Semantic Web, and a co-editor of another one, Beautiful Data, I pre-ordered them both on Amazon to Singapore, where I live. I got the former book about a month ago, but I’ll not discuss this here because frankly, I’ve been too lazy to give it the kind of attention needed to properly evaluate it (following the code examples and so on).
Beautiful Data, on the other hand, is more suited to browsing (and reading at the playground while my kids are playing). I actually got so frustrated waiting for it – although it was released 26 July in the States, I didn’t get it until 21 August – that I downloaded a PDF from the web and read part of it before I got the physical book. (Sorry about that, O’Reilly – but I did pay for the book with my own money!) It’s definitely a nice book. Loosely based on the concept of a previous book, Beautiful Code, it describes various interesting real-life data analysis and visualization projects. There are also a couple of more essay-like chapters. Each chapter is written by different authors, and the scope is very wide. Most people who read the book will probably have a couple of chapters they really like and a couple they don’t care that much about.
One of the more hands-on chapters is the one about the FaceStats site. This site, which I hadn’t heard about before (and which appears to be on a hiatus), lets users upload photos of themselves and judge the photos of other people. In this chapter, the creators of FaceStats walk the reader through a session of exploratory data analysis (i. e. analysis with no specific hypothesis in mind at the beginning), performed in the statistical scripting language R. Among other things, they show how to find the keywords most characteristic of different groups of people. A big surprise for me there was to see the Swedish word “fjortis” as one of the most female-specific (=most used to describe female faces) words in the database! Unfortunately, the authors don’t comment on this. What makes me surprised is both that a Swedish slang term (which means, roughly, an immature adolescent – it’s derived from the word “fjorton” which means “fourteen”) is apparently so common at an international web site, and that it is so strongly associated with females – as far as I know, it can be used for both male and female adolescents in Swedish. Looking at this site, it does seem to be a sort of new English loan word which has had its meaning slightly changed.
Google’s director of research, Peter Norvig, contributes a nice chapter on statistical language modelling. Many of Google’s tricks are probably sketched here. Toby Segaran’s chapter is basically a compressed version of Programming the Semantic Web. One of my favorite chapters is the one by Jeff Hammerbacher, where he describes how he and others built up Facebook’s information platforms. I like his thoughts about the emerging species of data scientists:
At Facebook, we felt that traditional titles such as Business Analyst, Statistician, Engineer, and Research Scientist didn’t quite capture what we were after for our team. The workload for the role was diverse: on any given day, a team member could author a multistage processing pipeline in Python, design a hypothesis test, perform a regression analysis over data samples with R, design and implement an algorithm for some data-intensive product or service in Hadoop, or communicate the results of our analyses to other members of the organization in a clear and concise fashion. To capture the skill set required to perform this multitude of tasks, we created the role of “Data Scientist.”
The part in italics sounds a lot like my everyday work activities. Maybe I’ve been a data scientist all along without even knowing it?
There is lots of other interesting stuff in the book. You will read about how to design an image processing system for a space shuttle going to Mars, how to shoot a Radiohead video without actually using film, how to visualize scientific data in Second Life, and much more.
There’s no point in enumerating all of the interesting topics here – suffice to say that I recommend it to anyone who want to understand more about real-life data analysis challenges. After you’ve been blown away by all the cool projects and methods, don’t forget to cool off with Coco Krumme’s sober chapter which outlines what data can’t do and how we frequently get fooled by data and fail to intuitively understand probabilities. A refreshing pinch of skepticism.