Follow the Data

A data driven blog

Archive for the category “Books”

Quick links

Advertisements

The fourth paradigm

A new book about science in the age of big data, Fourth Paradigm: Data-Intensive Scientific Discovery, is available for downloading (for free). The book was reviewed in Nature today. It’s written by people from Microsoft Research and has a foreword by Gordon Bell, one of the authors of Total Recall: How the E-memory Revolution Will Change Everything.

PDF version of useful textbook

I just learned (via Friendfeed) that a new edition of the meaty data analysis textbook Elements of Statistical Learning: Data Mining, Inference and Prediction is available as a free PDF download here. It seems very good – I just started reading it.

Self-tracking news

FitBit have started to ship their clip-on device for tracking the amount of calories burnt, steps taken and distance travelled, as well as sleep quality. As explained here, the FitBit contains an accelerometer (many new phones have that as well, but they are bulkier than the FitBit) which has custom algorithms, trained using “ground truth” measurements like breath gas composition, that can accurately estimate calorie consumption for different kinds of movements like walking to the kitchen, jogging or dashing to the bus.

Now the only thing left is to measure how many calories that go into your system. Luckily, DailyBurn have just released FoodScanner, an iPhone app that lets you scan barcodes on the food you buy. It’s currently on sale for just 99 cents (!), but unfortunately in might only be available in the U.S. version of AppleStore (at least the Swedish one doesn’t have it as I write this). The information you scan can be uploaded to DailyBurn’s web site and added to a growing food database. Seems pretty cool.

Another recent release (Sept. 17) was a book called Total Recall – How the E-memory Revolution Will Change Everything, by Gordon Bell and Jim Gemmell, who are involved with Microsoft Research’s MyLifeBits project. Bell has been trying to record as much of his life as possible – including photos, letters and phone calls – since 1998. A quote from the blurb, where I find the part in italics particularly interesting:

We are capturing so much of our lives now, be it on the date–and location–stamped photos we take with our smart phones or in the continuous records we have of our emails, instant messages, and tweets–not to mention the GPS tracking of our movements many cars and smart phones do automatically. We are storing what we capture either out there in the “cloud” of services such as Facebook or on our very own increasingly massive and cheap hard drives. But the critical technology, and perhaps least understood, is our magical new ability to find the information we want in the mountain of data that is our past. And not just Google it, but data mine it so that, say, we can chart how much exercise we have been doing in the last four weeks in comparison with what we did four years ago. In health, education, work life, and our personal lives, the Total Recall revolution is going to change everything.

Beautiful data

One of my favorite books of the last few years is Toby Segaran’s Programming Collective Intelligence, where the author really hit the sweet spot between the theory and practice of data analysis. Broadly speaking, the book had two themes: one, how to get hold of raw data from web sites such as eBay, del.icio.us, Facebook, Zillow and so on via APIs, and two, how to draw interesting conclusions from those data using analysis techniques such as clustering, collaborative filtering, matrix decompositions, decision trees etc. Everything was demonstrated in simple Python code, so it was easy to try it all by yourself.

When I heard this spring that Segaran was the co-author of a new book, Programming the Semantic Web, and a co-editor of another one, Beautiful Data, I pre-ordered them both on Amazon to Singapore, where I live. I got the former book about a month ago, but I’ll not discuss this here because frankly, I’ve been too lazy to give it the kind of attention needed to properly evaluate it (following the code examples and so on).

Beautiful Data, on the other hand, is more suited to browsing (and reading at the playground while my kids are playing). I actually got so frustrated waiting for it – although it was released 26 July in the States, I didn’t get it until 21 August – that I downloaded a PDF from the web and read part of it before I got the physical book. (Sorry about that, O’Reilly – but I did pay for the book with my own money!) It’s definitely a nice book. Loosely based on the concept of a previous book, Beautiful Code, it describes various interesting real-life data analysis and visualization projects. There are also a couple of more essay-like chapters. Each chapter is written by different authors, and the scope is very wide. Most people who read the book will probably have a couple of chapters they really like and a couple they don’t care that much about.

One of the more hands-on chapters is the one about the FaceStats site. This site, which I hadn’t heard about before (and which appears to be on a hiatus), lets users upload photos of themselves and judge the photos of other people. In this chapter, the creators of FaceStats walk the reader through a session of exploratory data analysis (i. e. analysis with no specific hypothesis in mind at the beginning), performed in the statistical scripting language R. Among other things, they show how to find the keywords most characteristic of different groups of people. A big surprise for me there was to see the Swedish word “fjortis” as one of the most female-specific (=most used to describe female faces) words in the database! Unfortunately, the authors don’t comment on this. What makes me surprised is both that a Swedish slang term (which means, roughly, an immature adolescent – it’s derived from the word “fjorton” which means “fourteen”) is apparently so common at an international web site, and that it is so strongly associated with females – as far as I know, it can be used for both male and female adolescents in Swedish. Looking at this site, it does seem to be a sort of new English loan word which has had its meaning slightly changed.

Google’s director of research, Peter Norvig, contributes a nice chapter on statistical language modelling. Many of Google’s tricks are probably sketched here. Toby Segaran’s chapter is basically a compressed version of Programming the Semantic Web. One of my favorite chapters is the one by Jeff Hammerbacher, where he describes how he and others built up Facebook’s information platforms. I like his thoughts about the emerging species of data scientists:

At Facebook, we felt that traditional titles such as Business Analyst, Statistician, Engineer, and Research Scientist didn’t quite capture what we were after for our team. The workload for the role was diverse: on any given day, a team member could author a multistage processing pipeline in Python, design a hypothesis test, perform a regression analysis over data samples with R, design and implement an algorithm for some data-intensive product or service in Hadoop, or communicate the results of our analyses to other members of the organization in a clear and concise fashion. To capture the skill set required to perform this multitude of tasks, we created the role of “Data Scientist.”

The part in italics sounds a lot like my everyday work activities. Maybe I’ve been a data scientist all along without even knowing it?

There is lots of other interesting stuff in the book. You will read about how to design an image processing system for a space shuttle going to Mars, how to shoot a Radiohead video without actually using film, how to visualize scientific data in Second Life, and much more.

There’s no point in enumerating all of the interesting topics here – suffice to say that I recommend it to anyone who want to understand more about real-life data analysis challenges. After you’ve been blown away by all the cool projects and methods, don’t forget to cool off with Coco Krumme’s sober chapter which outlines what data can’t do and how we frequently get fooled by data and fail to intuitively understand probabilities. A refreshing pinch of skepticism.

Post Navigation