Follow the Data

A data driven blog

Archive for the tag “Books”

Book and MOOC

As of today, Amazon.com is stocking a book to which I have contributed, RNA-seq Data Analysis: A Practical Approach. I realize the title might sound obscure to readers who are unfamiliar with genomics and bioinformatics. Simply put, RNA-seq is short for RNA sequencing, a method for measuring what we call gene expression. While the DNA contained in each cell is (to a first approximation) identical, different tissues and cell types turn their genes on and off in different ways in response to different conditions. The process when DNA is transcribed to RNA is called gene expression. RNA-seq has become a rather important experimental method and the lead author of our book, Eija Korpelainen, wanted to put together a user-friendly, practical and hopefully unbiased compendium of the existing RNA-seq data analysis methods and toolkits, without neglecting underlying theory. I contributed one chapter, the one about differential expression analysis, which basically means statistical testing for significant gene expression differences between groups of samples.

I am also currently involved as an assistant teacher in the Explore Statistics with R course given by Karolinska Institutet through the edX MOOC platform. Specifically, I have contributed material to the final week (week 5) which will start next Tuesday (October 7th). That material is also about RNA-seq analysis – I try to show a range of tools available in R which allow you to perform a complete analysis workflow for a typical scenario. Until the fifth week starts, I am helping out with answering student questions in the forums. It’s been a positive experience so far, but it is clear that one can never prepare enough for a MOOC – errors in phrasing, grading, etc are bound to pop up. Luckily, several gifted students are doing an amazing job of answering the questions from other students, while teaching us teachers a thing or two about the finer points of R.

Speaking of MOOCs, Coursera’s Mining Massive Datasets course featuring Jure Leskovec, Anand Rajaraman and Jeff Ullman started today. My plan is to try to follow it – we shall see if I have time.

Beautiful data

One of my favorite books of the last few years is Toby Segaran’s Programming Collective Intelligence, where the author really hit the sweet spot between the theory and practice of data analysis. Broadly speaking, the book had two themes: one, how to get hold of raw data from web sites such as eBay, del.icio.us, Facebook, Zillow and so on via APIs, and two, how to draw interesting conclusions from those data using analysis techniques such as clustering, collaborative filtering, matrix decompositions, decision trees etc. Everything was demonstrated in simple Python code, so it was easy to try it all by yourself.

When I heard this spring that Segaran was the co-author of a new book, Programming the Semantic Web, and a co-editor of another one, Beautiful Data, I pre-ordered them both on Amazon to Singapore, where I live. I got the former book about a month ago, but I’ll not discuss this here because frankly, I’ve been too lazy to give it the kind of attention needed to properly evaluate it (following the code examples and so on).

Beautiful Data, on the other hand, is more suited to browsing (and reading at the playground while my kids are playing). I actually got so frustrated waiting for it – although it was released 26 July in the States, I didn’t get it until 21 August – that I downloaded a PDF from the web and read part of it before I got the physical book. (Sorry about that, O’Reilly – but I did pay for the book with my own money!) It’s definitely a nice book. Loosely based on the concept of a previous book, Beautiful Code, it describes various interesting real-life data analysis and visualization projects. There are also a couple of more essay-like chapters. Each chapter is written by different authors, and the scope is very wide. Most people who read the book will probably have a couple of chapters they really like and a couple they don’t care that much about.

One of the more hands-on chapters is the one about the FaceStats site. This site, which I hadn’t heard about before (and which appears to be on a hiatus), lets users upload photos of themselves and judge the photos of other people. In this chapter, the creators of FaceStats walk the reader through a session of exploratory data analysis (i. e. analysis with no specific hypothesis in mind at the beginning), performed in the statistical scripting language R. Among other things, they show how to find the keywords most characteristic of different groups of people. A big surprise for me there was to see the Swedish word “fjortis” as one of the most female-specific (=most used to describe female faces) words in the database! Unfortunately, the authors don’t comment on this. What makes me surprised is both that a Swedish slang term (which means, roughly, an immature adolescent – it’s derived from the word “fjorton” which means “fourteen”) is apparently so common at an international web site, and that it is so strongly associated with females – as far as I know, it can be used for both male and female adolescents in Swedish. Looking at this site, it does seem to be a sort of new English loan word which has had its meaning slightly changed.

Google’s director of research, Peter Norvig, contributes a nice chapter on statistical language modelling. Many of Google’s tricks are probably sketched here. Toby Segaran’s chapter is basically a compressed version of Programming the Semantic Web. One of my favorite chapters is the one by Jeff Hammerbacher, where he describes how he and others built up Facebook’s information platforms. I like his thoughts about the emerging species of data scientists:

At Facebook, we felt that traditional titles such as Business Analyst, Statistician, Engineer, and Research Scientist didn’t quite capture what we were after for our team. The workload for the role was diverse: on any given day, a team member could author a multistage processing pipeline in Python, design a hypothesis test, perform a regression analysis over data samples with R, design and implement an algorithm for some data-intensive product or service in Hadoop, or communicate the results of our analyses to other members of the organization in a clear and concise fashion. To capture the skill set required to perform this multitude of tasks, we created the role of “Data Scientist.”

The part in italics sounds a lot like my everyday work activities. Maybe I’ve been a data scientist all along without even knowing it?

There is lots of other interesting stuff in the book. You will read about how to design an image processing system for a space shuttle going to Mars, how to shoot a Radiohead video without actually using film, how to visualize scientific data in Second Life, and much more.

There’s no point in enumerating all of the interesting topics here – suffice to say that I recommend it to anyone who want to understand more about real-life data analysis challenges. After you’ve been blown away by all the cool projects and methods, don’t forget to cool off with Coco Krumme’s sober chapter which outlines what data can’t do and how we frequently get fooled by data and fail to intuitively understand probabilities. A refreshing pinch of skepticism.

Post Navigation