Follow the Data

A data driven blog

Archive for the tag “visualization”

Hadley Wickham lecture: ggvis, tidyr, dplyr and much more

Another week, another great meetup. This time, the very prolific Hadley Wickham visited the Stockholm R useR group and talked for about an hour about his new projects.

Perhaps some background is in order. Hadleys PhD thesis (free pdf here) is a very inspiring tour of different aspects of practical data analysis issues, such as reshaping data into a “tidy” for that is easy to work with (he developed the R reshape package for this), visualizing clustering and classification problems (see his classifly, clusterfly, and meifly packages) and creating a consistent language for describing plots and graphics (which resulted in the influential ggplot2 package). He has also made the plyr package as a more consistent version of the various “apply” functions in R. I learned a lot from this thesis.

Today, Hadley talked about several new packages that he has been developing to further improve on his earlier toolkit. He said that in general, his packages become simpler and simpler as he re-defines the basic operations needed for data analysis.

  • The newest one (“I wrote it about four days ago”, Hadley said) is called tidyr (it’s not yet on CRAN but can be installed from GitHub) and provides functions for getting data into the “tidy” format mentioned above. While reshape had the melt and cast commands, tidyr has gather, separate, and spread.
  • dplyr – the “next iteration of plyr”, which is faster and focuses on data frames. It uses commands like select, filter, mutate, summarize, arrange.
  • ggvis – a “dynamic version of ggplot2” which is designed for responsive dynamic graphics, streaming visualization and meant for the web. This looked really nice. For example, you can easily add sliders to a plot so you can change the parameters and watch how the plot changes in real time. ggvis is built on Shiny but provides easier ways to make the plots. You can even embed dynamic ggvis plots in R markdown documents with knitR so that the resulting report can contain sliders and other things. This is obviously not possible with PDFs though. ggvis will be released on CRAN “in a week or so”.

Hadley also highlighted the magrittr package which implements a pipe operator for R (Magritte/pipe … get it? (groan)) The pipe looks like %>% and at first blush it may not look like a big deal, but Hadley made a convincing case that using the pipe together with (for example) dplyr results in code that is much easier to read, write and debug.

Hadley is writing a book, Advanced R (wiki version here), which he said has taught him a lot about the inner workings of R. He mentioned Rcpp as an excellent way to write C++ code and embed it in R packages. The bigvis package was mentioned as a “proof of concept” of how one might visualize big data sets (where the number of data points is larger than the number of pixels on the screen, so it is physically impossible to plot everything and summarization is necessary.)

Advertisements

Visualization

I was discussing the importance of data visualization with a co-worker a couple of weeks ago. We agreed that some sort of dynamic, intuitive interfaces for looking at and interacting with huge data sets in general, and sequencing-based data sets in particular, would be extremely useful. As the Dataspora blog puts it in a recent post, “The ultimate end-point for most data analysis is a human decision-maker, whose highest bandwidth channel is his or her eyeballs.” (the post is worth reading in its entirety)

Apparently Illumina (one the biggest vendors of high-throughput sequencers) agree; they’ve announced a competition where the aim is to provide useful visualizations of a number of genomic datasets derived from a breast cancer cell line. The competition closes at March 15, 2011.

Here’s a nice paper, A Tour through the Visualization Zoo, which provides a whirlwind tour of different kinds of graphs. The figures are actually interactive, so you can mess around with them if you are reading the article online.

The Infosthetics blog highlights Patients Like Me as the most successful marriage of online social media and data visualization.

Beautiful data

One of my favorite books of the last few years is Toby Segaran’s Programming Collective Intelligence, where the author really hit the sweet spot between the theory and practice of data analysis. Broadly speaking, the book had two themes: one, how to get hold of raw data from web sites such as eBay, del.icio.us, Facebook, Zillow and so on via APIs, and two, how to draw interesting conclusions from those data using analysis techniques such as clustering, collaborative filtering, matrix decompositions, decision trees etc. Everything was demonstrated in simple Python code, so it was easy to try it all by yourself.

When I heard this spring that Segaran was the co-author of a new book, Programming the Semantic Web, and a co-editor of another one, Beautiful Data, I pre-ordered them both on Amazon to Singapore, where I live. I got the former book about a month ago, but I’ll not discuss this here because frankly, I’ve been too lazy to give it the kind of attention needed to properly evaluate it (following the code examples and so on).

Beautiful Data, on the other hand, is more suited to browsing (and reading at the playground while my kids are playing). I actually got so frustrated waiting for it – although it was released 26 July in the States, I didn’t get it until 21 August – that I downloaded a PDF from the web and read part of it before I got the physical book. (Sorry about that, O’Reilly – but I did pay for the book with my own money!) It’s definitely a nice book. Loosely based on the concept of a previous book, Beautiful Code, it describes various interesting real-life data analysis and visualization projects. There are also a couple of more essay-like chapters. Each chapter is written by different authors, and the scope is very wide. Most people who read the book will probably have a couple of chapters they really like and a couple they don’t care that much about.

One of the more hands-on chapters is the one about the FaceStats site. This site, which I hadn’t heard about before (and which appears to be on a hiatus), lets users upload photos of themselves and judge the photos of other people. In this chapter, the creators of FaceStats walk the reader through a session of exploratory data analysis (i. e. analysis with no specific hypothesis in mind at the beginning), performed in the statistical scripting language R. Among other things, they show how to find the keywords most characteristic of different groups of people. A big surprise for me there was to see the Swedish word “fjortis” as one of the most female-specific (=most used to describe female faces) words in the database! Unfortunately, the authors don’t comment on this. What makes me surprised is both that a Swedish slang term (which means, roughly, an immature adolescent – it’s derived from the word “fjorton” which means “fourteen”) is apparently so common at an international web site, and that it is so strongly associated with females – as far as I know, it can be used for both male and female adolescents in Swedish. Looking at this site, it does seem to be a sort of new English loan word which has had its meaning slightly changed.

Google’s director of research, Peter Norvig, contributes a nice chapter on statistical language modelling. Many of Google’s tricks are probably sketched here. Toby Segaran’s chapter is basically a compressed version of Programming the Semantic Web. One of my favorite chapters is the one by Jeff Hammerbacher, where he describes how he and others built up Facebook’s information platforms. I like his thoughts about the emerging species of data scientists:

At Facebook, we felt that traditional titles such as Business Analyst, Statistician, Engineer, and Research Scientist didn’t quite capture what we were after for our team. The workload for the role was diverse: on any given day, a team member could author a multistage processing pipeline in Python, design a hypothesis test, perform a regression analysis over data samples with R, design and implement an algorithm for some data-intensive product or service in Hadoop, or communicate the results of our analyses to other members of the organization in a clear and concise fashion. To capture the skill set required to perform this multitude of tasks, we created the role of “Data Scientist.”

The part in italics sounds a lot like my everyday work activities. Maybe I’ve been a data scientist all along without even knowing it?

There is lots of other interesting stuff in the book. You will read about how to design an image processing system for a space shuttle going to Mars, how to shoot a Radiohead video without actually using film, how to visualize scientific data in Second Life, and much more.

There’s no point in enumerating all of the interesting topics here – suffice to say that I recommend it to anyone who want to understand more about real-life data analysis challenges. After you’ve been blown away by all the cool projects and methods, don’t forget to cool off with Coco Krumme’s sober chapter which outlines what data can’t do and how we frequently get fooled by data and fail to intuitively understand probabilities. A refreshing pinch of skepticism.

Post Navigation