Follow the Data

A data driven blog

Archive for the tag “data”

Cool data > big data?

From a 2006 post on Seth Roberts’ blog:

One day on the track I met a professor who had recently gotten tenure. He had only published three articles (maybe he had 700 in the pipeline), so his getting tenure surprised me. I asked him: What’s the secret? What was so great about those three papers? His answer was two words: “Cool data.”

[…]

I’m a big believer in cool data. The design goal is: How far can we possibly push it so that it makes it a vivid point? Most academics push it just far enough to get it published. I try to push it beyond that to make it much more vivid. That’s what [Stanley] Milgram did with his experiments. First, he showed obedience to authority in the lab. Then he stripped away a whole lot of things to show how extreme it was. He took away lab coats, the college campus. That’s what made it so powerful.

Data services

There’s been a little hiatus here as I have been traveling. I recently learned that Microsoft has launched Codename “Dallas”, a service for purchasing and managing datasets and web services. It seems they are trying to provide consistent APIs to work with different data from the public and private sectors in a clean way. There’s an introduction here.

This type of online data repository seems to be an idea whose time has arrived – I have previously talked about resources like Infochimps, Datamob and Amazon’s Public Data Sets, and there is also theinfo.org, which I seem to have forgotten to mention. A recent commenter on this blog pointed me to the comprehensive knowledge archive network, which is a “registry of open data and content packages”. Then there are the governmental and municipal data repositories, such as data.gov.

Another interesting service, which may have a slightly different focus, is Factual, described by founder Gil Elbaz as a “platform where anyone can share and mash open data“. Factual basically wants to list facts, and puts the emphasis on data accuracy, so you can express opinions on and discuss the validity of any piece of data. Factual also claims to have “deeper data technology” which allows users to explore the data in a more sophisticated way compared to other services like the Amazon Open Data Sets, for instance.

Companies specializing in helping users make sense out of massive data sets are, of course, popping up as well. I have previously written about Good Data, and now the launch of a new seemingly similar company,  Data Applied, has been announced.  Like Good Data, Data Applied offers affordable licenses for cloud-based and social data analysis, with a free trial package (though Good Data’s free version seems to offer more – a 10 MB data warehouse and 1-5 users vs Data Applied’s file size of <100 kb for a single user; someone correct me if I am wrong). The visualization capabilities of Data Applied do seem very nice. It’s still unclear to me how different the offerings of these two companies are but time will tell.

Beautiful data

One of my favorite books of the last few years is Toby Segaran’s Programming Collective Intelligence, where the author really hit the sweet spot between the theory and practice of data analysis. Broadly speaking, the book had two themes: one, how to get hold of raw data from web sites such as eBay, del.icio.us, Facebook, Zillow and so on via APIs, and two, how to draw interesting conclusions from those data using analysis techniques such as clustering, collaborative filtering, matrix decompositions, decision trees etc. Everything was demonstrated in simple Python code, so it was easy to try it all by yourself.

When I heard this spring that Segaran was the co-author of a new book, Programming the Semantic Web, and a co-editor of another one, Beautiful Data, I pre-ordered them both on Amazon to Singapore, where I live. I got the former book about a month ago, but I’ll not discuss this here because frankly, I’ve been too lazy to give it the kind of attention needed to properly evaluate it (following the code examples and so on).

Beautiful Data, on the other hand, is more suited to browsing (and reading at the playground while my kids are playing). I actually got so frustrated waiting for it – although it was released 26 July in the States, I didn’t get it until 21 August – that I downloaded a PDF from the web and read part of it before I got the physical book. (Sorry about that, O’Reilly – but I did pay for the book with my own money!) It’s definitely a nice book. Loosely based on the concept of a previous book, Beautiful Code, it describes various interesting real-life data analysis and visualization projects. There are also a couple of more essay-like chapters. Each chapter is written by different authors, and the scope is very wide. Most people who read the book will probably have a couple of chapters they really like and a couple they don’t care that much about.

One of the more hands-on chapters is the one about the FaceStats site. This site, which I hadn’t heard about before (and which appears to be on a hiatus), lets users upload photos of themselves and judge the photos of other people. In this chapter, the creators of FaceStats walk the reader through a session of exploratory data analysis (i. e. analysis with no specific hypothesis in mind at the beginning), performed in the statistical scripting language R. Among other things, they show how to find the keywords most characteristic of different groups of people. A big surprise for me there was to see the Swedish word “fjortis” as one of the most female-specific (=most used to describe female faces) words in the database! Unfortunately, the authors don’t comment on this. What makes me surprised is both that a Swedish slang term (which means, roughly, an immature adolescent – it’s derived from the word “fjorton” which means “fourteen”) is apparently so common at an international web site, and that it is so strongly associated with females – as far as I know, it can be used for both male and female adolescents in Swedish. Looking at this site, it does seem to be a sort of new English loan word which has had its meaning slightly changed.

Google’s director of research, Peter Norvig, contributes a nice chapter on statistical language modelling. Many of Google’s tricks are probably sketched here. Toby Segaran’s chapter is basically a compressed version of Programming the Semantic Web. One of my favorite chapters is the one by Jeff Hammerbacher, where he describes how he and others built up Facebook’s information platforms. I like his thoughts about the emerging species of data scientists:

At Facebook, we felt that traditional titles such as Business Analyst, Statistician, Engineer, and Research Scientist didn’t quite capture what we were after for our team. The workload for the role was diverse: on any given day, a team member could author a multistage processing pipeline in Python, design a hypothesis test, perform a regression analysis over data samples with R, design and implement an algorithm for some data-intensive product or service in Hadoop, or communicate the results of our analyses to other members of the organization in a clear and concise fashion. To capture the skill set required to perform this multitude of tasks, we created the role of “Data Scientist.”

The part in italics sounds a lot like my everyday work activities. Maybe I’ve been a data scientist all along without even knowing it?

There is lots of other interesting stuff in the book. You will read about how to design an image processing system for a space shuttle going to Mars, how to shoot a Radiohead video without actually using film, how to visualize scientific data in Second Life, and much more.

There’s no point in enumerating all of the interesting topics here – suffice to say that I recommend it to anyone who want to understand more about real-life data analysis challenges. After you’ve been blown away by all the cool projects and methods, don’t forget to cool off with Coco Krumme’s sober chapter which outlines what data can’t do and how we frequently get fooled by data and fail to intuitively understand probabilities. A refreshing pinch of skepticism.

Data sources on the web

So where are all these huge data sets that I (and others) have been talking about? Well, some of them are freely available for download. For example, the extensive Reality Mining data set from MIT (which I have blogged about) is available as a mySQL database for anyone to play around with.

There are a couple of repositories for data sets. Infochimps has hundreds or probably thousands of data sets from a wide variety of sources. Some of the data is directly downloadable from the site, while other data sets are just pointed to. Datamob is a similar, though smaller, resource. Amazon’s Public Data Sets are meant to be used seamlessly from within Amazon’s cloud computing applications, like the Elastic Compute Clusters (EC2). Here, we find massive datasets such as the collection of all publicly available DNA sequences from GenBank.

Peter Skomoroch has a del.icio.us tag for datasets which is probably the most extensive reference for big downloadable data out there (and which makes this blog post rather superfluous …) Due to the magic of del.icio.us, this list is of course dynamic and continuosly growing.

Finally, programmableweb is perhaps not strictly about data per se, but provides links to known APIs for access to web-based resources through your own programs.

Post Navigation