Follow the Data

A data driven blog

Archive for the month “August, 2009”

Citizen science and bees as data collectors

I really like the concept of citizen science, projects where anyone can contribute to scientific research. As described in this Economist article from a couple of months back, when people are able to sample their surroundings in different ways and report variables such as traffic noise and pollution levels, their view on both science and data is likely to change (this of course ties in with self-tracking of physiological parameters etc. as well; see this article for an exercise-related application).

Eric Paulos, who is interviewed in that Economist article, has equipped taxis in Accra, Ghana and street sweepers in San Francisco with pollution detectors that collect data which enabled him to construct pollution maps of those cities. This page contains an interesting description of his research project Citizen Science: Enabling Participatory Urbanism.

There is a dedicated room on FriendFeed on citizen science with frequent updates. The Maryland Science Center has a citizen science program for the earth sciences, where people can e g take daily UV radiation readings and compare them to predicted values, or take temperature measurements that will shed light on the effects of climate change.

And that takes me to the subjects of bees that collect data on climate change, from a fascinating press release that I discovered via the above-mentioned FriendFeed room. A NASA scientist named Wayne Esaias has figured out that honey bees are really good at sampling the environment around a hive in an even way when they scout for honey. This, according to the press release, “means they excel in keeping tabs on the dynamics of flowering ecosystems in ways that even a small army of graduate students can not.

Volunteering beekeepers – citizen scientists – weigh their hives on industrial-sized scales everyday to track the nectar flow. This small citizen research network is called HoneyBeeNet. The data collected by the bees and their keepers reflects a warming trend that could be due to climate change. When Esaias and his colleagues compared nectar flow data from HoneyBeeNet to satellite data on vegetation in the spring, they found a near perfect correspondence, suggesting that the citizen-science derived bee data are reliable and useful.

Speed of data collection

Can this quote from a new Wall Street Journal article really be true?

In fact, more technical data have been collected in the past year alone than in all previous years since science began, says Johns Hopkins astrophysicist Alexander Szalay, an authority on large data sets and their impact on science.

The article is about how to preserve and capture scientific data. This is a pressing question, as evidenced by another quote:

“Our ability to collect data now outstrips our ability to maintain it for the long run,” says William Michener at the University of New Mexico, who leads a data-preservation network called DataONE. “We lose an awful lot of data that is collected with public funds.”

An interesting point mentioned in the article is that although the advances in information technology mean that we now mostly have data which is much more suitable for preservation (electronic documents rather than hand-written notes and scribbles), it has also led to graduate students starting to communicate a lot by instant messaging, which acts as a sinkhole for a lot of information.

Erlang user group Stockholm 27 August

I went to the erlang user group meeting this thursday, and took part of two interesting presentations. The first was on Erlang tools for Eclipse by Jakob Cederlund one of the core developers of the ErlIde as it is called. Good stuff, although there seems to be room for improvement. I think definately there has to be something like this for massive adoption of Erlang in the software industry.

I use Eclipse a lot when I develop in Java, although I prefer to use emacs or textmate for other languages. I haven’t got much experience with Erlang apart from doing the exercises in Joe Armstrongs book Programming Erlang of which I downloaded the beta PDF, and then I used textmate I think. I would like to do some real development in Erlang, the trouble would be to find someone to pay me to do it :) . I was into building core services we have at my work in Erlang but it was voted down, so for the forseeable future there java + rails seems to be on top.

The other presentation was by Adam Lindberg on using QuickCheck to test a DSL by generating random DSL data with the DSL’s grammar rules. QuickCheck is a test framework where one can generate tests. Pretty Cool stuff, and it was incredibly fast too! I’m used to slowly churning through masses of cucumber story tests in rails which take forever to run!

http://www.erlang-consulting.com/etc/usergroup/stockholm

A nice evening, very interesting people and good beer at the Erlounge afterwards!

Beautiful data

One of my favorite books of the last few years is Toby Segaran’s Programming Collective Intelligence, where the author really hit the sweet spot between the theory and practice of data analysis. Broadly speaking, the book had two themes: one, how to get hold of raw data from web sites such as eBay, del.icio.us, Facebook, Zillow and so on via APIs, and two, how to draw interesting conclusions from those data using analysis techniques such as clustering, collaborative filtering, matrix decompositions, decision trees etc. Everything was demonstrated in simple Python code, so it was easy to try it all by yourself.

When I heard this spring that Segaran was the co-author of a new book, Programming the Semantic Web, and a co-editor of another one, Beautiful Data, I pre-ordered them both on Amazon to Singapore, where I live. I got the former book about a month ago, but I’ll not discuss this here because frankly, I’ve been too lazy to give it the kind of attention needed to properly evaluate it (following the code examples and so on).

Beautiful Data, on the other hand, is more suited to browsing (and reading at the playground while my kids are playing). I actually got so frustrated waiting for it – although it was released 26 July in the States, I didn’t get it until 21 August – that I downloaded a PDF from the web and read part of it before I got the physical book. (Sorry about that, O’Reilly – but I did pay for the book with my own money!) It’s definitely a nice book. Loosely based on the concept of a previous book, Beautiful Code, it describes various interesting real-life data analysis and visualization projects. There are also a couple of more essay-like chapters. Each chapter is written by different authors, and the scope is very wide. Most people who read the book will probably have a couple of chapters they really like and a couple they don’t care that much about.

One of the more hands-on chapters is the one about the FaceStats site. This site, which I hadn’t heard about before (and which appears to be on a hiatus), lets users upload photos of themselves and judge the photos of other people. In this chapter, the creators of FaceStats walk the reader through a session of exploratory data analysis (i. e. analysis with no specific hypothesis in mind at the beginning), performed in the statistical scripting language R. Among other things, they show how to find the keywords most characteristic of different groups of people. A big surprise for me there was to see the Swedish word “fjortis” as one of the most female-specific (=most used to describe female faces) words in the database! Unfortunately, the authors don’t comment on this. What makes me surprised is both that a Swedish slang term (which means, roughly, an immature adolescent – it’s derived from the word “fjorton” which means “fourteen”) is apparently so common at an international web site, and that it is so strongly associated with females – as far as I know, it can be used for both male and female adolescents in Swedish. Looking at this site, it does seem to be a sort of new English loan word which has had its meaning slightly changed.

Google’s director of research, Peter Norvig, contributes a nice chapter on statistical language modelling. Many of Google’s tricks are probably sketched here. Toby Segaran’s chapter is basically a compressed version of Programming the Semantic Web. One of my favorite chapters is the one by Jeff Hammerbacher, where he describes how he and others built up Facebook’s information platforms. I like his thoughts about the emerging species of data scientists:

At Facebook, we felt that traditional titles such as Business Analyst, Statistician, Engineer, and Research Scientist didn’t quite capture what we were after for our team. The workload for the role was diverse: on any given day, a team member could author a multistage processing pipeline in Python, design a hypothesis test, perform a regression analysis over data samples with R, design and implement an algorithm for some data-intensive product or service in Hadoop, or communicate the results of our analyses to other members of the organization in a clear and concise fashion. To capture the skill set required to perform this multitude of tasks, we created the role of “Data Scientist.”

The part in italics sounds a lot like my everyday work activities. Maybe I’ve been a data scientist all along without even knowing it?

There is lots of other interesting stuff in the book. You will read about how to design an image processing system for a space shuttle going to Mars, how to shoot a Radiohead video without actually using film, how to visualize scientific data in Second Life, and much more.

There’s no point in enumerating all of the interesting topics here – suffice to say that I recommend it to anyone who want to understand more about real-life data analysis challenges. After you’ve been blown away by all the cool projects and methods, don’t forget to cool off with Coco Krumme’s sober chapter which outlines what data can’t do and how we frequently get fooled by data and fail to intuitively understand probabilities. A refreshing pinch of skepticism.

predict.i2pi

To follow up on yesterday’s post about data sources on the web, I’d like to mention an interesting resource, predict.i2pi, which automatically builds predictive models based on data that you upload. Using it could hardly be simpler – you just have to prepare a comma-separated text file with attributes (predictor variables) and one or more  target values (response variables), with the latter being identified as such by putting a star (*) in front of the variable name in the header row. The system will then match your particular data file to a set of suitable prediction algorithms (for example, regression models rather than classification models for a continuous response variable), evaluate the performance of these algorithms on a hold-out set from your data, and output the best results. As the site itself puts it,

Our team of elves will work on your file, running it against a range of model types and keeping track of the best ones. Every now and then we will update your page indicating the best models to date.

There’s also an API for predict.i2pi, and developers of statistical learning methods are encouraged to integrate their own favourite algorithms into the system. Read this blog post for more details.

For in-depth background on the various statistical learning and machine learning algorithms, you could do worse than to check out the lectures at videolectures.net. There’s really an astounding amount of information there about lots of different fields, but in particular computer science, with a skew towards machine learning.

Data sources on the web

So where are all these huge data sets that I (and others) have been talking about? Well, some of them are freely available for download. For example, the extensive Reality Mining data set from MIT (which I have blogged about) is available as a mySQL database for anyone to play around with.

There are a couple of repositories for data sets. Infochimps has hundreds or probably thousands of data sets from a wide variety of sources. Some of the data is directly downloadable from the site, while other data sets are just pointed to. Datamob is a similar, though smaller, resource. Amazon’s Public Data Sets are meant to be used seamlessly from within Amazon’s cloud computing applications, like the Elastic Compute Clusters (EC2). Here, we find massive datasets such as the collection of all publicly available DNA sequences from GenBank.

Peter Skomoroch has a del.icio.us tag for datasets which is probably the most extensive reference for big downloadable data out there (and which makes this blog post rather superfluous …) Due to the magic of del.icio.us, this list is of course dynamic and continuosly growing.

Finally, programmableweb is perhaps not strictly about data per se, but provides links to known APIs for access to web-based resources through your own programs.

Predicting flight delays

It’s rare to find a company with a business idea based purely on prediction, but FlightCaster is such a company. FlightCaster launched less than two weeks ago (August 14) and promises to predict flight delays with high accuracy 6 hours before the flight.

The core of FlightCaster’s service is a patent-pending algorithm that pulls in data regarding weather forecasts, in-bound aircraft tracking,  flight history, and other things from various sources. The algorithm actually assigns a probability to a certain flight being delayed, so it can tell you if, for instance, there is an 80% chance of a delay. This can be useful to know, since airlines usually do not warn about delays unless they are 100% sure that a delay will occur.

FlightCaster recently demonstrated the application in front of venture capitalists, and correctly predicted that a flight to New York would be delayed, even though it was reported by the airline to be on time at the moment the prediction was made.

The application is available for BlackBerry and iPhone (for 10 USD) and on the web (for free). You can test a prediction for a random flight here. (At the moment, it only works for US flights – an international version would obviously be a killer app, but it would presumably be much more difficult to pull in the needed data.) One of the nice things about this service is that it tells you the factors it used to make the prediction.

Update: An interesting interview with Bradford Cross from FlightCaster here. It seems their application is built on Hadoop and Amazon EC2 using Rails and Clojure. Peter Skomoroch, who did the interview, does a good job of explaining what I failed to put into my blog post:

FlightCaster strikes me as a great example of the next generation of web applications that will leverage [raw data that has been collected by the government and industry but sits untapped in large data warehouses]: bootstrapped startups that apply machine learning and data processing at scale to solve a focused problem people actually care about.

Quick links

Ran across a couple of interesting links:

Space-Time Travel Data is Analytic Super-Food!, a very meaty blog post where Jeff Jonas starts by discussing largely the same themes that I blogged about a while back, but he has thought more – a lot more! – deeply about it and delivers a number of interesting insights and predictions. The comments section contains some good stuff too.

Data is Journalism – this post discusses the acquisition by MSNBC.com of the local data aggregator service Everyblock. The question of whether data “is” journalism reminds me of the world of science – is a big and hard-to-obtain data set worthy of being published in a prestigious journal, even if the accompanying paper lacks a clear advancement in scientific knowledge? These questions may not be correctly formulated, and when it comes to journalism, I’m certain that data analysis and presentation will play an important role in its future, along with the more traditional components.

Science by crowdsourcing

Science Daily reports that chemists at the annual meeting of the Americal Chemical Society will be using a computer game format to try to think up creative ways to find new energy sources. The press release describes it as a “‘collaborative think’ project“  which “…leverages the intellectual power of chemists for the greater good.”

The game seems to involve avatars moving around in a virtual future world, with the player thinking up ideas in response to scenarios presented in the game. The ideas are reviewed by moderators and the best ideas will be compiled and released to the public.

This exercise reminded me of FoldIt, which is actually a very cool idea and fun to boot. FoldIt is, simply put, a game about folding proteins. It’s pretty easy to get into and quite addictive, but the really interesting thing about it is that everybody’s play data is recorded and, indirectly, used for scientific purposes. The idea is to try to leverage the “human factor” to improve existing algorithms for predicting how proteins fold. Such algorithms suffer from the common problem that they easily get stuck in “local minima”, energy states that look good compared to the near surroundings but that are worse than other states futher away. FoldIt was developed by protein folding guru David Baker’s research group.

Another kind of scientific crowdsourcing is represented by Innocentive, which is linked to a couple of pharma companies.  The concept behind Innocentive is simple: pharmas (or other entities) can post challenges (basically, scientific or technical problems to solve) with an associated amount of prize money, and then anyone who has registered as an “Innocentive Solver” can try to solve the problem and claim the prize. Since the web knows no geographical borders, Innocentive can – in principle – access competent people from the whole world. I should add that the challenges look extremely difficult – but they do occasionally get solved.

Another company, Imaginatik, describes itself as “the leader in innovation, idea management and enterprise crowdsourcing software.” The company claims it has a procedure for obtaining better ideas within an organization by efficiently capturing and sharing ideas generated by employees.

In a recent press release, Imaginatik in collaboration with Pfizer and CambridgeSoft announced a new visual collaboration tool for scientists, ChemBioConnect. Quoting the press release:

ChemBioConnect allows scientists to draw, view, edit, archive and search chemical structures and biological systems in a secure, robust collaboration and idea management environment. The software solution effectively allows for collaborative problem-solving among scientists by combining CambridgeSoft’s chemically intelligent visual toolsets with Imaginatik’s leading-edge innovation and idea management platform, comprised of software, services and a deep understanding of how human networks operate.

Are there other examples of scientific crowdsourcing? It certainly feels like a fruitful arena for future development.

Measuring, understanding and fixing (?) your sleep habits

I’ve recently stumbled over articles describing two systems for quantifying and improving people’s sleep patterns. I’m sure there are more of them out there, but I’ll restrict myself to the two I’ve read about (out of sheer laziness).

The first system is called Proactive Sleep. It’s an iPhone-based application based on a couple of small tools. The “sleep diary” is used to track and graph the average sleep amount and the average number of times the user wakes up in the middle of the night. The “vigilance task” is a game that you play immediately on waking up. It “involves following a randomly moving ball with your pointer finger. While you are performing the task, your ability to follow the ball is measured. Performing the task quickly and accurately indicates healthier sleep and less sleep inertia.” The vigilance task is dynamic, so that the difficulty level is adjusted depending on the user’s performance.

The data that is generated through the playing of the vigilance game reflects the subject’s variations in alertness and can therefore be used to better understand his sleep patterns. When you have a good understanding of a person’s sleep, you could, for example, wake her up in a lighter sleep phase, so that she will feel more rested even though she has slept less than usual. As the web site says:

This would work by sampling wake-up times within a range of when you want to wake up and comparing how you perform on the stimulating game. Since performance on similar tasks as the game show that the deeper the stage of sleep just before awakening, the poorer the performance [...], it is possible that the score on the game can be used to determine better or worse times for you to awaken.

The other system I’ve read about is called the Zeo Personal Sleep Coach. Like Proactive Sleep, it graphs your sleep habits, including average sleep duration, time taken to fall asleep, number of awakenings per night and so on. These things are measured using a nifty headband recording device.  Zeo also measures the time spent in REM, light and deep sleep. Instead of using a game, Zeo lets you quickly record how you feel about your sleep when you’ve just woken up. Then, you can “…compare how you feel you slept to the objective data Zeo provides.” The user also has access to a personal coach who suggests ways to improve poor sleep patterns.

The Decision Tree (nice blog by the way), Technology Review and USA Today have written about Zeo in some depth.

Presumably, both Zeo and Proactive Sleep will eventually have a pretty large collection of data on different individuals’ sleep patterns and how they have succeeded (or not) in improving them. This may then lead to different sorts of population-level analysis. For example, maybe people can eventually be classified into different “sleeper types” based on their sleep characteristics. Knowing a person’s sleeper type might then guide the choice of behavioural modification in order for the person to get more sleep.

Post Navigation

Follow

Get every new post delivered to your Inbox.

Join 46 other followers