Taxi driving and data mining
Tim O’Reilly’s and John Battelles’s Web Squared essay from a couple of months back relates an interesting anecdote:
Radar blogger Nat Torkington tells the story of a taxi driver he met in Wellington, NZ, who kept logs of six weeks of pickups (GPS, weather, passenger, and three other variables), fed them into his computer, and did some analysis to figure out where he should be at any given point in the day to maximize his take. As a result, he’s making a very nice living with much less work than other taxi drivers. Instrumenting the world pays off.
I think this kind of thing could be applied in many different professions. It would be interesting to know how well-versed the taxi driver was in statistics. If he wasn’t statistically trained, he presumably used a simple common-sense model, the success of which suggests that large gains can be had simply by quantifying what you do and picking up the major trends. Of course, he may have been a real data-analysis ninja. Either way, it’s probably fair to say, as O’Reilly and Battelle do in their article, that “Data analysis, visualization, and other techniques for seeing patterns in data are going to be an increasingly valuable skillset. Employers take notice.” Experienced taxi drivers have probably built up an equally effective implicit model of how to get the most income, but the Wellington taxi driver may have been able to “skip ahead” a couple of years using his statistics.
Another thought that occurred to me is how one would go about building a generic web-based tool where people can track everyday data with a view towards prediction. It would likely be a combination of something like your.flowingdata for the tracking and predict.i2pi for the simple, no-fuss prediction part. Maybe such an application already exists?
The user would of course still have to put some work into defining the problem properly, like deciding what to record and how to encode it. For instance, the taxi driver mentioned above would have had to think about whether to record his location in terms of, for example, neighbourhoods, streets or exact GPS location (or all three) – each likely giving rise to its own advantages and drawbacks.
A really useful general tracking/prediction tool would probably also need some sort of automatic model optimization and validation framework (e.g. built-in variable selection and cross-validation cycles), which would be mostly kept out of the user’s view (unless the user explicitly wants to see it).