Strata, Feb 3
The third and last day of Strata 2011 kept up the momentum from the previous days. There were some energetic and interesting keynote talks in the morning. Simon Rogers of The Guardian talked about the paper’s Data Store and Data Blog and how it aims to “put big datasets out there and let people develop stories.” He exemplified with some insights about government spending, Wikileaks etc. Increasingly, The Guardian is displaying data in one way in the paper edition and another way on the web. The following panel discussion with Bradford Cross, Toby Segaran and Amber Case could have been great but the allotted time was too short for the discussion to really get going.
Barry Devlin, who introduced the term “data warehouse”, said he had now been invited to Strata to announce its death. However, he said data warehousing is not dead, it just needs to be complemented by more “chaotic” (in a good way) approaches for big data. This was reminiscent of a point that had been made in the earlier panel discussion and which came up again later in the day – that big data is characterized by being unstructured and that you don’t know in advance what it’s going to be used for. DJ Patil of LinkedIn presented some further visualizations of Strata attendee networks (sort of building on Pete Skomoroch’s analysis from the previous day.) He also introduced a new data product: LinkedIn Skills and Expertise. DJ Patil also gave a tip on the “secret sauce” for creating good data products; it’s about the people, not about the tech you use.
Carol McCall gave a good presentation about how big data can help repair the healthcare system. She said healthcare has a bad business model, where you end up paying for sickness rather than health. There is a culture of passive patients and not enough focus on pro-activity and prevention. Importantly, there is no knowledge architecture to learn what works and what doesn’t. McCall is working on adverse drug events (ADEs) in the elderly. ADEs are rarely documented as such (less than 10% of the time) and there aren’t enough ADE databases. She has used claims data from 1.2 million seniors and intersected with known ADEs to try to build prediction models for ADEs. A sound bite from Carol’s presentation: “Your data will suck. Deal with it.”
After lunch (during which I joined a “birds of a feather” session on healthcare data), it was time for a panel discussion about where the money is in big data. I’m afraid I didn’t get much out of that session, beyond noting that Paul Kedrosky of the Kauffman Foundation was saying that “search is broken” and that “curation is the new search.” He wants to go back to “little data” and do manual curation for seeding new, better search algorithms. The following panel on the theme of Online sentiment, machine learning and prediction was all right, but I got a serious case of “session envy” when I saw the tweets that were raving about the Flip Kramer talk on building data teams. At the online sentiment panel, the point was again made that the difference between big data and “regular” enterprise data is that big data is unstructured and you don’t know what it’s for. Creve Maples gave a good, slightly eccentric talk about human-machine interfaces. He talked about information anxiety – the black hole between data and knowledge – and an updated version of the von Neumann gap (system performance will be limited by the speed of information transfer between human and machine). He showed a recording of how the eyes move over a computer screen (actually several screen) during one minute of work and mentioned FoldIt, which I’ve blogged about repeatedly. T
he last sessions of Strata for me were panels on data marketplaces and predicting the future. The latter was, to me, one of the most interesting sessions of the conference. Christopher Ahlberg from Recorded Future started out by saying that “we are surrounded by the future”; there are clues everywhere around us that help us predict the future. But the web is organized around publication time; we need to understand event time in order to do temporal analytics. Recorded Future is building up a temporal index which indexes events rather than keywords and can be used for answering questions such as “What will Mubarak be doing during this year?” or “What drug patents will expire within the next year?” The company is working on an iPad app in English and Chinese. An interesting tidbit from their initial analytic efforts was that published information gets factored into stock prices within 1-3 days. Robert McGrew from Palantir followed up with a thought-provoking talk about predicting terrorist attacks. Terrorists are an adaptive adversary; you can’t just extrapolate from past behavior what they will do in the future. You have to try to predict their intentions and get inside their heads. This is impossible using algorithms only; you need human analysts as well. “Don’t think machine learning, think game theory. Think Spy Vs Spy.” Finally, Rion Snow from Twitter presented an impressive array of studies where Twitter moods or tweeted keywords predicted all kinds of things from stock market fluctuations to flu outbreaks and election outcomes.
After the end of the conference proper, there was a session with lightning talks from people at Rackspace, followed by an open bar (sponsored by Rackspace) with really tasty hamburgers. We had the chance to talk to some really cool people. While the presentations on Strata were interesting, the best part was to meet people whose blogs and Twitter feeds in the flesh and talk to them.