Strata, pre-conference tutorial day (Feb 1) impressions
I’m in Santa Clara, California, attending the inaugural O’Reilly Strata conference (Twitter hashtag #strataconf), which is about … well, what? I would say it’s a business-oriented data analysis conference with a focus on huge data sets and scalable technical solutions (Hadoop/mapreduce, distributed databases etc.) But that dry summary doesn’t convey the variety of the audience. During the first day (Tuesday Feb 1st), which was devoted mostly to pre-conference tutorials, I saw or talked to people from social network sites, venture capital firms, hedge funds, libraries, public radio, weather forecasting companies, publishers, etc etc. This kind of variety is very nice and so far, it seems like O’Reilly has managed to really pull it off with this conference.
For the morning tutorial session, I decided to go for How to Develop Big Data Applications with Hadoop, hosted by Karmasphere. It was nice enough, although the time allotted for the hands-on session was not really enough to both do the setup, learn how to use the Karmasphere software and understand what the code was doing. Still, I thought it was useful to get a bit more exposure to different map-reduce and Hadoop related tools.
After lunch, I decided to sample a couple of different sessions, starting with Apache Cassandra in Action. This was good but a bit too database-technical for me, so I switched to Communicating Data Clearly, where Naomi Robbins presented useful Tufte-like pointers on how to appropriately convey information in a graph. I then skipped to the Executive Summit, where JC Herz from Batchtags LLC gave an entertaining talk where she talked about how executives often regard data analysts as some sort of occult witch doctors; “Don’t tell me how those throwing bones work, just throw them and tell me the future!” She suggested that some executives exhibit a kind of “high status helplessness” where they take a perverse pride in regarding the analytics methods as a black box that they’re not supposed to know about. She called for more structured validations tests that keep both the customer and the developer focused and honest, and suggested that the pay-for-performance model for analytics might often be a good idea. Following her, John Fritz from AMD talked about analytics and ethics, (“with big data comes big responsibility”) and exemplified bystories about two companies, Caesar’s Entertainment and El Paso Energy, that according to him “were big data back when it was medium data”. As far as I understand, the talks from the conference will eventually be made available through the conference site, YouTube and other channels. I’ll be sure to catch up on the Executive Summit talks that I missed, like the talk on retail that I heard afterwards was really good and Mining the Tar Sands of Big Data by Michael Driscoll.
After dinner, it was time for the Startup Showcase with demos from some, well, startups. You can see who won on the page I just linked, but my own favorite was probably BuzzData, which could be very briefly summarized as a GitHub for data sets, with the trackability and social network aspects built in from the start.
The Startup Showcase session was followed by drinks at a nearby hotel bar, where I had the opportunity to discuss topics such as graph databases (surprisingly to me, they came up twice in independent discussions), a possible “data bubble”, and various geeky topics with fellow data enthusiasts.
I’m really looking forward to today’s packed schedule of keynotes (which will be live streamed), sessions and evening tutorials. I’ll be live tweeting (@mikaelhuss) and eventually summarizing on this blog.