Strata, Feb 2
This was the first day of the conference proper, with keynotes and other activities extending throughout the day until 9.20pm. It was a jam-packed day for sure. People were crowding around demonstration booths and bars, and the notice board was full of job openings for data scientists. The big data field seems to be on a roll – at least for now.
The keynotes are already available on O’Reilly’s YouTube channel. The keynotes I enjoyed the most were those by Hilary Mason and Mike Madsen. Hilary Mason came up with two “Strata memes” that people seemed to pick up on; (a) how nice it is to be able to spin up computing clusters while at home in your underwear and (b) that we have enough ad optimizing algorithms already – it’s time to use analytics for stuff that is actually important. Mark Madsen drew parallels between the current data hype and the gold rush of the 19th century, calling the image of the lone developer with a PC mining terabytes of data a myth. It’s actually changes in business processes – according to him – that will lead to the really disruptive changes. He also talked about how “software eating itself” and how “code is a commodity”. Not sure I agree but worth thinking about. Like JC Herz yesterday, Madsen cautioned against pretty visualizations with no use case, urging us not to “become the tabloid journalists of the data industry.” He also said that the one paper you definitely have to read if you want to become a useful data scientist is The Sensemaking Process and Leverage Points for Analyst Technology as Identified Through Cognitive Task Analysis by Pirelli and Card (2005.)
Following the keynotes, there was a brief panel session where Mike Olson from Cloudera opined that the real value of data lies in combining disparate data sets – a big data set is not necessarily interesting by itself, but becomes interesting when you intersect it with other data sets. He said that exploration & analysis tools that can ingest data from more sources are now more important than technology platforms.
Pete Skomoroch of LinkedIn delivered a talk where he gave suggestions for how to plan and implement a big data analysis project. He had intersected the Strata attendee list with data from LinkedIn to generate interesting statistics and graphs about the skill sets of the attendees (most common: Python, MySQL, data mining, machine learning), which skills that were the most specific for Strata attendees (Hadoop, big data, machine learning), and which rough thematic clusters of attendees that could be discerned (O’Reilly, Media, Startups, Hadoop, Big Data). He also looked at where the Strata attendees were working five years ago (Microsoft, Yahoo, Google, IBM, Oracle, Sun) and where they are now (Google, Cloudera, LinkedIn, Microsoft, IBM, Thomson Reuters.) He said there will be more analysis presented today by DJ Patil.
I spent the next session watching two talks, but unfortunately none of those was MAD Skills: A Magnetic, Agile and Deep Approach to Scalable Analytics. I heard from a couple of people that it was really good. There is a paper that I’ll have to read instead. A data cleaning/transformation tool called Wrangler, which looks interesting, was introduced at the same talk.
After lunch, I started out by watching Anthony Goldbloom from Kaggle, which I’ve written about many times on this blog. Anthony gave an engaging talk about the value of prediction competitions and described Kaggle as a matchmaker between those who have the data and those who are hungry to analyze it. There is often a disconnect between e g companies that are sitting on huge databases but don’t know how to analyze them and academics who have cool prediction algorithms but no real-world data to use them on, and Kaggle can help solve this dilemma. Different people tend to be good at solving different types of prediction tasks, but Kaggle noticed that one particular person was doing well in all their competitions. They subsequently hired this person (Jeremy Howard) as chief data scientist, and he appeared at the end of the talk to make an eloquent case for participating in prediction contests. In fact he convinced me to take up a belated New Year’s resolution to participate in at least one contest this year.
The next session was a real-world applications panel on medical and healthcare opportunities. As it turned out, it consisted mostly of company presentations and there weren’t many concrete suggestions for data startups in the medical space beyond just saying that there are big opportunities. However, I liked the concept behind Asthmapolis, which develops a networked GPS enabled asthma inhaler which will allow the company to obtain large-scale data about locations and events that are correlated with asthma attacks.
During the following break, I went to the Sponsor Pavilion in search of freebies and interesting companies. LinkedIn had a booth where you could get your LinkedIn contact graph printed out. It’s kind of a gimmick but pretty nice nonetheless. My graph had been clustered in a sensible way with clusters corresponding to childhood friends, grad school friends, postdoc friends etc. It may be more useful to look at another person’s (maybe a customer’s or a potential employer’s) to see what kind of professional social network they have.
The next talk was one that I had been looking forward to, and it did not disappoint. Joseph Turian talked about recent interesting developments in machine learning and NLP (natural language processing) and made me realize that I had been missing a really machine-learning oriented talk so far. He discussed deep learning, which I was already familiar with, but then he threw in “smart” (=data-driven) hashing, which when combined with deep learning is called “semantic hashing”. In this technique, input is processed through several layers in a deep learning architecture until one gets a compact 32-bit representation, which becomes a code for the data point. Using this representation, you can store a billion data points on a normal laptop. He also discussed graph stores – as I mentioned in my previous blog post, this topic had come up repeatedly the previous day – and namechecked the Swedish Neo4j graph database. Turian said that the map-reduce framework is too high level for most machine learning methods, which contain graph-like dependencies, and that graph-based parallelism will be a more fruitful framework for those. He mentioned Pregel and GraphLab as two abstractions of parallel graph operations. Finally, he talked about recent work by Poon and Domingos that seems to be able of almost magically extracting knowledge from text. Pedro Domingos is the guy who wrote the paper about why Naïve Bayes often works well even when inputs are not independent. (I have also heard rumors that he used to be a pop star in a Portuguese boy band; maybe the internet can confirm/deny.)
At this point I was pretty tired from the jet lag. I didn’t really pay attention to the next talk but instead headed over to the sponsor pavilion, where O’Reilly was handing out snacks, wine and books. Yes, they gave away free signed books by the likes of Toby Segaran (Programming Collective Intelligence, Beautiful Data, Programming the Semantic Web.) I picked up a book that seems really promising: Mining the Social Web by Matthew Russell. I will review it eventually here on the blog. After dinner, I went to a talk by the author of the book. It was an enjoyable talk about simple techniques for analyzing Twitter data. The final presentation I went to was nominally by Apache’s Mahout, a scalable machine-learning/data mining library, but I found that the presentation was a bit short on actual Mahout material, focusing more on machine learning and prediction in general. Nice enough overview, but I was already familiar with most of those topics and would have preferred to learn a bit more about the nuts and bolts of Mahout.
Phew … I guess that’s about it. At this moment I’m waiting for the keynotes of the third day to commence.