Follow the Data

A data driven blog

Archive for the month “December, 2014”

Data themed podcasts

I’ve started to follow a number of data-themed podcasts:

  • O’Reilly Data Show. This is a newish spin-off from the O’Reilly Radar Podcasts, led by O’Reilly’s chief data scientist, Ben Lorica. The first three episodes have dealt with “the science of moving dots” (analyzing sports videos), Bitcoin analytics, and Kafka.
  • Partially Derivative. This bills itself as a podcast about “data, data science and awesomeness”, and also seems to have a beer side theme (check out the logo, for instance). I have only heard the latest episode (#7), which I liked.
  • The Data Skeptic Podcast. A very good podcast with frequent episodes. Some are “minisodes”, shorter shows where the host usually discusses some specific statistical issue or model with his wife Linhda, and some are longer interviews with people who talk about surprisingly interesting topics such as “data myths”, how to make unbiased dice, how algorithms can help people find love, etc.
  • Data Stories. This one focuses on data visualization and interviews with strong practitioners of the same. Definitely worth a listen.

I also used to be a fan of The Cloud of Data, which does not seem to be updated anymore. (?)  Are there any other good ones?

Edit 2015-01-04. There is a brand new machine learning podcast called Talking Machines, which doesn’t seem to be on iTunes yet. It seems like a must-listen based on the guest list in the first show (Hanna Wallach [of the much-linked Big Data, Machine Learning, and the Social Sciences: Fairness, Accountability, and Transparency talk],  Kevin Murphy [who wrote one of the most highly praised textbooks in the area], Max Welling [co-chair of the NIPS conference], and the three deep learning giants Yann LeCun, Yoshoua Bengio and Geoff Hinton).

I also found another machine learning themed podcast from Udacity called Linear Digressions.

Data science workshop at Stockholm university

I spent last Thursday and Friday (Dec. 4-5 2014) in Kista just outside of Stockholm at a data science workshop hosted by professor Henrik Boström at the Department of Computer and Systems Sciences at Stockholm University. It was a nice opportunity to get to know some new data science people from around Sweden (the geographical spread was decent) and whom I haven’t encountered at my usual hunting grounds (mainly meetups.) This workshop was perhaps more oriented towards academic data scientists (“Most of the people here have published” as Henrik said), although there were participants from companies such as Scania and SAAB as well.

The meeting was very enjoyable and we heard presentations on topics such as predictive car maintenance, mining electronic health records for unknown drug side effects, anomaly prediction of elderly people’s movement patterns in their apartments, pinpointing dangerous mutations in flu viruses and much more. On the second day, we were divided into groups where we discussed various topics. I participated in two discussion groups. The first one was about computational life sciences, where we discussed the balance between method development and pure applied analysis in biological data science (bioinformatics). We also talked about the somewhat paradoxical situations that occur in high-throughput biological studies, where careful statistical analysis performed on a large dataset is often only considered believable after experimental validation on a much smaller (and clearly more biased) sample. Perhaps other types of validation, such as independent testing against previously published data, would often be more useful. We also argued that performing (good!) analyses on already published data should be valued much higher than it is today, when “novel data” are often required for publication.

The second discussion was about the comprehensibility or interpretability of classifiers. According to one of the participants, Leo Breiman (who invented random forests, among other things), had said when asked about the most pressing problems in data mining: “There are three: interpretability, interpretability, and interpretability!” (I’m paraphrasing) Domain experts often prefer models that can give clear and convincing reasons for predictions, that have a small number of relevant features, and that (in some cases) emulate human decision making in the way they produce predictions. However, interpretability or comprehensibility is subjective and hard to measure – and there isn’t even any good terminology or definition of it (which is exemplified by the fact that I am using a couple of different words for it here!) It’s likely that this problem will become more and more pressing as machine learning tools become more popular and the number of non-expert users increases. This was a very interesting discussion where a lot of ideas were thrown around – cost-sensitive methods where you have a feature acquisition cost (“pay extra” for features), posting models on Mechanical Turk and asking if they are comprehensible, gamification where people have to use different models to solve a problem, hybrid classifiers like decision trees where some nodes might contain a whole neural network etc.

Other groups discussed, e g, automatized experimentation and experimental design and the possible establishment of a PhD program in data science.

One idea that came up a few time throughout the meeting and which I hadn’t heard about before was the concept of conformal learning prediction. Apparently this was introduced in the book Algorithmic learning in a random world (2005). As far as I understood, conformal prediction methods are concerned with quantifying their own accuracy. These methods supposedly provide reliable bounds for how they will generalize to new data. The price you pay for this reliability is that you don’t get a point estimate as your prediction; instead of a predicted class (in classification) you might get more than one possible class, and instead of a single number (in regression) you get an interval. One of the participants likened this to a confidence interval, but a general one which does not assume a specific distribution of the data. Using this framework, you can get p values for your predictions and perform hypothesis tests.

There were also some nice ideas about how to embed various types of data as graphs, and how to validate graphs induced from biological experiments – another typ of “generalized confidence interval” where the aim is to find “reliable” subnetworks based on repeated subsampling.

Quick notes

  • I’ve found the Data Skeptic to be a nice podcast about data science and related subjects. For example, the “data myths” episode and the one with Matthew Russell (who wrote Mining the Social Web) are fun.
  • When I was in China last month, the seat pocket in front of me in the cab we took from the Beijing airport had a glossy magazine in it. The first feature article was about big data (大数据) analysis applied to Chinese TV series and movies, Netflix-style. Gotta beat those Korean dramas! One of the hotels we stayed in Beijing had organized an international conference on big data analytics the day before we arrived at the hotel. The signs and posters were still there. Anecdotes, not data, but still.
  • November was a good meetup month in Stockholm. The Machine Learning group had another good event at Spotify HQ, with interesting presentations from Watty , both about how to “data bootstrap” a startup when you discover that the existing data you’ve acquired is garbage and need to start generating your own in a hurry, and about the actual nitty gritty details of their algorithms (which model and predict energy consumption from different devices in households by deconvoluting a composite signal), and also about embodied cognition and robotics by Jorge Davila-Chacon (slides here). Also, in an effort to revive the Stockholm Big Data group, I co-organized (together with Stefan Avestad from Ericsson) a meetup with Paco Nathan on Spark. The slides for the talk, which was excellent and extremely appreciated by the audience, can be found here. Paco also gave a great workshop the next day on how to actually use Spark. Finally, I’ve joined the organizing committee of SRUG, the Stockholm R useR group, and have started to plan some future meetups there. The next one will be on December 9 and will deal with how Swedish governmental organizations use R.
  • Erik Bernhardsson of Spotify has written a fascinating blog post combining two of my favorite subjects: chess and deep learning. He has trained a 3 layer deep and 2048 unit wide network on 100 million games from FICS (the Free Internet Chess Server, where I, incidentally, play quite often). I’ve often thought about why it seems to be so hard to build a chess engine that really learns the game from scratch, using actual machine learning, rather than the rule- and heuristic based programs that have ruled the roost, and which have been pre-loaded with massive opening libraries and endgame tablebases (giving the optimal move in any position with less than N pieces; I think that N is currently about =<7). It would be much cooler to have a system that just learns implicitly how to play and does not rely on knowledge. Well, Erik seems to have achieved that, kind of. The cool thing is that this program does not need to be told explicitly how the pieces move; it can infer it from data. Since the system is using amateur games, it sensibly enough does not care about the outcome of each game (that would be a weak label for learning). I do think that Erik is a bit optimistic when he writes that “Still, even an amateur player probably makes near-optimal moves for most time.” Most people who have analyzed their own games, or online games, with a strong engine know that amateur games are just riddled with blunders. (I remember the old Max Euwe book “Chess master vs chess amateur”, which also demonstrated this convincingly … but I digress).  Still, a very impressive demonstration! I once supervised a master’s thesis where the aim was to teach a neural network to play some specific endgames, and even that was a challenge. As Erik notes in his blog post, his system needs to be tried against a “real” chess engine. It is reported to score around 33% against Sunfish, but that is a fairly weak engine, as I found out by playing it half and hour ago.

Post Navigation