Large-scale machine learning course & streaming organism classification

The NYU Large Scale Machine Learning course looks like it will be very worthwhile to follow. The instructors, John Langford and Yann Le Cun, are both key figures in the machine learning field – for instance, the former developed Vowpal Wabbit and the latter has done pioneering work in deep learning. It is not an online course like those at Coursera et al., but they have promised to put lecture videos and slides online. I’ll certainly try to follow along as best I can.


There is an interesting Innocentive challenge going on, called “Identify Organisms from A Stream of DNA Sequences.” This is interesting to me both because of the subject matter (classification based on DNA sequences) and also because the winner is explicitly required to submit an efficient, scalable solution (not just a good classifier.) Also, the prize sum is one million US dollars! It’s exactly this kind of algorithms that will be needed to enable the “genomic observatories” that I have mentioned before on this blog which will continuously stream out sequences obtained from the environment.

A good week for (big) data (science)

Perhaps as a subconscious compensation for my failure to attend Strata 2012 last week (I did watch some of the videos and study the downloads from the “Two Most Important Algorithms in Predictive Modeling Today” session), I devoted this week to more big-data/data-science things than usual.

Monday to Wednesday were spent at a Hadoop and NGS (Next Generation [DNA] Sequencing) data processing hackathon hosted by CSC in Espoo, Finland. All of the participants were very nice and accomplished; I’ll just single out two people for having developed high-throughput DNA sequencing related Hadoop software: Matti Niemenmaa, who is the main developer of Hadoop-BAM, a library for manipulating aligned sequence data in the cloud, and Luca Pireddu, who is the main developer of Seal, which is a nice Hadoop toolkit for sequencing data which enables running several different types of tasks in distributed fashion. Other things we looked at was the CloudBioLinux project, map/reduce sequence assembly using Contrail and CSC’s biological high-throughput data analysis platform Chipster.

On Friday, me and blog co-author Joel went to record our first episode of the upcoming Follow the Data podcast series with Fredrik Olsson and Magnus Sahlgren from Gavagai. In the podcast series, we will try to interview mainly Swedish but also other companies that we feel are big data or analytics related in an interesting way. Today I have been listening to the first edit and feel relatively happy with it, even though it is quite rough, owing to our lack of experience. I also hate to hear my own recorded voice, especially in English … I am working on one or two blog posts to summarize the highlights of the podcast (which is in English) and the following discussion in Swedish.

Over the course of the week, I’ve also worked in the evenings and on planes to finish an assignment for an academic R course I am helping out with. I decided to experiment a bit with this assignment and to base it on a Kaggle challenge. The students will download data from Kaggle and get instructions that can be regarded as a sort of “prediction contests 101”, discussing the practical details of getting your data into shape, evaluating your models, figuring out which variables are most important and so on. It’s been fun and can serve as a checklist for my self in the future.

Stay tuned for the first episode of Follow the Data podcast!

Computational advertising course

I’ve written about one company that exemplifies how advertising is becoming more data-driven, and now I find there is a Stanford university course about computational advertising. One of the lecture note PDFs defines computational advertising as “A principled way to find the ‘best match’ between a user in a context and a suitable ad“. Although I agree with this O’Reilly Radar blog post in thinking that it’s a stretch to call computational advertising a “scientific discipline”, the lecture notes are nevertheless fun and interesting to read. The instructors are from Yahoo! Research and probably a lot of the material that they cover is actually being used by Yahoo! in some way.

