A good week for (big) data (science)
Perhaps as a subconscious compensation for my failure to attend Strata 2012 last week (I did watch some of the videos and study the downloads from the “Two Most Important Algorithms in Predictive Modeling Today” session), I devoted this week to more big-data/data-science things than usual.
Monday to Wednesday were spent at a Hadoop and NGS (Next Generation [DNA] Sequencing) data processing hackathon hosted by CSC in Espoo, Finland. All of the participants were very nice and accomplished; I’ll just single out two people for having developed high-throughput DNA sequencing related Hadoop software: Matti Niemenmaa, who is the main developer of Hadoop-BAM, a library for manipulating aligned sequence data in the cloud, and Luca Pireddu, who is the main developer of Seal, which is a nice Hadoop toolkit for sequencing data which enables running several different types of tasks in distributed fashion. Other things we looked at was the CloudBioLinux project, map/reduce sequence assembly using Contrail and CSC’s biological high-throughput data analysis platform Chipster.
On Friday, me and blog co-author Joel went to record our first episode of the upcoming Follow the Data podcast series with Fredrik Olsson and Magnus Sahlgren from Gavagai. In the podcast series, we will try to interview mainly Swedish but also other companies that we feel are big data or analytics related in an interesting way. Today I have been listening to the first edit and feel relatively happy with it, even though it is quite rough, owing to our lack of experience. I also hate to hear my own recorded voice, especially in English … I am working on one or two blog posts to summarize the highlights of the podcast (which is in English) and the following discussion in Swedish.
Over the course of the week, I’ve also worked in the evenings and on planes to finish an assignment for an academic R course I am helping out with. I decided to experiment a bit with this assignment and to base it on a Kaggle challenge. The students will download data from Kaggle and get instructions that can be regarded as a sort of “prediction contests 101”, discussing the practical details of getting your data into shape, evaluating your models, figuring out which variables are most important and so on. It’s been fun and can serve as a checklist for my self in the future.
Stay tuned for the first episode of Follow the Data podcast!