I spent last Thursday and Friday (Dec. 4-5 2014) in Kista just outside of Stockholm at a data science workshop hosted by professor Henrik Boström at the Department of Computer and Systems Sciences at Stockholm University. It was a nice opportunity to get to know some new data science people from around Sweden (the geographical spread was decent) and whom I haven’t encountered at my usual hunting grounds (mainly meetups.) This workshop was perhaps more oriented towards academic data scientists (“Most of the people here have published” as Henrik said), although there were participants from companies such as Scania and SAAB as well.
The meeting was very enjoyable and we heard presentations on topics such as predictive car maintenance, mining electronic health records for unknown drug side effects, anomaly prediction of elderly people’s movement patterns in their apartments, pinpointing dangerous mutations in flu viruses and much more. On the second day, we were divided into groups where we discussed various topics. I participated in two discussion groups. The first one was about computational life sciences, where we discussed the balance between method development and pure applied analysis in biological data science (bioinformatics). We also talked about the somewhat paradoxical situations that occur in high-throughput biological studies, where careful statistical analysis performed on a large dataset is often only considered believable after experimental validation on a much smaller (and clearly more biased) sample. Perhaps other types of validation, such as independent testing against previously published data, would often be more useful. We also argued that performing (good!) analyses on already published data should be valued much higher than it is today, when “novel data” are often required for publication.
The second discussion was about the comprehensibility or interpretability of classifiers. According to one of the participants, Leo Breiman (who invented random forests, among other things), had said when asked about the most pressing problems in data mining: “There are three: interpretability, interpretability, and interpretability!” (I’m paraphrasing) Domain experts often prefer models that can give clear and convincing reasons for predictions, that have a small number of relevant features, and that (in some cases) emulate human decision making in the way they produce predictions. However, interpretability or comprehensibility is subjective and hard to measure – and there isn’t even any good terminology or definition of it (which is exemplified by the fact that I am using a couple of different words for it here!) It’s likely that this problem will become more and more pressing as machine learning tools become more popular and the number of non-expert users increases. This was a very interesting discussion where a lot of ideas were thrown around – cost-sensitive methods where you have a feature acquisition cost (“pay extra” for features), posting models on Mechanical Turk and asking if they are comprehensible, gamification where people have to use different models to solve a problem, hybrid classifiers like decision trees where some nodes might contain a whole neural network etc.
Other groups discussed, e g, automatized experimentation and experimental design and the possible establishment of a PhD program in data science.
One idea that came up a few time throughout the meeting and which I hadn’t heard about before was the concept of conformal learning prediction. Apparently this was introduced in the book Algorithmic learning in a random world (2005). As far as I understood, conformal prediction methods are concerned with quantifying their own accuracy. These methods supposedly provide reliable bounds for how they will generalize to new data. The price you pay for this reliability is that you don’t get a point estimate as your prediction; instead of a predicted class (in classification) you might get more than one possible class, and instead of a single number (in regression) you get an interval. One of the participants likened this to a confidence interval, but a general one which does not assume a specific distribution of the data. Using this framework, you can get p values for your predictions and perform hypothesis tests.
There were also some nice ideas about how to embed various types of data as graphs, and how to validate graphs induced from biological experiments – another typ of “generalized confidence interval” where the aim is to find “reliable” subnetworks based on repeated subsampling.