Follow the Data

A data driven blog

Archive for the month “December, 2012”

Explaining predictions and imbalanced data sets

In many applications it is of interest to know why a specific prediction was made for an individual instance. For instance, a credit card company might want to be able to explain why a customer’s card was blocked even though there was no actual fraud. Obviously, this kind of “reverse engineering” of predictions is very important for improving the predictive model, but many (most?) machine learning methods don’t have a straightforward way of explaining why a certain specific prediction was made, as opposed to displaying a “decision surface” or similar that describes in general what matters for the classification. There are exceptions, like decision trees, where the resulting classifier shows in a transparent way (using binary tests on the features) what the prediction will be for a specific example, and naïve Bayes classifiers, where it is straightforward to quantify how much each feature value contributes to the prediction.

Someone asked a question about this on MetaOptimize QA and the answers contained some interesting pointers to published work about this interesting problem. Alexandre Passos recommends two papers, How To Explain Individual Classification Decisions (pdf link) by Baehrens et al., and An Efficient Explanation of Individual Classifications
using Game Theory (pdf link) by Strumbelj and Kononenko. The Baehrens et al. paper defines something called an “explanation vector” for a data point, which is (if I have understood the paper correctly) a vector that has the same dimension as the data point itself (the number of features) and that points towards the direction of maximum “probability flow” away from the class in question. The entries in this vector that have large absolute values correspond to features that have a large local influence on which class is predicted. The problem is that this vector typically cannot be calculated directly from the classification model (except in some cases like Gaussian Processes), so it has to be estimated using some sort of smoothing of the data; in the paper they use Parzen windows.

I would really like to have code to do this (ideally, an R package) but couldn’t find any in the paper.

The Strumbelj paper uses a completely different approach which I frankly can’t really wrap my head around, but is based on game theory, specifically the idea that an explanation of a classifier’s prediction can be treated as something called a “coalitional form game” where the instance’s feature values form a “coalition” which causes a change in the classifier’s prediction. This lets them use something called the “Shapley value” to assess the contributions of each feature.

Again, it would be really nice to have the code for this, even though the authors state that “the explanation method is a straightforward Java implementation of the equations presented in this paper.”

On another “practical machine learning tips” note, the newish and very good blog p-value.info linked to a very interesting article, Learning from Imbalanced Data. In working with genome-scale biological data, we often encounter imbalanced data scenarios where the positive class may contain, say, 1,000 instances, and the negative class 1,000,000 instances. I knew from experience that it is not straightforward to build and assess the performance of classifiers for this type of data sets (for example, concepts like accuracy and ROC-AUC become highly problematic), but like Carl at p-value.info I was surprised by the sheer variety of approaches outlined in this review. For instance, there are very nice expositions of the dangers of under- and oversampling and how to perform more informed versions of those. Also, I had realized that cost-sensitive evaluation methods could be useful (it may be much more important to classify instances in the rare class correctly, for example) but before reading this review I hadn’t thought about how to integrate cost-sensitivity into the actual learning method.

Lessons from medical predictive analytics competititions

Presentations from the organizers and best performing teams of the Systems Biology Improver Challenge, which I have covered before, have been put online. There is a ton of interesting stuff here for people like me who are interested in prediction contests, machine learning, and/or medical bioinformatics.

To rewind a little bit, the SBV Improver Challenge was a set of competitions somewhat similar to Kaggle’s competitions (indeed, there is a presentation by Will Cukierski of Kaggle among these presentations) in that participants needed to build a predictive model for classifying diseased versus healthy samples for four diseases based on gene expression data (simplifying a little bit), and then predict the disease status for an unlabeled test set. The competitions differed from Kaggle competitions in that the training data sets were not fixed – the organizers just pointed to some suggested public data sets, which I think was a nice way to do it.

The point of these sub-competitions was to establish whether gene expression profiles are truly predictive of disease states in general or just in specific cases.

Anyway – I enjoyed browsing these presentations a lot. Some take-home messages:

  • Different diseases vary immensely in how much predictive signal the gene expression data contains. Psoriasis was much easier to classify correctly based on these data than the other diseases: multiple sclerosis, COPD and lung cancer. The nature of the disease was an incomparable more important variable than anything related to normalization, algorithms, pre-processing etc.
  • It was suggested that providing the whole test set at once may be a bad idea, because it may reveal information (such as cluster structure) that would not be known in a real-life scenario (for instance, if you were to go to the doctor to get a gene expression measurement from your own tissue.) It was suggested that next time, the platform would just provide one data point at a time to be predicted. Indeed, I have thought a lot about this “single-sample classification” problem in connection with gene expression data lately. As anyone who has worked with (microarray or RNA-seq) gene expression data knows, there are severe experimental batch effects in these data that are usually removed by normalizing all data points together, but this cannot always be done in practice.
  • In these competitions, none of the alleged superiority of random forest classifiers or GLMs (cf. http://strataconf.com/strata2012/public/schedule/detail/22658) was in evidence, with “simple methods” such as linear discriminant analysis and Mann-Whitney tests for feature selection performing the best. Yes, I know there is probably no statistical significance here due to the small sample size …
  • Speaking of random forests and GLMs, I was surprised to learn about the RGLM (Random Generalized Linear Model), which is kind of a mix of the two; generalized linear models are built on random subsets of the training examples and the features (like in random forest classifiers) and predictions of many of these models are aggregated to get a final prediction. The presentation is here.

The “lessons learned” presentation discusses (most of) these points and more, and is interesting throughout.

 

 

Data data data

I haven’t used Google Plus much since I signed up this summer but that is changing now after they launched the “communities” concept and I found the Data data data and Machine Learning communities, where a lot of interesting discussions can be found by “big names” and “smart unknowns” alike. Check them out if you haven’t done so.

Post Navigation