Explaining predictions and imbalanced data sets
In many applications it is of interest to know why a specific prediction was made for an individual instance. For instance, a credit card company might want to be able to explain why a customer’s card was blocked even though there was no actual fraud. Obviously, this kind of “reverse engineering” of predictions is very important for improving the predictive model, but many (most?) machine learning methods don’t have a straightforward way of explaining why a certain specific prediction was made, as opposed to displaying a “decision surface” or similar that describes in general what matters for the classification. There are exceptions, like decision trees, where the resulting classifier shows in a transparent way (using binary tests on the features) what the prediction will be for a specific example, and naïve Bayes classifiers, where it is straightforward to quantify how much each feature value contributes to the prediction.
Someone asked a question about this on MetaOptimize QA and the answers contained some interesting pointers to published work about this interesting problem. Alexandre Passos recommends two papers, How To Explain Individual Classification Decisions (pdf link) by Baehrens et al., and An Efficient Explanation of Individual Classifications
using Game Theory (pdf link) by Strumbelj and Kononenko. The Baehrens et al. paper defines something called an “explanation vector” for a data point, which is (if I have understood the paper correctly) a vector that has the same dimension as the data point itself (the number of features) and that points towards the direction of maximum “probability flow” away from the class in question. The entries in this vector that have large absolute values correspond to features that have a large local influence on which class is predicted. The problem is that this vector typically cannot be calculated directly from the classification model (except in some cases like Gaussian Processes), so it has to be estimated using some sort of smoothing of the data; in the paper they use Parzen windows.
I would really like to have code to do this (ideally, an R package) but couldn’t find any in the paper.
The Strumbelj paper uses a completely different approach which I frankly can’t really wrap my head around, but is based on game theory, specifically the idea that an explanation of a classifier’s prediction can be treated as something called a “coalitional form game” where the instance’s feature values form a “coalition” which causes a change in the classifier’s prediction. This lets them use something called the “Shapley value” to assess the contributions of each feature.
Again, it would be really nice to have the code for this, even though the authors state that “the explanation method is a straightforward Java implementation of the equations presented in this paper.”
On another “practical machine learning tips” note, the newish and very good blog p-value.info linked to a very interesting article, Learning from Imbalanced Data. In working with genome-scale biological data, we often encounter imbalanced data scenarios where the positive class may contain, say, 1,000 instances, and the negative class 1,000,000 instances. I knew from experience that it is not straightforward to build and assess the performance of classifiers for this type of data sets (for example, concepts like accuracy and ROC-AUC become highly problematic), but like Carl at p-value.info I was surprised by the sheer variety of approaches outlined in this review. For instance, there are very nice expositions of the dangers of under- and oversampling and how to perform more informed versions of those. Also, I had realized that cost-sensitive evaluation methods could be useful (it may be much more important to classify instances in the rare class correctly, for example) but before reading this review I hadn’t thought about how to integrate cost-sensitivity into the actual learning method.