Follow the Data

A data driven blog

Model explanation followup – anchors, Shapley values, counterfactuals etc.

Last year, I published a blog post about model explanations (a term I will use interchangeably with “model interpretation” here, although there might be some subtle differences.) Just eleven months later, so much has happened in this space that that blog post looks completely obsolete. I suspect part of the surge in interest in model interpretation techniques is partly due to the recently introduced GDPR regulations, partly due to pure momentum from a couple of influential papers. Perhaps practitioners have also started to realize that customers or other model users frequently want to have the option of peeking into the “black box”. In this post, I’ll try to provide some newer and better resources on model explanation and briefly introduce some new approaches.


This update deals with “black-box” explanation methods which should work on any type of predictive model and the aim of which is to provide the user of a predictive model with a useful explanation of why a certain prediction was made. In other words, I am talking about local rather than global explanations.

Out of scope for this post are neural network-specific and/or image-oriented methods such Grad-CAM, Understanding the inner workings of neural networks,  etc. I also don’t include things like RandomForestExplainer although I like it, because it is used for global investigation of feature importance rather than explaining single predictions.

I’ll assume that you have read the previous post and have at least heard about LIME, which has been an influential model interpretation method in the past few years. Although many methods preceded it, the LIME authors were successful in communicating its usefulness and arguing in favor of its approach. To summarize very briefly what LIME does, it attempts to explain a specific prediction by building a local, sparse, linear surrogate model around that data point and returning the nonzero coefficients of the fit. It does this by creating a “fake” data set by sampling new points around the point to be explained, classifying those points with the model, and then fitting a lasso model to the new “fake” (x, y) set of points. There are some further details, e.g. the contribution of each point to the loss depends on its distance from the original point, and there is also a penalty for having a complex model – please see the “Why should I trust you?” paper for details.

General sources

I’ve found this ebook, Interpretable Machine Learning, written by Christoph Molnar, a PhD student in Germany, to be really useful. It goes into the reasons for thinking about model interpretability as well as technical details on partial dependence plots, feature importance, feature interactions, LIME and SHAP.

The review paper “A Survey Of Methods For Explaining Black Box Models” by Guidotti et al. does a pretty good job of explaining all the nuances of different types if explanatory models. It also discusses some much earlier, interesting model explanation approaches.

O’Reilly have released an ebook, “An Introduction to Machine Learning Interpretability” which is available via Safari (you can read it via a free trial). I haven’t had time to read it yet, but trust it is good based on the authors’ (they are from H2O) previous blog posts on the subject, such as Ideas on Interpreting Machine Learning.

New methods

(1) SHAP

Probably my personal favorite of the methods I’ve tried so far, SHAP (SHapley Additive exPlanations) is based on a concept from game theory called Shapley values. These values reflect the optimal way of distributing credit in a multiplayer game based on how much each player contributes to some payoff in the game. In a machine learning context, you can see features as “players” and the payoff as being a prediction (or the difference between a prediction and a naïve baseline prediction.) There is a great blog post by Cody Marie Wild that explains this in more detail, and also a double episode of the Linear Digressions podcast which is well worth a listen.

Maybe even more important than the sound theoretical underpinnings, SHAP has a good Python interface with great plots built in. It plugs in to standard scikit-learn type predictors (or really anything you want) with little hassle. It is especially good for tree ensemble models (random forest, gradient boosting). For these models, there are effective ways of calculating Shapley values without running into combinatorial explosion, and therefore even very big datasets can be visualized in terms of each data point’s Shapley value if a tree ensemble has been used.

(1b) Shapley for deep learning: Integrated gradients

For deep learning models, there is an interface for Keras that allows for calculating Shapley score-like quantities using “integrated gradients” (see paper “Axiomatic Attribution for Deep Networks“), which is basically a way to calculate gradients in a way that does not violate one of the conditions (“sensitivity”) of feature attribution. This is done by aggregating gradients over a straight-line path between the point to explain and a reference points.

(2) Counterfactual explanations

A paper from last year, “Counterfactual Explanations Without Opening the Black Box: Automated Decisions and the GDPR“, comes at the problem from a slightly different angle which reflects that it was written by a data ethicist, a computer scientist, and a lawyer. It discusses under what conditions an explanation of a prediction is required by GDPR and when it is actually meaningful to the affected person. They arrive at the conclusion that the most useful way to explain a prediction is a counterfactual that changes the input variables as little as possible while ending up with a different prediction. For example, if you are denied a loan by an automated algorithm, it might be sufficient to learn that you would have gotten the loan if your income had been 5% higher. This leads to a method where one looks for “the closest possible world” where the decision would have been different. I.e. one tries to find a point as close as possible to the data point under explanation where the algorithm would have chosen a different class.

(3) Anchors

The group that published LIME has extended their work after noticing that the LIME explanations can have unclear coverage, ie it is not clear whether a given explanation applies in a region where an unseen instance is located. They have written a new paper, “Anchors: High-Precision Model-Agnostic Explanations“, which deals with “anchors”, high-precision explanation rules that “anchor” a prediction locally so that changes to the rest of the feature’s values don’t matter. On instances where the anchor holds, the prediction is (almost) always the same. (The degree to which it has to hold can be controlled with a parameter.). This tends to yield compact rules that are also easily understood by users. There is a Python interface for anchors.

I’d be happy to hear about other interesting explanation methods that I’ve missed!

Single Post Navigation

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: