Follow the Data

A data driven blog

Archive for the month “July, 2018”

Model explanation followup – anchors, Shapley values, counterfactuals etc.

Last year, I published a blog post about model explanations (a term I will use interchangeably with “model interpretation” here, although there might be some subtle differences.) Just eleven months later, so much has happened in this space that that blog post looks completely obsolete. I suspect part of the surge in interest in model interpretation techniques is partly due to the recently introduced GDPR regulations, partly due to pure momentum from a couple of influential papers. Perhaps practitioners have also started to realize that customers or other model users frequently want to have the option of peeking into the “black box”. In this post, I’ll try to provide some newer and better resources on model explanation and briefly introduce some new approaches.


This update deals with “black-box” explanation methods which should work on any type of predictive model and the aim of which is to provide the user of a predictive model with a useful explanation of why a certain prediction was made. In other words, I am talking about local rather than global explanations.

Out of scope for this post are neural network-specific and/or image-oriented methods such Grad-CAM, Understanding the inner workings of neural networks,  etc. I also don’t include things like RandomForestExplainer although I like it, because it is used for global investigation of feature importance rather than explaining single predictions.

I’ll assume that you have read the previous post and have at least heard about LIME, which has been an influential model interpretation method in the past few years. Although many methods preceded it, the LIME authors were successful in communicating its usefulness and arguing in favor of its approach. To summarize very briefly what LIME does, it attempts to explain a specific prediction by building a local, sparse, linear surrogate model around that data point and returning the nonzero coefficients of the fit. It does this by creating a “fake” data set by sampling new points around the point to be explained, classifying those points with the model, and then fitting a lasso model to the new “fake” (x, y) set of points. There are some further details, e.g. the contribution of each point to the loss depends on its distance from the original point, and there is also a penalty for having a complex model – please see the “Why should I trust you?” paper for details.

General sources

I’ve found this ebook, Interpretable Machine Learning, written by Christoph Molnar, a PhD student in Germany, to be really useful. It goes into the reasons for thinking about model interpretability as well as technical details on partial dependence plots, feature importance, feature interactions, LIME and SHAP.

The review paper “A Survey Of Methods For Explaining Black Box Models” by Guidotti et al. does a pretty good job of explaining all the nuances of different types if explanatory models. It also discusses some much earlier, interesting model explanation approaches.

O’Reilly have released an ebook, “An Introduction to Machine Learning Interpretability” which is available via Safari (you can read it via a free trial). I haven’t had time to read it yet, but trust it is good based on the authors’ (they are from H2O) previous blog posts on the subject, such as Ideas on Interpreting Machine Learning.

New methods

(1) SHAP

Probably my personal favorite of the methods I’ve tried so far, SHAP (SHapley Additive exPlanations) is based on a concept from game theory called Shapley values. These values reflect the optimal way of distributing credit in a multiplayer game based on how much each player contributes to some payoff in the game. In a machine learning context, you can see features as “players” and the payoff as being a prediction (or the difference between a prediction and a naïve baseline prediction.) There is a great blog post by Cody Marie Wild that explains this in more detail, and also a double episode of the Linear Digressions podcast which is well worth a listen.

Maybe even more important than the sound theoretical underpinnings, SHAP has a good Python interface with great plots built in. It plugs in to standard scikit-learn type predictors (or really anything you want) with little hassle. It is especially good for tree ensemble models (random forest, gradient boosting). For these models, there are effective ways of calculating Shapley values without running into combinatorial explosion, and therefore even very big datasets can be visualized in terms of each data point’s Shapley value if a tree ensemble has been used.

(1b) Shapley for deep learning: Integrated gradients

For deep learning models, there is an interface for Keras that allows for calculating Shapley score-like quantities using “integrated gradients” (see paper “Axiomatic Attribution for Deep Networks“), which is basically a way to calculate gradients in a way that does not violate one of the conditions (“sensitivity”) of feature attribution. This is done by aggregating gradients over a straight-line path between the point to explain and a reference points.

(2) Counterfactual explanations

A paper from last year, “Counterfactual Explanations Without Opening the Black Box: Automated Decisions and the GDPR“, comes at the problem from a slightly different angle which reflects that it was written by a data ethicist, a computer scientist, and a lawyer. It discusses under what conditions an explanation of a prediction is required by GDPR and when it is actually meaningful to the affected person. They arrive at the conclusion that the most useful way to explain a prediction is a counterfactual that changes the input variables as little as possible while ending up with a different prediction. For example, if you are denied a loan by an automated algorithm, it might be sufficient to learn that you would have gotten the loan if your income had been 5% higher. This leads to a method where one looks for “the closest possible world” where the decision would have been different. I.e. one tries to find a point as close as possible to the data point under explanation where the algorithm would have chosen a different class.

(3) Anchors

The group that published LIME has extended their work after noticing that the LIME explanations can have unclear coverage, ie it is not clear whether a given explanation applies in a region where an unseen instance is located. They have written a new paper, “Anchors: High-Precision Model-Agnostic Explanations“, which deals with “anchors”, high-precision explanation rules that “anchor” a prediction locally so that changes to the rest of the feature’s values don’t matter. On instances where the anchor holds, the prediction is (almost) always the same. (The degree to which it has to hold can be controlled with a parameter.). This tends to yield compact rules that are also easily understood by users. There is a Python interface for anchors.

I’d be happy to hear about other interesting explanation methods that I’ve missed!

Personal reflections on data science jobs

A bit more than a year ago, I took the plunge and left my academic job to try my luck as a corporate data scientist, first at IBM (obviously a very big company) and now at Peltarion (a startup which I still want to call small although it is growing rapidly). I am not sure if this blog post is premature or not, but in any case I’d like to share some of my experiences and impressions of the different roles I’ve been in. So without further ado, I present my three last data science positions!

(1) Bioinformatics scientist at Stockholm university (May 2010- May 2017) + freelance gigs.

I was working as a senior bioinformatician at SciLifeLab/Stockholm University in different capacities for seven years. At first I was hired as a general bioinformatics go-to person in a so-called core facility that does DNA sequencing, where I would be involved in a lot of different kinds of things: setting up data pipelines, deciding on quality control routines, trying to figure out what had gone wrong, delivering data to and communicating with customers, performing routine or custom analysis, and sometimes doing some actual research and writing papers. After a while, I moved into a different role where my job was more explicitly to help researchers with data analysis, statistics and programming – more research-oriented and long-term work. In a way, I was an academic data science consultant. Of course, we didn’t really call it “data science” because we were doing science, plain and simple, but in terms of what we did all day, it was in many ways similar to “data science” in industry.

Characteristics of data science in an academic (biology) setting

Note: this is the type of role I have the most experience with, or the most data on, if you will, so I am more confident about the pronouncements here than in the other categories.

  • The final product is almost always a paper. This has some positive and negative implications. On the good side, there is (at least nowadays) a strong focus on reproducibility. On the bad side, there is almost no emphasis on putting predictive models into production or making them easily usable. Code quality can also be spotty as a result.
  • Bioinformatics data scientists tend to be good at data visualization, often in R or Python. They understand the concept of batch effects (drift in distribution parameters) and are good at dealing with high-dimensional data where the number of examples is usually much smaller than the number of dimensions (n << p), for example datasets with measurements of 20,000 genes for 20 different individuals. This makes it necessary for bioinformatics data scientists to be familiar with dimensionality reduction and multivariate methods such as PCA, PLS, t-SNE and so on.
  • They like to use notebooks (Jupyter or R Markdown) to communicate analyses, because these have a similar structure to scientific presentations or manuscripts.
  • They often like to use pipelining tools such as Nextflow, Snakemake or Bpipe to chain operations together.

During this time, I was also consulting part-time (usually less than 10%) for a few startups. From one of these gigs I learned to build very complex processing pipelines with Snakemake. From another, I learned to build obscure functionality for web applications in Shiny. These are both tools that fit naturally into a bioinformatician’s mindset. For yet another customer, I suggested a way to use PCA and MDS to view their data from a global point of view which they had not considered, guiding them onto a path that eventually resulted in this Medium article.


(2) Senior Data Scientist at IBM (June-November 2017).

After having been an academic consultant for quite a while, I decided to try to be a corporate one for a change. I got a position at IBM’s consulting arm, Global Business Services, in Kista outside of Stockholm. Since I was only there for six months, I only had time to participate in a handful of projects, which were mostly related to the manufacturing industry. Fortunately, the knowledge of high-dimensional data that I had from biology came in good stead when working on these problems. It was not difficult to apply the skills I had obtained from academia in this setting.

Characteristics of data science in a “big consulting” setting

Note: With my short experience, I have a hard time isolating out for some of the points below if they are true in general for consulting companies or company (IBM in this case) specific.

  • Pragmaticism is the word to summarize data science in “big consulting”. There isn’t time to think through every wrinkle of a problem as there is (albeit in theory only) in academia. The end goal is specified by a contract which you try to fulfill as closely as possible within the allotted amount of time. Notably, your task is not to do as much as possible but to do exactly what has been agreed upon. There is almost always a trade-off between time and model performance.
  • Consultants are good at giving effective presentations. One of the first things I learned was to completely rework the way I had done presentations in academia to more clearly highlight the important findings and tailor them for the management level in companies. Communication is a very important skill for a data science consultant; maybe the most important one.
  • Like in academia, there is also not that much emphasis on productization, because that part will typically be handled by a software engineering team that comes in after you have completed a proof of concept (PoC), if that PoC leads to a longer engagement. On the other hand the IBM stack (see below) has good support for deploying models e.g. via NodeRed.
  • (This part might be more or less company-specific) In my team, we did not make very much use of code version control with Github, for a couple of different reasons. Since we worked mostly with short PoC projects, it was more prioritized to find a promising approach in the allotted time, after which the software engineering team would come in to build the final implementation given a prolonged contract. Also, some of my colleagues worked mainly with non-code tools such as SPSS Modeler, which has its own built-in version control. We ensured reproducibility mainly through the version control mechanism in Box, where we stored scripts, documentation and metadata.
  • Automated data cleaning and model building (AutoML) are important in this setting because of the time constraints. Data cleaning can yield big “quick wins” but is tedious and a lot can be gained by automating it, for example with packages such as vtreat for R. AutoML with TPOT, auto-sklearn or H20 is interesting for rapidly finding a good-enough model.
  • Feature importance or other types of model explanation are very important for communicating results to customers (also see below). Decision trees are still used surprisingly often, and for random forests and gradient boosting, there is feature importance and various tree-model explanation interfaces.
  • It’s quite common to encounter projects with unbalanced data and to use tools like SMOTE, ADASYN or ROSE to do smart oversampling of the rare class(es). It is also not uncommon that some classes are so rare that one needs to go for an anomaly detection approach rather than standard classification.
  • (Company specific, at least in part) In terms of tooling, there was a larger emphasis on using commercial products (preferably from the IBM ecosystem) such as SPSS Modeler rather than open-source programming languages. Naturally, one has to rapidly become conversant with Bluemix (now called IBM Cloud) offerings and associated products in order to be an effective IBM consultant.


(3) Data scientist at Peltarion (Nov 2017-).

In the autumn of 2017, I got an offer from a deep learning company, Peltarion, that I had applied to before starting at IBM. I decided to take it on the strength of the skills of my new colleagues, many of whom I knew from the Stockholm AI and machine learning scene. As the company is a startup, I have worn many hats during the first six months, working in customer projects, writing documentation and blog posts, testing our deep learning platform, sitting together with beta testers, keeping an eye on competitors and so on.

Characteristics of data science in a startup setting: (surely not representative of all startups…)

Note: I suspect that the variance among startups is much higher than among academic groups or big consulting companies, so almost everything here is probably highly company-specific.

  • (Possibly company-specific) There is more emphasis on software engineering practices and than academia or big consulting. Git and Github (or some equivalent) are not “nice-to-haves” but the core of the whole enterprise, and frequent pull requests and code reviews much more common. Virtual environments and containers (e g Docker) are important (though also found in academic bioinformatics to a large extent.)
  • Data scientists in startups tend to think more about deployment and productization of models, because it hits closer to home (there often isn’t a supporting software engineering team to do that for the data scientists, or the startup is building its own deployment functionality, like we are at Peltarion).
  • (Possibly company-specific) Startup data scientists tend to be more informed about the latest technical advances in machine learning. Consultants don’t have time to keep up as much (or to install and play with the latest tools) and academics are often more interested in keeping up with the latest scientific advances in their specific field rather than general ML news. It is also more important to keep track of competitors.
  • (Possibly company-specific, e.g. Spotify uses Luigi) Reproducibility is achieved by writing libraries rather than chaining together operations with pipelines. Continuous integration (CI), like with Travis or Jenkins, is much more common than in academia, although it is starting to appear there as well. For us at Peltarion, CI is essential because we need to move fast and make every effort to minimize technical debt that could come back and bite us in the future.

I hope you enjoyed this highly subjective look at different kinds of data scientist positions. Feel free to ask questions in the comments section or provide your own views on different roles.


Post Navigation