Shapley Additive Explanations (SHAP) - YouTube

Channel: KIE

[8]
Hi everyone, I’m Rob Geada, an engineer on the  TrustyAI team, and I’m going to talk a little  
[13]
about my work with Shapley Additive Explanations,  or SHAP. Now before I go into the specifics  
[18]
of SHAP and how it works, I first have to talk  about the mathematical foundation it’s built on;  
[22]
and that’s Shapley values from game theory. Shapley values were invented by Lloyd  
[27]
Shapley as a way of providing a fair  solution to the following question:  
[31]
if we have a coalition C that collaborates to  produce a value V, how much did each individual  
[37]
member contribute to that final value? So what does this mean? We have a coalition C, a  
[43]
group of cooperating members that work together to  produce some value V, called the coalition value.  
[49]
This could be something like, a corporation of  employees that together generate a certain profit,  
[54]
or a dinner group running up a restaurant bill.  We want to know exactly how much each member  
[59]
contributed to that final coalition value; what  share of the profit does each employee deserve,  
[64]
how much each person in the dinner  party owes to settle the bill.  
[67]
However, answering this gets tricky when there are  interacting effects between members, when certain  
[72]
permutations cause members to contribute more  than the sum of their parts. To find a fair answer  
[77]
to this question that takes into account these  interaction effects, we can compute the Shapley  
[81]
value for each member of the coalition. So let’s compute the Shapley value for member 1 of  
[86]
our example coalition. The way this is done is by  sampling a coalition that contains member 1, and  
[92]
then looking at the coalition formed by removing  that member. We then look at the respective  
[97]
values of these two coalitions, and compare the  difference between the two. This difference is the  
[103]
marginal contribution of member 1 to the coalition  consisting of members 2, 3, and 4; how much  
[110]
member 1 contributed to that specific group. So we then enumerate all such pairs of coalitions,  
[116]
that is, all pairs of coalitions that only differ  based on whether or not member 1 is included,  
[122]
and then look at all the marginal contributions  for each. The mean marginal contribution is the  
[127]
Shapley value of that member. We can do  this same process for each member of the  
[131]
coalition, and we’ve found a fair solution  to our original question. Mathematically,  
[137]
the whole process looks like this, but all we  need is to know that the Shapley value is the  
[142]
average amount of contribution that a particular  member makes to the coalition value.  
[148]
Now, translating this concept to model  explainability is relatively straightforward, and  
[152]
that’s exactly what Scott Lundberg and Su-In Lee  did in 2017 with their paper “A Unified Approach  
[158]
to Interpreting Model Predictions,” where they  introduced SHAP. SHAP reframes the Shapley value  
[164]
problem from one where we look at how members  of a coalition contribute to a coalition value  
[169]
to one where we look at how individual  features contribute to a model’s outputs.  
[173]
They do this in a very specific way, one  that we can get a clue to by looking at  
[178]
the name of their algorithm; Shapley Additive  Explanations. We know what Shapley values are,  
[183]
we know what explanations are, but  what do they mean by additive?  
[188]
Lundberg and Lee define an additive feature  attribution as follows: if we have a set a of  
[192]
inputs x, and a model f(x), we can define a set  of simplified local inputs x’ (which usually means  
[200]
that we turn a feature vector into a discrete  binary vector, where features are either included  
[205]
or excluded) and we can also define an explanatory  model g. What we need to ensure is that  
[212]
One: if x’ is roughly equal to x then  g(x’) should be roughly equal to f(x),  
[219]
and two: g must take this form, where  phi_0 is the null output of the model,  
[225]
that is, the average output of the model, and  phi_i is the explained effect of feature_i;  
[230]
how much that feature changes the output of the  model. This is called it’s attribution.  
[236]
If we have these two, we have an explanatory  model that has additive feature attribution.  
[242]
The advantage of this form of explanation is  really easy to interpret; we can see the exact  
[246]
contribution and importance of each feature  just by looking at the phi values.  
[252]
Now Lundberg and Lee go on to describe  a set of three desirable properties of  
[256]
such an additive feature method; local  accuracy, missingness, and consistency.  
[261]
We’ve actually already touched upon local  accuracy; it simply says if the input and the  
[265]
simplified input are roughly the same, then  the actual model and the explanatory model  
[270]
should produce roughly the same output.  Missingness states that if a feature is  
[275]
excluded from the model, it’s attribution  must be zero; that is, the only thing that can  
[280]
affect the output of the explanation model is the  inclusion of features, not the exclusion. Finally,  
[287]
we have consistency (and this one’s a little hard  to represent mathematically), but it states that  
[291]
if the original model changes so that the a  particular feature’s contribution changes,  
[296]
the attribution in the explanatory model  cannot change in the opposite direction;  
[300]
so for example, if we have a new model where a  specific feature has a more positive contribution  
[305]
than in the original; the attribution in our  new explanatory model cannot decrease.  
[312]
Now while a bunch of different explanation methods  satisfy some of these properties, Lundberg and Lee  
[320]
argue that only SHAP satisfies all three;  if the feature attributions in our additive  
[326]
explanatory model are specifically chosen  to be the shapley values of those features,  
[330]
then all three properties are upheld. The problem  with this, however, is that computing Shapley  
[336]
values means you have to sample the coalition  values for each possible feature permutation,  
[341]
which in a model explainability setting means we  have to evaluate our model that number of times.  
[346]
For a model that operates over 4 features, it’s  easy enough, it’s just 64 coalitions to sample  
[352]
to get all the Shapley values. For 32 features,  that’s over 17 billion samples, which is  
[356]
entirely untenable. To get around this,  Lundberg and Lee devise the Shapley Kernel,  
[363]
a means of approximating shapley  values through much fewer samples.  
[368]
So what we do is we pass samples through  the model, of various feature permutations  
[372]
of the particular datapoint that  we’re trying to explain. Of course,  
[375]
most ML models won’t just let you omit a feature,  so what we do is define a background dataset B,  
[382]
one that contains a set of representative  datapoints that the model was trained over.  
[385]
We then fill in our omitted feature or features  with values from the background dataset,  
[391]
while holding the features that are included in  the permutation fixed to their original values.  
[396]
We then take the average of the model output  over all of these new synthetic datapoints as  
[401]
our model output for that feature permutation,  which we’ll call that y bar.  
[407]
So once we have a number of samples computed in  this way, we can formulate this as a weighted  
[412]
linear regression, with each feature assigned  a coefficient. With a very specific choice of  
[418]
weighting for each sample, based on a combination  of the total number of features in the model,  
[423]
the number of coalitions with the same  number of features as this particular sample,  
[428]
and the number of features included and excluded  in this permutation, we ensure that the solution  
[434]
to this weighted linear regression is such  that the returned coefficients are equivalent  
[438]
to the Shapley values. This weighting  scheme is the basis of the Shapley Kernel,  
[443]
and the weighted linear regression  process as a whole is Kernel SHAP.  
[447]
Now, there are a lot of other forms of SHAP that  are presented in the paper, ones that make use of  
[451]
model-specific assumptions and optimizations to  speed up the algorithm and the sampling process,  
[456]
but Kernel SHAP is the one among them that is  universal and can be applied to any type of  
[460]
machine learning model. This general applicability  is why we chose Kernel SHAP as the first form of  
[466]
SHAP to implement for TrustyAI; we want to be  able to cover every possible use-case first,  
[471]
and add specific optimizations later.  At the moment of recording this video,  
[475]
the 18th of March 2021, I’d estimate I’m about  85% done with our implementation, and it should  
[481]
be ready in a week or so. Since our  version isn’t quite finished yet,  
[486]
I’ll run through an example of the Python SHAP  implementation provided by Lundberg and Lee.  
[492]
So first I’ll grab a dataset to run our example  over, and I’ve picked the Boston housing price  
[497]
dataset, which is a dataset consisting of  various attributes about Boston neighborhoods  
[503]
and the corresponding house prices within that  neighborhood. Next, I’ll train a model over that  
[508]
dataset, in this case an XGBoost regressor. Let’s  take a quick look at the performance of our model,  
[513]
just to make sure the model we’ll  be explaining is actually any good.  
[518]
Here I’m comparing the predicted  house value on the x axis to the  
[522]
actual house value on the y axis, and we can  see that our plot runs pretty close to y=x,  
[527]
indicating that our model is relatively decent; it  has a mean absolute error of 2.27; so about 5-10%  
[533]
error given the magnitude of the predictions,  more than good enough for our purposes. So  
[539]
now that we have a model, let’s imagine we’re  trying to use it to predict the value of our  
[542]
own house. So we’ll take a look at the input  features, we’ll fill them out, and we’ll pass  
[547]
them through the model, and we see that our house  has a value of around 22 thousand dollars. But,  
[554]
why? To answer that, let’s set up a Kernel SHAP  explainer; we’ll pass it our prediction function  
[560]
and some background data. Next, we’ll pass it  our sample datapoint, the one we created earlier.  
[566]
Before we take a look at the SHAP values, let’s  make sure local accuracy is upheld, that our  
[575]
explanatory model is equivalent to the original  model. We’ll do this by adding the model null  
[582]
to the sum of the SHAP values, and we find  that they are exactly identical. Perfect,  
[592]
both our original model and our explanatory  model make sense. Now we can take a look at  
[597]
the SHAP values for each feature. Here I present  the value of the feature in our sample datapoint,  
[602]
the attributions of each feature, as well as  the average value that each feature takes in  
[607]
the background data, just so we know which  direction of change caused the attribution.  
[612]
We can see that the biggest attributions  are from the below average crime rate,  
[616]
the above average number of rooms, and the above  average percentage of neighbors with low income.  
[621]
The question is, however, are these true? Did,  for example, this last feature cause us to lose  
[627]
exactly $1,190 from the value of our house? We can  test this by passing our datapoint back through  
[634]
the model, replacing the last feature with values  from the background dataset. The average of these  
[640]
outputs is our new house value, having excluded  the Lower Status feature. Our original datapoint  
[646]
predicted 22.09, while the datapoint with the  excluded status predicted 24.42 on average. That’s  
[654]
a change of around 2.33, almost double the change  predicted by SHAP. So where did SHAP go wrong?  
[662]
Well, all these attributions come from a weighted  linear regression, one trained over noisy samples.  
[668]
There is going to be implicit error in each of the  attributions, giving each one an error bound. The  
[674]
existing Python implementation by Lundberg and  Lee doesn’t report these error bounds, so it’s  
[679]
very possible that the delta of -1.19 is actually  -1.19 plus or minus 1.19. This is something the  
[688]
TrustyAI implementation will remedy, so that we  can ensure the outputted attributions and bounds  
[693]
always match reality, and are therefore,  trustworthy. But until then, that’s all  
[699]
I’ve got time for, I’d love to hear any questions  you may have, and thanks so much for listening!