🔍

The Science Behind InterpretML: SHAP - YouTube

Channel: unknown

[0]

>> On this special build edition of the AI Show,

[3]

we'll get to hear from Scott Lundberg,

[5]

Senior Researcher on the Microsoft Research team.

[8]

It is definitely important to debug and

[10]

explain your machine learning models.

[12]

In this video, Scott will explain the science behind

[16]

SHAP values and how they can be used to

[18]

explain and debug your models. Make sure you tune in.

[21]

[MUSIC]

[30]

>> Hi, my name is Scott Lundberg.

[32]

I'm a Senior Researcher here at Microsoft Research AI,

[34]

and I look forward today to

[36]

diving into the background behind SHAP,

[38]

which is a tool in research

[42]

designed to use Shapley values from

[43]

game theory to explain machine learning models.

[46]

So to understand how this works,

[48]

let's start with a model that we're going to explain.

[50]

This model, let's imagine,

[51]

works at a bank and takes

[54]

information about customers like John and

[56]

outputs predictions about their likelihood

[59]

of having repayment problems if the bank were to give him a loan.

[62]

In this case, the repayment problem is a bit high,

[65]

so the banks unlikely to give him a loan.

[67]

As a data scientist responsible for building this model,

[70]

you may have used a whole variety

[72]

of different packages in the model development process.

[74]

Anything from Scikit-learn, or linear models,

[77]

or trees, gradient booster decision stuff,

[80]

deep networks, all in pursuit

[83]

of producing a good, accurate, high-quality model.

[86]

But in that process, a lot of these things are

[88]

very complicated and very opaque.

[91]

Which means in order to debug these things,

[95]

you need to be able to interpret them.

[96]

So that's one huge motivation for interpretability and

[98]

explainability is the ability to debug and understand your model.

[101]

Not just for the data scientists though,

[103]

it's also for customers.

[104]

There's even legal requirements in the finance domain.

[107]

But in many areas,

[108]

you need to be able to communicate to

[109]

a customer why a model is making decision about them.

[113]

Also for businesses, it would depend on these models.

[117]

Understanding how they work and, hence,

[118]

when they can break is extremely important in order

[122]

to manage the risk that is taken in these businesses for models.

[126]

All of these motivate how

[128]

important it is to have interpretability and explainability.

[131]

So how does SHAP help with this?

[132]

Well, if we go back to John,

[134]

it's important to understand that

[136]

whenever a model makes a prediction,

[138]

it's always got some prior in mind,

[140]

in the sense that there's always some base rate

[143]

that we would have predicted if we knew nothing about John.

[145]

In this case, it could be our average or

[148]

the training dataset of defaults,

[150]

or could be some test dataset that we have in mind,

[153]

or some particular group of people,

[156]

like all the people who got accepted.

[157]

Whatever that background prior knowledge is,

[160]

that's actually where we start when we don't

[162]

know anything about prediction.

[164]

But John didn't get predicted the base rate,

[167]

which in this case was 16 percent of our dataset.

[169]

That's just the expected value for our model's output.

[171]

He got predicted 22 percent.

[173]

So what SHAP does, it says, "Hey look,

[175]

we need to explain not 22 percent from zero,

[179]

because zero is just an arbitrary number.

[180]

What we need to explain is how we

[182]

got from the base rate where we knew

[184]

nothing about John to the current prediction for John,

[187]

which is 22 percent."

[189]

How do we go about doing this?

[191]

Well, essentially, we can look at this expectation of

[193]

the model's prediction over our training dataset in this case,

[197]

and then we can fill out John's application one field at a time.

[201]

In this case, we're filling out that his income is not verified.

[205]

Now, what that does is it bumps up

[208]

the expected value of the model by 2.2 percent.

[211]

So we can say that 2.2 percent

[212]

must be attributable to

[214]

the fact that John didn't have his income verified.

[217]

So relative [inaudible] and training

[218]

dataset this increases his risk.

[220]

If we do the same thing for his debt-to-income ratio,

[223]

we see that that is at 30,

[225]

which bumps him up to 21 percent.

[227]

Then we see that he had a delinquent payment 10 months ago,

[230]

which further increases his risk up to 22.5 percent.

[233]

Again, we're filling out his application one entry at a time.

[237]

Then we throw out the fact that he had no recent account openings,

[240]

and this drops his risk significantly

[243]

because not applying for credit is a good sign.

[246]

But then finally, we fill in

[248]

the fact that he has 46 years of credit history,

[250]

which you would think would be a really good thing,

[252]

but ironically in this case,

[254]

it turns out that that hurts him

[255]

significantly and bumps his risk up to 22 percent.

[258]

So now what we've done is we've filled out his entire application

[261]

and we've arrived at the prediction of the model,

[265]

but we've done it piece by piece so that

[267]

we can attribute each piece to each feature.

[269]

Hence, explain how we got from when we knew

[272]

nothing about John to the model's final prediction.

[275]

Now, let's back up and see how this works for

[279]

a simple linear regression model from scikit-learn, for example.

[283]

So here's a model trained

[284]

on this lending dataset where it's a linear model,

[289]

and I'm showing you a straight line that

[291]

represents the partial dependence plot of that linear model.

[295]

What we can see also on x-axis we have the feature I'm explaining,

[301]

which in this case happens to be annual income.

[303]

Then on the y-axis, we just

[304]

have the axis for partial dependence plot,

[307]

which happens to be the expected value that models

[309]

output when we change one feature,

[311]

which for a linear model is a straight line.

[313]

What I want to highlight here is how easy it is for

[316]

a linear model to read off

[317]

the SHAP value from a partial dependence plot.

[319]

So the gray line here is just the average output of the model.

[322]

But that's the prior base rate we were talking about.

[324]

Then what we can see is that the SHAP value is just the difference

[329]

between that average and

[332]

the partial dependence plot for

[333]

the value of the feature we're interested in.

[335]

For a linear model,

[336]

we can simply look and say,

[339]

"John has made $140,000.

[341]

That puts his partial dependence plot at a certain point."

[345]

Then we can just measure that height from the mean value,

[350]

which if it's higher than

[351]

the average, it's just going to be positive.

[353]

If it's lower than the average, it's going to be negative.

[355]

We can do the exact same thing for more complicated models like

[358]

generalized additive models such as whether an EVM package.

[363]

What happens there is now you don't have a straight line anymore.

[366]

You can have a much more flexible partial dependence plot.

[369]

But again, the SHAP values,

[371]

because there's no interaction effects going on,

[374]

we still have an additive model,

[375]

the SHAP values are again,

[376]

just exactly the difference between

[379]

the height of the partial dependence plot

[380]

and the expected value of the model.

[382]

So if you were to plot the SHAP values

[384]

for many different individuals,

[386]

you would get essentially a line,

[388]

and that line would be exactly

[389]

the mean centered partial dependence plot.

[392]

So more complicated models though,

[395]

or of course, where people are most

[396]

interested in this kind of stuff.

[398]

In that case, you can't just use a single order.

[401]

You can't just introduce features one at a time,

[403]

because it turns out that the orders

[404]

you introduced features matters.

[406]

If there's an and function or an or function,

[409]

the first or second one you introduce will get all the credit.

[412]

So here's an example on a real dataset where we

[414]

say no recent account openings in 46 years of credit history,

[417]

we've filled out account openings first and then credit history.

[420]

What if, in filling out this application we first

[422]

fill out credit history and then account openings?

[424]

Turns out, it makes a huge difference.

[425]

What that means is there's a strong interaction effect

[428]

between credit history and account openings.

[429]

That's where SHAP comes in to try and fairly

[433]

distribute the effects that are

[435]

going on in high level interactions.

[437]

You can say, "How on earth are we going to do this?"

[439]

Well, it turns out we can go back in the 1950s

[441]

and rely on some very solid theory and game theory

[445]

that is all about how to do this fairly in

[448]

complicated games with lots of

[449]

interacting players that have higher-order interaction effects.

[453]

How can we share those interaction effects fairly among all of

[456]

the players such that a set of basic axioms are satisfied?

[460]

It turns out there's only one way to do it.

[463]

It came from values that are now

[465]

called the Shapley values after Lloyd Shapley.

[467]

Lloyd Shapley did a lot of great work in

[469]

game theory and allocation and things like this,

[471]

and actually got a Nobel Prize in 2012.

[472]

So this is based on some solid math.

[475]

So going back to our data scientist,

[478]

you can say, "That's great.

[479]

I'm really convinced by this.

[480]

I think I should use these values.

[482]

How do I compute them?" Well, it turns

[484]

out that result from averaging,

[486]

just we did talk about before,

[487]

using a single ordering, but we have to do it over all orderings.

[490]

Because it's computationally intractable,

[492]

and it's even worse because it's

[493]

NP-hard, if you know what that means.

[496]

So that's where the real challenge in these values lie,

[500]

it's how to compute these things efficiently.

[502]

I'm not going to go into the algorithms that allow us to do that,

[505]

but that's at the heart of what is in

[506]

the SHAP package and the research behind it.

[509]

It is designed to enable us to compute

[511]

these very well-justified values

[514]

efficiently and effectively on real datasets.

[518]

If we do that for extra boost,

[519]

we can actually solve it in polynomial time very quickly, exactly.

[523]

Now we see that the SHAP values no longer exactly matched

[527]

the partial dependence plot because

[529]

they're accounting for these interaction effects.

[531]

Because when you look at a partial dependence plot,

[533]

you're losing all the higher-order interacting information

[537]

about ands and ors that your model may be doing.

[540]

But the SHAP values account for that and

[542]

then drop that credit down onto each feature.

[545]

So you'll see vertical dispersion when you

[547]

plot many people's SHAP values for a feature.

[550]

So let's do this for

[552]

a particular feature to dive into this credit history.

[555]

Because remember, it was a bit surprising that credit history

[558]

hurt John's credit score.

[561]

So if we plot credit history

[563]

versus the SHAP value for that credit history,

[565]

we get a dot for every person.

[567]

Again, a little bit of vertical dispersion

[568]

from the interaction effects.

[570]

Then if we look at John,

[571]

we'll see he's in this tail

[572]

here at the end where he's got a really,

[574]

really long credit history.

[575]

It doesn't take too long before you

[577]

realize that debugging was super important

[579]

because this model was actually

[581]

identifying retirement age individuals

[583]

based on their long credit histories and

[584]

increase the risk of default for them.

[587]

This is a big problem because age is a protected class.

[590]

So this essentially found

[592]

that credit history with a complicated model

[594]

was able to pull out credit history and use it as a proxy for age.

[599]

So it's really important to explain and debug your models.

[603]

So we've talked a little bit about

[605]

just one example of explainable AI in

[607]

practice and debugging and at model exploration.

[610]

But I'd like to highlight the fact

[612]

that there are so many other ways to

[613]

use these types of interpretability in your workflow.

[616]

You can monitor models by explaining their error over time.

[619]

You can encode prior beliefs about models and then

[623]

use explanations to actually control your model train process.

[626]

You can talk about customer retention by

[628]

supporting call centers by explaining why a churn model was done.

[635]

We've applied this in decision support for medical settings.

[639]

There's a lot of places where human risk oversight of

[642]

machine learning models is enhanced with explanations.

[645]

In regulatory compliance, there's a lot of need for

[648]

these type of transparency for consumer explanations.

[651]

It can help you better understand

[653]

anti-discrimination as we just showed an example of,

[656]

and of course, risk-management,

[658]

where you're understanding what your model will

[660]

do when economic conditions change.

[662]

All the more important right now.

[664]

Even in scientific discovery,

[666]

you can find that explanations can

[668]

help you better do population subtyping,

[671]

extremely helpful for pattern discovery and

[674]

even signal recovery of things inside, things like DNA.

[678]

All of these things are just tons of

[681]

downstream applications of interpretable ML

[684]

that's supported by these types of tools and research,

[687]

and I hope that this insight has given you a bit of a taste

[690]

and excitement for what can be done here. Thanks.

[694]

[MUSIC]

Most Recent Videos:

WE KILLED 6 HEROIC BOSSES! - YouTube

¿Quién inventó el dinero? - YouTube

Cuándo se inventó el dinero y cómo el dólar se convirtió en la principal moneda del mundo - YouTube

This Citizenship Program is Failing - YouTube

Candida Treatment Protocol w/ Dr. DiNezza - YouTube

$500M investor reacts to Real Estate Tik Toks 2 - YouTube

You can go back to the homepage right here: Homepage