Bayes theorem, the geometry of changing beliefs - YouTube

Channel: 3Blue1Brown

[0]
The goal is for you to come away from this video understanding one of the most important
[3]
formulas in all of probability, Bayes’ theorem.
[7]
This formula is central to scientific discovery, it’s a core tool in machine learning and
[12]
AI, and it’s even been used for treasure hunting, when in the 80’s a small team led
[17]
by Tommy Thompson used Bayesian search tactics to help uncover a ship that had sunk a century
[23]
and half earlier carrying what, in today’s terms, amounts to $700,000,000 worth of gold.
[30]
So it's a formula worth understanding.
[33]
But of course there were multiple different levels of possible understanding.
[37]
At the simplest there’s just knowing what each part means, so you can plug in numbers.
[42]
Then there’s understanding why it’s true; and later I’m gonna show you a certain diagram that’s helpful
[47]
for rediscovering the formula on the fly as needed.
[51]
Then there’s being able to recognize when you need to use it.
[56]
With the goal of gaining a deeper understanding, you and I will tackle these in reverse order.
[60]
So before dissecting the formula, or explaining the visual that makes it obvious, I’d like
[65]
to tell you about a man named Steve. Listen carefully.
[72]
Steve is very shy and withdrawn, invariably helpful but with very little interest in people
[78]
or in the world of reality. A meek and tidy soul, he has a need for order and structure,
[83]
and a passion for detail.
[85]
Which of the following do you find more likely: “Steve is a librarian”, or “Steve is
[89]
a farmer”?
[91]
Some of you may recognize this as an example from a study conducted by the psychologists
[95]
Daniel Kahneman and Amos Tversky, whose Nobel-prize-winning work was popularized in books like “Thinking
[103]
Fast and Slow”, “The Undoing Project”, . They researched human
[108]
judgments, with a frequent focus on when these judgments irrationally contradict what the
[113]
laws of probability suggest they should be.
[116]
The example with Steve, the maybe-librarian-maybe-farmer, illustrates one specific type of irrationality.
[122]
Or maybe I should say “alleged” irrationality; some people debate the conclusion, but more
[127]
on all that in a moment.
[130]
According to Kahneman and Tversky, after people are given this description of Steve as “meek
[134]
and tidy soul”, most say he is more likely to be a librarian than a farmer. After all,
[139]
these traits line up better with the stereotypical view of a librarian than that of a farmer.
[143]
And according to Kahneman and Tversky, this is irrational.
[147]
The point is not whether people hold correct or biased views about the personalities of
[151]
librarians or farmers, it’s that almost no one thinks to incorporate information about
[156]
ratio of farmers to librarians into their judgments. In their paper, Kahneman and Tversky
[162]
said that in the US that ratio is about 20 to 1. The numbers I can find for today put
[167]
it much higher than that, but let’s just run with the 20 to 1 ratio since it’s a
[171]
bit easier to illustrate, and proves the point just as well.
[173]
To be clear, anyone who is asked this question is not expected to have perfect information on the
[179]
actual statistics of farmers, librarians, and their personality traits. But the question
[184]
is whether people even think to consider this ratio, enough to make a rough estimate. Rationality
[190]
is not about knowing facts, it’s about recognizing which facts are relevant.
[196]
If you do think to make this estimate, there’s a pretty simple way to reason about the question
[199]
– which, spoiler alert, involves all the essential reasoning behind Bayes’ theorem.
[204]
You might start by picturing a representative sample of farmers and librarians, say, 200
[209]
farmers and 10 librarians. Then when you hear the meek and tidy soul description, let’s
[215]
say your gut instinct is that 40% of librarians would fit that description and that 10% of
[220]
farmers would. That would mean that from your sample, you’d expect that about 4 librarians
[226]
fit it, and that 20 farmers do. The probability that a random person who fits this description
[235]
is a librarian is 4/24, or 16.7%.
[240]
So even if you think a librarian is 4 times as likely as a farmer to fit this description,
[245]
that’s not enough to overcome the fact that there are way more farmers. The upshot, and
[250]
this is the key mantra underlying Bayes’ theorem, is that new evidence should not completely
[255]
determine your beliefs in a vacuum; it should update prior beliefs.
[261]
If this line of reasoning makes sense to you, the way seeing evidence restricts the space
[265]
of possibilities, and the ratio you need to consider after that, then congratulations! You understand the heart of Bayes’ theorem.
[273]
Maybe the numbers you’d estimate would be a little bit different, but what matters is how you fit
[277]
the numbers together to update a belief based on evidence. Here, see if you can take a minute
[285]
to generalize what we just did and write it down as a formula.
[292]
The general situation where Bayes’ theorem is relevant is when you have some hypothesis,
[296]
say that Steve is a librarian, and you see some evidence, say this verbal description
[302]
of Steve as a “meek and tidy soul”, and you want to know the probability that the
[306]
hypothesis holds given that the evidence is true. In the standard notation, this vertical
[312]
bar means “given that”. As in, we’re restricting our view only to the possibilities
[317]
where the evidence holds.
[320]
The first relevant number is the probability that the hypothesis holds before considering
[326]
the new evidence. In our example, that was the 1/21, which came from considering the
[331]
ratio of farmers to librarians in the general population. This is known as the prior.
[338]
After that, we needed to consider the proportion of librarians that fit this description; the
[342]
probability we would see the evidence given that the hypothesis is true. Again, when you
[348]
see this vertical bar, it means we’re talking about a proportion of a limited part of the
[353]
total space of possibilities, in this cass, limited to the left slide where the hypothesis
[358]
holds. In the context of Bayes’ theorem, this value also has a special name, it’s
[363]
the “likelihood”.
[364]
Similarly, we need to know how much of the other side of our space includes the evidence;
[369]
the probability of seeing the evidence given that our hypothesis isn’t true. This little
[375]
elbow symbol is commonly used to mean “not” in probability.
[380]
Now remember what our final answer was. The probability that our librarian hypothesis
[385]
is true given the evidence is the total number of librarians fitting the evidence, 4, divided
[391]
by the total number of people fitting the evidence, 24.
[395]
Where does that 4 come from? Well it’s the total number of people, times the prior probability
[401]
of being a librarian, giving us the 10 total librarians, times the probability that one
[406]
of those fits the evidence. That same number shows up again in the denominator, but we
[412]
need to add in the total number of people times the proportion who are not librarians,
[417]
times the proportion of those who fit the evidence, which in our example gave 20.
[423]
The total number of people in our example, 210, gets canceled out – which of course
[427]
it should, that was just an arbitrary choice we made for illustration – leaving us finally
[432]
with the more abstract representation purely in terms of probabilities. This, my friends,
[438]
is Bayes’ theorem.
[440]
You often see this big denominator written more simply as P(E), the total probability
[446]
of seeing the evidence. In practice, to calculate it, you almost always have to break it down
[454]
into the case where the hypothesis is true, and the one where it isn’t.
[458]
Piling on one final bit of jargon, this final answer is called the “posterior”; it’s
[465]
your belief about the hypothesis after seeing the evidence.
[470]
Writing it all out abstractly might seem more complicated than just thinking through the
[473]
example directly with a representative sample; and yeah, it is! Keep in mind, though, the
[480]
value of a formula like this is that it lets you quantify and systematize the idea of changing
[486]
beliefs. Scientists use this formula when analyzing the extent to which new data validates
[491]
or invalidates their models; programmers use it in building artificial intelligence, where
[497]
you sometimes want to explicitly and numerically model a machine’s belief. And honestly just
[502]
for how you view yourself, your own opinions and what it takes for your mind to change,
[506]
Bayes’ theorem can reframe how you think about thought itself. Putting a formula to
[513]
it is also all the more important as the examples get more intricate.
[517]
However you end up writing it, I’d actually encourage you not to memorize the formula,
[522]
but to draw out this diagram as needed.
[524]
This is sort of the distilled version of thinking with a representative sample where we think
[529]
with areas instead of counts, which is more flexible and easier to sketch on the fly.
[534]
Rather than bringing to mind some specific number of examples, think of the space of
[538]
all possibilities as a 1x1 square. Any event occupies some subset of this space, and the
[546]
probability of that event can be thought about as the area of that subset. For example, I
[552]
like to think of the hypothesis as filling the left part of this square, with a width
[556]
of P(H).
[557]
I recognize I’m being a bit repetitive, but when you see evidence, the space of possibilities
[563]
gets restricted. Crucially, that restriction may not happen evenly between the left and
[568]
the right. So the new probability for the hypothesis is the proportion it occupies in
[574]
this restricted subspace.
[578]
If you happen to think a farmer is just as likely to fit the evidence as a librarian,
[582]
then the proportion doesn’t change, which should make sense. Irrelevant evidence doesn’t
[587]
change your belief. But when these likelihoods are very different, that's when your belief changes a lot.
[595]
This is actually a good time to step back and consider a few broader takeaways about
[619]
how to make probability more intuitive, beyond Bayes’ theorem. First off, there’s the
[624]
trick of thinking about a representative sample with a specific number of examples, like our
[629]
210 librarians and farmers. There’s actually another Kahneman and Tversky result to this
[635]
effect, which is interesting enough to interject here.
[638]
They did an experiment similar to the one with Steve, but where people were given the
[642]
following description of a fictitious woman named Linda:
[646]
Linda is 31 years old, single, outspoken, and very bright. She majored in philosophy.
[652]
As a student, she was deeply concerned with issues of discrimination and social justice,
[656]
and also participated in anti-nuclear demonstrations.
[660]
They were then asked what is more likely: That Linda is a bank teller, or that Linda
[667]
is a bank teller and is active in the feminist movement. 85% of participants said the latter
[674]
is more likely, even though the set of bank tellers active in the femist movement is a
[681]
subset of the set of bank tellers!
[684]
But, what’s fascinating is that there’s a simple way to rephrase the question that
[691]
dropped this error from 85% to 0. Instead, if participants are told there are 100 people
[698]
who fit this description, and asked people to estimate how many of those 100 are bank
[703]
tellers, and how many are bank tellers who are active in the feminist movement, no one
[707]
makes the error. Everyone correctly assigns a higher number to the first option than to
[712]
the second.
[715]
Somehow a phrase like “40 out of 100” kicks our intuition into gear more effectively
[720]
than “40%”, much less “0.4”, or abstractly referencing the idea of something being more
[727]
or less likely.
[729]
That said, representative samples don’t easily capture the continuous nature of probability,
[734]
so turning to area is a nice alternative, not just because of the continuity, but also
[738]
because it’s way easier to sketch out while you’re puzzling over some problem.
[744]
You see, people often think of probability as being the study of uncertainty. While that
[750]
is, of course, how it’s applied in science, the actual math of probability is really just
[756]
the math of proportions, where turning to geometry is exceedingly helpful.
[761]
I mean, if you look at Bayes’ theorem as a statement about proportions – proportions
[769]
of people, of areas, whatever – once you digest what it’s saying, it’s actually
[773]
kind of obvious. Both sides tell you to look at all the cases where the evidence is true,
[778]
and consider the proportion where the hypothesis is also true. That’s it. That’s all it’s
[785]
saying.
[786]
What’s noteworthy is that such a straightforward fact about proportions can become hugely significant
[792]
for science, AI, and any situation where you want to quantify belief. You’ll get a better
[799]
glimpse of this as we get into more examples.
[801]
But before any more examples, we have some unfinished business with Steve. Some psychologists
[808]
debate Kahneman and Tversky’s conclusion, that the rational thing to do is to bring
[812]
to mind the ratio of farmers to librarians. They complain that the context is ambiguous.
[818]
Who is Steve, exactly? Should you expect he’s a randomly sampled American? Or would you
[823]
be better to assume he’s a friend of these two psychologists interrogating you?
[827]
Or perhaps someone you’re personally likely to know? This assumption determines the prior.
[832]
I, for one, run into many more librarians in a given month than farmers. And needless
[837]
to say, the probability of a librarian or a farmer fitting this description is highly
[842]
open to interpretation.
[843]
But for our purposes, understanding the math, notice how any questions worth debating can
[850]
be pictured in the context of the diagram. Questions of context shift around the prior,
[855]
and questions of personalities and stereotypes shift the relevant likelihoods.
[861]
All that said, whether or not you buy this particular experiment the ultimate point that
[865]
evidence should not determine beliefs, but update them, is worth tattooing in your mind.
[871]
I’m in no position to say whether this does or doesn’t run against natural human intuition,
[876]
we’ll leave that to the psychologists. What’s more interesting to me is how we can reprogram
[881]
our intuitions to authentically reflect the implications of math, and bringing to mind
[886]
the right image can often do just that.