🔍
Why “probability of 0” does not mean “impossible” | Probabilities of probabilities, part 2 - YouTube
Channel: 3Blue1Brown
[2]
Imagine you have a weighted coin.
[5]
So the probability of flipping heads,
it might not be 50/50 exactly.
[8]
It could be 20%.
[11]
Or maybe 90%.
[13]
Or 0%.
[15]
Or 31.41592%.
[18]
The point is that you just don't know.
[20]
But imagine that you flipped this coin 10 different times, and 7 of those times, it comes up heads.
[26]
Do you think that the underlying weight of this coin
[28]
is such that each flip has a 70% chance
of coming up heads?
[32]
If I were to ask you,
[33]
"Hey, what's the probability that the true probability of flipping heads is 0.7?"
[38]
what would you say?
[41]
This is a pretty weird question, and for two reasons.
[44]
First of all, it's asking about a probability of a probability,
[48]
as in, the value we don't know is itself
some kind of long-run frequency for a random event,
[53]
which, frankly, is hard to think about.
[56]
But the more pressing weirdness comes from asking about probabilities in the setting of continuous values.
[62]
Let's give this unknown probability of flipping heads some kind of name, like h.
[67]
Keep in mind that h could be any real number,
from 0 up to 1.
[71]
Ranging from a coin that always flips tails up to one that always flips heads, and everything in between.
[78]
So if I ask, "Hey, what's the probability that h is
precisely 0.7,
[83]
as opposed to, say, 0.70000001,
or any other nearby value?"
[90]
well, there's gonna be a strong possibility for paradox
if we're not careful.
[94]
It feels like no matter how small the answer to this question, it just wouldn't be small enough.
[99]
If every specific value within some range, all uncountably infinitely many of them, has a non-zero probability,
[107]
Well, even if that probability was minuscule,
[109]
adding them all up to get the total probability of
any one of these values
[113]
will blow up to infinity.
[115]
On the other hand, though,
if all of these probabilities are 0,
[118]
aside from the fact that that now gives you
no useful information about the coin,
[123]
the total sum of those probabilities would be 0,
when it should be 1.
[128]
After all, this weight of the coin h is *something*,
[132]
so the probability of it being any one of these values *should* add up to 1.
[137]
So, if these values can't all be non-zero,
and they can't all be zero, what do you do?
[144]
Where we're going with this, by the way, is that
I'd like to talk about the very practical question
[149]
of using data to create meaningful answers to these sorts of 'probabilities of probabilities' questions.
[155]
But for this video, let's take a moment to appreciate
how to work with probabilities over continuous values,
[161]
and resolve this apparent paradox.
[169]
The key is not to focus on individual values,
but ranges of values.
[174]
For example, we might make these buckets to represent the probability that h is between, say,
[179]
0.8 and 0.85.
[182]
Also, and this is more important than it might seem,
[185]
rather than thinking of the *height* of each of these bars as representing the probability,
[189]
think of the *area* of each one
as representing that probability.
[193]
Where exactly those areas come from
is something that we'll answer later.
[197]
For right now, just know that in principle,
[200]
there's *some* answer to the probability of h
sitting inside one of these ranges.
[204]
Our task right now is to take the answers to these
very coarse-grained questions,
[209]
and to get a more exact understanding of the distribution at the level of each individual input.
[215]
The natural thing to do would be consider
finer and finer buckets.
[219]
And when you do, the smaller probability
of falling into any one of them,
[223]
is accounted for in the thinner *width*
of each of these bars,
[226]
while the heights are gonna stay roughly the same.
[229]
That's important because it means that as you
take this process to the limit,
[233]
you approach some kind of smooth curve.
[235]
So even though all of the individual probabilities of falling into any one particular bucket will approach 0,
[242]
the overall shape of the distribution is preserved, and even refined in this limit.
[248]
If, on the other hand, we had let the *heights* of the bars represent probabilities,
[252]
everything would've gone to 0.
[260]
So in the limit, we would've just had a flat line giving no information about the overall shape of the distribution.
[267]
So, wonderful!
[268]
Letting area represent probability
helps solve this problem.
[271]
But let me ask you, if the y-axis no longer represents probability, what exactly are the units here?
[277]
Since probability sits in the area of these bars,
or width times height,
[282]
the height represents a kind of probability per unit in the x direction,
[287]
what's known in the business as a "probability density".
[290]
The other thing to keep in mind is that
the total area of all these bars has to equal 1
[295]
at every level of the process.
[296]
That's something that has to be true for any
valid probability distribution.
[302]
The idea of probability density is actually really clever when you step back to think about it.
[306]
As you take things to the limit,
[308]
even if there's all sorts of paradoxes
associated with assigning a probability
[312]
to each of these uncountably infinitely many values of h between 0 and 1,
[316]
there's no problem if we associate a probability density to each one of them,
[321]
giving what's known as a "probability density function",
or PDF for short.
[326]
Any time you see a PDF in the wild,
the way to interpret it
[329]
is that the probability of your random variable
lying *between* 2 values
[333]
equals the area under this curve between those values.
[337]
So, for example, what's the probability of getting any one very specific number, like 0.7?
[344]
Well, the area of an infinitely thin slice is 0, so it's 0.
[348]
What's the probability of all of them put together?
[351]
Well, the area under the full curve is 1.
[354]
You see? Paradox sidestepped.
[357]
And the way that it's been sidestepped is a bit subtle.
[360]
In normal, finite settings, like rolling a die
or drawing a card,
[364]
the probability that a random value
falls into a given collection of possibilities
[369]
is simply the sum of the probabilities
of being any one of them.
[373]
This feels very intuitive.
[375]
It's even true in a countably infinite context.
[377]
But to deal with a continuum,
[379]
the rules themselves have shifted.
[381]
The probability of falling into a range of values
[384]
is no longer the sum of the probabilities
of each individual value.
[389]
Instead, probabilities associated with ranges are the fundamental primitive objects.
[394]
And the only sense in which it's meaningful to talk about an individual value here,
[398]
is to think of it as a range of width 0.
[402]
If the ideas of the rules changing between a finite setting and a continuous one feels unsettling,
[407]
well you'll be happy to know that mathematicians are way ahead of you.
[410]
There's a field of math called 'measure theory',
[413]
which helps to unite these two settings
[414]
and make rigorous the idea of
associating numbers like probabilities,
[418]
to various subsets of all possibilities
in a way that combines and distributes nicely.
[424]
For example, let's say you're in a setting where you have a random number that equals 0 with 50% probability,
[430]
and the rest of the time,
[431]
it's some positive number according to a distribution that looks like half of a bell curve.
[436]
This is an awkward middle-ground between a finite context, where a single value has a non-zero probability,
[442]
and a continuous one,
where probabilities are found according to areas
under the appropriate density function.
[449]
This is the sort of thing that measure theory handles very smoothly.
[452]
I mentioned this mainly for the especially curious viewer,
[455]
and you can find more reading material in the description.
[460]
It's a pretty common rule of thumb that if you find yourself using a sum in a discrete context,
[465]
then use an integral in the continuous context,
[468]
which is the tool from calculus that we use
to find areas under curves.
[472]
In fact, you could argue this video would be way shorter
if I just said that at the front and called it good.
[477]
For my part though,
I always found it a little unsatisfying to do this blindly
without thinking through what it really means.
[483]
And, in fact, if you really dig in
to the theoretical underpinnings of integrals,
[488]
what you'd find is that in addition to the way that it's defined in a typical intro calculus class,
[493]
there is a separate, more powerful definition
that's based on measure theory,
[497]
this formal foundation of probability.
[500]
If I look back to when I first learned probability,
[503]
I definitely remember grappling with this weird idea
that in continuous settings,
[507]
like random variables that are real numbers,
or throwing a dart at a dart board,
[511]
you have a bunch of outcomes that are possible,
and yet each one has a probability of 0.
[516]
And somehow, altogether, they have a probability of 1.
[520]
Now, one step of coming to terms with this
is to realise that
[523]
possibility is better tied to probability density
than probability,
[528]
but just swapping out sums of 1 for integrals of the other has never quite scratched the itch for me.
[533]
I remember that it only really clicked when I realised that the rules for combining probabilities of different sets
[538]
were not quite what I thought they were.
[540]
And there was simply a different axiom system underlying it all.
[544]
But anyway, steering away from the theory,
somewhere back in the loose direction of application,
[549]
look back to our original question
about the coin with an unknown weight.
[552]
What we've learned here
is that the right question to ask is
[556]
what's the probability density function that describes this value h after seeing the outcomes of a few tosses?
[563]
If you can find that PDF,
you can use it to answer questions like
[567]
'What's the probability that the true probability
of flipping heads falls between 0.6 and 0.8?'.
[573]
To find that PDF, join me in the next part.
Most Recent Videos:
You can go back to the homepage right here: Homepage





