🔍

Why “probability of 0” does not mean “impossible” | Probabilities of probabilities, part 2 - YouTube

Channel: 3Blue1Brown

[2]

Imagine you have a weighted coin.

[5]

So the probability of flipping heads, it might not be 50/50 exactly.

[8]

It could be 20%.

[11]

Or maybe 90%.

[13]

Or 0%.

[15]

Or 31.41592%.

[18]

The point is that you just don't know.

[20]

But imagine that you flipped this coin 10 different times, and 7 of those times, it comes up heads.

[26]

Do you think that the underlying weight of this coin

[28]

is such that each flip has a 70% chance of coming up heads?

[32]

If I were to ask you,

[33]

"Hey, what's the probability that the true probability of flipping heads is 0.7?"

[38]

what would you say?

[41]

This is a pretty weird question, and for two reasons.

[44]

First of all, it's asking about a probability of a probability,

[48]

as in, the value we don't know is itself some kind of long-run frequency for a random event,

[53]

which, frankly, is hard to think about.

[56]

But the more pressing weirdness comes from asking about probabilities in the setting of continuous values.

[62]

Let's give this unknown probability of flipping heads some kind of name, like h.

[67]

Keep in mind that h could be any real number, from 0 up to 1.

[71]

Ranging from a coin that always flips tails up to one that always flips heads, and everything in between.

[78]

So if I ask, "Hey, what's the probability that h is precisely 0.7,

[83]

as opposed to, say, 0.70000001, or any other nearby value?"

[90]

well, there's gonna be a strong possibility for paradox if we're not careful.

[94]

It feels like no matter how small the answer to this question, it just wouldn't be small enough.

[99]

If every specific value within some range, all uncountably infinitely many of them, has a non-zero probability,

[107]

Well, even if that probability was minuscule,

[109]

adding them all up to get the total probability of any one of these values

[113]

will blow up to infinity.

[115]

On the other hand, though, if all of these probabilities are 0,

[118]

aside from the fact that that now gives you no useful information about the coin,

[123]

the total sum of those probabilities would be 0, when it should be 1.

[128]

After all, this weight of the coin h is *something*,

[132]

so the probability of it being any one of these values *should* add up to 1.

[137]

So, if these values can't all be non-zero, and they can't all be zero, what do you do?

[144]

Where we're going with this, by the way, is that I'd like to talk about the very practical question

[149]

of using data to create meaningful answers to these sorts of 'probabilities of probabilities' questions.

[155]

But for this video, let's take a moment to appreciate how to work with probabilities over continuous values,

[161]

and resolve this apparent paradox.

[169]

The key is not to focus on individual values, but ranges of values.

[174]

For example, we might make these buckets to represent the probability that h is between, say,

[179]

0.8 and 0.85.

[182]

Also, and this is more important than it might seem,

[185]

rather than thinking of the *height* of each of these bars as representing the probability,

[189]

think of the *area* of each one as representing that probability.

[193]

Where exactly those areas come from is something that we'll answer later.

[197]

For right now, just know that in principle,

[200]

there's *some* answer to the probability of h sitting inside one of these ranges.

[204]

Our task right now is to take the answers to these very coarse-grained questions,

[209]

and to get a more exact understanding of the distribution at the level of each individual input.

[215]

The natural thing to do would be consider finer and finer buckets.

[219]

And when you do, the smaller probability of falling into any one of them,

[223]

is accounted for in the thinner *width* of each of these bars,

[226]

while the heights are gonna stay roughly the same.

[229]

That's important because it means that as you take this process to the limit,

[233]

you approach some kind of smooth curve.

[235]

So even though all of the individual probabilities of falling into any one particular bucket will approach 0,

[242]

the overall shape of the distribution is preserved, and even refined in this limit.

[248]

If, on the other hand, we had let the *heights* of the bars represent probabilities,

[252]

everything would've gone to 0.

[260]

So in the limit, we would've just had a flat line giving no information about the overall shape of the distribution.

[267]

So, wonderful!

[268]

Letting area represent probability helps solve this problem.

[271]

But let me ask you, if the y-axis no longer represents probability, what exactly are the units here?

[277]

Since probability sits in the area of these bars, or width times height,

[282]

the height represents a kind of probability per unit in the x direction,

[287]

what's known in the business as a "probability density".

[290]

The other thing to keep in mind is that the total area of all these bars has to equal 1

[295]

at every level of the process.

[296]

That's something that has to be true for any valid probability distribution.

[302]

The idea of probability density is actually really clever when you step back to think about it.

[306]

As you take things to the limit,

[308]

even if there's all sorts of paradoxes associated with assigning a probability

[312]

to each of these uncountably infinitely many values of h between 0 and 1,

[316]

there's no problem if we associate a probability density to each one of them,

[321]

giving what's known as a "probability density function", or PDF for short.

[326]

Any time you see a PDF in the wild, the way to interpret it

[329]

is that the probability of your random variable lying *between* 2 values

[333]

equals the area under this curve between those values.

[337]

So, for example, what's the probability of getting any one very specific number, like 0.7?

[344]

Well, the area of an infinitely thin slice is 0, so it's 0.

[348]

What's the probability of all of them put together?

[351]

Well, the area under the full curve is 1.

[354]

You see? Paradox sidestepped.

[357]

And the way that it's been sidestepped is a bit subtle.

[360]

In normal, finite settings, like rolling a die or drawing a card,

[364]

the probability that a random value falls into a given collection of possibilities

[369]

is simply the sum of the probabilities of being any one of them.

[373]

This feels very intuitive.

[375]

It's even true in a countably infinite context.

[377]

But to deal with a continuum,

[379]

the rules themselves have shifted.

[381]

The probability of falling into a range of values

[384]

is no longer the sum of the probabilities of each individual value.

[389]

Instead, probabilities associated with ranges are the fundamental primitive objects.

[394]

And the only sense in which it's meaningful to talk about an individual value here,

[398]

is to think of it as a range of width 0.

[402]

If the ideas of the rules changing between a finite setting and a continuous one feels unsettling,

[407]

well you'll be happy to know that mathematicians are way ahead of you.

[410]

There's a field of math called 'measure theory',

[413]

which helps to unite these two settings

[414]

and make rigorous the idea of associating numbers like probabilities,

[418]

to various subsets of all possibilities in a way that combines and distributes nicely.

[424]

For example, let's say you're in a setting where you have a random number that equals 0 with 50% probability,

[430]

and the rest of the time,

[431]

it's some positive number according to a distribution that looks like half of a bell curve.

[436]

This is an awkward middle-ground between a finite context, where a single value has a non-zero probability,

[442]

and a continuous one, where probabilities are found according to areas under the appropriate density function.

[449]

This is the sort of thing that measure theory handles very smoothly.

[452]

I mentioned this mainly for the especially curious viewer,

[455]

and you can find more reading material in the description.

[460]

It's a pretty common rule of thumb that if you find yourself using a sum in a discrete context,

[465]

then use an integral in the continuous context,

[468]

which is the tool from calculus that we use to find areas under curves.

[472]

In fact, you could argue this video would be way shorter if I just said that at the front and called it good.

[477]

For my part though, I always found it a little unsatisfying to do this blindly without thinking through what it really means.

[483]

And, in fact, if you really dig in to the theoretical underpinnings of integrals,

[488]

what you'd find is that in addition to the way that it's defined in a typical intro calculus class,

[493]

there is a separate, more powerful definition that's based on measure theory,

[497]

this formal foundation of probability.

[500]

If I look back to when I first learned probability,

[503]

I definitely remember grappling with this weird idea that in continuous settings,

[507]

like random variables that are real numbers, or throwing a dart at a dart board,

[511]

you have a bunch of outcomes that are possible, and yet each one has a probability of 0.

[516]

And somehow, altogether, they have a probability of 1.

[520]

Now, one step of coming to terms with this is to realise that

[523]

possibility is better tied to probability density than probability,

[528]

but just swapping out sums of 1 for integrals of the other has never quite scratched the itch for me.

[533]

I remember that it only really clicked when I realised that the rules for combining probabilities of different sets

[538]

were not quite what I thought they were.

[540]

And there was simply a different axiom system underlying it all.

[544]

But anyway, steering away from the theory, somewhere back in the loose direction of application,

[549]

look back to our original question about the coin with an unknown weight.

[552]

What we've learned here is that the right question to ask is

[556]

what's the probability density function that describes this value h after seeing the outcomes of a few tosses?

[563]

If you can find that PDF, you can use it to answer questions like

[567]

'What's the probability that the true probability of flipping heads falls between 0.6 and 0.8?'.

[573]

To find that PDF, join me in the next part.

Most Recent Videos:

WE KILLED 6 HEROIC BOSSES! - YouTube

¿Quién inventó el dinero? - YouTube

Cuándo se inventó el dinero y cómo el dólar se convirtió en la principal moneda del mundo - YouTube

This Citizenship Program is Failing - YouTube

Candida Treatment Protocol w/ Dr. DiNezza - YouTube

$500M investor reacts to Real Estate Tik Toks 2 - YouTube

You can go back to the homepage right here: Homepage