🔍

Backpropagation calculus | Chapter 4, Deep learning - YouTube

Channel: 3Blue1Brown

[4]

The hard assumption here is that you’ve watched part 3,

[7]

giving an intuitive walkthrough of the backpropagation algorithm.

[11]

Here, we get a bit more formal and dive into the relevant calculus.

[14]

It’s normal for this to be a little confusing,

[17]

so the mantra to regularly pause and ponder certainly applies as much here as anywhere else.

[21]

Our main goal is to show how people in machine learning

[25]

commonly think about the chain rule from the calculus in the context of networks,

[29]

which has a different feel for how much most introductory calculus courses approach the subject.

[34]

For those of you uncomfortable with the relevant calculus,

[36]

I do have a whole series on the topic.

[40]

Let’s just start off with an extremely simple network,

[43]

one where each layer has a single neuron in it.

[46]

So this particular network is determined by 3 weights and 3 biases,

[50]

and our goal is to understand how sensitive the cost function is to these variables.

[55]

That way we know which adjustments to these terms

[57]

is going to cause the most efficient decrease to the cost function.

[61]

And we're just focus on the connection between the last two neurons.

[65]

Let's label the activation of that last neuron a with a superscript L, indicating which layer it’s in,

[71]

so the activation of this previous neuron is a^(L-1).

[76]

There are not exponents, they're just a way of indexing what we’re talking about,

[80]

since I want to save subscripts for different indices later on.

[83]

Let’s say that the value we want this last activation to be for a given training example is y.

[90]

For example, y might be 0 or 1.

[92]

So the cost of this simple network for a single training example is (a^(L) - y)^2.

[100]

We’ll denote the cost of this one training example as C_0.

[106]

As a reminder, this last activation is determined by a weight, which I'm going to call w^(L)

[111]

times the previous neuron’s activation,

[114]

plus some bias, which I’ll call b^(L),

[117]

then you pump that through some special nonlinear function

[119]

like a sigmoid or a ReLU.

[121]

It's actually going to make things easier for us if we give a special name to this weighted sum, like z,

[126]

with the same superscript as the relevant activations.

[130]

So there are a lot of terms.

[131]

And a way you might conceptualize this is that the weight, the previous activation, and the bias

[136]

altogether are used to compute z, which in turn lets us compute a,

[141]

which finally, along with the constant y, let us compute the cost.

[147]

And of course, a^(L-1) is influenced by its own weight and bias, and such.

[152]

But we are not gonna focus on that right now.

[155]

All of these are just numbers, right?

[158]

And it can be nice to think of each one as having its own little number line.

[161]

Our first goal is to understand

[163]

how sensitive the cost function is to small changes in our weight w^(L).

[169]

Or phrased differently, what’s the derivative of C with respect to w^(L).

[175]

When you see this “∂w” term,

[178]

think of it as meaning “some tiny nudge to w”, like a change by 0.01.

[183]

And think of this “∂C” term as meaning “whatever the resulting nudge to the cost is”.

[188]

What we want is their ratio.

[191]

Conceptually, this tiny nudge to w^(L) causes some nudge to z^(L)

[196]

which in turn causes some change to a^(L), which directly influences the cost.

[203]

So we break this up by first looking at the ratio of a tiny change to z^(L) to the tiny change in w^(L).

[209]

That is, the derivative of z^(L) with respect to w^(L).

[213]

Likewise, you then consider the ratio of a change to a^(L) to the tiny change in z^(L) that caused it,

[219]

as well as the ratio between the final nudge to C and this intermediate nudge to a^(L).

[225]

This right here is the chain rule,

[227]

where multiplying together these three ratios gives us the sensitivity of C to small changes in w^(L).

[237]

So on screen right now, there’s kinda lot of symbols,

[240]

so take a moment to make sure it’s clear what they all are,

[243]

because now we are gonna compute the relevant derivatives.

[247]

The derivative of C with respect to a^(L) works out to be 2(a^(L) - y).

[253]

Notice, this means that its size is proportional to

[256]

the difference between the network’s output, and the thing we want it to be.

[261]

So if that output was very different,

[263]

even slight changes stand to have a big impact on the cost function.

[268]

The derivative of a^(L) with respect to z^(L) is just the derivative of our sigmoid function,

[273]

or whatever nonlinearity you choose to use.

[277]

And the derivative of z^(L) with respect to w^(L),

[281]

in this case comes out just to be a^(L-1).

[286]

Now I don't know about you, but I think it’s easy to get stuck head-down in these formulas

[289]

without taking a moment to sit back and remind yourself what they all actually mean.

[294]

In the case of this last derivative,

[296]

the amount that a small nudge to this weight influences the last layer

[300]

depends on how strong the previous neuron is.

[303]

Remember, this is where that “neurons that fire together wire together” idea comes in.

[309]

And all of this is the derivative with respect to w^(L) only of the cost for a specific training example.

[316]

Since the full cost function involves averaging together all those costs across many training examples,

[322]

its derivative requires averaging this expression that we found over all training examples.

[328]

And of course that is just one component of the gradient vector,

[331]

which itself is built up from

[333]

the partial derivatives of the cost function with respect to all those weights and biases.

[340]

But even though it was just one of those partial derivatives we need,

[343]

it's more than 50% of the work.

[346]

The sensitivity to the bias, for example, is almost identical.

[350]

We just need to change out this ∂z/∂w term for a ∂z/∂b,

[358]

And if you look at the relevant formula, that derivative comes to be 1.

[366]

Also, and this is where the idea of propagating backwards comes in,

[370]

you can see how sensitive this cost function is to the activation of the previous layer;

[376]

namely, this initial derivative in the chain rule expansion,

[379]

the sensitivity of z to the previous activation,

[383]

comes out to be the weight w^(L).

[386]

And again, even though we won’t be able to directly influence that activation,

[391]

it’s helpful to keep track of,

[393]

because now we can just keep iterating this chain rule idea backwards

[398]

to see how sensitive the cost function is to previous weights and to previous biases.

[403]

And you might think this is an overly simple example,

[405]

since all layers just have 1 neuron,

[407]

and things are just gonna get exponentially more complicated in the real network.

[411]

But honestly, not that much changes when we give the layers multiple neurons.

[416]

Really it's just a few more indices to keep track of.

[419]

Rather than the activation of a given layer simply being a^(L),

[422]

it's also going to have a subscript indicating which neuron of that layer it is.

[427]

Let’s go ahead and use the letter k to index the layer (L-1), and j to index the layer (L).

[435]

For the the cost, again we look at what the desired output is.

[438]

But this time

[439]

we add up the squares of the differences between these last layer activations and the desired output.

[446]

That is, you take a sum over (a_j^(L) - y_j)^2

[453]

Since there are a lot more weights,

[454]

each one has to have a couple more indices to keep track of where it is.

[458]

So let’s call the weight of the edge connecting this k-th neuron to the j-th neuron w_{jk}^(L).

[465]

Those indices might feel a little backwards at first,

[468]

but it lines up with how you’d index the weight matrix that I talked about in the Part 1 video.

[473]

Just as before, it’s still nice to give a name to the relevant weighted sum, like z,

[478]

so that the activation of the last layer is just your special function, like the sigmoid, applied to z.

[485]

You can kinda see what I mean, right?

[486]

These are all essentially the same equations we had before in the one-neuron-per-layer case;

[491]

it just looks a little more complicated.

[495]

And indeed, the chain-rule derivative expression

[498]

describing how sensitive the cost is to a specific weight

[501]

looks essentially the same.

[503]

I’ll leave it to you to pause and think about each of these terms if you want.

[509]

What does change here, though,

[511]

is the derivative of the cost with respect to one of the activations in the layer (L-1).

[517]

In this case, the difference is the neuron influences the cost function through multiple paths.

[524]

That is, on the one hand, it influences a_0^(L), which plays a role in the cost function,

[531]

but it also has an influence on a_1^(L), which also plays a role in the cost function.

[536]

And you have to add those up.

[540]

And that... well that is pretty much it.

[543]

Once you know how sensitive the cost function is to the activations in this second to last layer,

[548]

you can just repeat the process for all the weights and biases feeding into that layer.

[553]

So pat yourself on the back!

[555]

If this all of these makes sense,

[556]

you have now looked deep into the heart of backpropagation,

[560]

the workhorse behind how neural networks learn.

[563]

These chain rule expressions give you the derivatives that determine each component in the gradient

[569]

that helps minimize the cost of the network by repeatedly stepping downhill.

[574]

Hhhhpf. If you sit back and think about all that,

[576]

that’s a lot of layers of complexity to wrap your mind around.

[580]

So don't worry if it takes time for your mind to digest it all.

Most Recent Videos:

WE KILLED 6 HEROIC BOSSES! - YouTube

¿Quién inventó el dinero? - YouTube

Cuándo se inventó el dinero y cómo el dólar se convirtió en la principal moneda del mundo - YouTube

This Citizenship Program is Failing - YouTube

Candida Treatment Protocol w/ Dr. DiNezza - YouTube

$500M investor reacts to Real Estate Tik Toks 2 - YouTube

You can go back to the homepage right here: Homepage