Backpropagation calculus | Chapter 4, Deep learning - YouTube

Channel: 3Blue1Brown

[4]
The hard assumption here is that you’ve watched part 3,
[7]
giving an intuitive walkthrough of the backpropagation algorithm.
[11]
Here, we get a bit more formal and dive into the relevant calculus.
[14]
It’s normal for this to be a little confusing,
[17]
so the mantra to regularly pause and ponder certainly applies as much here as anywhere else.
[21]
Our main goal is to show how people in machine learning
[25]
commonly think about the chain rule from the calculus in the context of networks,
[29]
which has a different feel for how much most introductory calculus courses approach the subject.
[34]
For those of you uncomfortable with the relevant calculus,
[36]
I do have a whole series on the topic.
[40]
Let’s just start off with an extremely simple network,
[43]
one where each layer has a single neuron in it.
[46]
So this particular network is determined by 3 weights and 3 biases,
[50]
and our goal is to understand how sensitive the cost function is to these variables.
[55]
That way we know which adjustments to these terms
[57]
is going to cause the most efficient decrease to the cost function.
[61]
And we're just focus on the connection between the last two neurons.
[65]
Let's label the activation of that last neuron a with a superscript L, indicating which layer it’s in,
[71]
so the activation of this previous neuron is a^(L-1).
[76]
There are not exponents, they're just a way of indexing what we’re talking about,
[80]
since I want to save subscripts for different indices later on.
[83]
Let’s say that the value we want this last activation to be for a given training example is y.
[90]
For example, y might be 0 or 1.
[92]
So the cost of this simple network for a single training example is (a^(L) - y)^2.
[100]
We’ll denote the cost of this one training example as C_0.
[106]
As a reminder, this last activation is determined by a weight, which I'm going to call w^(L)
[111]
times the previous neuron’s activation,
[114]
plus some bias, which I’ll call b^(L),
[117]
then you pump that through some special nonlinear function
[119]
like a sigmoid or a ReLU.
[121]
It's actually going to make things easier for us if we give a special name to this weighted sum, like z,
[126]
with the same superscript as the relevant activations.
[130]
So there are a lot of terms.
[131]
And a way you might conceptualize this is that the weight, the previous activation, and the bias
[136]
altogether are used to compute z, which in turn lets us compute a,
[141]
which finally, along with the constant y, let us compute the cost.
[147]
And of course, a^(L-1) is influenced by its own weight and bias, and such.
[152]
But we are not gonna focus on that right now.
[155]
All of these are just numbers, right?
[158]
And it can be nice to think of each one as having its own little number line.
[161]
Our first goal is to understand
[163]
how sensitive the cost function is to small changes in our weight w^(L).
[169]
Or phrased differently, what’s the derivative of C with respect to w^(L).
[175]
When you see this “∂w” term,
[178]
think of it as meaning “some tiny nudge to w”, like a change by 0.01.
[183]
And think of this “∂C” term as meaning “whatever the resulting nudge to the cost is”.
[188]
What we want is their ratio.
[191]
Conceptually, this tiny nudge to w^(L) causes some nudge to z^(L)
[196]
which in turn causes some change to a^(L), which directly influences the cost.
[203]
So we break this up by first looking at the ratio of a tiny change to z^(L) to the tiny change in w^(L).
[209]
That is, the derivative of z^(L) with respect to w^(L).
[213]
Likewise, you then consider the ratio of a change to a^(L) to the tiny change in z^(L) that caused it,
[219]
as well as the ratio between the final nudge to C and this intermediate nudge to a^(L).
[225]
This right here is the chain rule,
[227]
where multiplying together these three ratios gives us the sensitivity of C to small changes in w^(L).
[237]
So on screen right now, there’s kinda lot of symbols,
[240]
so take a moment to make sure it’s clear what they all are,
[243]
because now we are gonna compute the relevant derivatives.
[247]
The derivative of C with respect to a^(L) works out to be 2(a^(L) - y).
[253]
Notice, this means that its size is proportional to
[256]
the difference between the network’s output, and the thing we want it to be.
[261]
So if that output was very different,
[263]
even slight changes stand to have a big impact on the cost function.
[268]
The derivative of a^(L) with respect to z^(L) is just the derivative of our sigmoid function,
[273]
or whatever nonlinearity you choose to use.
[277]
And the derivative of z^(L) with respect to w^(L),
[281]
in this case comes out just to be a^(L-1).
[286]
Now I don't know about you, but I think it’s easy to get stuck head-down in these formulas
[289]
without taking a moment to sit back and remind yourself what they all actually mean.
[294]
In the case of this last derivative,
[296]
the amount that a small nudge to this weight influences the last layer
[300]
depends on how strong the previous neuron is.
[303]
Remember, this is where that “neurons that fire together wire together” idea comes in.
[309]
And all of this is the derivative with respect to w^(L) only of the cost for a specific training example.
[316]
Since the full cost function involves averaging together all those costs across many training examples,
[322]
its derivative requires averaging this expression that we found over all training examples.
[328]
And of course that is just one component of the gradient vector,
[331]
which itself is built up from
[333]
the partial derivatives of the cost function with respect to all those weights and biases.
[340]
But even though it was just one of those partial derivatives we need,
[343]
it's more than 50% of the work.
[346]
The sensitivity to the bias, for example, is almost identical.
[350]
We just need to change out this ∂z/∂w term for a ∂z/∂b,
[358]
And if you look at the relevant formula, that derivative comes to be 1.
[366]
Also, and this is where the idea of propagating backwards comes in,
[370]
you can see how sensitive this cost function is to the activation of the previous layer;
[376]
namely, this initial derivative in the chain rule expansion,
[379]
the sensitivity of z to the previous activation,
[383]
comes out to be the weight w^(L).
[386]
And again, even though we won’t be able to directly influence that activation,
[391]
it’s helpful to keep track of,
[393]
because now we can just keep iterating this chain rule idea backwards
[398]
to see how sensitive the cost function is to previous weights and to previous biases.
[403]
And you might think this is an overly simple example,
[405]
since all layers just have 1 neuron,
[407]
and things are just gonna get exponentially more complicated in the real network.
[411]
But honestly, not that much changes when we give the layers multiple neurons.
[416]
Really it's just a few more indices to keep track of.
[419]
Rather than the activation of a given layer simply being a^(L),
[422]
it's also going to have a subscript indicating which neuron of that layer it is.
[427]
Let’s go ahead and use the letter k to index the layer (L-1), and j to index the layer (L).
[435]
For the the cost, again we look at what the desired output is.
[438]
But this time
[439]
we add up the squares of the differences between these last layer activations and the desired output.
[446]
That is, you take a sum over (a_j^(L) - y_j)^2
[453]
Since there are a lot more weights,
[454]
each one has to have a couple more indices to keep track of where it is.
[458]
So let’s call the weight of the edge connecting this k-th neuron to the j-th neuron w_{jk}^(L).
[465]
Those indices might feel a little backwards at first,
[468]
but it lines up with how you’d index the weight matrix that I talked about in the Part 1 video.
[473]
Just as before, it’s still nice to give a name to the relevant weighted sum, like z,
[478]
so that the activation of the last layer is just your special function, like the sigmoid, applied to z.
[485]
You can kinda see what I mean, right?
[486]
These are all essentially the same equations we had before in the one-neuron-per-layer case;
[491]
it just looks a little more complicated.
[495]
And indeed, the chain-rule derivative expression
[498]
describing how sensitive the cost is to a specific weight
[501]
looks essentially the same.
[503]
I’ll leave it to you to pause and think about each of these terms if you want.
[509]
What does change here, though,
[511]
is the derivative of the cost with respect to one of the activations in the layer (L-1).
[517]
In this case, the difference is the neuron influences the cost function through multiple paths.
[524]
That is, on the one hand, it influences a_0^(L), which plays a role in the cost function,
[531]
but it also has an influence on a_1^(L), which also plays a role in the cost function.
[536]
And you have to add those up.
[540]
And that... well that is pretty much it.
[543]
Once you know how sensitive the cost function is to the activations in this second to last layer,
[548]
you can just repeat the process for all the weights and biases feeding into that layer.
[553]
So pat yourself on the back!
[555]
If this all of these makes sense,
[556]
you have now looked deep into the heart of backpropagation,
[560]
the workhorse behind how neural networks learn.
[563]
These chain rule expressions give you the derivatives that determine each component in the gradient
[569]
that helps minimize the cost of the network by repeatedly stepping downhill.
[574]
Hhhhpf. If you sit back and think about all that,
[576]
that’s a lot of layers of complexity to wrap your mind around.
[580]
So don't worry if it takes time for your mind to digest it all.