馃攳
Neural Networks Pt. 1: Inside the Black Box - YouTube
Channel: StatQuest with Josh Starmer
[0]
neural networks
[4]
seem so complicated
[7]
but they're not statquest
[11]
hello i'm josh starmer and welcome to
[14]
statquest
[15]
today we're going to talk about neural
[17]
networks part one
[19]
inside the black box
[22]
neural networks one of the most popular
[24]
algorithms in machine learning
[27]
cover a broad range of concepts and
[29]
techniques
[31]
however people call them a black box
[34]
because it can be hard to understand
[36]
what they're doing
[38]
the goal of this series is to take a
[40]
peek into the black box by breaking down
[43]
each concept
[44]
and technique into its components and
[46]
walking through how they fit together
[49]
step by step in this first part
[52]
we will learn about what neural networks
[54]
do and how they do it
[57]
in part two we'll talk about how neural
[60]
networks are fit to data with back
[62]
propagation
[64]
then we will talk about variations on
[66]
the simple neural network presented in
[68]
this part
[69]
including deep learning note
[73]
crazy awesome news i have a new way to
[78]
think about neural networks that will
[79]
help beginners
[80]
and seasoned experts alike gain a deep
[83]
insight into what neural networks do
[87]
for example most tutorials use
[90]
cool looking but hard to understand
[92]
graphs
[93]
and fancy mathematical notation to
[96]
represent neural networks
[99]
in contrast i'm going to label every
[102]
little thing on the neural network
[104]
to make it easy to keep track of the
[106]
details
[108]
and the math will be as simple as
[110]
possible while still being true to the
[112]
algorithm
[114]
these differences will help you develop
[116]
a deep understanding
[118]
of what neural networks actually do
[122]
so with that said let's imagine we
[125]
tested a drug that was designed to treat
[128]
an illness and we gave the drug to three
[131]
different groups of people
[132]
with three different dosages low
[136]
medium and high
[139]
the low dosages were not effective so we
[143]
set them to zero
[144]
on this graph in contrast
[147]
the medium dosages were effective so we
[150]
set them to one
[152]
and the high dosages were not effective
[155]
so those are set to zero
[158]
now that we have this data we would like
[160]
to use it to predict whether or not a
[162]
future dosage will be effective
[166]
however we can't just fit a straight
[168]
line to the data to make
[170]
predictions because no matter how we
[172]
rotate the straight line
[174]
it can only accurately predict two of
[177]
the three dosages
[179]
the good news is that a neural network
[182]
can fit a squiggle to the data
[185]
the green squiggle is close to zero for
[187]
low dosages
[189]
close to one for medium dosages
[193]
and close to zero for high dosages
[197]
and even if we have a really complicated
[199]
data set like this
[201]
a neural network can fit a squiggle to
[204]
it
[205]
in this stat quest we're going to use
[208]
this super simple data set
[210]
and show how this neural network creates
[214]
this green squiggle
[216]
but first let's just talk about what a
[219]
neural network
[220]
is a neural network consists of nodes
[225]
and connections between the nodes
[229]
note the numbers along each connection
[231]
represent parameter values that were
[233]
estimated when this neural network was
[236]
fit to the data
[238]
for now just know that these parameter
[241]
estimates are analogous to the slope and
[243]
intercept values that we solve for
[245]
when we fit a straight line to data
[249]
likewise a neural network starts out
[252]
with unknown parameter values
[255]
that are estimated when we fit the
[257]
neural network to a data set
[258]
using a method called back propagation
[262]
and we will talk about how back
[264]
propagation estimates these parameters
[266]
in part 2
[267]
in this series but for now
[270]
just assume that we've already fit this
[272]
neural network to this specific
[274]
data set and that means we have already
[278]
estimated these parameters
[281]
also you may have noticed that some of
[284]
the nodes have
[285]
curved lines inside of them these
[288]
bent or curved lines are the building
[291]
blocks for fitting a squiggle
[293]
to data the goal of this stat quest is
[297]
to show you how these identical curves
[300]
can be reshaped by the parameter values
[304]
and then added together to get a green
[306]
squiggle that fits the data
[309]
note there are many common bent or
[312]
curved lines that we can choose for a
[314]
neural network
[316]
this specific curved line is called soft
[319]
plus which sounds like a brand of toilet
[322]
paper
[323]
alternatively we could use this bent
[326]
line
[326]
called relu which is short for rectified
[329]
linear unit
[330]
and sounds like a robot or we could use
[334]
a sigmoid shape
[335]
or any other bent or curved line
[338]
oh no it's the dreaded terminology alert
[343]
the curved or bent lines are called
[345]
activation functions
[348]
when you build a neural network you have
[350]
to decide which activation function
[353]
or functions you want to use
[356]
when most people teach neural networks
[359]
they use the sigmoid activation function
[363]
however in practice it is much more
[366]
common to use the relu
[367]
activation function or the soft plus
[370]
activation function
[373]
so we'll use the soft plus activation
[375]
function in this stat quest
[378]
anyway we'll talk more about how you
[381]
choose activation functions
[382]
later in this series note
[386]
this specific neural network is about as
[388]
simple as they get
[391]
it only has one input node where we plug
[393]
in the dosage
[395]
only one output node to tell us the
[397]
predicted effectiveness
[399]
and only two nodes between the input and
[402]
output nodes
[404]
however in practice neural networks are
[408]
usually much
[409]
fancier and have more than one
[412]
input node more than one output node
[416]
different layers of nodes between the
[418]
input and output nodes
[420]
and a spider web of connections between
[423]
each layer of nodes
[425]
oh no it's another terminology alert
[429]
these layers of nodes between the input
[432]
and output nodes are called hidden
[434]
layers
[436]
when you build a neural network one of
[438]
the first things you do
[439]
is decide how many hidden layers you
[441]
want and how many nodes
[443]
go into each hidden layer although there
[447]
are rules of thumb
[448]
for making decisions about the hidden
[450]
layers
[451]
you essentially make a guess and see how
[454]
well the neural network performs
[456]
adding more layers and nodes if needed
[460]
now even though this neural network
[462]
looks fancy
[464]
it is still made from the same parts
[467]
used in this simple neural network
[470]
which has only one hidden layer with two
[472]
nodes
[474]
so let's learn how this neural network
[477]
creates new shapes from the curved or
[479]
bent lines in the hidden layer
[482]
and then adds them together to get a
[484]
green squiggle
[485]
that fits the data note
[489]
to keep the math simple let's assume
[491]
dosages go from zero
[493]
for low to one for high
[497]
the first thing we are going to do is
[500]
plug the lowest dosage
[502]
zero into the neural network
[505]
now to get from the input node
[508]
to the top node in the hidden layer
[511]
this connection multiplies the dosage by
[514]
negative
[515]
34.4 and then adds
[518]
2.14 and the result
[522]
is an x-axis coordinate for the
[524]
activation function
[527]
for example the lowest dosage 0
[531]
is multiplied by negative 34.4
[535]
and then we add 2.14
[539]
to get 2.14 as the x-axis coordinate for
[543]
the activation function
[546]
to get the corresponding y-axis value
[549]
we plug 2.14 into the activation
[552]
function
[553]
which in this case is the soft plus
[556]
function
[557]
note if we had chosen the sigmoid curve
[560]
for the activation function
[563]
then we would plug 2.14 into the
[565]
equation for the sigmoid curve
[568]
and if we had chosen the rail u bent
[571]
line for the activation function
[573]
then we would plug 2.14 into the relu
[577]
equation but since we are using soft
[580]
plus for the activation function
[583]
we plug 2.14 into the soft plus equation
[588]
and the log of one plus e raised to the
[591]
2.14
[592]
power is 2.25
[597]
note in statistics machine learning and
[600]
most programming languages
[602]
the log function implies the natural log
[605]
or the log base e
[608]
anyway the y axis coordinate for the
[611]
activation function
[613]
is 2.25 so let's extend this y-axis
[618]
up a little bit and put a blue dot
[621]
at 2.25 for when dosage equals zero
[626]
now if we increase the dosage a little
[629]
bit
[629]
and plug 0.1 into the input
[633]
the x-axis coordinate for the activation
[635]
function
[636]
is negative 1.3 and the corresponding
[640]
y-axis value
[641]
is 0.24
[644]
so let's put a blue dot at 0.24
[648]
for when dosage equals 0.1
[652]
and if we continue to increase the
[654]
dosage values all the way to 1
[656]
the maximum dosage we get this blue
[660]
curve note
[663]
before we move on i want to point out
[665]
that the full range of dosage values
[667]
from 0 to 1 corresponds to this
[671]
relatively narrow range of values from
[674]
the activation function
[676]
in other words when we plug dosage
[679]
values
[679]
from 0 to 1 into the neural network
[683]
and then multiply them by negative 34.4
[687]
and add 2.14
[690]
we only get x-axis coordinates that are
[693]
within the red box
[695]
and thus only the corresponding y-axis
[698]
values in the red box are used to make
[701]
this new blue curve
[704]
bam now we scale the y-axis values for
[708]
the blue curve
[709]
by negative 1.3
[712]
for example when dosage equals zero the
[715]
current y-axis
[717]
coordinate for the blue curve is 2.25
[721]
so we multiply 2.25 by negative 1.3
[726]
and get negative 2.93
[729]
and negative 2.93 corresponds to this
[733]
position
[733]
on the y-axis likewise
[737]
we multiply all of the other y-axis
[739]
coordinates on the blue curve
[741]
by negative 1.3 and we end up with a new
[745]
blue curve
[747]
bam
[750]
now let's focus on the connection from
[752]
the input node
[754]
to the bottom node in the hidden layer
[757]
however this time we multiply the dosage
[760]
by negative 2.52
[764]
instead of negative 34.4
[767]
and we add 1.29
[770]
instead of 2.14 to get the x-axis
[775]
coordinate for the activation function
[778]
remember these values come from fitting
[781]
the neural network to the data with back
[783]
propagation and we'll talk about that in
[786]
part two
[787]
in this series now
[790]
if we plug the lowest dosage zero into
[793]
the neural network
[795]
then the x-axis coordinate for the
[797]
activation function
[799]
is 1.29 now we plug
[803]
1.29 into the activation function
[806]
to get the corresponding y-axis value
[809]
and get 1.53
[813]
and that corresponds to this yellow dot
[817]
now we just plug in dosage values from 0
[820]
to 1
[821]
to get the corresponding y-axis values
[824]
and we get this orange curve
[828]
note just like before i want to point
[831]
out that the full range of dosage values
[833]
from 0 to 1 corresponds to this narrow
[837]
range of values from the activation
[839]
function
[841]
in other words when we plug dosage
[843]
values from 0 to 1
[845]
into the neural network we only get
[849]
x-axis coordinates that are within the
[851]
red box
[853]
and thus only the corresponding y-axis
[857]
values in the red box are used to make
[859]
this new orange curve
[862]
so we see that fitting a neural network
[865]
to data
[865]
gives us different parameter estimates
[867]
on the connections
[869]
and that results in each node in the
[872]
hidden layer
[872]
using different portions of the
[874]
activation functions
[876]
to create these new and exciting shapes
[880]
now just like before we scale the y-axis
[884]
coordinates on the orange curve only
[887]
this time
[887]
we scale by a positive number 2.28
[895]
and that gives us this new orange curve
[899]
now the neural network tells us to add
[901]
the y-axis coordinates from the blue
[903]
curve
[904]
to the orange curve
[908]
and that gives us this green squiggle
[911]
then finally we subtract 0.58
[916]
from the y-axis values on the green
[918]
squiggle
[919]
and we have a green squiggle that fits
[922]
the data
[923]
bam now if someone comes along and says
[927]
that they are using dosage
[929]
equal to 0.5 we can look at the
[932]
corresponding
[933]
y-axis coordinate on the green squiggle
[936]
and see that the dosage will be
[938]
effective
[939]
or we can solve for the y-axis
[942]
coordinate by plugging
[943]
dosage equals 0.5 into the neural
[946]
network
[947]
and do the math
[975]
[Music]
[979]
and we see that the y-axis coordinate on
[982]
the green squiggle
[983]
is 1.03 and since
[986]
1.03 is closer to 1 than 0
[990]
we will conclude that a dosage equal to
[992]
0.5
[993]
is effective double bam
[998]
now if you've made it this far you may
[1001]
be wondering why this is called a neural
[1004]
network
[1005]
instead of a big fancy squiggle fitting
[1007]
machine
[1009]
reason is that way back in the 1940s and
[1012]
50s when neural networks were invented
[1015]
they thought the nodes were vaguely like
[1018]
neurons
[1019]
and the connections between the nodes
[1021]
were sort of like synapses
[1024]
however i think they should be called
[1026]
big fancy squiggle fitting machines
[1029]
because that's what they do note
[1032]
whether or not you call it a squiggle
[1034]
fitting machine
[1035]
the parameters that we multiply are
[1037]
called weights
[1039]
and the parameters that we add are
[1041]
called biases
[1043]
note this neural network starts with two
[1046]
identical activation functions
[1049]
but the weights and biases on the
[1050]
connections slice them
[1052]
flip them and stretch them into new
[1054]
shapes
[1056]
which are then added together to get a
[1058]
squiggle that is entirely new
[1060]
and then the squiggle is shifted to fit
[1063]
the data
[1065]
now if we can create this green squiggle
[1067]
with just
[1068]
two nodes in a single hidden layer just
[1071]
imagine what types of green squiggles we
[1073]
could fit with more hidden layers
[1075]
and more nodes in each hidden layer
[1078]
in theory neural networks can fit a
[1081]
green squiggle to just about
[1082]
any data set no matter how complicated
[1086]
and i think that's pretty cool triple
[1089]
bam
[1091]
now it's time for some shameless
[1094]
self-promotion
[1096]
if you want to review statistics and
[1098]
machine learning offline
[1100]
check out the statquest study guides at
[1103]
statquest.org
[1105]
there's something for everyone hooray
[1109]
we've made it to the end of another
[1110]
exciting stat quest
[1112]
if you like this stat quest and want to
[1114]
see more please subscribe
[1116]
and if you want to support statquest
[1118]
consider contributing to my patreon
[1120]
campaign
[1121]
becoming a channel member buying one or
[1123]
two of my original songs or a t-shirt or
[1126]
a hoodie
[1126]
or just donate the links are in the
[1128]
description below
[1130]
alright until next time quest on
Most Recent Videos:
You can go back to the homepage right here: Homepage





