🔍

Neural Networks Pt. 1: Inside the Black Box - YouTube

Channel: StatQuest with Josh Starmer

[0]

neural networks

[4]

seem so complicated

[7]

but they're not statquest

[11]

hello i'm josh starmer and welcome to

[14]

statquest

[15]

today we're going to talk about neural

[17]

networks part one

[19]

inside the black box

[22]

neural networks one of the most popular

[24]

algorithms in machine learning

[27]

cover a broad range of concepts and

[29]

techniques

[31]

however people call them a black box

[34]

because it can be hard to understand

[36]

what they're doing

[38]

the goal of this series is to take a

[40]

peek into the black box by breaking down

[43]

each concept

[44]

and technique into its components and

[46]

walking through how they fit together

[49]

step by step in this first part

[52]

we will learn about what neural networks

[54]

do and how they do it

[57]

in part two we'll talk about how neural

[60]

networks are fit to data with back

[62]

propagation

[64]

then we will talk about variations on

[66]

the simple neural network presented in

[68]

this part

[69]

including deep learning note

[73]

crazy awesome news i have a new way to

[78]

think about neural networks that will

[79]

help beginners

[80]

and seasoned experts alike gain a deep

[83]

insight into what neural networks do

[87]

for example most tutorials use

[90]

cool looking but hard to understand

[92]

graphs

[93]

and fancy mathematical notation to

[96]

represent neural networks

[99]

in contrast i'm going to label every

[102]

little thing on the neural network

[104]

to make it easy to keep track of the

[106]

details

[108]

and the math will be as simple as

[110]

possible while still being true to the

[112]

algorithm

[114]

these differences will help you develop

[116]

a deep understanding

[118]

of what neural networks actually do

[122]

so with that said let's imagine we

[125]

tested a drug that was designed to treat

[128]

an illness and we gave the drug to three

[131]

different groups of people

[132]

with three different dosages low

[136]

medium and high

[139]

the low dosages were not effective so we

[143]

set them to zero

[144]

on this graph in contrast

[147]

the medium dosages were effective so we

[150]

set them to one

[152]

and the high dosages were not effective

[155]

so those are set to zero

[158]

now that we have this data we would like

[160]

to use it to predict whether or not a

[162]

future dosage will be effective

[166]

however we can't just fit a straight

[168]

line to the data to make

[170]

predictions because no matter how we

[172]

rotate the straight line

[174]

it can only accurately predict two of

[177]

the three dosages

[179]

the good news is that a neural network

[182]

can fit a squiggle to the data

[185]

the green squiggle is close to zero for

[187]

low dosages

[189]

close to one for medium dosages

[193]

and close to zero for high dosages

[197]

and even if we have a really complicated

[199]

data set like this

[201]

a neural network can fit a squiggle to

[204]

[205]

in this stat quest we're going to use

[208]

this super simple data set

[210]

and show how this neural network creates

[214]

this green squiggle

[216]

but first let's just talk about what a

[219]

neural network

[220]

is a neural network consists of nodes

[225]

and connections between the nodes

[229]

note the numbers along each connection

[231]

represent parameter values that were

[233]

estimated when this neural network was

[236]

fit to the data

[238]

for now just know that these parameter

[241]

estimates are analogous to the slope and

[243]

intercept values that we solve for

[245]

when we fit a straight line to data

[249]

likewise a neural network starts out

[252]

with unknown parameter values

[255]

that are estimated when we fit the

[257]

neural network to a data set

[258]

using a method called back propagation

[262]

and we will talk about how back

[264]

propagation estimates these parameters

[266]

in part 2

[267]

in this series but for now

[270]

just assume that we've already fit this

[272]

neural network to this specific

[274]

data set and that means we have already

[278]

estimated these parameters

[281]

also you may have noticed that some of

[284]

the nodes have

[285]

curved lines inside of them these

[288]

bent or curved lines are the building

[291]

blocks for fitting a squiggle

[293]

to data the goal of this stat quest is

[297]

to show you how these identical curves

[300]

can be reshaped by the parameter values

[304]

and then added together to get a green

[306]

squiggle that fits the data

[309]

note there are many common bent or

[312]

curved lines that we can choose for a

[314]

neural network

[316]

this specific curved line is called soft

[319]

plus which sounds like a brand of toilet

[322]

paper

[323]

alternatively we could use this bent

[326]

line

[326]

called relu which is short for rectified

[329]

linear unit

[330]

and sounds like a robot or we could use

[334]

a sigmoid shape

[335]

or any other bent or curved line

[338]

oh no it's the dreaded terminology alert

[343]

the curved or bent lines are called

[345]

activation functions

[348]

when you build a neural network you have

[350]

to decide which activation function

[353]

or functions you want to use

[356]

when most people teach neural networks

[359]

they use the sigmoid activation function

[363]

however in practice it is much more

[366]

common to use the relu

[367]

activation function or the soft plus

[370]

activation function

[373]

so we'll use the soft plus activation

[375]

function in this stat quest

[378]

anyway we'll talk more about how you

[381]

choose activation functions

[382]

later in this series note

[386]

this specific neural network is about as

[388]

simple as they get

[391]

it only has one input node where we plug

[393]

in the dosage

[395]

only one output node to tell us the

[397]

predicted effectiveness

[399]

and only two nodes between the input and

[402]

output nodes

[404]

however in practice neural networks are

[408]

usually much

[409]

fancier and have more than one

[412]

input node more than one output node

[416]

different layers of nodes between the

[418]

input and output nodes

[420]

and a spider web of connections between

[423]

each layer of nodes

[425]

oh no it's another terminology alert

[429]

these layers of nodes between the input

[432]

and output nodes are called hidden

[434]

layers

[436]

when you build a neural network one of

[438]

the first things you do

[439]

is decide how many hidden layers you

[441]

want and how many nodes

[443]

go into each hidden layer although there

[447]

are rules of thumb

[448]

for making decisions about the hidden

[450]

layers

[451]

you essentially make a guess and see how

[454]

well the neural network performs

[456]

adding more layers and nodes if needed

[460]

now even though this neural network

[462]

looks fancy

[464]

it is still made from the same parts

[467]

used in this simple neural network

[470]

which has only one hidden layer with two

[472]

nodes

[474]

so let's learn how this neural network

[477]

creates new shapes from the curved or

[479]

bent lines in the hidden layer

[482]

and then adds them together to get a

[484]

green squiggle

[485]

that fits the data note

[489]

to keep the math simple let's assume

[491]

dosages go from zero

[493]

for low to one for high

[497]

the first thing we are going to do is

[500]

plug the lowest dosage

[502]

zero into the neural network

[505]

now to get from the input node

[508]

to the top node in the hidden layer

[511]

this connection multiplies the dosage by

[514]

negative

[515]

34.4 and then adds

[518]

2.14 and the result

[522]

is an x-axis coordinate for the

[524]

activation function

[527]

for example the lowest dosage 0

[531]

is multiplied by negative 34.4

[535]

and then we add 2.14

[539]

to get 2.14 as the x-axis coordinate for

[543]

the activation function

[546]

to get the corresponding y-axis value

[549]

we plug 2.14 into the activation

[552]

function

[553]

which in this case is the soft plus

[556]

function

[557]

note if we had chosen the sigmoid curve

[560]

for the activation function

[563]

then we would plug 2.14 into the

[565]

equation for the sigmoid curve

[568]

and if we had chosen the rail u bent

[571]

line for the activation function

[573]

then we would plug 2.14 into the relu

[577]

equation but since we are using soft

[580]

plus for the activation function

[583]

we plug 2.14 into the soft plus equation

[588]

and the log of one plus e raised to the

[591]

2.14

[592]

power is 2.25

[597]

note in statistics machine learning and

[600]

most programming languages

[602]

the log function implies the natural log

[605]

or the log base e

[608]

anyway the y axis coordinate for the

[611]

activation function

[613]

is 2.25 so let's extend this y-axis

[618]

up a little bit and put a blue dot

[621]

at 2.25 for when dosage equals zero

[626]

now if we increase the dosage a little

[629]

bit

[629]

and plug 0.1 into the input

[633]

the x-axis coordinate for the activation

[635]

function

[636]

is negative 1.3 and the corresponding

[640]

y-axis value

[641]

is 0.24

[644]

so let's put a blue dot at 0.24

[648]

for when dosage equals 0.1

[652]

and if we continue to increase the

[654]

dosage values all the way to 1

[656]

the maximum dosage we get this blue

[660]

curve note

[663]

before we move on i want to point out

[665]

that the full range of dosage values

[667]

from 0 to 1 corresponds to this

[671]

relatively narrow range of values from

[674]

the activation function

[676]

in other words when we plug dosage

[679]

values

[679]

from 0 to 1 into the neural network

[683]

and then multiply them by negative 34.4

[687]

and add 2.14

[690]

we only get x-axis coordinates that are

[693]

within the red box

[695]

and thus only the corresponding y-axis

[698]

values in the red box are used to make

[701]

this new blue curve

[704]

bam now we scale the y-axis values for

[708]

the blue curve

[709]

by negative 1.3

[712]

for example when dosage equals zero the

[715]

current y-axis

[717]

coordinate for the blue curve is 2.25

[721]

so we multiply 2.25 by negative 1.3

[726]

and get negative 2.93

[729]

and negative 2.93 corresponds to this

[733]

position

[733]

on the y-axis likewise

[737]

we multiply all of the other y-axis

[739]

coordinates on the blue curve

[741]

by negative 1.3 and we end up with a new

[745]

blue curve

[747]

bam

[750]

now let's focus on the connection from

[752]

the input node

[754]

to the bottom node in the hidden layer

[757]

however this time we multiply the dosage

[760]

by negative 2.52

[764]

instead of negative 34.4

[767]

and we add 1.29

[770]

instead of 2.14 to get the x-axis

[775]

coordinate for the activation function

[778]

remember these values come from fitting

[781]

the neural network to the data with back

[783]

propagation and we'll talk about that in

[786]

part two

[787]

in this series now

[790]

if we plug the lowest dosage zero into

[793]

the neural network

[795]

then the x-axis coordinate for the

[797]

activation function

[799]

is 1.29 now we plug

[803]

1.29 into the activation function

[806]

to get the corresponding y-axis value

[809]

and get 1.53

[813]

and that corresponds to this yellow dot

[817]

now we just plug in dosage values from 0

[820]

to 1

[821]

to get the corresponding y-axis values

[824]

and we get this orange curve

[828]

note just like before i want to point

[831]

out that the full range of dosage values

[833]

from 0 to 1 corresponds to this narrow

[837]

range of values from the activation

[839]

function

[841]

in other words when we plug dosage

[843]

values from 0 to 1

[845]

into the neural network we only get

[849]

x-axis coordinates that are within the

[851]

red box

[853]

and thus only the corresponding y-axis

[857]

values in the red box are used to make

[859]

this new orange curve

[862]

so we see that fitting a neural network

[865]

to data

[865]

gives us different parameter estimates

[867]

on the connections

[869]

and that results in each node in the

[872]

hidden layer

[872]

using different portions of the

[874]

activation functions

[876]

to create these new and exciting shapes

[880]

now just like before we scale the y-axis

[884]

coordinates on the orange curve only

[887]

this time

[887]

we scale by a positive number 2.28

[895]

and that gives us this new orange curve

[899]

now the neural network tells us to add

[901]

the y-axis coordinates from the blue

[903]

curve

[904]

to the orange curve

[908]

and that gives us this green squiggle

[911]

then finally we subtract 0.58

[916]

from the y-axis values on the green

[918]

squiggle

[919]

and we have a green squiggle that fits

[922]

the data

[923]

bam now if someone comes along and says

[927]

that they are using dosage

[929]

equal to 0.5 we can look at the

[932]

corresponding

[933]

y-axis coordinate on the green squiggle

[936]

and see that the dosage will be

[938]

effective

[939]

or we can solve for the y-axis

[942]

coordinate by plugging

[943]

dosage equals 0.5 into the neural

[946]

network

[947]

and do the math

[975]

[Music]

[979]

and we see that the y-axis coordinate on

[982]

the green squiggle

[983]

is 1.03 and since

[986]

1.03 is closer to 1 than 0

[990]

we will conclude that a dosage equal to

[992]

0.5

[993]

is effective double bam

[998]

now if you've made it this far you may

[1001]

be wondering why this is called a neural

[1004]

network

[1005]

instead of a big fancy squiggle fitting

[1007]

machine

[1009]

reason is that way back in the 1940s and

[1012]

50s when neural networks were invented

[1015]

they thought the nodes were vaguely like

[1018]

neurons

[1019]

and the connections between the nodes

[1021]

were sort of like synapses

[1024]

however i think they should be called

[1026]

big fancy squiggle fitting machines

[1029]

because that's what they do note

[1032]

whether or not you call it a squiggle

[1034]

fitting machine

[1035]

the parameters that we multiply are

[1037]

called weights

[1039]

and the parameters that we add are

[1041]

called biases

[1043]

note this neural network starts with two

[1046]

identical activation functions

[1049]

but the weights and biases on the

[1050]

connections slice them

[1052]

flip them and stretch them into new

[1054]

shapes

[1056]

which are then added together to get a

[1058]

squiggle that is entirely new

[1060]

and then the squiggle is shifted to fit

[1063]

the data

[1065]

now if we can create this green squiggle

[1067]

with just

[1068]

two nodes in a single hidden layer just

[1071]

imagine what types of green squiggles we

[1073]

could fit with more hidden layers

[1075]

and more nodes in each hidden layer

[1078]

in theory neural networks can fit a

[1081]

green squiggle to just about

[1082]

any data set no matter how complicated

[1086]

and i think that's pretty cool triple

[1089]

bam

[1091]

now it's time for some shameless

[1094]

self-promotion

[1096]

if you want to review statistics and

[1098]

machine learning offline

[1100]

check out the statquest study guides at

[1103]

statquest.org

[1105]

there's something for everyone hooray

[1109]

we've made it to the end of another

[1110]

exciting stat quest

[1112]

if you like this stat quest and want to

[1114]

see more please subscribe

[1116]

and if you want to support statquest

[1118]

consider contributing to my patreon

[1120]

campaign

[1121]

becoming a channel member buying one or

[1123]

two of my original songs or a t-shirt or

[1126]

a hoodie

[1126]

or just donate the links are in the

[1128]

description below

[1130]

alright until next time quest on

Most Recent Videos:

WE KILLED 6 HEROIC BOSSES! - YouTube

¿Quién inventó el dinero? - YouTube

Cuándo se inventó el dinero y cómo el dólar se convirtió en la principal moneda del mundo - YouTube

This Citizenship Program is Failing - YouTube

Candida Treatment Protocol w/ Dr. DiNezza - YouTube

$500M investor reacts to Real Estate Tik Toks 2 - YouTube

You can go back to the homepage right here: Homepage