馃攳
Statistics 101: Linear Regression, Residual Analysis - YouTube
Channel: Brandon Foltz
[0]
hello my name is Brandon and welcome to
[2]
the next video in my series on basic
[4]
statistics if you are new to the channel
[6]
welcome if you're returning viewer it is
[9]
great to have you back if you liked the
[11]
video please give it a thumbs up share
[13]
it with classmates colleagues or friends
[15]
or anyone else you think might benefit
[17]
from watching so now that we are
[19]
introduced let's go ahead and get
[20]
started so this video is the next in my
[25]
series on simple linear regression
[27]
however I will say that this video does
[30]
have implications beyond just simple
[33]
regression so this video is about
[34]
residual analysis and residual analysis
[38]
will do two primary things for us
[40]
number one it will tell us how good the
[43]
model we have produced fits the data we
[46]
are looking at or in other words how is
[48]
our error is our error large or is our
[51]
error small number two it maybe most
[53]
importantly it will tell us whether or
[55]
not the model we are using is actually
[58]
appropriate for the data we are looking
[60]
at so as you know probably by now there
[63]
are many many ways to model a data set
[66]
and there are certain models that are
[68]
more appropriate than others and a
[70]
residuals can help us decide that so
[73]
residual analysis goes well beyond just
[75]
statistics it goes into higher-level
[77]
statistics it goes into data science and
[79]
of course machine learning when we're
[81]
talking about which model we should
[83]
choose for our application so let's go
[86]
ahead and get started learning about
[87]
residual analysis so this video is
[93]
brought to you by the great people at
[94]
the great courses plus if you're
[96]
watching this chances are you need to or
[98]
want to or like to learn things and
[100]
there are a few better places to learn
[102]
pretty much anything you want then the
[105]
great courses plus they have over 10,000
[108]
video lectures on everything from
[110]
photography to literature philosophy to
[113]
finance and yes statistics so please
[116]
check out the link in the description
[118]
below and learn how you can get a free
[120]
trial to the great courses plus it helps
[123]
my channel and it also helps them so
[125]
let's go ahead learn about residuals
[129]
so here is the data we've been using for
[131]
this entire playlist I'm not going to go
[133]
into it very much but in case you're new
[135]
to the playlist I just want to go over
[137]
it very briefly so you go into a
[139]
restaurant you eat a meal and then that
[140]
course they give you the bill at the end
[141]
and usually especially here in the US
[144]
maybe not everywhere in the world but
[145]
here in the US it's customary to tip the
[147]
server for that meal so we have a bill
[149]
amount along the bottom our x-axis then
[152]
we have the tip amount along the y-axis
[153]
and of course each diamond there is the
[156]
intersection of those two things so as
[158]
far as the data table goes you can see
[160]
that over here on the right so our first
[162]
bill was $34 and the server had a tip of
[165]
five dollars and so on and so forth and
[167]
then the mean of each variable at least
[169]
for the bill amount with 74 dollars and
[171]
for the tip amount was $10 and that's
[174]
data set we'll be using
[177]
so when we put that data into a
[179]
regression model this is what we get so
[182]
first you can see obviously the
[183]
regression line that goes across the
[185]
middle of our graph here we have a
[187]
centroid of 74 and 10 again that's just
[189]
the mean of each variable we have a
[192]
regression line of y equals zero point
[194]
one four six to X minus zero point eight
[197]
one eight eight slightly different in
[199]
those two regression lines and again
[201]
it's just because of differences in the
[202]
algorithms of software but they're
[204]
basically the same so we have a slope of
[206]
zero point one four six two and we have
[209]
our intercept down there in the lower
[210]
left of course we can interpret this
[212]
overall is that as the bill amount goes
[216]
up the tip amount goes up that's why our
[218]
slope is positive and as far as the
[220]
actual numbers go for every dollar
[223]
increase in the meal bill we would
[225]
expect or predict an increase in the tip
[227]
amount of about fifteen cents so again
[230]
this is the very simple small data set
[232]
model we're using and this is what it
[234]
looks like when we actually plot it and
[236]
put it into a regression model so what
[241]
is residual analysis so by definition a
[245]
residual is the quantity remaining after
[248]
other things have sort of been taken
[251]
into account
[252]
so it's subtracted or allowed for so in
[254]
our daily lives most of us get a
[256]
paycheck of some sort and then we have
[258]
our bills to pay we have to buy food and
[260]
other things and hopefully at the end of
[263]
all that
[263]
we have a little bit of money left to
[265]
save or to use for leisure or whatever
[268]
else so once all those obligations that
[271]
we have to do are done that little bit
[273]
of money we might have left is actually
[275]
the residual of our paycheck so a
[278]
residual is literally what's left over
[280]
so in this case it's what's left over
[283]
after our model is done explaining or is
[287]
run out of the ability to explain the
[290]
data that we are looking at so in this
[292]
case in stats it's the difference
[294]
between the observed value of the
[296]
dependent variable which in this case is
[298]
the tip amount and what is predicted by
[300]
the regression model so our regression
[302]
line in the previous slide is actually a
[305]
way of predicting what tip we would
[307]
expect for a given meal amount but we
[310]
also have observed values in there and
[312]
the residual is the difference between
[314]
those two so for example if the
[317]
regression model predicts a tip of $10
[320]
for a given meal but the observed tip
[322]
that actually happens on the table is
[324]
$12 then the residual amount is 12 minus
[328]
10 or 2 the notation which we have seen
[331]
before in many cases is y sub I minus y
[335]
hat sub I that is just the observed tip
[338]
minus the predicted tip so remember the
[344]
standard regression model y equals beta
[346]
sub 0 which is the intercept plus beta
[348]
sub 1 the slope and then we have an
[351]
error term so the first two terms are
[354]
our regression model and then the end is
[356]
the error so the regression model there
[358]
at the beginning is hopefully going to
[360]
explain a lot of the variation and our
[362]
dependent variable but it's probably not
[364]
going to explain all of it it's very
[366]
rare that it will explain all of it so
[368]
there's some left so how do we explain
[370]
what's left and that's what we call it
[372]
residuals so only part of the variance
[375]
in the dependent variable will be
[377]
explained by the values of the
[379]
independent variable so we see that as
[382]
the value of r-squared in regression
[383]
output which again is just the sum of
[386]
squares due to regression and the SS are
[388]
divided by the total sum of squares so
[391]
that is the variance explained by the
[394]
model itself but that's not the whole
[396]
story
[397]
so the variants left unexplained is due
[400]
to the model error so our model will fit
[403]
the data to a certain point but then
[406]
there's some left and that's our error
[408]
or our residuals so you can think of it
[412]
how far off if you're thinking
[414]
negatively or if you're thinking
[416]
positively how good the model accounts
[419]
for the variance in the dependent
[420]
variable so to explain a good chunk of
[422]
it hopefully
[423]
but then there's probably going to be
[424]
some left so this graph is a lot of
[429]
things going on it's actually from a
[431]
previous video I did on regression so if
[433]
you go back in the playlist you'll see
[434]
the first time I went over this slide so
[437]
just to set the stage real quick and
[438]
kind of refresh your memory if you're
[439]
new to the playlist you'll get this
[441]
information so a couple of things are
[443]
going on here
[444]
so the sloped line is our regression
[446]
line so the line with two purple dots
[448]
that goes from lower left upper right
[449]
that's actually a regression line of the
[452]
equation up here in the upper left now
[454]
the purple dots are actually predicted
[456]
values so it should make sense that the
[458]
purple dots on the line are predicted
[461]
values explained by the regression
[463]
equation up here in the upper left now
[465]
you have a dashed line across the middle
[468]
here so that is the mean of the
[470]
dependent variable so the mean tip
[473]
amount was $10 so we create a line
[475]
that's flat right there at the mean of
[478]
the dependent variable and that is that
[480]
line has a couple of black dots on it
[482]
then you have some orange diamonds those
[485]
are our observed values so we have three
[488]
kind of things going on here we have the
[490]
regression line with the purple dots for
[493]
the predicted values we have the dashed
[495]
line with the black dots which is the
[497]
mean of the dependent variable and then
[499]
we have the orange diamonds which are
[501]
the actual observed values now let's
[503]
talk about what SSE SST and SS are are
[507]
very quickly
[508]
so first SSE so here's the equation y
[511]
sub I minus y hat sub I so again that's
[514]
just observed minus predicted squared
[517]
and then summed up that's what SSE is
[520]
that's the error now in distance terms
[522]
it is this so remember the purple dot is
[525]
our the predicted the orange diamond is
[527]
the observed so it's just the difference
[529]
between those two square
[531]
and then summed up for each point along
[533]
the line that's what SSE is so there's
[536]
one up there in the upper right so next
[538]
we have the SST or total sum of squares
[540]
that's y sub I minus y bar squared to
[544]
remember y sub I is the observed value
[547]
which in the orange diamond minus y bar
[550]
which is the black dashed line across
[553]
the middle so we take that distance
[554]
squared and sum them up and that's that
[557]
distance so that's the distance between
[558]
the orange diamond and the black dot
[560]
that's there there's another one there
[562]
now you'll notice by looking at these
[564]
brackets there's one left and that is
[567]
SSR or a sum of squares due to our model
[570]
or sum of squares due to regression so
[573]
that is y hat I that's the purple dot
[576]
the predicted value minus y bar which is
[579]
the mean of the dependent variable
[581]
Square those sum them up and that's that
[583]
distance there and there so you can see
[586]
that we have three measures going on
[588]
SSE sum of squares due to error total
[590]
sum of squares there are some of squares
[592]
due to regression you'll notice that
[595]
over look over here on the right is a
[596]
good example if you see the total sum of
[598]
squares over here in this orange bracket
[601]
and the far right that it's actually
[602]
made up literally of the SSE there in
[605]
the green and the SS are in the purple
[610]
so first a few model assumptions so
[613]
here's our standard regression model we
[614]
talked about and here are some
[615]
assumptions that we make and info number
[618]
one the residuals offer the best
[620]
information about the error term so
[622]
again the be the beta sub zero emitted
[624]
sub 1 that's a regression model that
[626]
won't explain everything they'll explain
[628]
some of it but not everything so we have
[630]
the epsilon or the error that's left the
[632]
residuals offer the best information
[634]
about the remainder of the story and our
[636]
model to the expected value of the error
[639]
term or the mean of the error term is 0
[642]
for all values of the independent
[644]
variable X the variance of the error
[646]
term is the same so what we're saying
[649]
there is that regardless in this case of
[651]
what mule amount you have a meal of $30
[653]
$50 $75 the variance of the error term
[656]
at each point along that independent
[658]
variable is constant it's the same the
[662]
values of the error term are independent
[664]
of
[665]
each other so there is no relationship
[667]
between the error terms in the error
[670]
term follows a normal distribution so if
[672]
we took all of our errors or all of our
[674]
residuals we put them in their own
[676]
distribution and looked at it it should
[678]
follow a normal bell shaped distribution
[683]
so again here are the residuals in this
[685]
model let's want to put them up here
[687]
because we're gonna graph them here in a
[688]
second now what we can do is graph a
[691]
residuals and often the best way to look
[693]
at residuals is on a graph or a scatter
[695]
plot so what we're gonna do is graph the
[698]
residuals against two things we're gonna
[700]
graph them against the independent
[703]
variable which is the meal amount here
[705]
along the bottom and then we're gonna
[706]
graph them as a function of the
[709]
predicted values so let's go ahead and
[711]
look at both of those graphs so first we
[717]
have the residual plot against the
[718]
independent variable in this case the X
[720]
aware of what is the bill amount and
[722]
here's that looks like so you can see
[724]
that the residual for the first meal
[726]
amount over here on the left-hand side
[728]
was like a meal of like thirty seven
[729]
dollars the residual for that was a
[732]
little bit under one and then for the
[734]
meal amount here in the middle of around
[736]
50 or 51 dollars the residual was a
[739]
little bit less than a negative two and
[741]
so on and so forth so these are the
[744]
residuals plotted against the bill
[746]
amount for each one next and maybe most
[751]
importantly we're gonna plot the
[753]
residuals against the predicted values
[756]
so here's that so how do we interpret
[758]
this let's look at the first dot over
[760]
here on the left for that first meal the
[763]
difference between the predicted tip by
[765]
our model and the observed tip that we
[767]
had in the data was a little bit less
[769]
than one now look at the second one what
[772]
we're saying is that the predicted tip
[774]
that our model gave us and the observed
[777]
tip the difference between those two was
[779]
a little bit less than negative two so
[782]
you can see that what we're doing here
[783]
is actually looking at the observed
[785]
versus the predicted and this is the
[788]
residual plot against Y hat or the
[790]
predicted dependent variable and this is
[792]
probably the most important one what
[793]
we're looking at patterns in the
[795]
residuals
[797]
let's talk about some general patterns
[799]
misery and just generic graphs can look
[802]
for different ways that residuals can
[803]
appear on graphs the first case is kind
[806]
of the best case so if we graph our
[808]
residuals like we did in the previous
[810]
two slides and they kind of look like
[812]
this they're kind of evenly scattered
[814]
left to right up to down all over the
[817]
graph that's a good thing okay they kind
[819]
of fit all here in the middle there's no
[820]
other pattern to them except for being
[822]
uniformly distributed pretty much
[824]
everywhere and there's a technical word
[826]
for that it's called homoscedasticity or
[829]
constant variance so we can see that the
[832]
variance along the residuals here in the
[834]
middle is constant from left to right
[836]
okay there's no sort of bending or
[838]
bowing or you know squeezing or anything
[840]
like that all the residuals are in a
[843]
nice even distribution across the graph
[845]
so constant variance we could have
[849]
something looks like this
[851]
so that is called heteroscedasticity or
[853]
non constant variance on the left side
[856]
of our graph the residuals are much more
[858]
spread out than they are over here on
[861]
the right side and this might cause us
[863]
some pause and we'll see why in a minute
[865]
that our residuals are not evenly
[867]
distributed across the graph left to
[870]
right so the error is larger down on
[873]
this end than it is over on the right
[875]
end and that can be a problem so another
[880]
type of heteroscedasticity is nonlinear
[883]
data or using the wrong model so here
[885]
our residuals are like in a bow shape
[888]
but actually we're gonna see this in
[889]
other videos coming up when we talk
[891]
about nonlinear models but the residuals
[893]
follow an arc either from lower left and
[897]
up and down to the right or maybe
[898]
another direction you know it could be
[900]
sort of a half of an arc or something
[902]
like that but this might show us that
[904]
our data is actually nonlinear in a
[906]
linear model may not be appropriate for
[908]
this data now in the next few videos
[910]
coming up we're gonna talk about
[912]
nonlinear models and you will see this
[914]
pattern in the residuals when we go to
[916]
look at which model is best for fitting
[919]
the data so here is the same residual
[924]
plot we had before so residual plot
[926]
against Y hat our predicted values and
[928]
there we go
[929]
so here
[930]
on the bottom it predicted tip amount
[931]
and then we have the difference with
[933]
that in the observed that's the distance
[934]
over here in the residual so what
[936]
pattern does this follow well it follows
[939]
a fairly standard pattern left to right
[942]
again we only have six observations in
[945]
this very small data set so you might
[946]
see patterns where there aren't really
[948]
any but in this case I think it's fair
[950]
to assume that the residuals you know
[953]
occur on the top and bottom of our plot
[955]
and they're about the same left to right
[957]
there's no you know cone shape or
[959]
there's no curve to them or anything
[960]
like that so this is a good residual
[963]
plot so here are our two plots
[967]
side-by-side so first we a residual plot
[969]
against the independent variable which
[971]
is the bill amount so they're there and
[973]
there
[974]
again nice pattern and in that there's
[977]
no pattern then over here we have the
[979]
same with the residual plot against the
[981]
predicted a dependent variable same
[984]
thing so these look pretty good so now
[989]
let's put it all together we have our
[991]
bill amount line fit plot so first we
[993]
have our observed values here in the
[995]
orange circles and then on top of that
[997]
we can put our regression line and then
[999]
our predicted values there in the yellow
[1002]
circles so we can see how each observed
[1004]
value Falls above or below the actual
[1007]
predicted amount and those are our
[1009]
residuals here's our bill amount
[1014]
residual plot so there there there again
[1017]
and what pattern
[1018]
well no roll pattern that's a good
[1021]
residual plot so a few final points so
[1026]
what happens if the residual analysis
[1028]
reveals heteroscedasticity so that means
[1031]
that our residuals are not sort of
[1033]
uniformly distributed across a residual
[1035]
plot that might have a curvature to them
[1037]
or they might be non constants like a
[1040]
cone shape in one direction what can we
[1041]
do
[1042]
so we could rebuild the model with
[1044]
different independent variable or
[1045]
variables that's wise one option we
[1048]
could perform some type of
[1049]
transformation on the nonlinear data so
[1052]
you take a logarithm or something else
[1054]
for that variable we could fit a
[1057]
nonlinear regression model so linear
[1060]
regression is not the only type there
[1062]
are many other types there's non linear
[1063]
there's
[1064]
like piecewise regression all kinds but
[1066]
be careful don't over fit the model and
[1069]
in my next playlist where we talk about
[1071]
nonlinear regression we will talk a lot
[1073]
about the dangers of overfitting
[1075]
so if final question is that well are
[1078]
there sort of quantitative statistical
[1080]
tests for residuals and the answer is
[1082]
yes
[1083]
there is their brush pagon test the
[1086]
white test in the NCV test which is non
[1089]
constant variants test however for the
[1092]
sake of this video we're not going to go
[1094]
into that those are more advanced and I
[1095]
actually think there are other ways both
[1097]
visually and computationally to figure
[1100]
out if you are a problem with your
[1101]
residuals so we'll stick to that for now
[1103]
but I want you to be aware that there
[1105]
are some statistical tests out there for
[1107]
residuals this video is brought to you
[1110]
by the great courses plus where you can
[1113]
get unlimited access to over 10,000
[1115]
different video lectures taught by
[1117]
award-winning professors from the Ivy
[1119]
League and other top schools around the
[1121]
world you can learn about anything that
[1123]
interests you science literature and yes
[1126]
statistics like this lecture from
[1129]
Professor Tula Theo Williams called
[1130]
linear regression models and assumptions
[1133]
from her course learning statistics
[1135]
concepts and applications in R and right
[1139]
now the great courses plus is offering
[1141]
my viewers a free trial and is now also
[1144]
optimized for Australia and the UK so go
[1147]
to the great courses plus.com slash
[1150]
brandon volts my name to have access to
[1153]
the ten thousand video lecture library
[1156]
or click on the link in the description
[1158]
below okay so that wraps up our video on
[1162]
residual analysis in simple linear
[1164]
regression again it is a very important
[1167]
concept when figuring out one how good
[1169]
our model is and two whether the model
[1171]
were trying to implement is actually
[1172]
appropriate for the data we have and it
[1175]
does have implications for other areas
[1177]
more advanced in other fields such as
[1179]
advanced stats and data science and
[1181]
machine learning and things of that
[1182]
nature so I hope you found this very
[1185]
visual very insightful and it's
[1187]
something you can take with you as you
[1189]
progress so thank you very much for
[1190]
watching and I look forward to seeing
[1192]
you again in our next video
[1194]
take care
Most Recent Videos:
You can go back to the homepage right here: Homepage





