Simple Linear Regression: Checking Assumptions with Residual Plots - YouTube

Channel: unknown

[1]
Let's look into checking the model assumptions with residual plots in simple linear regression.
[6]
Recall our simple linear regression model,
[9]
where Y is assumed to have this linear relationship with X,
[13]
and epsilon is a random error component,
[17]
representing the fact that the Ys have some variability
[20]
and are randomly distributed about that line.
[22]
To do any statistical inference
[25]
we needed to make some assumptions about that random error term epsilon.
[29]
The error terms are assumed to be normally distributed
[32]
and homoscedastic, meaning they all have the same variance.
[36]
and they are assumed to be independent of one another.
[39]
These assumptions may or may not be true,
[41]
and so we should investigate them where we can.
[45]
If these assumptions are true then the observed residuals should behave in a similar fashion.
[51]
Epsilon_i would represent the true theoretical error term for the ith individual
[58]
whereas e_i represents the observed residual that we see in the sample,
[63]
the value of Y for that individual minus the predicted value of Y from the model.
[70]
And if the assumptions of our model are true
[72]
then these observed residuals should behave in a similar fashion.
[75]
In other words, they should be approximately normally distributed
[78]
with constant variance for all the different X values.
[83]
Strictly speaking the observed residuals are not going to be independent of one another
[87]
but that's not really a big problem for our purposes.
[91]
Here I've plotted the residuals for a simulated dataset against X
[95]
I could have plotted them against the predicted values (the Y-hat values),
[100]
and that would give us basically the same picture with just a different scaling on the x-axis.
[105]
But we shouldn't plot them against the Y values
[108]
We can't plot the residuals against the observed values of Y because they're related,
[113]
so that would give us a misleading picture.
[115]
We shouldn't plot that out.
[117]
One thing to note right off the bat is that the residuals always sum to 0 in simple linear regression.
[123]
These observed residuals are going to sum to 0,
[125]
and that's why I put this 0 line in for a little perspective.
[128]
The residuals sum to zero and so they have a mean of zero.
[132]
What we are hoping to see is simply a random scattering of points,
[136]
nothing giving any indication that the assumptions of our model are false.
[140]
So here for instance it looks like the variability is about the same at all these different values of X.
[146]
The variability here is approximately equal all the way along.
[151]
There also doesn't appear to be any curvature
[153]
or just any other indications that there is a problem with the model.
[156]
So I'm going to give this the big check mark, indicating that that residual plot
[161]
gives us no indication that the assumptions of our model are false.
[165]
Here's another type of plot we might see.
[167]
At first glance this might look a little bit different from that plot on the last page
[170]
because we've got all these different measurements at these individual values of X
[176]
We might see this type of thing in an experiment
[179]
where we have some control over the levels of X.
[182]
For instance, if we're giving different dose levels of a drug or something to that effect.
[185]
But here it's a similar story to the last page, there's not really much going on here.
[190]
The variability is approximately the same across the board.
[193]
There's no systematic curvature -- nothing indicating non normality or anything of that nature.
[199]
Overall this is a very reasonable residual plot
[202]
A fairly random scattering of points I'm going to give that the check mark,
[206]
indicating there's no real obvious problems with our assumed model.
[211]
In this residual plot there's a more obvious problem.
[214]
The variance of the residuals is increasing with X
[217]
and that's a violation of the constant variance assumption.
[222]
This type of situation is not uncommon.
[225]
It's not unusual in statistics for the variance to increase with the mean,
[229]
so we do this see this kind of thing. We do have ways of dealing with this.
[233]
One option would be to possibly use something we call weighted regression to deal with this changing variance
[239]
but overall the assumptions of our model are not reasonable here,
[243]
so I'm going to give that the big X.
[246]
Here's something that's a more obvious problem.
[249]
There's systematic curvature in the residuals
[253]
and that indicates that that assumed linear relationship between Y and X is not a reasonable one.
[259]
We may be able to deal with this by using a slightly different model,
[262]
but for now this residual plot indicates serious problems with our assumed model,
[266]
and I'm going to give it the big X.
[270]
If we recorded our observations in time order of some nature,
[273]
then we should also plot that against time order.
[276]
Here there is something going on
[278]
our residuals are small and then they're big and then they're small and then they're big.
[283]
there is some sort of time affect that has not been properly included in our model,
[289]
and we're going to have to deal with that in some way.
[291]
If we ignore it that's going to be a problem.
[293]
So this residual plot indicates there's something systematic going on,
[297]
and there's a problem with our assumed model, so it gets the big X.
[303]
Now let's look at a real-world data set that we've looked at previously.
[307]
This is activation level in the pain centres of the brain for 16 women
[311]
versus their score on the empathic concern scale.
[314]
and I've plotted in the least squares regression line.
[317]
Let's see what the residual plot looks like.
[321]
Here I've plotted the residuals against the explanatory variable X,
[325]
and I'm going to say there's no obvious problems here.
[327]
There's no systematic curvature, there's no major outliers.
[332]
One could make an argument I suppose
[334]
that the variance of the residuals might seem to be a little bit lower here than over here
[338]
but that's not necessarily an obvious effect here
[341]
and I'm not going to consider that to be a big problem.
[343]
So I'm gonna give this the check mark -- that's a reasonable residual plot.
[348]
Here's a normal quantile quantile plot of those residuals.
[353]
If you recall if the residuals are approximately normally distributed
[357]
then the normal quantile quantile plot of those residuals will result in an approximately straight-line
[363]
and I'm going to say that for normal quantile quantile plot that's a pretty straight line.
[367]
So I'm going to give the check mark there,
[369]
and say that the residuals are approximately normally distributed.
[373]
so those different residual plots don't indicate any problems with our assumed model,
[378]
and it would be okay to go ahead with some of our statistical inference techniques.
[383]
Here's another data set that we've looked at previously.
[387]
Janko hardness versus density for 36 Australian trees,
[390]
and we fit a least-squares regression line through those 36 points.
[394]
and it looks to be a pretty reasonable fit,
[396]
but let's see what the residual plot looks like.
[399]
Here I've plotted the residuals against the explanatory variable X
[404]
and there does appear to be some systematic curvature here,
[408]
as well one could argue that the variability over here
[412]
seems to be a little bit less than the variability over here,
[415]
so this residual plot is indicating a possible problem with our assumed model
[421]
And if we go back to this scatterplot
[423]
armed with the knowledge of what the residual plot looked like,
[426]
we might be able to see that curvature here.
[428]
We might be able to see wait a minute
[430]
it seems to be doing something like this
[434]
but the residual plot makes that a little bit more obvious.
[438]
The residual plot removes that increasing trend and then rescales the y-axis
[443]
so it's a little bit easier to see these issues sometimes in the residual plot,
[448]
even though the scatterplot gives us basically the same information.
[452]
So the residual plot has told us that there's a problem with the assumed model.
[456]
Perhaps that straight line relationship was a reasonable one to begin with,
[460]
but the residual plot told us that it's not perfect and that we can probably improve upon it
[465]
To improve our model we may consider adding a X^2 term, to fit a curve through those points,
[471]
or we might consider a transformation of one or both of our variables,
[475]
in an effort to come up with a straight line relationship
[478]
but those are other talks for other days.