🔍

Simple Linear Regression: Checking Assumptions with Residual Plots - YouTube

Channel: unknown

[1]

Let's look into checking the model assumptions with residual plots in simple linear regression.

[6]

Recall our simple linear regression model,

[9]

where Y is assumed to have this linear relationship with X,

[13]

and epsilon is a random error component,

[17]

representing the fact that the Ys have some variability

[20]

and are randomly distributed about that line.

[22]

To do any statistical inference

[25]

we needed to make some assumptions about that random error term epsilon.

[29]

The error terms are assumed to be normally distributed

[32]

and homoscedastic, meaning they all have the same variance.

[36]

and they are assumed to be independent of one another.

[39]

These assumptions may or may not be true,

[41]

and so we should investigate them where we can.

[45]

If these assumptions are true then the observed residuals should behave in a similar fashion.

[51]

Epsilon_i would represent the true theoretical error term for the ith individual

[58]

whereas e_i represents the observed residual that we see in the sample,

[63]

the value of Y for that individual minus the predicted value of Y from the model.

[70]

And if the assumptions of our model are true

[72]

then these observed residuals should behave in a similar fashion.

[75]

In other words, they should be approximately normally distributed

[78]

with constant variance for all the different X values.

[83]

Strictly speaking the observed residuals are not going to be independent of one another

[87]

but that's not really a big problem for our purposes.

[91]

Here I've plotted the residuals for a simulated dataset against X

[95]

I could have plotted them against the predicted values (the Y-hat values),

[100]

and that would give us basically the same picture with just a different scaling on the x-axis.

[105]

But we shouldn't plot them against the Y values

[108]

We can't plot the residuals against the observed values of Y because they're related,

[113]

so that would give us a misleading picture.

[115]

We shouldn't plot that out.

[117]

One thing to note right off the bat is that the residuals always sum to 0 in simple linear regression.

[123]

These observed residuals are going to sum to 0,

[125]

and that's why I put this 0 line in for a little perspective.

[128]

The residuals sum to zero and so they have a mean of zero.

[132]

What we are hoping to see is simply a random scattering of points,

[136]

nothing giving any indication that the assumptions of our model are false.

[140]

So here for instance it looks like the variability is about the same at all these different values of X.

[146]

The variability here is approximately equal all the way along.

[151]

There also doesn't appear to be any curvature

[153]

or just any other indications that there is a problem with the model.

[156]

So I'm going to give this the big check mark, indicating that that residual plot

[161]

gives us no indication that the assumptions of our model are false.

[165]

Here's another type of plot we might see.

[167]

At first glance this might look a little bit different from that plot on the last page

[170]

because we've got all these different measurements at these individual values of X

[176]

We might see this type of thing in an experiment

[179]

where we have some control over the levels of X.

[182]

For instance, if we're giving different dose levels of a drug or something to that effect.

[185]

But here it's a similar story to the last page, there's not really much going on here.

[190]

The variability is approximately the same across the board.

[193]

There's no systematic curvature -- nothing indicating non normality or anything of that nature.

[199]

Overall this is a very reasonable residual plot

[202]

A fairly random scattering of points I'm going to give that the check mark,

[206]

indicating there's no real obvious problems with our assumed model.

[211]

In this residual plot there's a more obvious problem.

[214]

The variance of the residuals is increasing with X

[217]

and that's a violation of the constant variance assumption.

[222]

This type of situation is not uncommon.

[225]

It's not unusual in statistics for the variance to increase with the mean,

[229]

so we do this see this kind of thing. We do have ways of dealing with this.

[233]

One option would be to possibly use something we call weighted regression to deal with this changing variance

[239]

but overall the assumptions of our model are not reasonable here,

[243]

so I'm going to give that the big X.

[246]

Here's something that's a more obvious problem.

[249]

There's systematic curvature in the residuals

[253]

and that indicates that that assumed linear relationship between Y and X is not a reasonable one.

[259]

We may be able to deal with this by using a slightly different model,

[262]

but for now this residual plot indicates serious problems with our assumed model,

[266]

and I'm going to give it the big X.

[270]

If we recorded our observations in time order of some nature,

[273]

then we should also plot that against time order.

[276]

Here there is something going on

[278]

our residuals are small and then they're big and then they're small and then they're big.

[283]

there is some sort of time affect that has not been properly included in our model,

[289]

and we're going to have to deal with that in some way.

[291]

If we ignore it that's going to be a problem.

[293]

So this residual plot indicates there's something systematic going on,

[297]

and there's a problem with our assumed model, so it gets the big X.

[303]

Now let's look at a real-world data set that we've looked at previously.

[307]

This is activation level in the pain centres of the brain for 16 women

[311]

versus their score on the empathic concern scale.

[314]

and I've plotted in the least squares regression line.

[317]

Let's see what the residual plot looks like.

[321]

Here I've plotted the residuals against the explanatory variable X,

[325]

and I'm going to say there's no obvious problems here.

[327]

There's no systematic curvature, there's no major outliers.

[332]

One could make an argument I suppose

[334]

that the variance of the residuals might seem to be a little bit lower here than over here

[338]

but that's not necessarily an obvious effect here

[341]

and I'm not going to consider that to be a big problem.

[343]

So I'm gonna give this the check mark -- that's a reasonable residual plot.

[348]

Here's a normal quantile quantile plot of those residuals.

[353]

If you recall if the residuals are approximately normally distributed

[357]

then the normal quantile quantile plot of those residuals will result in an approximately straight-line

[363]

and I'm going to say that for normal quantile quantile plot that's a pretty straight line.

[367]

So I'm going to give the check mark there,

[369]

and say that the residuals are approximately normally distributed.

[373]

so those different residual plots don't indicate any problems with our assumed model,

[378]

and it would be okay to go ahead with some of our statistical inference techniques.

[383]

Here's another data set that we've looked at previously.

[387]

Janko hardness versus density for 36 Australian trees,

[390]

and we fit a least-squares regression line through those 36 points.

[394]

and it looks to be a pretty reasonable fit,

[396]

but let's see what the residual plot looks like.

[399]

Here I've plotted the residuals against the explanatory variable X

[404]

and there does appear to be some systematic curvature here,

[408]

as well one could argue that the variability over here

[412]

seems to be a little bit less than the variability over here,

[415]

so this residual plot is indicating a possible problem with our assumed model

[421]

And if we go back to this scatterplot

[423]

armed with the knowledge of what the residual plot looked like,

[426]

we might be able to see that curvature here.

[428]

We might be able to see wait a minute

[430]

it seems to be doing something like this

[434]

but the residual plot makes that a little bit more obvious.

[438]

The residual plot removes that increasing trend and then rescales the y-axis

[443]

so it's a little bit easier to see these issues sometimes in the residual plot,

[448]

even though the scatterplot gives us basically the same information.

[452]

So the residual plot has told us that there's a problem with the assumed model.

[456]

Perhaps that straight line relationship was a reasonable one to begin with,

[460]

but the residual plot told us that it's not perfect and that we can probably improve upon it

[465]

To improve our model we may consider adding a X^2 term, to fit a curve through those points,

[471]

or we might consider a transformation of one or both of our variables,

[475]

in an effort to come up with a straight line relationship

[478]

but those are other talks for other days.

Most Recent Videos:

WE KILLED 6 HEROIC BOSSES! - YouTube

¿Quién inventó el dinero? - YouTube

Cuándo se inventó el dinero y cómo el dólar se convirtió en la principal moneda del mundo - YouTube

This Citizenship Program is Failing - YouTube

Candida Treatment Protocol w/ Dr. DiNezza - YouTube

$500M investor reacts to Real Estate Tik Toks 2 - YouTube

You can go back to the homepage right here: Homepage