Calculating R-squared | Regression | Probability and Statistics | Khan Academy - YouTube

Channel: Khan Academy

[0]
In the last video, we were able to find the equation for
[2]
the regression line for these four data points.
[7]
What I want to do in this video is figure out the r
[9]
squared for these data points.
[10]
Figure out how good this line fits the data.
[13]
Or even better, figure out the percentage-- which is really
[16]
the same thing-- of the variation of these data
[19]
points, especially the variation in y, that is due
[23]
to, or that can be explained by variation in x.
[26]
And to do that, I'm actually going to get
[27]
a spreadsheet out.
[29]
I've actually tried to do this with a calculator and it's
[31]
much harder.
[31]
So hopefully this doesn't confuse you too much to use a
[34]
spreadsheet.
[35]
And I'm a make a couple of columns here.
[36]
And spreadsheets actually have functions that'll do all of
[38]
this automatically, but I really want to do it so that
[40]
you could do it by hand if you had to.
[42]
So I'm going to make a couple of columns here.
[44]
This is going to be my x column.
[45]
This is going to be my y column.
[48]
This is going to be the column-- I'll call this y
[50]
star-- this'll be the y value that our line predicts based
[56]
on our x value.
[57]
This is going to be the error with the line.
[68]
Let me caught it the squared error with the line.
[80]
I don't want us to take up too much space.
[88]
And then the next one, I'm going to have the squared
[101]
variation for that y value from the mean y.
[116]
And I think these columns by themselves will be enough for
[119]
us to do everything.
[120]
So let's first put all the data points in.
[122]
So we had negative 2 comma negative 3.
[125]
That was one data point.
[126]
Negative 1 comma negative 1.
[128]
And we had 1 comma 2.
[130]
Then we have 4 comma 3.
[134]
Now, what does our line predict?
[139]
Well our line says, you give me an x value, I'm going to
[141]
tell you what y value I'll predict.
[143]
So when x is equal to negative 2, the y value on the line is
[147]
going to be the slope.
[149]
So this is going to be equal to 41 divided by
[154]
42 times our x value.
[158]
And I just selected that cell.
[160]
And just a little bit of a primer on spreadsheets, I'm
[163]
selecting the cell D2.
[166]
I was able to just move my cursor over and select that.
[168]
But that tells me the x value.
[171]
Minus 5/21.
[173]
Minus 5 divided by 21.
[179]
Just like that.
[182]
So just to be clear of what we're even doing.
[184]
This y star here, I got negative 2.19.
[187]
That tells us at this point right over
[189]
here is negative 2.19.
[197]
So when we figure out the error, we're going to figure
[198]
out the distance between negative 3, that's our y
[205]
value, and negative 2.19.
[208]
So let's do that.
[211]
So the error is just going to be equal to our y value.
[215]
That's cell E2.
[219]
Minus the value that our line would predict.
[225]
So just that value is the actual error.
[227]
But we want to square it.
[236]
And then, the next thing we want to do
[238]
is the squared distance.
[240]
so this is equal to the squared distance of our y
[243]
value from the y's mean.
[245]
So what's the mean of the y's?
[247]
Mean of the y's is 1/4.
[249]
So minus 0.25, is the same thing is 1/4.
[253]
And we also want to square that.
[257]
Now, this is what's fun about spreadsheets.
[259]
I can apply those formulas to every row now.
[263]
And notice, what it did when I did that.
[265]
Now all of a sudden, this is the y value that my line would
[269]
predict, it's now using this x value and
[271]
sticking it over here.
[273]
It's now figuring out the square distance from the line
[276]
using what the line would predict and using the
[280]
y value, this one.
[281]
And then does the same thing over here.
[285]
It's figures out the squared distance of this y
[287]
value from the mean.
[291]
So what is the total squared error with the line?
[293]
So let me just sum this up.
[295]
The total squared error with the line is 2.73.
[299]
And then the total variation from the mean, squared
[305]
distances from the mean of the y, are 22.75.
[310]
So let me be very clear what this is.
[314]
So let me write these numbers down.
[318]
I'll write it up here so we can keep looking at this
[320]
actual graph.
[322]
So are squared error versus our line, our total squared
[327]
error, we just computed to be 2.74.
[331]
I rounded a little bit.
[332]
And what that is, is you take each of these data points'
[335]
vertical distance to the line.
[337]
So this distance squared, plus this distance squared, plus
[341]
this distance squared, plus this distance squared.
[343]
That's all we just calculated on Excel.
[346]
And that total squared variation to the line is 2.74.
[352]
Or total squared error with the line.
[354]
And then the other number we figured out was the total
[357]
distance from the mean.
[358]
So the mean here is y is equal to 1/4.
[360]
So that's going to be right over here.
[368]
This is 1/2.
[369]
So right over here.
[374]
So this is our mean y value.
[379]
Or the central tendency for our y values.
[382]
And so what we calculated next was the total error, the
[386]
squared error, from the means of our y values.
[390]
That's what we calculated over here in the spreadsheet.
[394]
You see in the formula.
[396]
It is this number, E2, minus 0.25, which is the mean of our
[401]
y's squared.
[403]
That's exactly what we calculated.
[404]
We calculated for each of the y values.
[406]
And then we summed them all up.
[407]
It's 22.75.
[409]
It is equal to 22.75.
[418]
So this is essentially the error that the
[421]
line does not explain.
[423]
This is the total error, this is the total
[426]
variation of the numbers.
[427]
So if you wanted to know the percentage of the total
[429]
variation that is not explained by the line, you
[433]
could take this number divided by this number.
[435]
So 2.74 over 22.75.
[443]
This tells us the percentage of total variation not
[459]
explained by the line or by the variation in x.
[470]
And so what is this number going to be?
[473]
I can just use Excel for this.
[476]
So I'm just going to divide this number divided by this
[482]
number right over there.
[484]
I get 0.12.
[487]
So this is equal to 0.12.
[496]
Or another way to think about it is 12% of the total
[499]
variation is not explained by the variation in x.
[502]
The total squared distance between each of the points or
[506]
their kind of spread, their variation, is not explain by
[509]
the variation in x.
[510]
So if you want the amount that is explained by the variance
[512]
in x, you just subtract that from 1.
[514]
So let me write it right over here.
[516]
So we have our r squared, which is the percent of the
[520]
total variation that is explained by x, is going to be
[522]
1 the minus that 0.12 that we just calculated.
[531]
Which is going to be 0.88.
[535]
So our r squared here is 0.88.
[538]
It's very, very close to 1.
[539]
The highest number it can be is 1.
[542]
So what this tells us, or a way to interpret this, is that
[545]
88% of the total variation of these y values is explained by
[558]
the line or by the variation in x.
[563]
And you can see that it looks like a pretty good fit.
[566]
Each of these aren't too far.
[571]
Each of these points are definitely much closer to the
[573]
line than they are to the mean line.
[578]
In fact, all of them are closer to our actual line than
[581]
to the mean.