馃攳
Calculating R-squared | Regression | Probability and Statistics | Khan Academy - YouTube
Channel: Khan Academy
[0]
In the last video, we were able
to find the equation for
[2]
the regression line for these
four data points.
[7]
What I want to do in this video
is figure out the r
[9]
squared for these data points.
[10]
Figure out how good this
line fits the data.
[13]
Or even better, figure out the
percentage-- which is really
[16]
the same thing-- of the
variation of these data
[19]
points, especially the variation
in y, that is due
[23]
to, or that can be explained
by variation in x.
[26]
And to do that, I'm actually
going to get
[27]
a spreadsheet out.
[29]
I've actually tried to do this
with a calculator and it's
[31]
much harder.
[31]
So hopefully this doesn't
confuse you too much to use a
[34]
spreadsheet.
[35]
And I'm a make a couple
of columns here.
[36]
And spreadsheets actually have
functions that'll do all of
[38]
this automatically, but I really
want to do it so that
[40]
you could do it by hand
if you had to.
[42]
So I'm going to make a couple
of columns here.
[44]
This is going to
be my x column.
[45]
This is going to
be my y column.
[48]
This is going to be the column--
I'll call this y
[50]
star-- this'll be the y value
that our line predicts based
[56]
on our x value.
[57]
This is going to be the
error with the line.
[68]
Let me caught it the squared
error with the line.
[80]
I don't want us to take
up too much space.
[88]
And then the next one, I'm
going to have the squared
[101]
variation for that y value
from the mean y.
[116]
And I think these columns by
themselves will be enough for
[119]
us to do everything.
[120]
So let's first put all
the data points in.
[122]
So we had negative 2
comma negative 3.
[125]
That was one data point.
[126]
Negative 1 comma negative 1.
[128]
And we had 1 comma 2.
[130]
Then we have 4 comma 3.
[134]
Now, what does our
line predict?
[139]
Well our line says, you give
me an x value, I'm going to
[141]
tell you what y value
I'll predict.
[143]
So when x is equal to negative
2, the y value on the line is
[147]
going to be the slope.
[149]
So this is going to be equal
to 41 divided by
[154]
42 times our x value.
[158]
And I just selected that cell.
[160]
And just a little bit of a
primer on spreadsheets, I'm
[163]
selecting the cell D2.
[166]
I was able to just move my
cursor over and select that.
[168]
But that tells me the x value.
[171]
Minus 5/21.
[173]
Minus 5 divided by 21.
[179]
Just like that.
[182]
So just to be clear of what
we're even doing.
[184]
This y star here, I
got negative 2.19.
[187]
That tells us at this
point right over
[189]
here is negative 2.19.
[197]
So when we figure out the error,
we're going to figure
[198]
out the distance between
negative 3, that's our y
[205]
value, and negative 2.19.
[208]
So let's do that.
[211]
So the error is just going to
be equal to our y value.
[215]
That's cell E2.
[219]
Minus the value that our
line would predict.
[225]
So just that value is
the actual error.
[227]
But we want to square it.
[236]
And then, the next thing
we want to do
[238]
is the squared distance.
[240]
so this is equal to the squared
distance of our y
[243]
value from the y's mean.
[245]
So what's the mean of the y's?
[247]
Mean of the y's is 1/4.
[249]
So minus 0.25, is the
same thing is 1/4.
[253]
And we also want
to square that.
[257]
Now, this is what's fun
about spreadsheets.
[259]
I can apply those formulas
to every row now.
[263]
And notice, what it did
when I did that.
[265]
Now all of a sudden, this is the
y value that my line would
[269]
predict, it's now using
this x value and
[271]
sticking it over here.
[273]
It's now figuring out the square
distance from the line
[276]
using what the line would
predict and using the
[280]
y value, this one.
[281]
And then does the same
thing over here.
[285]
It's figures out the squared
distance of this y
[287]
value from the mean.
[291]
So what is the total squared
error with the line?
[293]
So let me just sum this up.
[295]
The total squared error
with the line is 2.73.
[299]
And then the total variation
from the mean, squared
[305]
distances from the mean
of the y, are 22.75.
[310]
So let me be very clear
what this is.
[314]
So let me write these
numbers down.
[318]
I'll write it up here so we
can keep looking at this
[320]
actual graph.
[322]
So are squared error versus our
line, our total squared
[327]
error, we just computed
to be 2.74.
[331]
I rounded a little bit.
[332]
And what that is, is you take
each of these data points'
[335]
vertical distance to the line.
[337]
So this distance squared, plus
this distance squared, plus
[341]
this distance squared, plus
this distance squared.
[343]
That's all we just calculated
on Excel.
[346]
And that total squared variation
to the line is 2.74.
[352]
Or total squared error
with the line.
[354]
And then the other number we
figured out was the total
[357]
distance from the mean.
[358]
So the mean here is
y is equal to 1/4.
[360]
So that's going to be
right over here.
[368]
This is 1/2.
[369]
So right over here.
[374]
So this is our mean y value.
[379]
Or the central tendency
for our y values.
[382]
And so what we calculated next
was the total error, the
[386]
squared error, from the
means of our y values.
[390]
That's what we calculated over
here in the spreadsheet.
[394]
You see in the formula.
[396]
It is this number, E2, minus
0.25, which is the mean of our
[401]
y's squared.
[403]
That's exactly what
we calculated.
[404]
We calculated for each
of the y values.
[406]
And then we summed
them all up.
[407]
It's 22.75.
[409]
It is equal to 22.75.
[418]
So this is essentially
the error that the
[421]
line does not explain.
[423]
This is the total error,
this is the total
[426]
variation of the numbers.
[427]
So if you wanted to know the
percentage of the total
[429]
variation that is not explained
by the line, you
[433]
could take this number divided
by this number.
[435]
So 2.74 over 22.75.
[443]
This tells us the percentage
of total variation not
[459]
explained by the line or
by the variation in x.
[470]
And so what is this number
going to be?
[473]
I can just use Excel for this.
[476]
So I'm just going to divide this
number divided by this
[482]
number right over there.
[484]
I get 0.12.
[487]
So this is equal to 0.12.
[496]
Or another way to think about
it is 12% of the total
[499]
variation is not explained
by the variation in x.
[502]
The total squared distance
between each of the points or
[506]
their kind of spread, their
variation, is not explain by
[509]
the variation in x.
[510]
So if you want the amount that
is explained by the variance
[512]
in x, you just subtract
that from 1.
[514]
So let me write it
right over here.
[516]
So we have our r squared, which
is the percent of the
[520]
total variation that is
explained by x, is going to be
[522]
1 the minus that 0.12 that
we just calculated.
[531]
Which is going to be 0.88.
[535]
So our r squared here is 0.88.
[538]
It's very, very close to 1.
[539]
The highest number
it can be is 1.
[542]
So what this tells us, or a way
to interpret this, is that
[545]
88% of the total variation of
these y values is explained by
[558]
the line or by the
variation in x.
[563]
And you can see that it looks
like a pretty good fit.
[566]
Each of these aren't too far.
[571]
Each of these points are
definitely much closer to the
[573]
line than they are
to the mean line.
[578]
In fact, all of them are closer
to our actual line than
[581]
to the mean.
Most Recent Videos:
You can go back to the homepage right here: Homepage





