馃攳
ANOVA 2: Calculating SSW and SSB (total sum of squares within and between) | Khan Academy - YouTube
Channel: Khan Academy
[0]
In the last video, we
were able to calculate
[2]
the total sum of squares
for these nine data points
[5]
right here.
[6]
And these nine data
points are grouped
[8]
into three different groups, or
if we want to speak generally,
[11]
into m different groups.
[12]
What I want to do
in this video is
[14]
to figure out how much of
this total sum of squares
[18]
is due to variation within
each group versus variation
[23]
between the actual groups.
[26]
So first, let's figure
out the total variation
[28]
within the group.
[30]
So let's call that the
sum of squares within.
[33]
So let's calculate the
sum of squares within.
[35]
I'll do that in yellow.
[36]
Actually, I already used
yellow, so let me do blue.
[40]
So the sum of squares within.
[46]
Let me make it clear.
[47]
That stands for within.
[51]
So we want to see how
much of the variation
[53]
is due to how far each
of these data points
[55]
are from their central tendency,
from their respective mean.
[59]
So this is going to
be equal to-- let's
[61]
start with these guys.
[62]
So instead of taking the
distance between each data
[65]
point and the mean
of means, I'm going
[67]
to find the distance
between each data
[69]
point and that group's
mean, because we
[71]
want to square the total sum
of squares between each data
[77]
point and their respective mean.
[79]
So let's do that.
[80]
So it's 3 minus-- the mean here
is 2-- squared, plus 2 minus 2
[87]
squared, plus 2 minus 2
squared, plus 1 minus 2 squared.
[92]
1 minus 2 squared
plus-- I'm going
[95]
to do this for
all of the groups,
[96]
but for each group, the
distance between each data point
[98]
and its mean.
[99]
So plus 5 minus 4,
plus 5 minus 4 squared,
[105]
plus 4 minus 4 squared--
sorry, the next point was
[110]
3-- plus 3 minus 4 squared,
plus 4 minus 4 squared.
[117]
And then finally, we
have the third group.
[120]
But we're finding that
all of the sum of squares
[122]
from each point to its
central tendency within that,
[125]
but we're going to
add them all up.
[127]
And then we find
the third group.
[128]
So we have 5 minus-- oh, its
mean is 6-- 5 minus 6 squared,
[135]
plus 6 minus 6 squared,
plus 7 minus 6 squared.
[140]
And what is this going to equal?
[142]
So this is going to
be equal to-- up here,
[145]
it's going to be
1 plus 0 plus 1.
[149]
So that's going to
be equal to 2 plus.
[152]
And then this is going to be
equal to 1, 1 plus 1 plus 0--
[158]
so another 2--
plus this is going
[162]
to be equal to 1 plus 0 plus 1.
[164]
7 minus 6 is 1 squared is 1.
[166]
So plus.
[167]
So that's 2 over here.
[169]
So this is going to be
equal to our sum of squares
[173]
within, I should say, is 6.
[176]
So one way to think about it--
our total variation was 30.
[181]
And based on this
calculation, 6 of that 30
[186]
comes from a variation
within these samples.
[188]
Now, the next thing
I want to think about
[190]
is how many degrees of freedom
do we have in this calculation?
[195]
How many independent data
points do we actually have?
[199]
Well, for each of
these-- so over here,
[205]
we have n data points in one.
[207]
In particular, n is 3 here.
[209]
But if you know n
minus 1 of them,
[211]
you can always figure
out the nth one
[214]
if you know the
actual sample mean.
[218]
So in this case, for
any of these groups,
[220]
if you know two of
these data points,
[221]
you can always
figure out the third.
[223]
If you know these two, you can
always figure out the third
[225]
if you know the sample mean.
[227]
So in general, let's figure out
the degrees of freedom here.
[231]
For each group,
when you did this,
[234]
you had n minus 1
degrees of freedom.
[237]
Remember, n is the
number of data points
[243]
you had in each group.
[244]
So you have n minus
1 degrees of freedom
[246]
for each of these groups.
[249]
So it's n minus 1, n
minus 1, n minus 1.
[252]
Or let me put it this
way-- you have n minus 1
[255]
for each of these groups,
and there are m groups.
[261]
So there's m times n minus
1 degrees of freedom.
[269]
And in this case in particular,
each group-- n minus 1 is 2.
[272]
Or in each case, you had
2 degrees of freedom,
[275]
and there's three
groups of that.
[277]
So there are 6
degrees of freedom.
[285]
And in the future, we might
do a more detailed discussion
[289]
of what degrees of
freedom mean, and how
[292]
to mathematically
think about it.
[293]
But the best-- the simplest
way to think about it
[296]
is really, truly
independent data
[298]
points, assuming you
knew, in this case,
[300]
the central statistic
that we used
[301]
to calculate the squared
distance in each of them.
[303]
If you know them already,
the third data point
[305]
could actually be calculated
from the other two.
[308]
So we have 6 degrees
of freedom over here.
[310]
Now, that was how much
of the total variation
[316]
is due to variation
within each sample.
[318]
Now let's think about
how much of the variation
[320]
is due to variation
between the samples.
[325]
And to do that, we're
going to calculate.
[327]
Let me get a nice color here.
[329]
I think I've run
out all the colors.
[330]
We'll call this sum
of squares between.
[334]
The B stands for between.
[341]
So another way to
think about it--
[342]
how much of this
total variation is
[344]
due to the variation
between the means,
[347]
between the central
tendency-- that's
[349]
what we're going to
calculate right now--
[351]
and how much is due to
variation from each data
[354]
point to its mean?
[356]
So let's figure out how
much is due to variation
[359]
between these guys over here.
[367]
Actually, let's think about
just this first group.
[369]
For this first group,
how much variation
[371]
for each of these guys is due to
the variation between this mean
[375]
and the mean of means?
[378]
Well, so for this
first guy up here--
[381]
I'll just write it
all out explicitly--
[383]
the variation is going
to be its sample mean.
[386]
So it's going to be 2 minus
the mean of means squared.
[390]
And then for this
guy, it's going
[392]
to be the same thing--
his sample mean,
[394]
2 minus the mean
of mean squared.
[397]
Plus same thing for this guy, 2
minus the mean of mean squared.
[401]
Or another way to
think about it--
[403]
this is equal to-- I'll
write it over here--
[408]
this is equal to 3
times 2 minus 4 squared,
[412]
which is the same thing as 3.
[414]
This is equal to 3 times 4.
[420]
Three times 4 is equal to 12.
[422]
And then we could do
it for each of them.
[423]
And actually, I want
to find the total sum.
[425]
So let me just write
it all out, actually.
[427]
I think that might
be an easier thing
[428]
to do, because I want to find,
for all of these guys combined,
[433]
the sum of squares
due to the differences
[436]
between the samples.
[438]
So that's from the contribution
from the first sample.
[441]
And then from the second sample,
you have this guy over here.
[445]
Oh, sorry.
[447]
You don't want to calculate him.
[448]
For this data point,
the amount of variation
[451]
due to the difference
between the means
[452]
is going to be 4
minus 4 squared.
[457]
Same thing for this guy.
[458]
It's going to be
4 minus 4 squared.
[461]
And we're not taking
it into consideration.
[462]
We're only taking its sample
mean into consideration.
[466]
And then finally, plus
4 minus 4 squared.
[469]
We're taking this
minus this squared
[471]
for each of these data points.
[473]
And then finally, we'll do
that with the last group.
[476]
With the last group,
sample mean is 6.
[478]
So it's going to be
6 minus 4 squared,
[481]
plus 6 minus 4 squared, plus
6 minus 4, plus 6 minus 4
[489]
squared.
[490]
Now, let's think about how
many degrees of freedom
[495]
we had in this calculation
right over here.
[500]
Well, in general, I
guess the easiest way
[503]
to think about is, how
much information did we
[506]
have, assuming that we
knew the mean of means?
[508]
If we know the mean
of means, how much
[510]
here is new information?
[512]
Well, if you know
the mean of the mean,
[515]
and you know two of
these sample means,
[517]
you can always
figure out the third.
[518]
If you know this
one and this one,
[519]
you can figure out that one.
[520]
And if you know that
one and that one,
[521]
you can figure out that one.
[523]
And that's because this is the
mean of these means over here.
[526]
So in general, if you have m
groups, or if you have m means,
[531]
there are m minus 1
degrees of freedom here.
[538]
Let me write that.
[546]
But with that said, well,
and in this case, m is 3.
[548]
So we could say there's
two degrees of freedom
[553]
for this exact example.
[555]
Let's actually, let's calculate
the sum of squares between.
[557]
So what is this going to be?
[560]
I'll just scroll down.
[561]
Running out of space.
[562]
This is going to be equal to--
this right here is 2 minus 4
[567]
is negative 2 squared is 4.
[569]
And then we have
three 4's over here.
[571]
So it's 3 times 4, plus
3 times-- what is this?
[577]
3 times 0 plus-- what is this?
[583]
The difference between
each of these-- 6 minus 4
[585]
is 2 squared is 4-- so
that means we have 3 times
[587]
4, plus 3 times 4.
[591]
And we get 3 times 4 is 12,
plus 0, plus 12 is equal to 24.
[599]
So the sum of
squares, or we could
[601]
say, the variation due
to what's the difference
[604]
between the groups,
between the means, is 24.
[609]
Now, let's put it all together.
[610]
We said that the
total variation,
[614]
that if you looked at
all 9 data points, is 30.
[617]
Let me write that over here.
[619]
So the total sum of
squares is equal to 30.
[625]
We figured out
the sum of squares
[627]
between each data point
and its central tendency,
[631]
its sample mean--
we figured out,
[633]
and when you totaled
it all up, we got 6.
[636]
So the sum of squares
within was equal to 6.
[643]
And in this case, it was
6 degrees of freedom.
[648]
Or if we wanted to
write it generally,
[650]
there were m times n minus
1 degrees of freedom.
[654]
And actually, for the
total, we figured out
[656]
we have m times n minus
1 degrees of freedom.
[662]
Actually, let me just
write degrees of freedom
[664]
in this column right over here.
[666]
In this case, the number
turned out to be 8.
[669]
And then just now we
calculated the sum
[671]
of squares between the samples.
[674]
The sum of squares between
the samples is equal to 24.
[678]
And we figured out that it had
m minus 1 degrees of freedom,
[682]
which ended up being 2.
[684]
Now, the interesting
thing here-- and this
[687]
is why this analysis of variance
all fits nicely together,
[691]
and in future videos we'll
think about how we can actually
[694]
test hypotheses using
some of the tools
[696]
that we're thinking
about right now--
[698]
is that the sum
of squares within,
[700]
plus the sum of
squares between, is
[702]
equal to the total
sum of squares.
[705]
So a way to think about is
that the total variation
[709]
in this data right
here can be described
[712]
as the sum of the
variation within each
[715]
of these groups, when
you take that total,
[718]
plus the sum of the
variation between the groups.
[723]
And even the degrees
of freedom work out.
[726]
The sum of squares between
had 2 degrees of freedom.
[729]
The sum of squares
within each of the groups
[731]
had 6 degrees of freedom.
[732]
2 plus 6 is 8.
[734]
That's the total
degrees of freedom
[736]
we had for all of
the data combined.
[739]
It even works if you
look at the more general.
[742]
So our sum of
squares between had
[744]
m minus 1 degrees of freedom.
[747]
Our sum of squares
within had m times n
[751]
minus 1 degrees of freedom.
[753]
So this is equal to m
minus 1, plus mn minus m.
[758]
These guys cancel out.
[759]
This is equal to mn minus
1 degrees of freedom, which
[764]
is exactly the total
degrees of freedom we
[766]
had for the total
sum of squares.
[769]
So the whole point
of the calculations
[771]
that we did in the last
video and in this video
[773]
is just to appreciate that
this total variation over here,
[778]
this total variation
that we first calculated,
[780]
can be viewed as the sum
of these two component
[783]
variations-- how much
variation is there
[790]
within each of the samples plus
how much variation is there
[794]
between the means
of the samples?
[797]
Hopefully that's
not too confusing.
Most Recent Videos:
You can go back to the homepage right here: Homepage





