ANOVA 2: Calculating SSW and SSB (total sum of squares within and between) | Khan Academy - YouTube

Channel: Khan Academy

[0]
In the last video, we were able to calculate
[2]
the total sum of squares for these nine data points
[5]
right here.
[6]
And these nine data points are grouped
[8]
into three different groups, or if we want to speak generally,
[11]
into m different groups.
[12]
What I want to do in this video is
[14]
to figure out how much of this total sum of squares
[18]
is due to variation within each group versus variation
[23]
between the actual groups.
[26]
So first, let's figure out the total variation
[28]
within the group.
[30]
So let's call that the sum of squares within.
[33]
So let's calculate the sum of squares within.
[35]
I'll do that in yellow.
[36]
Actually, I already used yellow, so let me do blue.
[40]
So the sum of squares within.
[46]
Let me make it clear.
[47]
That stands for within.
[51]
So we want to see how much of the variation
[53]
is due to how far each of these data points
[55]
are from their central tendency, from their respective mean.
[59]
So this is going to be equal to-- let's
[61]
start with these guys.
[62]
So instead of taking the distance between each data
[65]
point and the mean of means, I'm going
[67]
to find the distance between each data
[69]
point and that group's mean, because we
[71]
want to square the total sum of squares between each data
[77]
point and their respective mean.
[79]
So let's do that.
[80]
So it's 3 minus-- the mean here is 2-- squared, plus 2 minus 2
[87]
squared, plus 2 minus 2 squared, plus 1 minus 2 squared.
[92]
1 minus 2 squared plus-- I'm going
[95]
to do this for all of the groups,
[96]
but for each group, the distance between each data point
[98]
and its mean.
[99]
So plus 5 minus 4, plus 5 minus 4 squared,
[105]
plus 4 minus 4 squared-- sorry, the next point was
[110]
3-- plus 3 minus 4 squared, plus 4 minus 4 squared.
[117]
And then finally, we have the third group.
[120]
But we're finding that all of the sum of squares
[122]
from each point to its central tendency within that,
[125]
but we're going to add them all up.
[127]
And then we find the third group.
[128]
So we have 5 minus-- oh, its mean is 6-- 5 minus 6 squared,
[135]
plus 6 minus 6 squared, plus 7 minus 6 squared.
[140]
And what is this going to equal?
[142]
So this is going to be equal to-- up here,
[145]
it's going to be 1 plus 0 plus 1.
[149]
So that's going to be equal to 2 plus.
[152]
And then this is going to be equal to 1, 1 plus 1 plus 0--
[158]
so another 2-- plus this is going
[162]
to be equal to 1 plus 0 plus 1.
[164]
7 minus 6 is 1 squared is 1.
[166]
So plus.
[167]
So that's 2 over here.
[169]
So this is going to be equal to our sum of squares
[173]
within, I should say, is 6.
[176]
So one way to think about it-- our total variation was 30.
[181]
And based on this calculation, 6 of that 30
[186]
comes from a variation within these samples.
[188]
Now, the next thing I want to think about
[190]
is how many degrees of freedom do we have in this calculation?
[195]
How many independent data points do we actually have?
[199]
Well, for each of these-- so over here,
[205]
we have n data points in one.
[207]
In particular, n is 3 here.
[209]
But if you know n minus 1 of them,
[211]
you can always figure out the nth one
[214]
if you know the actual sample mean.
[218]
So in this case, for any of these groups,
[220]
if you know two of these data points,
[221]
you can always figure out the third.
[223]
If you know these two, you can always figure out the third
[225]
if you know the sample mean.
[227]
So in general, let's figure out the degrees of freedom here.
[231]
For each group, when you did this,
[234]
you had n minus 1 degrees of freedom.
[237]
Remember, n is the number of data points
[243]
you had in each group.
[244]
So you have n minus 1 degrees of freedom
[246]
for each of these groups.
[249]
So it's n minus 1, n minus 1, n minus 1.
[252]
Or let me put it this way-- you have n minus 1
[255]
for each of these groups, and there are m groups.
[261]
So there's m times n minus 1 degrees of freedom.
[269]
And in this case in particular, each group-- n minus 1 is 2.
[272]
Or in each case, you had 2 degrees of freedom,
[275]
and there's three groups of that.
[277]
So there are 6 degrees of freedom.
[285]
And in the future, we might do a more detailed discussion
[289]
of what degrees of freedom mean, and how
[292]
to mathematically think about it.
[293]
But the best-- the simplest way to think about it
[296]
is really, truly independent data
[298]
points, assuming you knew, in this case,
[300]
the central statistic that we used
[301]
to calculate the squared distance in each of them.
[303]
If you know them already, the third data point
[305]
could actually be calculated from the other two.
[308]
So we have 6 degrees of freedom over here.
[310]
Now, that was how much of the total variation
[316]
is due to variation within each sample.
[318]
Now let's think about how much of the variation
[320]
is due to variation between the samples.
[325]
And to do that, we're going to calculate.
[327]
Let me get a nice color here.
[329]
I think I've run out all the colors.
[330]
We'll call this sum of squares between.
[334]
The B stands for between.
[341]
So another way to think about it--
[342]
how much of this total variation is
[344]
due to the variation between the means,
[347]
between the central tendency-- that's
[349]
what we're going to calculate right now--
[351]
and how much is due to variation from each data
[354]
point to its mean?
[356]
So let's figure out how much is due to variation
[359]
between these guys over here.
[367]
Actually, let's think about just this first group.
[369]
For this first group, how much variation
[371]
for each of these guys is due to the variation between this mean
[375]
and the mean of means?
[378]
Well, so for this first guy up here--
[381]
I'll just write it all out explicitly--
[383]
the variation is going to be its sample mean.
[386]
So it's going to be 2 minus the mean of means squared.
[390]
And then for this guy, it's going
[392]
to be the same thing-- his sample mean,
[394]
2 minus the mean of mean squared.
[397]
Plus same thing for this guy, 2 minus the mean of mean squared.
[401]
Or another way to think about it--
[403]
this is equal to-- I'll write it over here--
[408]
this is equal to 3 times 2 minus 4 squared,
[412]
which is the same thing as 3.
[414]
This is equal to 3 times 4.
[420]
Three times 4 is equal to 12.
[422]
And then we could do it for each of them.
[423]
And actually, I want to find the total sum.
[425]
So let me just write it all out, actually.
[427]
I think that might be an easier thing
[428]
to do, because I want to find, for all of these guys combined,
[433]
the sum of squares due to the differences
[436]
between the samples.
[438]
So that's from the contribution from the first sample.
[441]
And then from the second sample, you have this guy over here.
[445]
Oh, sorry.
[447]
You don't want to calculate him.
[448]
For this data point, the amount of variation
[451]
due to the difference between the means
[452]
is going to be 4 minus 4 squared.
[457]
Same thing for this guy.
[458]
It's going to be 4 minus 4 squared.
[461]
And we're not taking it into consideration.
[462]
We're only taking its sample mean into consideration.
[466]
And then finally, plus 4 minus 4 squared.
[469]
We're taking this minus this squared
[471]
for each of these data points.
[473]
And then finally, we'll do that with the last group.
[476]
With the last group, sample mean is 6.
[478]
So it's going to be 6 minus 4 squared,
[481]
plus 6 minus 4 squared, plus 6 minus 4, plus 6 minus 4
[489]
squared.
[490]
Now, let's think about how many degrees of freedom
[495]
we had in this calculation right over here.
[500]
Well, in general, I guess the easiest way
[503]
to think about is, how much information did we
[506]
have, assuming that we knew the mean of means?
[508]
If we know the mean of means, how much
[510]
here is new information?
[512]
Well, if you know the mean of the mean,
[515]
and you know two of these sample means,
[517]
you can always figure out the third.
[518]
If you know this one and this one,
[519]
you can figure out that one.
[520]
And if you know that one and that one,
[521]
you can figure out that one.
[523]
And that's because this is the mean of these means over here.
[526]
So in general, if you have m groups, or if you have m means,
[531]
there are m minus 1 degrees of freedom here.
[538]
Let me write that.
[546]
But with that said, well, and in this case, m is 3.
[548]
So we could say there's two degrees of freedom
[553]
for this exact example.
[555]
Let's actually, let's calculate the sum of squares between.
[557]
So what is this going to be?
[560]
I'll just scroll down.
[561]
Running out of space.
[562]
This is going to be equal to-- this right here is 2 minus 4
[567]
is negative 2 squared is 4.
[569]
And then we have three 4's over here.
[571]
So it's 3 times 4, plus 3 times-- what is this?
[577]
3 times 0 plus-- what is this?
[583]
The difference between each of these-- 6 minus 4
[585]
is 2 squared is 4-- so that means we have 3 times
[587]
4, plus 3 times 4.
[591]
And we get 3 times 4 is 12, plus 0, plus 12 is equal to 24.
[599]
So the sum of squares, or we could
[601]
say, the variation due to what's the difference
[604]
between the groups, between the means, is 24.
[609]
Now, let's put it all together.
[610]
We said that the total variation,
[614]
that if you looked at all 9 data points, is 30.
[617]
Let me write that over here.
[619]
So the total sum of squares is equal to 30.
[625]
We figured out the sum of squares
[627]
between each data point and its central tendency,
[631]
its sample mean-- we figured out,
[633]
and when you totaled it all up, we got 6.
[636]
So the sum of squares within was equal to 6.
[643]
And in this case, it was 6 degrees of freedom.
[648]
Or if we wanted to write it generally,
[650]
there were m times n minus 1 degrees of freedom.
[654]
And actually, for the total, we figured out
[656]
we have m times n minus 1 degrees of freedom.
[662]
Actually, let me just write degrees of freedom
[664]
in this column right over here.
[666]
In this case, the number turned out to be 8.
[669]
And then just now we calculated the sum
[671]
of squares between the samples.
[674]
The sum of squares between the samples is equal to 24.
[678]
And we figured out that it had m minus 1 degrees of freedom,
[682]
which ended up being 2.
[684]
Now, the interesting thing here-- and this
[687]
is why this analysis of variance all fits nicely together,
[691]
and in future videos we'll think about how we can actually
[694]
test hypotheses using some of the tools
[696]
that we're thinking about right now--
[698]
is that the sum of squares within,
[700]
plus the sum of squares between, is
[702]
equal to the total sum of squares.
[705]
So a way to think about is that the total variation
[709]
in this data right here can be described
[712]
as the sum of the variation within each
[715]
of these groups, when you take that total,
[718]
plus the sum of the variation between the groups.
[723]
And even the degrees of freedom work out.
[726]
The sum of squares between had 2 degrees of freedom.
[729]
The sum of squares within each of the groups
[731]
had 6 degrees of freedom.
[732]
2 plus 6 is 8.
[734]
That's the total degrees of freedom
[736]
we had for all of the data combined.
[739]
It even works if you look at the more general.
[742]
So our sum of squares between had
[744]
m minus 1 degrees of freedom.
[747]
Our sum of squares within had m times n
[751]
minus 1 degrees of freedom.
[753]
So this is equal to m minus 1, plus mn minus m.
[758]
These guys cancel out.
[759]
This is equal to mn minus 1 degrees of freedom, which
[764]
is exactly the total degrees of freedom we
[766]
had for the total sum of squares.
[769]
So the whole point of the calculations
[771]
that we did in the last video and in this video
[773]
is just to appreciate that this total variation over here,
[778]
this total variation that we first calculated,
[780]
can be viewed as the sum of these two component
[783]
variations-- how much variation is there
[790]
within each of the samples plus how much variation is there
[794]
between the means of the samples?
[797]
Hopefully that's not too confusing.