ANOVA Part III: F Statistic and P Value | Statistics Tutorial #27 | MarinStatsLectures - YouTube

Channel: MarinStatsLectures-R Programming & Statistics

[0]
Let's build up the test statistic for one-way analysis of variance so recall
[5]
we were working with this example comparing the weight loss on one of four
[9]
diets A B C or D and we can see the observations here as well as the summary
[15]
statistics the mean weight loss and standard deviation of weight loss for
[18]
each of the four diets, we're working with this null assuming that all means are
[23]
equal and an alternative at least one differs from the rest. We previously
[28]
talked about how we can take the total variability in weight loss or the total
[32]
sum of squares and separate it into two parts that which is explained by the
[36]
diet and that which is not explained by the diet, so let's look at how we can use
[42]
that to build up the test statistic: first just a quick note on notation and
[45]
again we want to focus on the concepts not on plugging into formulas but this
[51]
helps us with understanding the formula and what's written in the notation so i is
[57]
used to index the group: Group one two three or four k is to signify the
[65]
number of groups J is used to represent the observation number within a group
[71]
Yij tells us the individual observations in group i, observation number j, so for
[78]
example Y1,3 is the observed value for Group 1 person number 3
[86]
Yi-bar is the mean for group i, Y-bar with no subscript is the overall or grand
[94]
mean: the mean weight loss for everyone in the study; Si is the standard
[99]
deviation for people in the group i, and ni is the sample size for people in
[104]
group i, so we saw that we can take the total variability and separate it into two
[112]
parts that which is explained by diet and that we signified as the variance
[119]
between diets or sometimes called the mean square between and that was the sum
[126]
of squares between divided by its degrees of freedom
[133]
the degrees of freedom between groups and looking at the formula and again we don't want to
[138]
get stuck on this but this helps us see the concepts we just learned from a
[141]
slightly different angle you can think of this as we're going to sum over all the
[146]
groups so Group one two three four what's the sample size in each group and
[152]
how far is the group specific mean from the overall mean squared divided by its
[158]
degrees of freedom if we work that out for example we'd find that the sum of
[165]
squares between groups or the explained sum of squares it's 97.3
[169]
degrees of freedom 3 right four groups minus one and that's going
[174]
to come out to 32.4t, we also saw we can think of the unexplained
[179]
and again this is the variability that's not explained by diet or not explained
[185]
by X, this is the variability going on within a group or the mean square within
[193]
and again this is the sum of squares within groups divided by the degrees of
[200]
freedom within and formulaically we can think of summing over all observations
[211]
how far is each individual from their group specific mean square divided by
[219]
three degrees of freedom. Right again we have n observations and
[223]
we lose K degrees of freedom by estimating the K group means, we can also
[228]
express this as summing over the groups
[235]
each group sample size minus one times the variance the
[242]
sample variance of each group divided by its degrees of freedom and the reason
[247]
why I write it this way is you can take a moment yourself to note that this here
[252]
is the exact formula for the pooled variance that we talked about in the two
[256]
sample t-test assuming equal variance in the two groups we're taking a the sample
[263]
variance of each group weighted by their degrees of freedom. okay so you can take
[268]
a moment yourself to work your way through and convince yourself that this
[273]
within group variance is the exact same as the pooled variance in the two sample
[278]
t-test assuming equal variance if you work this out for example you're going
[283]
to find the sum of squares within a group is 297 its degrees of freedom 56
[289]
and this comes out to be 5.3 so as noted we want to compare these two to each
[295]
other the mean square between groups to the mean square within groups and
[300]
the average sum of squares that can be explained by diet the average sum of
[304]
squares that cannot be explained by diet so let's try and think our way through
[308]
some stuff first suppose if the alternative hypothesis is true if at
[317]
least one mean differs if not all the means are the same how would we expect
[326]
statistically how we expect the mean squares between groups to compare to the
[333]
mean square within group if diets are different we'd expect this one should be
[338]
larger than this okay there should be much more variability that's explained
[342]
by diet than not explained by diet if we take a ratio of these and this is going
[349]
to be what we call our F statistic and or our test statistic
[353]
it's a mean squared between groups over the mean square within groups if we
[359]
expect the top to be larger than the bottom we expect this test statistic to
[363]
be larger than 1, if on the other hand our null hypothesis is
[370]
true if all the means are equal at the population level what would we expect to
[378]
see we'd expect the mean squares between, okay the variability that's explained by
[385]
diet, to be roughly the same as the mean square within or the variability that's
[391]
not explained by diet when looking at an F statistic you're taking the ratio of
[399]
these two we'd expect that to come out to be roughly 1 if we do this for our
[408]
set of data our F statistic 32.4 over the 5.3 that's going to come out to be
[418]
6.1 okay so the larger our F statistic gets the more evidence we have that the
[425]
alternative is likely true or the null is false well we don't want to get too
[430]
caught up on looking things up in tables it's important to note that this F
[436]
statistic follows what's called an F distribution it has degrees of freedom
[445]
for the numerator and degrees of freedom for the denominator so it has degrees of
[451]
freedom for the numerator which are K minus 1 right those are the degrees of
[457]
freedom of what's in the numerator and has degrees of freedom for the
[461]
denominator n-k, ok so again piece of software can do all these
[469]
calculations for you we don't want to focus on an F table and looking up an
[473]
exact p-value from that table so let's just jump to the interpretation
[478]
if we were going to work this out looking at a table or using a piece of
[482]
software p value is going to tell us like it always does what's the
[487]
probability of our observed test statistic or one even more extreme if the
[492]
null is true and we'd expect it to be 1 so what's the probability of getting an
[498]
F stat greater or equal to 6.1 if it should
[508]
be roughly equal to 1 if our null is true we'd expect our test that to be
[513]
roughly 1, what's the chance of seeing an estimate of 6.1 or more
[518]
you'll find that this comes out to 0.0011
[522]
okay roughly 0.1 percent so again if our null is true if all these diets are the
[527]
same the chance of seeing an F stat like this or the differences we saw or even
[532]
larger is only going to happen about 0.1 percent of the time that gives us
[537]
evidence to reject our null hypothesis we have evidence to believe the
[541]
alternative is likely true we have evidence to believe at least one diet
[545]
differs from the rest so now we need to decide which diets might differ from the
[550]
others and to do that we're going to compare all possible pairwise means
[553]
that's a topic we're going to get to talking about in a moment
[558]
Thanks! for more videos please subscribe to marinstatslectures