š
ANOVA: Crash Course Statistics #33 - YouTube
Channel: CrashCourse
[2]
Hi, Iām Adriene Hill, and welcome back to
Crash Course Statistics.
[6]
In many of our episodes weāve looked at
t-tests, which among other things, are good
[10]
for testing the difference between two groups.
[13]
Like people with or without cats.
[15]
Families below the poverty line...and families
above it.
[18]
Petri dishes of cells that are treated with
a chemical and those that aren't.
[22]
But the world isnāt always so binary.
[24]
We often want to compare measurements of MORE
than two groups.
[27]
Things like ethnicity, medical diagnosis,
country of origin, or job title.
[31]
So today, weāre going to apply the General
Linear Model Framework we learned in the last episode
[36]
to test the difference between multiple groups using a new model called the ANOVA.
[41]
INTRO
[50]
The GLM Framework takes all the information
that our data contain, and partitions it into
[54]
two piles: information that can be explained
by a model that represents the way we think
[60]
things work, and error, which is the amount
of information that our model fails to explain.
[65]
So letās apply that to a new model: the
ANOVA.
[68]
ANOVA is an acronym for ANalysis Of VAriance.
[71]
Itās actually very similar to Regression,
except weāre using a categorical variable
[76]
to predict a continuous one.
[78]
Like using a soccer playerās position to
predict the number of yards he runs in a game.
[82]
Or using highest completed degree to predict
a personās salary, note that this alone
[87]
isnāt evidence that getting a degree causes
a higher salary, just that knowing someoneās
[93]
degree might help estimate how much they get
paid.
[96]
Like Regression, the ANOVA builds a model
of how the world works.
[100]
For example, my model for how many bunnies
Iāll see on my walk into work might be that
[104]
if itās raining Iāll see 1 bunny, and
if itās sunny, Iāll see 5.
[107]
I walk through a bunny preserve...
[109]
1 and 5 are my predictions for how many bunnies Iāll see, based on whether or not itās raining.
[114]
Yesterday it rained.
[116]
And I saw two bunnies!
[117]
My model predicted 1, and my error is 1.
[120]
And we can represent this model as a sort
of Regression where there are ONLY two possible
[125]
values that the Variable Weather can have.
[128]
0--if it rains--or 1--if it doesnāt.
[130]
In this case, expected number of bunnies on
a rainy day is 1 and beta is the difference
[135]
between the two means, 5-1 = 4.
[139]
Which means our ANOVA model looks like this:
[141]
In a Regression we did a statistical test
of the slope and thatās what this simple
[145]
ANOVA is doing too.
[147]
Since we assigned rainy days to be coded as
0, and sunny days as 1, the change in the
[152]
X-direction is just one (1-0).
[155]
So the slope of this line is the difference
between mean bunny count on sunny days, five,
[160]
minus mean bunny count on rainy days, one.
[163]
This difference of 4 is the change in the
Y direction.
[167]
We test this difference in the same way that
we tested the regression slope.
[170]
And this slope tells us the difference between
the means of the two groups.
[174]
Usually weāll like to think of this slope
as the difference between two group means.
[177]
But, knowing that our model treats it like
a slope helps us understand how ANOVAs relate
[182]
to regression.
[184]
In a regression the slope tells you how much
an increase in one unit of X affects Y.
[190]
Like for example, how much an increase of
1 year increases shoe size in kids.
[194]
An ANOVA actually does the same thing.
[196]
It looks at how much an increase from 0 (rainy
days) to 1 (non-rainy days) affects the number
[202]
of bunnies youād see.
[204]
Now...to another example.
[205]
Letās look at the ratings of various chocolate
bars based on the type of cocoa bean used.
[209]
Weāll use a dataset you can find at Kaggle.com
courtesy of Brady Brelinski.
[214]
Our three groups are chocolate bars made with Criollo beans, Forastero beans, or Trinitario beans.
[220]
Chocolate making is complex, so we took a
small sample of bars that only contained 1
[226]
of these three beans.
[227]
And the chocolate taster used a scale--with
5 as the highest score --transcending beyond
[232]
the ordinary limits.
[233]
1 was āmostly unpalatableā...
[236]
But is there really āmostly unpalatableā chocolate out there?
[240]
We want to know if the type of bean affects
our tasterās ratings.
[243]
To find out, we need the ANOVA model!
[246]
Like Regression, we can calculate a Sums of
Squares Total by adding up the squared differences
[250]
between each chocolate rating, and the overall
mean chocolate rating.
[254]
This gives us our Sums of Squares Total, or
SST.
[257]
If that sounds like how we calculated variance,
thatās because it is!
[261]
SST is just N times Variance.
[263]
This Sum represents the total amount of variation,
or information, in the data.
[267]
Now, we need to partition this variation.
[270]
When we previously used a simple linear regression
model, we partitioned this variation into
[274]
two parts: Sums of Squares for Regression,
and Sums of Squares for Error.
[278]
And the ANOVA does the same thing.
[280]
The first step is to figure out how much of
the variation is explained by our model.
[285]
In an ANOVA--what weāre using here--our
best guess of a chocolate barās rating is
[290]
its group mean.
[291]
For bars made with Criollo beans 3.1, Forastero
beans 3.25, and Trinitario beans 3.27.
[301]
So we sum up the squared distances between
each point and its group mean.
[305]
This is called our Model Sums of Squares (or SSM) because itās the variation our model explains.
[311]
So now that we have the amount of variation
explained by the model.
[315]
In other words, how much variation is accounted
for if we just assumed each rating value were
[321]
itās group mean rating.
[322]
Weāre also going to need the amount of variation
that it DOESNāT explain.
[326]
In other words, how much ratings vary within
each group of Cacao beans.
[330]
So, we can sum up the squared differences
between each data point and its group mean
[336]
to get our Sums of Squares for Error: the
amount of information that our model doesnāt explain.
[341]
Now that we have that information, we can
calculate our F-statistic, just like we did
[345]
for regression.
[346]
The F-statistic compares how much variation
our model accounts for vs. how much it canāt
[353]
account for.
[354]
The larger that F is, the more information
our model is able to give us about our chocolate
[359]
bar ratings.
[360]
Again, SSM is the variation our model explains
and SSE is the variation it doesnāt explain.
[364]
We want to compare the two.
[365]
But we also need to account for the amount
of independent information that each one uses.
[370]
So, we divide each Sums of Squares by its
degrees of freedom.
[374]
Our ANOVA model has 2 degrees of freedom.
[376]
In general, the formula for degrees of freedom
for categorical variables (like cocoa bean
[381]
types) in an ANOVA is k-1, where k is the
number of groups. In our case we have 3 groups.
[387]
Our Sums of Squares for Error has 787 degrees
of freedom because we originally had 790 data
[394]
points, but we calculated 3 means.
[397]
The general formula for degrees of freedom
for your errors is n minus k where n is the
[402]
sample size and k is the number of groups.
[405]
For our test, we got an F-statistic of 7.7619.
[409]
This F-statistic--sometimes called an F-ratio--has
a distribution that looks like this:
[413]
And weāre going to use this distribution
to find our p-value.
[416]
We want to know whether the effect of bean
type on chocolate bar ratings is significant.
[421]
In this case we have a p-value of 0.000459.
[425]
Small enough to reject the null.
[427]
So weāve found evidence that beans influenced
the chocolate bar ratings.
[431]
A statistically significant result means that
there is SOME statistically significant difference
[436]
SOMEWHERE in the groups, but it doesnāt
tell you where that difference is.
[440]
Maybe Trinitario is significantly different
from Criollo but not Forastero beans..
[445]
An F-test is an example of an Omnibus test,
which means itās a test that contains many
[450]
items or groups.
[452]
When we get a significant F-statistic, it
means that thereās SOME statistically significant
[458]
difference somewhere between the groups, but
we still have to look for it.
[461]
Itās kinda like walking into your kitchen
and smelling something realllllllly stinky.
[465]
You know thereās SOMETHING gross, but you
have to do more work to find out exactly what
[470]
is rotting...
[471]
We already have tools to do this, in statistics
at least, because you can follow up a significant
[475]
F-test in an ANOVA with multiple t-tests,
one for every unique pair of categories your
[481]
variable had.
[483]
We had 3, which means we only need to do 3
t-tests in order to find the statistically
[489]
significant difference or differences.
[490]
To conduct these T-tests, we take just the
data in the two categories for that t-test,
[496]
and calculate the t-statistic and p-value.
[499]
For our first t-test we just look at the bars
with Trinitario and Criollo beans.
[504]
First, we follow our Test statistic general
formula:
[507]
We take the difference between the mean rating of chocolates made with Trinitario and Criollo beans.
[511]
And divide by the standard error.
[513]
And once we do this for all three comparisons,
we can see where our statistically significant
[518]
differences are.
[519]
It looks--from our graph--like ratings of
chocolate bars made with Criollo beans are
[523]
different...in a statistically significant
way... than those made with Trinitario or
[528]
Forastero beans.
[530]
And our graph and group means show that Criollo
bars have a slightly lower mean rating.
[533]
But bars made with Trinitario beans are NOT
statistically significantly different than
[538]
those made with Forastero beans.
[541]
So our ANOVA F-test told us that there WERE
some differences, and our follow up t-tests
[546]
told us WHERE they were.
[548]
And this is interesting.
[549]
Criollo beans are generally considered a delicacy
and of a much higher quality than Forastero.
[555]
And Trinitario are hybrid of the two.
[558]
But we found...in this data set... that Criollo
bars had statistically significantly lower ratings.
[564]
This might be because we excluded bars with
combinations of our three bean types...or
[568]
because the rater has a different preference...or
even be caused by some other unknown factor
[574]
that our model does not include.
[576]
Like who made the chocolate.
[577]
Or the country of origin of the beans.
[580]
We can also use ANOVAs for more than 3 groups.
[582]
For example, the ANOVA was first created by
the statistician R.A. Fisher when he was on
[587]
a potato farm looking at studies of fertilizer.
[590]
In one of the first experiments he described,
he looked at 12 different species of potato
[595]
and the effect of various fertilizers.
[597]
Letās look at a simple version of Fisherās
potato study.
[601]
Here we have 12 different varieties of potato.
[604]
Weāll represent each of them with a letter
A through L.
[607]
There are 21 of each of the potato plants,
for a total of 252 potato plants.
[612]
We give our future french fries about a season to grow, then we dig them up and weigh each one.
[617]
This graph shows the potato weights that we
recorded, as well as the total mean potato
[622]
weight and each group mean potato weight.
[625]
Using these numbers, we can calculate our
Total Sums of Squares, Model Sums of Squares,
[629]
and Sums of Squares error.
[630]
Weāre going to let a computer do that for
us this time.
[633]
And our computer spit out this: the degrees
of freedom, sums of squares, mean squares,
[639]
F-statistic, and p-value.
[641]
This is called an ANOVA table and it organizes all the information our ANOVA models give us.
[646]
Here we can see that our Model had an F-statistic--or
F-value--of around 3, and a p-value of 0.000829.
[654]
So we reject the null hypothesis.
[656]
We found evidence that the potato varieties
donāt all have the same mean weight.
[661]
But since this was an Omnibus test, our statistically
significant F-test just means that there is
[666]
some statistically significant difference
somewhere in those 12 potato varieties.
[672]
We donāt know where it is.
[674]
In that way, ANOVAs can be thought of as a
first step.
[677]
We do an overall test that tells us whether
thereās a needle in our haystack.
[682]
If we find out there is a needle, then we
go looking for it.
[686]
However, if our test tells us thereās no
needle, weāre done.
[689]
No need to look for something that probably
doesnāt exist.
[692]
But you can see that this significant F-statistic
for potato varieties will require MANY follow
[697]
up tests.
[698]
12 choose 2.
[700]
Or 66.
[701]
We showed a lot of calculations today, but
thereās two big ANOVA ideas to take away
[705]
from this.
[706]
First, a lot of these different statistical
models are more similar than they are actually different.
[711]
ANOVAs and Regressions both use the General
Linear Model form to create a story about
[717]
how the world might work.
[719]
The ANOVA says that the best guess for a data
point--like the rating of a new chocolate
[723]
bar--is the mean rating of whatever Group
it belongs to.
[727]
Whether thatās Criollo, Trinitario , or
Forastero.
[731]
If we donāt know anything else, weād guess
that the rating of a Criollo chocolate bar
[736]
is the mean rating for all Criollo bars.
[738]
Also, an ANOVA is a great example of filtering.
[741]
If thereās no evidence that bean type has
an overall effect on chocolate-bar ratings,
[746]
we donāt want to go chasing more specific
effects.
[749]
Our time is precious...and we want to use
it as best as we can.
[753]
So we have more time out in the world...to
look for bunnies.
[756]
Thanks for watching, Iāll see you next time.
You can go back to the homepage right here: Homepage





