ANOVA: Crash Course Statistics #33 - YouTube

Channel: CrashCourse

[2]
Hi, I’m Adriene Hill, and welcome back to Crash Course Statistics.
[6]
In many of our episodes we’ve looked at t-tests, which among other things, are good
[10]
for testing the difference between two groups.
[13]
Like people with or without cats.
[15]
Families below the poverty line...and families above it.
[18]
Petri dishes of cells that are treated with a chemical and those that aren't.
[22]
But the world isn’t always so binary.
[24]
We often want to compare measurements of MORE than two groups.
[27]
Things like ethnicity, medical diagnosis, country of origin, or job title.
[31]
So today, we’re going to apply the General Linear Model Framework we learned in the last episode
[36]
to test the difference between multiple groups using a new model called the ANOVA.
[41]
INTRO
[50]
The GLM Framework takes all the information that our data contain, and partitions it into
[54]
two piles: information that can be explained by a model that represents the way we think
[60]
things work, and error, which is the amount of information that our model fails to explain.
[65]
So let’s apply that to a new model: the ANOVA.
[68]
ANOVA is an acronym for ANalysis Of VAriance.
[71]
It’s actually very similar to Regression, except we’re using a categorical variable
[76]
to predict a continuous one.
[78]
Like using a soccer player’s position to predict the number of yards he runs in a game.
[82]
Or using highest completed degree to predict a person’s salary, note that this alone
[87]
isn’t evidence that getting a degree causes a higher salary, just that knowing someone’s
[93]
degree might help estimate how much they get paid.
[96]
Like Regression, the ANOVA builds a model of how the world works.
[100]
For example, my model for how many bunnies I’ll see on my walk into work might be that
[104]
if it’s raining I’ll see 1 bunny, and if it’s sunny, I’ll see 5.
[107]
I walk through a bunny preserve...
[109]
1 and 5 are my predictions for how many bunnies I’ll see, based on whether or not it’s raining.
[114]
Yesterday it rained.
[116]
And I saw two bunnies!
[117]
My model predicted 1, and my error is 1.
[120]
And we can represent this model as a sort of Regression where there are ONLY two possible
[125]
values that the Variable Weather can have.
[128]
0--if it rains--or 1--if it doesn’t.
[130]
In this case, expected number of bunnies on a rainy day is 1 and beta is the difference
[135]
between the two means, 5-1 = 4.
[139]
Which means our ANOVA model looks like this:
[141]
In a Regression we did a statistical test of the slope and that’s what this simple
[145]
ANOVA is doing too.
[147]
Since we assigned rainy days to be coded as 0, and sunny days as 1, the change in the
[152]
X-direction is just one (1-0).
[155]
So the slope of this line is the difference between mean bunny count on sunny days, five,
[160]
minus mean bunny count on rainy days, one.
[163]
This difference of 4 is the change in the Y direction.
[167]
We test this difference in the same way that we tested the regression slope.
[170]
And this slope tells us the difference between the means of the two groups.
[174]
Usually we’ll like to think of this slope as the difference between two group means.
[177]
But, knowing that our model treats it like a slope helps us understand how ANOVAs relate
[182]
to regression.
[184]
In a regression the slope tells you how much an increase in one unit of X affects Y.
[190]
Like for example, how much an increase of 1 year increases shoe size in kids.
[194]
An ANOVA actually does the same thing.
[196]
It looks at how much an increase from 0 (rainy days) to 1 (non-rainy days) affects the number
[202]
of bunnies you’d see.
[204]
Now...to another example.
[205]
Let’s look at the ratings of various chocolate bars based on the type of cocoa bean used.
[209]
We’ll use a dataset you can find at Kaggle.com courtesy of Brady Brelinski.
[214]
Our three groups are chocolate bars made with Criollo beans, Forastero beans, or Trinitario beans.
[220]
Chocolate making is complex, so we took a small sample of bars that only contained 1
[226]
of these three beans.
[227]
And the chocolate taster used a scale--with 5 as the highest score --transcending beyond
[232]
the ordinary limits.
[233]
1 was ā€œmostly unpalatableā€...
[236]
But is there really ā€œmostly unpalatableā€ chocolate out there?
[240]
We want to know if the type of bean affects our taster’s ratings.
[243]
To find out, we need the ANOVA model!
[246]
Like Regression, we can calculate a Sums of Squares Total by adding up the squared differences
[250]
between each chocolate rating, and the overall mean chocolate rating.
[254]
This gives us our Sums of Squares Total, or SST.
[257]
If that sounds like how we calculated variance, that’s because it is!
[261]
SST is just N times Variance.
[263]
This Sum represents the total amount of variation, or information, in the data.
[267]
Now, we need to partition this variation.
[270]
When we previously used a simple linear regression model, we partitioned this variation into
[274]
two parts: Sums of Squares for Regression, and Sums of Squares for Error.
[278]
And the ANOVA does the same thing.
[280]
The first step is to figure out how much of the variation is explained by our model.
[285]
In an ANOVA--what we’re using here--our best guess of a chocolate bar’s rating is
[290]
its group mean.
[291]
For bars made with Criollo beans 3.1, Forastero beans 3.25, and Trinitario beans 3.27.
[301]
So we sum up the squared distances between each point and its group mean.
[305]
This is called our Model Sums of Squares (or SSM) because it’s the variation our model explains.
[311]
So now that we have the amount of variation explained by the model.
[315]
In other words, how much variation is accounted for if we just assumed each rating value were
[321]
it’s group mean rating.
[322]
We’re also going to need the amount of variation that it DOESN’T explain.
[326]
In other words, how much ratings vary within each group of Cacao beans.
[330]
So, we can sum up the squared differences between each data point and its group mean
[336]
to get our Sums of Squares for Error: the amount of information that our model doesn’t explain.
[341]
Now that we have that information, we can calculate our F-statistic, just like we did
[345]
for regression.
[346]
The F-statistic compares how much variation our model accounts for vs. how much it can’t
[353]
account for.
[354]
The larger that F is, the more information our model is able to give us about our chocolate
[359]
bar ratings.
[360]
Again, SSM is the variation our model explains and SSE is the variation it doesn’t explain.
[364]
We want to compare the two.
[365]
But we also need to account for the amount of independent information that each one uses.
[370]
So, we divide each Sums of Squares by its degrees of freedom.
[374]
Our ANOVA model has 2 degrees of freedom.
[376]
In general, the formula for degrees of freedom for categorical variables (like cocoa bean
[381]
types) in an ANOVA is k-1, where k is the number of groups. In our case we have 3 groups.
[387]
Our Sums of Squares for Error has 787 degrees of freedom because we originally had 790 data
[394]
points, but we calculated 3 means.
[397]
The general formula for degrees of freedom for your errors is n minus k where n is the
[402]
sample size and k is the number of groups.
[405]
For our test, we got an F-statistic of 7.7619.
[409]
This F-statistic--sometimes called an F-ratio--has a distribution that looks like this:
[413]
And we’re going to use this distribution to find our p-value.
[416]
We want to know whether the effect of bean type on chocolate bar ratings is significant.
[421]
In this case we have a p-value of 0.000459.
[425]
Small enough to reject the null.
[427]
So we’ve found evidence that beans influenced the chocolate bar ratings.
[431]
A statistically significant result means that there is SOME statistically significant difference
[436]
SOMEWHERE in the groups, but it doesn’t tell you where that difference is.
[440]
Maybe Trinitario is significantly different from Criollo but not Forastero beans..
[445]
An F-test is an example of an Omnibus test, which means it’s a test that contains many
[450]
items or groups.
[452]
When we get a significant F-statistic, it means that there’s SOME statistically significant
[458]
difference somewhere between the groups, but we still have to look for it.
[461]
It’s kinda like walking into your kitchen and smelling something realllllllly stinky.
[465]
You know there’s SOMETHING gross, but you have to do more work to find out exactly what
[470]
is rotting...
[471]
We already have tools to do this, in statistics at least, because you can follow up a significant
[475]
F-test in an ANOVA with multiple t-tests, one for every unique pair of categories your
[481]
variable had.
[483]
We had 3, which means we only need to do 3 t-tests in order to find the statistically
[489]
significant difference or differences.
[490]
To conduct these T-tests, we take just the data in the two categories for that t-test,
[496]
and calculate the t-statistic and p-value.
[499]
For our first t-test we just look at the bars with Trinitario and Criollo beans.
[504]
First, we follow our Test statistic general formula:
[507]
We take the difference between the mean rating of chocolates made with Trinitario and Criollo beans.
[511]
And divide by the standard error.
[513]
And once we do this for all three comparisons, we can see where our statistically significant
[518]
differences are.
[519]
It looks--from our graph--like ratings of chocolate bars made with Criollo beans are
[523]
different...in a statistically significant way... than those made with Trinitario or
[528]
Forastero beans.
[530]
And our graph and group means show that Criollo bars have a slightly lower mean rating.
[533]
But bars made with Trinitario beans are NOT statistically significantly different than
[538]
those made with Forastero beans.
[541]
So our ANOVA F-test told us that there WERE some differences, and our follow up t-tests
[546]
told us WHERE they were.
[548]
And this is interesting.
[549]
Criollo beans are generally considered a delicacy and of a much higher quality than Forastero.
[555]
And Trinitario are hybrid of the two.
[558]
But we found...in this data set... that Criollo bars had statistically significantly lower ratings.
[564]
This might be because we excluded bars with combinations of our three bean types...or
[568]
because the rater has a different preference...or even be caused by some other unknown factor
[574]
that our model does not include.
[576]
Like who made the chocolate.
[577]
Or the country of origin of the beans.
[580]
We can also use ANOVAs for more than 3 groups.
[582]
For example, the ANOVA was first created by the statistician R.A. Fisher when he was on
[587]
a potato farm looking at studies of fertilizer.
[590]
In one of the first experiments he described, he looked at 12 different species of potato
[595]
and the effect of various fertilizers.
[597]
Let’s look at a simple version of Fisher’s potato study.
[601]
Here we have 12 different varieties of potato.
[604]
We’ll represent each of them with a letter A through L.
[607]
There are 21 of each of the potato plants, for a total of 252 potato plants.
[612]
We give our future french fries about a season to grow, then we dig them up and weigh each one.
[617]
This graph shows the potato weights that we recorded, as well as the total mean potato
[622]
weight and each group mean potato weight.
[625]
Using these numbers, we can calculate our Total Sums of Squares, Model Sums of Squares,
[629]
and Sums of Squares error.
[630]
We’re going to let a computer do that for us this time.
[633]
And our computer spit out this: the degrees of freedom, sums of squares, mean squares,
[639]
F-statistic, and p-value.
[641]
This is called an ANOVA table and it organizes all the information our ANOVA models give us.
[646]
Here we can see that our Model had an F-statistic--or F-value--of around 3, and a p-value of 0.000829.
[654]
So we reject the null hypothesis.
[656]
We found evidence that the potato varieties don’t all have the same mean weight.
[661]
But since this was an Omnibus test, our statistically significant F-test just means that there is
[666]
some statistically significant difference somewhere in those 12 potato varieties.
[672]
We don’t know where it is.
[674]
In that way, ANOVAs can be thought of as a first step.
[677]
We do an overall test that tells us whether there’s a needle in our haystack.
[682]
If we find out there is a needle, then we go looking for it.
[686]
However, if our test tells us there’s no needle, we’re done.
[689]
No need to look for something that probably doesn’t exist.
[692]
But you can see that this significant F-statistic for potato varieties will require MANY follow
[697]
up tests.
[698]
12 choose 2.
[700]
Or 66.
[701]
We showed a lot of calculations today, but there’s two big ANOVA ideas to take away
[705]
from this.
[706]
First, a lot of these different statistical models are more similar than they are actually different.
[711]
ANOVAs and Regressions both use the General Linear Model form to create a story about
[717]
how the world might work.
[719]
The ANOVA says that the best guess for a data point--like the rating of a new chocolate
[723]
bar--is the mean rating of whatever Group it belongs to.
[727]
Whether that’s Criollo, Trinitario , or Forastero.
[731]
If we don’t know anything else, we’d guess that the rating of a Criollo chocolate bar
[736]
is the mean rating for all Criollo bars.
[738]
Also, an ANOVA is a great example of filtering.
[741]
If there’s no evidence that bean type has an overall effect on chocolate-bar ratings,
[746]
we don’t want to go chasing more specific effects.
[749]
Our time is precious...and we want to use it as best as we can.
[753]
So we have more time out in the world...to look for bunnies.
[756]
Thanks for watching, I’ll see you next time.