馃攳
Regression: Crash Course Statistics #32 - YouTube
Channel: CrashCourse
[3]
Hi, I鈥檓 Adriene Hill and welcome back to
Crash Course Statistics.
[6]
There鈥檚 something to be said for flexibility.
[8]
It allows you to adapt to new circumstances.
[11]
Like a Transformer is a truck, but it can
also be an awesome fighting robot.
[15]
Today we鈥檒l introduce you to one of the
most flexible statistical tools--the General
[18]
Linear Model, or GLM.
[21]
The GLM will allow us to create many different models to help describe the world.
[25]
The first we鈥檒l talk about is The Regression
Model.
[27]
INTRO
[36]
General Linear Models say that your data can be explained by two things: your model, and
[41]
some error:
[42]
First, the model.
[43]
It usually takes the form Y = mx + b, or rather, Y = b + mx in most cases.
[49]
Say I want to predict the number of trick-or-treaters I鈥檒l get this Halloween by using enrollment
[54]
numbers from the local middle school.
[56]
I have to make sure I have enough candy on hand.
[58]
I expect a baseline of 25 trick-or-treaters.
[61]
And then for every middle school student,
I鈥檒l increase the number of trick-or-treaters
[64]
I expect by 0.01.
[67]
So this would be my model:
[69]
There were about 1,000 middle school students nearby last year, so based on my model, I
[73]
predicted that I鈥檇 get 35 trick-or-treaters.
[76]
But reality doesn鈥檛 always match predictions.
[79]
When Halloween came around, I got 42, which means that the error in this case was 7.
[84]
Now, error doesn鈥檛 mean that something鈥檚
WRONG, per se.
[87]
We call it error because it鈥檚 a deviation
from our model.
[90]
So the data isn鈥檛 wrong, the model is.
[92]
And these errors can come from many sources: like variables we didn鈥檛 account for in
[96]
our model-- including the candy-crazed kindergartners from the elementary school--or just random variation
[102]
Models allow us to make inferences --whether it鈥檚 the number of kids on my doorstep at
[106]
Halloween, or the number of credit card frauds
committed in a year.
[110]
General Linear Models take the information
that data give us and portion it out into
[114]
two major parts: information that can be accounted for by our model, and information that can鈥檛 be.
[120]
There鈥檚 many types of GLMS, one is Linear
Regression.
[124]
Which can also provide a prediction for our
data.
[127]
But instead of predicting our data using a
categorical variable like we do in a t-test,
[132]
we use a continuous one.
[134]
For example, we can predict the number of
likes a trending YouTube video gets based
[138]
on the number of comments that it has.
[141]
Here, the number of comments would be our
input variable and the number of likes our
[145]
output variable.
[147]
Our model will look something like this:
[149]
The first thing we want to do is plot our
datafrom 100 videos:
[153]
This allows us to check whether we think that
the data is best fit by a straight line, and
[158]
look for outliers--those are points that are
really extreme compared to the rest of our data.
[163]
These two points look pretty far away from
our data.
[166]
So we need to decide how to handle them.
[168]
We covered outliers in a previous episode,
and the same rules apply here.
[172]
We鈥檙e trying to catch data that doesn鈥檛
belong.
[174]
Since we can鈥檛 always tell when that happened,
we set a criteria for what an outlier is,
[180]
and stick to it.
[181]
One reason that we鈥檙e concerned with outliers
in regression is that values that are really
[185]
far away from the rest of our data can have
an undue influence on the regression line.
[190]
Without this extreme point, our line would
look like this.
[193]
But with it, like this.
[194]
That鈥檚 a lot of difference for one little
point!
[197]
There鈥檚 a lot of different ways to decide,
but in this case we鈥檙e gonna leave them in.
[201]
One of the assumptions that we make when using
linear regression, is that the relationship
[206]
is linear.
[207]
So if there鈥檚 some other shape our data
takes, we may want to look into some other models.
[212]
This plot looks linear, so we鈥檒l go ahead
and fit our regression model.
[215]
Usually a computer is going to do this part
for us, but we want to show you how this line fits.
[219]
A regression line is the straight line that鈥檚
as close as possible to all the data points
[224]
at once.
[225]
That means that it鈥檚 the one straight line
that minimizes the sum of the squared distance
[231]
of each point to the line.
[233]
The blue line is our regression line.
[235]
Its equation looks like this:
[237]
This number--the y-intercept--tells us how
many likes we鈥檇 expect a trending video
[241]
with zero comments to have.
[244]
Often, the intercept might not make much sense.
[247]
In this model, it鈥檚 possible that you could
have a video with 0 comments, but a video
[251]
with 0 comments and 9104 likes does seem to
conflict with our experience on youtube.
[258]
The slope, aka, the coefficient--tells us
how much our likes are determined by the number
[263]
of comments.
[264]
Our coefficient here is about 6.5, which means
that on average, an increase in 1 comment
[269]
is associated with an increase of about 6.5
likes.
[273]
But There鈥檚 another part of the General
Linear Model: the error.
[276]
Before we go any further, let鈥檚 take a look
at these errors--also called residuals.
[280]
The residual plot looks like this:
[283]
And we can tell a lot by looking at its shape.
[285]
We want a pretty evenly spaced cloud of residuals.
[288]
Ideally, we don鈥檛 want them to be extreme
in some areas and close to 0 in others.
[294]
It鈥檚 especially concerning if you can see
a weird pattern in your residuals like this:
[298]
Which would indicate that the error of your
predictions is dependent on how big your predictor
[303]
variable value is.
[305]
That would be like if our YouTube model was
pretty accurate at predicting the number of
[308]
likes for videos with very few comments, but
was wildly inaccurate on videos with a lot
[313]
of comments.
[314]
So, now that we鈥檝e looked at this error,
This is where Statistical tests come in.
[318]
There are actually two common ways to do a Null Hypothesis Significance test on a regression coefficient.
[324]
Today we鈥檒l cover the F-test.
[326]
The F-test, like the t-test, helps us quantify
how well we think our data fit a distribution,
[331]
like the null distribution.
[332]
Remember, the general form of many test statistics
is this:
[335]
But I鈥檓 going to make one small tweak to
the wording of our general formula to help
[339]
us understand F-tests a little better.
[341]
The null hypothesis here is that there鈥檚
NO relationship between the number of comments
[345]
on a trending YouTube video and the number
of likes.
[348]
IF that were true, we鈥檇 expect a kind of
blob-y, amorphous-cloud-looking scatter plot
[353]
and a regression line with a slope of 0.
[356]
It would mean that the number of comments
wouldn鈥檛 help us predict the number of likes.
[359]
We鈥檇 just predict the mean number of likes
no matter how many comments there were.
[364]
Back to our actual data.
[365]
This blue line is our observed model.
[368]
And the red is the model we鈥檇 expect if
the null hypothesis were true.
[372]
Let鈥檚 add some notation so it鈥檚 easier
to read our formulas.
[375]
Y-hat looks like this, and it represents the
predicted value for our outcome variable--here
[380]
it鈥檚 the predicted number of likes.
[382]
Y-bar looks like this, and it represents the
mean value of likes in this sample.
[387]
Taking the squared difference between each
data point and the mean line tells us the
[392]
total variation in our data set.
[394]
This might look similar to how we calculated
variance, because it is.
[399]
Variance is just this sum of squared deviations--called
the Sum of Squares Total--divided by N.
[404]
And we want to know how much of that total
Variation is accounted for by our regression
[407]
model, and how much is just error.
[409]
That would allow us to follow the General
Linear Model framework and explain our data
[413]
with two things: the model鈥檚 prediction,
and error.
[416]
We can look at the difference between our
observed slope coefficient--6.468--and the
[421]
one we鈥檇 expect if there were no relationship--0,
for each point.
[425]
And we鈥檒l start here with this point:
[426]
The green line represents the difference between
our observed model--which is the blue line--and
[431]
the model that would occur if the null were
true--which is the red line.
[435]
And we can do this for EVERY point in the
data set.
[437]
We want negative differences and positive
differences to count equally, so we square
[441]
each difference so that they鈥檙e all positive.
[444]
Then we add them all up to get part of the
numerator of our F-statistic:
[448]
The numerator has a special name in statistics.
[451]
It鈥檚 called the Sums of Squares for Regression,
or SSR for short.
[455]
Like the name suggests, this is the sum of
the squared distances between our regression
[459]
model and the null model.
[460]
Now we just need a measure of average variation.
[463]
We already found a measure of the total variation
in our sample data, the Total Sums of Squares.
[468]
And we calculated the variation that鈥檚 explained
by our model.
[472]
The other portion of the variation should
then represent the error, the variation of
[475]
data points around our model.
[477]
Shown here in Orange.
[478]
The sum of these squared distances are called
the Sums of Squares for Error (SSE).
[483]
If data points are close to the regression
line, then our model is pretty good at predicting
[487]
outcome values like likes on trending YouTube
Videos.
[491]
And so our SSE will be small.
[492]
If the data are far from the regression line,
then our model isn鈥檛 too good at predicting
[496]
outcome values.
[497]
And our SSE is going to be big.
[498]
Alright, so now we have all the pieces of
our puzzle.
[501]
Total Sums of Squares, Sums of Squares for
Regression, and Sums of Squares for Error:
[506]
Total Sums of Squares represents ALL the information
that we have from our Data on YouTube likes.
[511]
Sums of Squares for Regression represents
the proportion of that information that we
[515]
can explain using the model we created.
[517]
And Sums of Squares for Error represents the
leftover information--the portion of Total
[522]
Sums of Squares that the model can鈥檛 explain.
[525]
So the Total Sums of Squares is the Sum of
SSR and SSE.
[528]
Now we鈥檝e followed the General Linear Model
framework and taken our data and portioned
[532]
it into two categories: Regression Model,
and Error.
[536]
And now that we have the SSE, our measurement
of error, we can finally start to fill in
[539]
the Bottom of our F-statistic.
[541]
But we鈥檙e not quite done yet.
[542]
The last and final step to getting our F-statistic
is to divide each Sums of Squares by their
[547]
respective Degrees of freedom.
[549]
Remember degrees of freedom represent the amount of independent information that we have.
[554]
The sums of square error has n--the sample
size--minus 2 degrees of freedom.
[558]
We had 100 pieces of independent information
from our data, and we used 1 to calculate
[563]
the y-intercept and 1 to calculate the regression
coefficient.
[567]
So the Sums of Squares for Error has 98 degrees
of freedom.
[570]
The Sums of Squares for Regression has one
degree of freedom, because we鈥檙e using one
[575]
piece of independent information to estimate
our coefficient our slope.
[579]
We have to divide each sums of squares by
its degrees of freedom because we want to
[583]
weight each one appropriately.
[585]
More degrees of freedom mean more information.
[587]
It鈥檚 like how you wouldn鈥檛 be surprised
that Katie Mack who has a PhD in AstroPhysics
[591]
can explain more about the planets than someone
taking a high school Physics class.
[595]
Of course she can she has way more information.
[598]
Similarly, we want to make sure to scale the
Sums of Squares based on the amount of independent
[604]
information each have.
[605]
So we鈥檙e finally left with this:
[607]
And using an F-distribution, we can find our
p-value: the probability that we鈥檇 get a
[612]
F statistic as big or bigger than 59.613.
[616]
Our p-value is super tiny.
[618]
It鈥檚 about 0.000-000-000-000-99.
[624]
With an alpha level of 0.05, we reject the
null that there is NO relationship between
[628]
likes and YouTube comments on trending videos.
[631]
So we reject that true coefficient for the
relationship between likes and comments on
[636]
YouTube is 0.
[637]
The F-statistic allows us to directly compare
the amount of variation that our model can
[643]
and cannot explain.
[645]
When our model explains a lot of variation,
we consider it statistically significant.
[649]
And it turns out, if we did a t-test on this
coefficient, we鈥檇 get the exact same p-value.
[654]
That鈥檚 because these two methods of hypothesis
testing are equivalent, in fact if you square
[658]
our t-statistic, you鈥檒l get our F-statistic!
[661]
And we鈥檙e going to talk more about why F-tests
are important later.
[664]
Regression is a really useful tool to understand.
[667]
Scientists, economists, and political scientists
use it to make discoveries and communicate
[671]
those discoveries to the public.
[673]
Regression can be used to model the relationship
between increased taxes on cigarettes and
[677]
the average number of cigarettes people buy.
[680]
Or to show the relationship between peak-heart-rate-during-exercise
and blood pressure.
[684]
Not that we鈥檙e able to use regression alone
to determine if it causes changes.
[689]
But more abstractly, we learned today about
the General Linear Model framework.
[693]
What happens in life can be explained by two
things: what we know about how the world works,
[697]
and error--or deviations--from that model.
[700]
Like say you budgeted $30 for gas and only
ended up needing $28 last week.
[705]
The reality deviated from your guess and now
you get to to go to The Blend Den again!
[709]
Or just how angry your roommate is that you
left dishes in the sink can be explained by
[713]
how many days you left them out with a little
wiggle room for error depending on how your
[718]
roommate's day was.
[719]
Alright, thanks for watching, I鈥檒l see you
next time.
Most Recent Videos:
You can go back to the homepage right here: Homepage





