Regression: Crash Course Statistics #32 - YouTube

Channel: CrashCourse

[3]
Hi, I鈥檓 Adriene Hill and welcome back to Crash Course Statistics.
[6]
There鈥檚 something to be said for flexibility.
[8]
It allows you to adapt to new circumstances.
[11]
Like a Transformer is a truck, but it can also be an awesome fighting robot.
[15]
Today we鈥檒l introduce you to one of the most flexible statistical tools--the General
[18]
Linear Model, or GLM.
[21]
The GLM will allow us to create many different models to help describe the world.
[25]
The first we鈥檒l talk about is The Regression Model.
[27]
INTRO
[36]
General Linear Models say that your data can be explained by two things: your model, and
[41]
some error:
[42]
First, the model.
[43]
It usually takes the form Y = mx + b, or rather, Y = b + mx in most cases.
[49]
Say I want to predict the number of trick-or-treaters I鈥檒l get this Halloween by using enrollment
[54]
numbers from the local middle school.
[56]
I have to make sure I have enough candy on hand.
[58]
I expect a baseline of 25 trick-or-treaters.
[61]
And then for every middle school student, I鈥檒l increase the number of trick-or-treaters
[64]
I expect by 0.01.
[67]
So this would be my model:
[69]
There were about 1,000 middle school students nearby last year, so based on my model, I
[73]
predicted that I鈥檇 get 35 trick-or-treaters.
[76]
But reality doesn鈥檛 always match predictions.
[79]
When Halloween came around, I got 42, which means that the error in this case was 7.
[84]
Now, error doesn鈥檛 mean that something鈥檚 WRONG, per se.
[87]
We call it error because it鈥檚 a deviation from our model.
[90]
So the data isn鈥檛 wrong, the model is.
[92]
And these errors can come from many sources: like variables we didn鈥檛 account for in
[96]
our model-- including the candy-crazed kindergartners from the elementary school--or just random variation
[102]
Models allow us to make inferences --whether it鈥檚 the number of kids on my doorstep at
[106]
Halloween, or the number of credit card frauds committed in a year.
[110]
General Linear Models take the information that data give us and portion it out into
[114]
two major parts: information that can be accounted for by our model, and information that can鈥檛 be.
[120]
There鈥檚 many types of GLMS, one is Linear Regression.
[124]
Which can also provide a prediction for our data.
[127]
But instead of predicting our data using a categorical variable like we do in a t-test,
[132]
we use a continuous one.
[134]
For example, we can predict the number of likes a trending YouTube video gets based
[138]
on the number of comments that it has.
[141]
Here, the number of comments would be our input variable and the number of likes our
[145]
output variable.
[147]
Our model will look something like this:
[149]
The first thing we want to do is plot our datafrom 100 videos:
[153]
This allows us to check whether we think that the data is best fit by a straight line, and
[158]
look for outliers--those are points that are really extreme compared to the rest of our data.
[163]
These two points look pretty far away from our data.
[166]
So we need to decide how to handle them.
[168]
We covered outliers in a previous episode, and the same rules apply here.
[172]
We鈥檙e trying to catch data that doesn鈥檛 belong.
[174]
Since we can鈥檛 always tell when that happened, we set a criteria for what an outlier is,
[180]
and stick to it.
[181]
One reason that we鈥檙e concerned with outliers in regression is that values that are really
[185]
far away from the rest of our data can have an undue influence on the regression line.
[190]
Without this extreme point, our line would look like this.
[193]
But with it, like this.
[194]
That鈥檚 a lot of difference for one little point!
[197]
There鈥檚 a lot of different ways to decide, but in this case we鈥檙e gonna leave them in.
[201]
One of the assumptions that we make when using linear regression, is that the relationship
[206]
is linear.
[207]
So if there鈥檚 some other shape our data takes, we may want to look into some other models.
[212]
This plot looks linear, so we鈥檒l go ahead and fit our regression model.
[215]
Usually a computer is going to do this part for us, but we want to show you how this line fits.
[219]
A regression line is the straight line that鈥檚 as close as possible to all the data points
[224]
at once.
[225]
That means that it鈥檚 the one straight line that minimizes the sum of the squared distance
[231]
of each point to the line.
[233]
The blue line is our regression line.
[235]
Its equation looks like this:
[237]
This number--the y-intercept--tells us how many likes we鈥檇 expect a trending video
[241]
with zero comments to have.
[244]
Often, the intercept might not make much sense.
[247]
In this model, it鈥檚 possible that you could have a video with 0 comments, but a video
[251]
with 0 comments and 9104 likes does seem to conflict with our experience on youtube.
[258]
The slope, aka, the coefficient--tells us how much our likes are determined by the number
[263]
of comments.
[264]
Our coefficient here is about 6.5, which means that on average, an increase in 1 comment
[269]
is associated with an increase of about 6.5 likes.
[273]
But There鈥檚 another part of the General Linear Model: the error.
[276]
Before we go any further, let鈥檚 take a look at these errors--also called residuals.
[280]
The residual plot looks like this:
[283]
And we can tell a lot by looking at its shape.
[285]
We want a pretty evenly spaced cloud of residuals.
[288]
Ideally, we don鈥檛 want them to be extreme in some areas and close to 0 in others.
[294]
It鈥檚 especially concerning if you can see a weird pattern in your residuals like this:
[298]
Which would indicate that the error of your predictions is dependent on how big your predictor
[303]
variable value is.
[305]
That would be like if our YouTube model was pretty accurate at predicting the number of
[308]
likes for videos with very few comments, but was wildly inaccurate on videos with a lot
[313]
of comments.
[314]
So, now that we鈥檝e looked at this error, This is where Statistical tests come in.
[318]
There are actually two common ways to do a Null Hypothesis Significance test on a regression coefficient.
[324]
Today we鈥檒l cover the F-test.
[326]
The F-test, like the t-test, helps us quantify how well we think our data fit a distribution,
[331]
like the null distribution.
[332]
Remember, the general form of many test statistics is this:
[335]
But I鈥檓 going to make one small tweak to the wording of our general formula to help
[339]
us understand F-tests a little better.
[341]
The null hypothesis here is that there鈥檚 NO relationship between the number of comments
[345]
on a trending YouTube video and the number of likes.
[348]
IF that were true, we鈥檇 expect a kind of blob-y, amorphous-cloud-looking scatter plot
[353]
and a regression line with a slope of 0.
[356]
It would mean that the number of comments wouldn鈥檛 help us predict the number of likes.
[359]
We鈥檇 just predict the mean number of likes no matter how many comments there were.
[364]
Back to our actual data.
[365]
This blue line is our observed model.
[368]
And the red is the model we鈥檇 expect if the null hypothesis were true.
[372]
Let鈥檚 add some notation so it鈥檚 easier to read our formulas.
[375]
Y-hat looks like this, and it represents the predicted value for our outcome variable--here
[380]
it鈥檚 the predicted number of likes.
[382]
Y-bar looks like this, and it represents the mean value of likes in this sample.
[387]
Taking the squared difference between each data point and the mean line tells us the
[392]
total variation in our data set.
[394]
This might look similar to how we calculated variance, because it is.
[399]
Variance is just this sum of squared deviations--called the Sum of Squares Total--divided by N.
[404]
And we want to know how much of that total Variation is accounted for by our regression
[407]
model, and how much is just error.
[409]
That would allow us to follow the General Linear Model framework and explain our data
[413]
with two things: the model鈥檚 prediction, and error.
[416]
We can look at the difference between our observed slope coefficient--6.468--and the
[421]
one we鈥檇 expect if there were no relationship--0, for each point.
[425]
And we鈥檒l start here with this point:
[426]
The green line represents the difference between our observed model--which is the blue line--and
[431]
the model that would occur if the null were true--which is the red line.
[435]
And we can do this for EVERY point in the data set.
[437]
We want negative differences and positive differences to count equally, so we square
[441]
each difference so that they鈥檙e all positive.
[444]
Then we add them all up to get part of the numerator of our F-statistic:
[448]
The numerator has a special name in statistics.
[451]
It鈥檚 called the Sums of Squares for Regression, or SSR for short.
[455]
Like the name suggests, this is the sum of the squared distances between our regression
[459]
model and the null model.
[460]
Now we just need a measure of average variation.
[463]
We already found a measure of the total variation in our sample data, the Total Sums of Squares.
[468]
And we calculated the variation that鈥檚 explained by our model.
[472]
The other portion of the variation should then represent the error, the variation of
[475]
data points around our model.
[477]
Shown here in Orange.
[478]
The sum of these squared distances are called the Sums of Squares for Error (SSE).
[483]
If data points are close to the regression line, then our model is pretty good at predicting
[487]
outcome values like likes on trending YouTube Videos.
[491]
And so our SSE will be small.
[492]
If the data are far from the regression line, then our model isn鈥檛 too good at predicting
[496]
outcome values.
[497]
And our SSE is going to be big.
[498]
Alright, so now we have all the pieces of our puzzle.
[501]
Total Sums of Squares, Sums of Squares for Regression, and Sums of Squares for Error:
[506]
Total Sums of Squares represents ALL the information that we have from our Data on YouTube likes.
[511]
Sums of Squares for Regression represents the proportion of that information that we
[515]
can explain using the model we created.
[517]
And Sums of Squares for Error represents the leftover information--the portion of Total
[522]
Sums of Squares that the model can鈥檛 explain.
[525]
So the Total Sums of Squares is the Sum of SSR and SSE.
[528]
Now we鈥檝e followed the General Linear Model framework and taken our data and portioned
[532]
it into two categories: Regression Model, and Error.
[536]
And now that we have the SSE, our measurement of error, we can finally start to fill in
[539]
the Bottom of our F-statistic.
[541]
But we鈥檙e not quite done yet.
[542]
The last and final step to getting our F-statistic is to divide each Sums of Squares by their
[547]
respective Degrees of freedom.
[549]
Remember degrees of freedom represent the amount of independent information that we have.
[554]
The sums of square error has n--the sample size--minus 2 degrees of freedom.
[558]
We had 100 pieces of independent information from our data, and we used 1 to calculate
[563]
the y-intercept and 1 to calculate the regression coefficient.
[567]
So the Sums of Squares for Error has 98 degrees of freedom.
[570]
The Sums of Squares for Regression has one degree of freedom, because we鈥檙e using one
[575]
piece of independent information to estimate our coefficient our slope.
[579]
We have to divide each sums of squares by its degrees of freedom because we want to
[583]
weight each one appropriately.
[585]
More degrees of freedom mean more information.
[587]
It鈥檚 like how you wouldn鈥檛 be surprised that Katie Mack who has a PhD in AstroPhysics
[591]
can explain more about the planets than someone taking a high school Physics class.
[595]
Of course she can she has way more information.
[598]
Similarly, we want to make sure to scale the Sums of Squares based on the amount of independent
[604]
information each have.
[605]
So we鈥檙e finally left with this:
[607]
And using an F-distribution, we can find our p-value: the probability that we鈥檇 get a
[612]
F statistic as big or bigger than 59.613.
[616]
Our p-value is super tiny.
[618]
It鈥檚 about 0.000-000-000-000-99.
[624]
With an alpha level of 0.05, we reject the null that there is NO relationship between
[628]
likes and YouTube comments on trending videos.
[631]
So we reject that true coefficient for the relationship between likes and comments on
[636]
YouTube is 0.
[637]
The F-statistic allows us to directly compare the amount of variation that our model can
[643]
and cannot explain.
[645]
When our model explains a lot of variation, we consider it statistically significant.
[649]
And it turns out, if we did a t-test on this coefficient, we鈥檇 get the exact same p-value.
[654]
That鈥檚 because these two methods of hypothesis testing are equivalent, in fact if you square
[658]
our t-statistic, you鈥檒l get our F-statistic!
[661]
And we鈥檙e going to talk more about why F-tests are important later.
[664]
Regression is a really useful tool to understand.
[667]
Scientists, economists, and political scientists use it to make discoveries and communicate
[671]
those discoveries to the public.
[673]
Regression can be used to model the relationship between increased taxes on cigarettes and
[677]
the average number of cigarettes people buy.
[680]
Or to show the relationship between peak-heart-rate-during-exercise and blood pressure.
[684]
Not that we鈥檙e able to use regression alone to determine if it causes changes.
[689]
But more abstractly, we learned today about the General Linear Model framework.
[693]
What happens in life can be explained by two things: what we know about how the world works,
[697]
and error--or deviations--from that model.
[700]
Like say you budgeted $30 for gas and only ended up needing $28 last week.
[705]
The reality deviated from your guess and now you get to to go to The Blend Den again!
[709]
Or just how angry your roommate is that you left dishes in the sink can be explained by
[713]
how many days you left them out with a little wiggle room for error depending on how your
[718]
roommate's day was.
[719]
Alright, thanks for watching, I鈥檒l see you next time.