🔍

Regression: Crash Course Statistics #32 - YouTube

Channel: CrashCourse

[3]

Hi, I’m Adriene Hill and welcome back to Crash Course Statistics.

[6]

There’s something to be said for flexibility.

[8]

It allows you to adapt to new circumstances.

[11]

Like a Transformer is a truck, but it can also be an awesome fighting robot.

[15]

Today we’ll introduce you to one of the most flexible statistical tools--the General

[18]

Linear Model, or GLM.

[21]

The GLM will allow us to create many different models to help describe the world.

[25]

The first we’ll talk about is The Regression Model.

[27]

INTRO

[36]

General Linear Models say that your data can be explained by two things: your model, and

[41]

some error:

[42]

First, the model.

[43]

It usually takes the form Y = mx + b, or rather, Y = b + mx in most cases.

[49]

Say I want to predict the number of trick-or-treaters I’ll get this Halloween by using enrollment

[54]

numbers from the local middle school.

[56]

I have to make sure I have enough candy on hand.

[58]

I expect a baseline of 25 trick-or-treaters.

[61]

And then for every middle school student, I’ll increase the number of trick-or-treaters

[64]

I expect by 0.01.

[67]

So this would be my model:

[69]

There were about 1,000 middle school students nearby last year, so based on my model, I

[73]

predicted that I’d get 35 trick-or-treaters.

[76]

But reality doesn’t always match predictions.

[79]

When Halloween came around, I got 42, which means that the error in this case was 7.

[84]

Now, error doesn’t mean that something’s WRONG, per se.

[87]

We call it error because it’s a deviation from our model.

[90]

So the data isn’t wrong, the model is.

[92]

And these errors can come from many sources: like variables we didn’t account for in

[96]

our model-- including the candy-crazed kindergartners from the elementary school--or just random variation

[102]

Models allow us to make inferences --whether it’s the number of kids on my doorstep at

[106]

Halloween, or the number of credit card frauds committed in a year.

[110]

General Linear Models take the information that data give us and portion it out into

[114]

two major parts: information that can be accounted for by our model, and information that can’t be.

[120]

There’s many types of GLMS, one is Linear Regression.

[124]

Which can also provide a prediction for our data.

[127]

But instead of predicting our data using a categorical variable like we do in a t-test,

[132]

we use a continuous one.

[134]

For example, we can predict the number of likes a trending YouTube video gets based

[138]

on the number of comments that it has.

[141]

Here, the number of comments would be our input variable and the number of likes our

[145]

output variable.

[147]

Our model will look something like this:

[149]

The first thing we want to do is plot our datafrom 100 videos:

[153]

This allows us to check whether we think that the data is best fit by a straight line, and

[158]

look for outliers--those are points that are really extreme compared to the rest of our data.

[163]

These two points look pretty far away from our data.

[166]

So we need to decide how to handle them.

[168]

We covered outliers in a previous episode, and the same rules apply here.

[172]

We’re trying to catch data that doesn’t belong.

[174]

Since we can’t always tell when that happened, we set a criteria for what an outlier is,

[180]

and stick to it.

[181]

One reason that we’re concerned with outliers in regression is that values that are really

[185]

far away from the rest of our data can have an undue influence on the regression line.

[190]

Without this extreme point, our line would look like this.

[193]

But with it, like this.

[194]

That’s a lot of difference for one little point!

[197]

There’s a lot of different ways to decide, but in this case we’re gonna leave them in.

[201]

One of the assumptions that we make when using linear regression, is that the relationship

[206]

is linear.

[207]

So if there’s some other shape our data takes, we may want to look into some other models.

[212]

This plot looks linear, so we’ll go ahead and fit our regression model.

[215]

Usually a computer is going to do this part for us, but we want to show you how this line fits.

[219]

A regression line is the straight line that’s as close as possible to all the data points

[224]

at once.

[225]

That means that it’s the one straight line that minimizes the sum of the squared distance

[231]

of each point to the line.

[233]

The blue line is our regression line.

[235]

Its equation looks like this:

[237]

This number--the y-intercept--tells us how many likes we’d expect a trending video

[241]

with zero comments to have.

[244]

Often, the intercept might not make much sense.

[247]

In this model, it’s possible that you could have a video with 0 comments, but a video

[251]

with 0 comments and 9104 likes does seem to conflict with our experience on youtube.

[258]

The slope, aka, the coefficient--tells us how much our likes are determined by the number

[263]

of comments.

[264]

Our coefficient here is about 6.5, which means that on average, an increase in 1 comment

[269]

is associated with an increase of about 6.5 likes.

[273]

But There’s another part of the General Linear Model: the error.

[276]

Before we go any further, let’s take a look at these errors--also called residuals.

[280]

The residual plot looks like this:

[283]

And we can tell a lot by looking at its shape.

[285]

We want a pretty evenly spaced cloud of residuals.

[288]

Ideally, we don’t want them to be extreme in some areas and close to 0 in others.

[294]

It’s especially concerning if you can see a weird pattern in your residuals like this:

[298]

Which would indicate that the error of your predictions is dependent on how big your predictor

[303]

variable value is.

[305]

That would be like if our YouTube model was pretty accurate at predicting the number of

[308]

likes for videos with very few comments, but was wildly inaccurate on videos with a lot

[313]

of comments.

[314]

So, now that we’ve looked at this error, This is where Statistical tests come in.

[318]

There are actually two common ways to do a Null Hypothesis Significance test on a regression coefficient.

[324]

Today we’ll cover the F-test.

[326]

The F-test, like the t-test, helps us quantify how well we think our data fit a distribution,

[331]

like the null distribution.

[332]

Remember, the general form of many test statistics is this:

[335]

But I’m going to make one small tweak to the wording of our general formula to help

[339]

us understand F-tests a little better.

[341]

The null hypothesis here is that there’s NO relationship between the number of comments

[345]

on a trending YouTube video and the number of likes.

[348]

IF that were true, we’d expect a kind of blob-y, amorphous-cloud-looking scatter plot

[353]

and a regression line with a slope of 0.

[356]

It would mean that the number of comments wouldn’t help us predict the number of likes.

[359]

We’d just predict the mean number of likes no matter how many comments there were.

[364]

Back to our actual data.

[365]

This blue line is our observed model.

[368]

And the red is the model we’d expect if the null hypothesis were true.

[372]

Let’s add some notation so it’s easier to read our formulas.

[375]

Y-hat looks like this, and it represents the predicted value for our outcome variable--here

[380]

it’s the predicted number of likes.

[382]

Y-bar looks like this, and it represents the mean value of likes in this sample.

[387]

Taking the squared difference between each data point and the mean line tells us the

[392]

total variation in our data set.

[394]

This might look similar to how we calculated variance, because it is.

[399]

Variance is just this sum of squared deviations--called the Sum of Squares Total--divided by N.

[404]

And we want to know how much of that total Variation is accounted for by our regression

[407]

model, and how much is just error.

[409]

That would allow us to follow the General Linear Model framework and explain our data

[413]

with two things: the model’s prediction, and error.

[416]

We can look at the difference between our observed slope coefficient--6.468--and the

[421]

one we’d expect if there were no relationship--0, for each point.

[425]

And we’ll start here with this point:

[426]

The green line represents the difference between our observed model--which is the blue line--and

[431]

the model that would occur if the null were true--which is the red line.

[435]

And we can do this for EVERY point in the data set.

[437]

We want negative differences and positive differences to count equally, so we square

[441]

each difference so that they’re all positive.

[444]

Then we add them all up to get part of the numerator of our F-statistic:

[448]

The numerator has a special name in statistics.

[451]

It’s called the Sums of Squares for Regression, or SSR for short.

[455]

Like the name suggests, this is the sum of the squared distances between our regression

[459]

model and the null model.

[460]

Now we just need a measure of average variation.

[463]

We already found a measure of the total variation in our sample data, the Total Sums of Squares.

[468]

And we calculated the variation that’s explained by our model.

[472]

The other portion of the variation should then represent the error, the variation of

[475]

data points around our model.

[477]

Shown here in Orange.

[478]

The sum of these squared distances are called the Sums of Squares for Error (SSE).

[483]

If data points are close to the regression line, then our model is pretty good at predicting

[487]

outcome values like likes on trending YouTube Videos.

[491]

And so our SSE will be small.

[492]

If the data are far from the regression line, then our model isn’t too good at predicting

[496]

outcome values.

[497]

And our SSE is going to be big.

[498]

Alright, so now we have all the pieces of our puzzle.

[501]

Total Sums of Squares, Sums of Squares for Regression, and Sums of Squares for Error:

[506]

Total Sums of Squares represents ALL the information that we have from our Data on YouTube likes.

[511]

Sums of Squares for Regression represents the proportion of that information that we

[515]

can explain using the model we created.

[517]

And Sums of Squares for Error represents the leftover information--the portion of Total

[522]

Sums of Squares that the model can’t explain.

[525]

So the Total Sums of Squares is the Sum of SSR and SSE.

[528]

Now we’ve followed the General Linear Model framework and taken our data and portioned

[532]

it into two categories: Regression Model, and Error.

[536]

And now that we have the SSE, our measurement of error, we can finally start to fill in

[539]

the Bottom of our F-statistic.

[541]

But we’re not quite done yet.

[542]

The last and final step to getting our F-statistic is to divide each Sums of Squares by their

[547]

respective Degrees of freedom.

[549]

Remember degrees of freedom represent the amount of independent information that we have.

[554]

The sums of square error has n--the sample size--minus 2 degrees of freedom.

[558]

We had 100 pieces of independent information from our data, and we used 1 to calculate

[563]

the y-intercept and 1 to calculate the regression coefficient.

[567]

So the Sums of Squares for Error has 98 degrees of freedom.

[570]

The Sums of Squares for Regression has one degree of freedom, because we’re using one

[575]

piece of independent information to estimate our coefficient our slope.

[579]

We have to divide each sums of squares by its degrees of freedom because we want to

[583]

weight each one appropriately.

[585]

More degrees of freedom mean more information.

[587]

It’s like how you wouldn’t be surprised that Katie Mack who has a PhD in AstroPhysics

[591]

can explain more about the planets than someone taking a high school Physics class.

[595]

Of course she can she has way more information.

[598]

Similarly, we want to make sure to scale the Sums of Squares based on the amount of independent

[604]

information each have.

[605]

So we’re finally left with this:

[607]

And using an F-distribution, we can find our p-value: the probability that we’d get a

[612]

F statistic as big or bigger than 59.613.

[616]

Our p-value is super tiny.

[618]

It’s about 0.000-000-000-000-99.

[624]

With an alpha level of 0.05, we reject the null that there is NO relationship between

[628]

likes and YouTube comments on trending videos.

[631]

So we reject that true coefficient for the relationship between likes and comments on

[636]

YouTube is 0.

[637]

The F-statistic allows us to directly compare the amount of variation that our model can

[643]

and cannot explain.

[645]

When our model explains a lot of variation, we consider it statistically significant.

[649]

And it turns out, if we did a t-test on this coefficient, we’d get the exact same p-value.

[654]

That’s because these two methods of hypothesis testing are equivalent, in fact if you square

[658]

our t-statistic, you’ll get our F-statistic!

[661]

And we’re going to talk more about why F-tests are important later.

[664]

Regression is a really useful tool to understand.

[667]

Scientists, economists, and political scientists use it to make discoveries and communicate

[671]

those discoveries to the public.

[673]

Regression can be used to model the relationship between increased taxes on cigarettes and

[677]

the average number of cigarettes people buy.

[680]

Or to show the relationship between peak-heart-rate-during-exercise and blood pressure.

[684]

Not that we’re able to use regression alone to determine if it causes changes.

[689]

But more abstractly, we learned today about the General Linear Model framework.

[693]

What happens in life can be explained by two things: what we know about how the world works,

[697]

and error--or deviations--from that model.

[700]

Like say you budgeted $30 for gas and only ended up needing $28 last week.

[705]

The reality deviated from your guess and now you get to to go to The Blend Den again!

[709]

Or just how angry your roommate is that you left dishes in the sink can be explained by

[713]

how many days you left them out with a little wiggle room for error depending on how your

[718]

roommate's day was.

[719]

Alright, thanks for watching, I’ll see you next time.

Most Recent Videos:

WE KILLED 6 HEROIC BOSSES! - YouTube

¿Quién inventó el dinero? - YouTube

Cuándo se inventó el dinero y cómo el dólar se convirtió en la principal moneda del mundo - YouTube

This Citizenship Program is Failing - YouTube

Candida Treatment Protocol w/ Dr. DiNezza - YouTube

$500M investor reacts to Real Estate Tik Toks 2 - YouTube

You can go back to the homepage right here: Homepage