🔍

Pearson's Correlation, Clearly Explained!!! - YouTube

Channel: StatQuest with Josh Starmer

[0]

correlation it's the sensation across

[6]

the nation stack quests hello I'm Josh

[13]

starburns welcome to stack quest today

[16]

is part two in our series on covariance

[18]

and correlation this time we're going to

[21]

talk about correlation however before we

[25]

dive deep into correlation I want to

[27]

talk about relationships not the fun

[30]

and/or confusing kind we sometimes find

[33]

ourselves in will you hold my hand um

[38]

you don't have a hand

[41]

you're just a stick figure dang instead

[47]

I want to talk about the relationships

[49]

between data on the x-axis and data on

[53]

the y-axis

[55]

in this example we're looking at mRNA

[58]

transcripts from gene X in five

[60]

different cells on the x-axis and from

[64]

gene Y in the same five different cells

[67]

on the y-axis

[70]

however if mRNA transcripts doesn't mean

[73]

anything to you

[74]

imagine we went into five different

[76]

grocery stores and put the number of

[78]

green apples on the x-axis in the number

[82]

of red apples on the y-axis

[85]

each pair of measurements were taken

[88]

from a single cell or grocery store and

[90]

can be represented by a blue dot

[94]

we can see that in general relatively

[98]

low values for gene X are paired with

[100]

relatively low values for gene Y and

[103]

relatively high values for gene X are

[106]

paired with relatively high values for

[108]

gene y

[110]

we can use a straight line with a

[113]

positive slope to represent this trend

[115]

and if someone told us that they

[118]

collected a new measurement for gene X

[120]

20 then we can use the line to predict

[125]

that when gene X equals 20 then the

[128]

value for gene Y should be somewhere

[131]

around 27

[134]

alternatively if someone gave us a value

[137]

for gene Y we could use the trend to

[141]

predict a range of values for gene X

[146]

both cases we made guesses based on the

[148]

trend we observed in the data

[152]

if the data were closer to the trendline

[155]

then given a gene X value

[158]

we might guess that the value for gene Y

[160]

falls in a smaller range

[164]

in this case the closer the data are to

[167]

the line the more gene X can tell us

[169]

about gene y alternatively we could say

[174]

that the relationship between gene X and

[176]

gene Y is relatively strong

[181]

the data were further from the trendline

[183]

then we might guess that the value for

[185]

Jean Y falls in a larger range

[189]

in this case we could say that the

[192]

values for gene X tell us less about the

[194]

values for gene y alternatively we could

[199]

say that the relationship between gene X

[201]

and gene Y is relatively weak

[205]

note just to be clear all we are saying

[208]

is that we observed that low values for

[211]

gene X tend to be paired with low values

[214]

for gene Y and that high values for gene

[218]

X tend to be paired with relatively high

[221]

values for gene Y and that this

[224]

observation suggests a trend that we can

[228]

use to make predictions and inferences

[230]

aka

[231]

educated guesses we are not saying that

[235]

a low value for gene X causes gene Y to

[239]

have a low value or that a high value

[243]

for gene Y causes gene X to have a high

[247]

value in other words we are not ruling

[250]

out the possibility that something else

[253]

causes the trend that we observe small

[256]

bam so far we have looked at a

[260]

relatively weak relationship and a

[263]

relatively strong relationship we can

[266]

quantify the strength of a relationship

[269]

with correlation

[271]

in other words these data with a

[274]

relatively weak relationship have a

[276]

small correlation value these data with

[281]

a moderate relationship have a moderate

[283]

correlation value

[285]

and these data with a strong

[288]

relationship have a relatively large

[290]

correlation value the maximum value for

[294]

correlation is 1

[297]

correlation equals one when a straight

[300]

line with a positive slope can go

[302]

through the center of every data point

[305]

this means that if someone gave us a

[307]

value for gene X

[310]

then we could guess that jean y had a

[312]

value in a very very narrow range

[316]

note correlation does not depend on the

[320]

scale of the data

[322]

in fact I intentionally omitted putting

[325]

numbers on the axes because they do not

[327]

affect correlation at all

[330]

in other words regardless of the scale

[333]

of the data correlation equals one when

[336]

a straight line with a positive slope

[337]

can go through all of the data

[341]

that means that correlation can equal

[343]

one when the slope is large and when the

[347]

slope is small

[349]

note when a straight line with a

[352]

positive slope goes through the data

[354]

correlation equals one regardless of how

[357]

much data we have

[359]

for example if we only had two data

[362]

points then we can draw a straight line

[364]

with a positive slope by just connecting

[367]

the two dots

[369]

and then correlation equals one and that

[372]

makes the relationship appear strong

[376]

but we should not have any confidence in

[378]

predictions made with this line because

[381]

we have so little data

[384]

to understand why we should have low

[386]

confidence in correlations made with

[388]

small datasets let's start with an empty

[392]

graph and draw two random points on it

[397]

then just like before we could draw a

[400]

straight line that goes through the

[402]

center of each point just by connecting

[404]

the dots and that means correlation

[408]

equals one for these two randomly drawn

[410]

dots

[412]

in fact we can always draw a straight

[415]

line between any two random dots

[419]

now let's go back to the original data

[422]

and imagine that instead of two pairs of

[424]

measurements we had three pairs of

[427]

measurements

[429]

now just like before since we can draw a

[432]

straight line through all three points

[434]

correlation equals one however now we

[438]

can have more confidence in the

[440]

predictions we make with this line

[443]

this is because if we started with an

[445]

empty graph and drew three random points

[449]

on it then even though it's easy to draw

[452]

a straight line to connect any two

[454]

points there is a very small chance that

[457]

we will be able to draw a straight line

[459]

through all three points ultimately the

[464]

probability that we can connect three

[466]

randomly drawn points with a straight

[468]

line is very small and thus we can have

[472]

more confidence that the observed

[474]

correlation isn't just the result of

[476]

random chance

[478]

in general the more data we have the

[482]

more confidence we have in the

[484]

predictions we make with the line

[487]

because the probability that we can draw

[490]

a straight line through the same number

[492]

of randomly placed points gets smaller

[494]

and smaller with each additional point

[498]

note we could draw a squiggly line that

[501]

connects all of the dots

[504]

but when we're talking about correlation

[506]

were only talking about using straight

[508]

lines

[511]

oh no it's the dreaded terminology alert

[514]

for correlation a p-value tells us the

[517]

probability that randomly drawn dots

[519]

will result in a similarly strong

[521]

relationship or stronger

[525]

thus the smaller the p-value the more

[528]

confidence we have in the predictions we

[530]

make with the line in this case the

[533]

p-value is crazy small 2.2 times 10 to

[537]

the negative 16 which means that the

[541]

probability of random data creating a

[543]

similarly strong or stronger

[545]

relationship is crazy small

[549]

to summarize what we've talked about so

[551]

far the maximum value for correlation

[554]

one occurs whenever you can draw a

[557]

straight line with a positive slope that

[560]

goes through all of the data and our

[563]

confidence in how useful the

[565]

relationship is depends on how much data

[567]

we have

[569]

of these three examples we should have

[572]

the least confidence in this

[573]

relationship since it is supported by

[576]

the least amount of data and we should

[579]

have the most confidence in this

[581]

relationship since it is supported by

[583]

the most data and has the smallest

[585]

p-value BAM

[590]

when a straight line with a negative

[592]

slope can go through the center of every

[594]

data point then the correlation equals

[597]

negative one

[599]

since a straight line can go through all

[601]

of the data points correlation equals

[604]

negative one implies that there is a

[606]

strong relationship in the data and if

[609]

someone gives us a value for gene X then

[612]

we can guess a value for gene Y within a

[615]

very narrow range

[617]

just like before our confidence in that

[620]

guess which we quantify with a p-value

[623]

depends on how much data we have if we

[627]

had a lot of data we could have a lot of

[629]

confidence in the guess because the

[631]

p-value would be super small and the

[635]

less data we have the less confidence we

[638]

have in the guess because the p-value

[639]

gets larger

[643]

like before as long as a straight line

[645]

goes through all of the data and the

[647]

slope of the line is negative

[649]

correlation equals negative one when the

[651]

slope is large and when the slope is

[654]

small BAM so far we've seen that when

[660]

the slope of the line is negative the

[662]

strongest relationship has correlation

[665]

equal to negative one and when the slope

[668]

of the line is positive the strongest

[670]

relationship has correlation equal to

[673]

one

[675]

in both cases if a straight line cannot

[678]

go through all of the data then we will

[680]

get correlation values closer to zero

[682]

and the worse the fit the closer the

[686]

correlation gets to zero

[689]

and when there is no relationship that

[692]

we can represent with a straight line

[694]

correlation equals zero when correlation

[698]

equals zero a value on the x-axis

[702]

doesn't tell us anything about what to

[704]

expect on the y-axis because there is no

[708]

reason to choose one value over another

[711]

BAM as long as the correlation value is

[716]

not zero we can still use the line to

[721]

make inferences

[723]

but our guesses become more refined the

[725]

closer the correlation values get to

[727]

negative one or one

[731]

and just like before our confidence in

[734]

our inferences depends on the amount of

[736]

data we have collected and the p-value

[739]

in the left graph we have very little

[742]

confidence in the trim because we have

[744]

very little data and the p value equals

[747]

0.8

[749]

in the middle we have moderate

[752]

confidence in the trend because we have

[754]

more data and the p value equals 0.08 on

[758]

the right we have a lot of confidence in

[761]

the trend because we have even more data

[763]

in the p value equals zero point zero

[766]

zero eight note the correlation equals

[771]

zero point three in all three examples

[773]

in this case increasing the sample size

[777]

did not increase correlation and that

[781]

means adding data did not refine our

[784]

guests

[786]

all it did was increase our confidence

[788]

in the guess

[791]

thus our guesses will probably be pretty

[793]

bad in all three cases however we'll

[797]

have the most confidence in the bad

[799]

guest that came from this data

[802]

in other words just because you have a

[805]

lot of data and you have a lot of

[807]

confidence in your guests

[809]

if the correlation value is small your

[812]

guests will still be bad double bam if

[817]

you know how to calculate variance and

[820]

covariance calculating correlation is a

[823]

snap note if you're not already familiar

[827]

with the concepts of variance and

[829]

covariance check out the quests the

[831]

links are in the description below

[834]

if this were the data then the

[838]

correlation equals the covariance of

[841]

gene X and gene Y divided by the square

[845]

root of the variance for gene X times

[849]

the square root of the variance for gene

[851]

[853]

as we saw in the stat quest on

[856]

covariance the numerator can be any

[858]

value between positive and negative

[861]

infinity depending on whether the slope

[864]

of the line that represents the

[866]

relationship is positive or negative how

[870]

far the data are spread out around the

[873]

means and the scale of the data

[878]

thus when we calculate correlation the

[881]

denominator squeezes the covariance to

[884]

be a number from negative 1 to 1 in

[887]

other words the denominator ensures that

[890]

the scale of the data does not affect

[893]

the correlation value and this makes

[895]

correlations much easier to interpret

[898]

when the data all fall on a straight

[901]

line with a positive or negative slope

[904]

then the covariance and the product of

[907]

the square root of the variance terms

[909]

are the same and division gives us 1 or

[912]

negative 1 depending on the slope

[916]

when the data do not fall on a straight

[918]

line with a positive or negative slope

[921]

then the covariance accounts for less of

[924]

the variance in the data and the

[926]

correlation is closer to zero

[929]

as we saw in the stat quest on

[932]

covariance the covariance value for this

[935]

data is 116 so the denominator will

[940]

squeeze 116 down to a value from

[943]

negative 1 to 1

[946]

the variants in the gene X data is 100

[949]

1.8 and the variants in the gene Y data

[953]

is 160 point 3 and when we do the math

[958]

we get 0.9

[961]

like I mentioned earlier we can quantify

[964]

our confidence in this relationship with

[966]

a p-value

[968]

the smaller the p-value the more

[971]

confidence we can have in the guesses we

[973]

make in this case the p-value is 0.03

[980]

that means that there is a 3% chance

[982]

that random data could produce a

[984]

similarly strong relationship or

[987]

stronger

[989]

triple bam

[992]

before we go there's one last important

[995]

thing I want to mention about

[996]

correlation even though correlation

[1000]

values are way easier to interpret then

[1002]

covariance values they are still not

[1005]

super easy to interpret for example it's

[1009]

not super obvious that this relationship

[1011]

where correlation equals zero point nine

[1014]

is twice as good as making predictions

[1017]

as this relationship where correlation

[1021]

equals zero point six four

[1023]

the good news is that R squared which is

[1027]

related to correlation solves this

[1029]

problem the better news is that if you

[1033]

want to learn more about r-squared you

[1035]

can check out these quests the links are

[1038]

in the description below

[1040]

pS another awesome thing about R squared

[1044]

is that it can quantify relationships

[1046]

that are more complicated than simple

[1048]

straight lines

[1051]

in summary correlation quantifies the

[1055]

strengths of relationships if you have a

[1058]

weak relationship then you will have a

[1061]

small correlation value if you have a

[1064]

moderate relationship then you'll have a

[1066]

moderate correlation value and if you

[1069]

have a strong relationship then you will

[1072]

have a large correlation value

[1075]

correlation values go from negative one

[1077]

which is the strongest linear

[1079]

relationship with a negative slope to

[1082]

one which is the strongest linear

[1084]

relationship with a positive slope in

[1087]

both cases if a straight line cannot go

[1091]

through all of the data then we will get

[1093]

correlation values closer to zero and

[1096]

the worse the fit the closer the

[1099]

correlation values get to zero and when

[1103]

there is no relationship that we can

[1105]

represent with a straight line

[1107]

correlation equals zero lastly our

[1111]

confidence in the inferences depends on

[1113]

the amount of data we have collected and

[1115]

the p-value

[1118]

the more data we have the smaller the

[1120]

p-value and the more confidence we have

[1122]

in our inferences BAM hooray

[1129]

we made it to the end of another

[1130]

exciting stat quest if you like this

[1133]

stack quest and want to see more please

[1135]

subscribe and if you want to support

[1137]

stack quest consider buying one or two

[1139]

of my original songs or a t-shirt or a

[1141]

hoodie or just donate the links are in

[1144]

the description below alright until next

[1147]

time quest on

Most Recent Videos:

WE KILLED 6 HEROIC BOSSES! - YouTube

¿Quién inventó el dinero? - YouTube

Cuándo se inventó el dinero y cómo el dólar se convirtió en la principal moneda del mundo - YouTube

This Citizenship Program is Failing - YouTube

Candida Treatment Protocol w/ Dr. DiNezza - YouTube

$500M investor reacts to Real Estate Tik Toks 2 - YouTube

You can go back to the homepage right here: Homepage

Pearson&#39;s Correlation, Clearly Explained!!! - YouTube

Pearson's Correlation, Clearly Explained!!! - YouTube