馃攳
Pearson's Correlation, Clearly Explained!!! - YouTube
Channel: StatQuest with Josh Starmer
[0]
correlation it's the sensation across
[6]
the nation stack quests hello I'm Josh
[13]
starburns welcome to stack quest today
[16]
is part two in our series on covariance
[18]
and correlation this time we're going to
[21]
talk about correlation however before we
[25]
dive deep into correlation I want to
[27]
talk about relationships not the fun
[30]
and/or confusing kind we sometimes find
[33]
ourselves in will you hold my hand um
[38]
you don't have a hand
[41]
you're just a stick figure dang instead
[47]
I want to talk about the relationships
[49]
between data on the x-axis and data on
[53]
the y-axis
[55]
in this example we're looking at mRNA
[58]
transcripts from gene X in five
[60]
different cells on the x-axis and from
[64]
gene Y in the same five different cells
[67]
on the y-axis
[70]
however if mRNA transcripts doesn't mean
[73]
anything to you
[74]
imagine we went into five different
[76]
grocery stores and put the number of
[78]
green apples on the x-axis in the number
[82]
of red apples on the y-axis
[85]
each pair of measurements were taken
[88]
from a single cell or grocery store and
[90]
can be represented by a blue dot
[94]
we can see that in general relatively
[98]
low values for gene X are paired with
[100]
relatively low values for gene Y and
[103]
relatively high values for gene X are
[106]
paired with relatively high values for
[108]
gene y
[110]
we can use a straight line with a
[113]
positive slope to represent this trend
[115]
and if someone told us that they
[118]
collected a new measurement for gene X
[120]
20 then we can use the line to predict
[125]
that when gene X equals 20 then the
[128]
value for gene Y should be somewhere
[131]
around 27
[134]
alternatively if someone gave us a value
[137]
for gene Y we could use the trend to
[141]
predict a range of values for gene X
[146]
both cases we made guesses based on the
[148]
trend we observed in the data
[152]
if the data were closer to the trendline
[155]
then given a gene X value
[158]
we might guess that the value for gene Y
[160]
falls in a smaller range
[164]
in this case the closer the data are to
[167]
the line the more gene X can tell us
[169]
about gene y alternatively we could say
[174]
that the relationship between gene X and
[176]
gene Y is relatively strong
[181]
the data were further from the trendline
[183]
then we might guess that the value for
[185]
Jean Y falls in a larger range
[189]
in this case we could say that the
[192]
values for gene X tell us less about the
[194]
values for gene y alternatively we could
[199]
say that the relationship between gene X
[201]
and gene Y is relatively weak
[205]
note just to be clear all we are saying
[208]
is that we observed that low values for
[211]
gene X tend to be paired with low values
[214]
for gene Y and that high values for gene
[218]
X tend to be paired with relatively high
[221]
values for gene Y and that this
[224]
observation suggests a trend that we can
[228]
use to make predictions and inferences
[230]
aka
[231]
educated guesses we are not saying that
[235]
a low value for gene X causes gene Y to
[239]
have a low value or that a high value
[243]
for gene Y causes gene X to have a high
[247]
value in other words we are not ruling
[250]
out the possibility that something else
[253]
causes the trend that we observe small
[256]
bam so far we have looked at a
[260]
relatively weak relationship and a
[263]
relatively strong relationship we can
[266]
quantify the strength of a relationship
[269]
with correlation
[271]
in other words these data with a
[274]
relatively weak relationship have a
[276]
small correlation value these data with
[281]
a moderate relationship have a moderate
[283]
correlation value
[285]
and these data with a strong
[288]
relationship have a relatively large
[290]
correlation value the maximum value for
[294]
correlation is 1
[297]
correlation equals one when a straight
[300]
line with a positive slope can go
[302]
through the center of every data point
[305]
this means that if someone gave us a
[307]
value for gene X
[310]
then we could guess that jean y had a
[312]
value in a very very narrow range
[316]
note correlation does not depend on the
[320]
scale of the data
[322]
in fact I intentionally omitted putting
[325]
numbers on the axes because they do not
[327]
affect correlation at all
[330]
in other words regardless of the scale
[333]
of the data correlation equals one when
[336]
a straight line with a positive slope
[337]
can go through all of the data
[341]
that means that correlation can equal
[343]
one when the slope is large and when the
[347]
slope is small
[349]
note when a straight line with a
[352]
positive slope goes through the data
[354]
correlation equals one regardless of how
[357]
much data we have
[359]
for example if we only had two data
[362]
points then we can draw a straight line
[364]
with a positive slope by just connecting
[367]
the two dots
[369]
and then correlation equals one and that
[372]
makes the relationship appear strong
[376]
but we should not have any confidence in
[378]
predictions made with this line because
[381]
we have so little data
[384]
to understand why we should have low
[386]
confidence in correlations made with
[388]
small datasets let's start with an empty
[392]
graph and draw two random points on it
[397]
then just like before we could draw a
[400]
straight line that goes through the
[402]
center of each point just by connecting
[404]
the dots and that means correlation
[408]
equals one for these two randomly drawn
[410]
dots
[412]
in fact we can always draw a straight
[415]
line between any two random dots
[419]
now let's go back to the original data
[422]
and imagine that instead of two pairs of
[424]
measurements we had three pairs of
[427]
measurements
[429]
now just like before since we can draw a
[432]
straight line through all three points
[434]
correlation equals one however now we
[438]
can have more confidence in the
[440]
predictions we make with this line
[443]
this is because if we started with an
[445]
empty graph and drew three random points
[449]
on it then even though it's easy to draw
[452]
a straight line to connect any two
[454]
points there is a very small chance that
[457]
we will be able to draw a straight line
[459]
through all three points ultimately the
[464]
probability that we can connect three
[466]
randomly drawn points with a straight
[468]
line is very small and thus we can have
[472]
more confidence that the observed
[474]
correlation isn't just the result of
[476]
random chance
[478]
in general the more data we have the
[482]
more confidence we have in the
[484]
predictions we make with the line
[487]
because the probability that we can draw
[490]
a straight line through the same number
[492]
of randomly placed points gets smaller
[494]
and smaller with each additional point
[498]
note we could draw a squiggly line that
[501]
connects all of the dots
[504]
but when we're talking about correlation
[506]
were only talking about using straight
[508]
lines
[511]
oh no it's the dreaded terminology alert
[514]
for correlation a p-value tells us the
[517]
probability that randomly drawn dots
[519]
will result in a similarly strong
[521]
relationship or stronger
[525]
thus the smaller the p-value the more
[528]
confidence we have in the predictions we
[530]
make with the line in this case the
[533]
p-value is crazy small 2.2 times 10 to
[537]
the negative 16 which means that the
[541]
probability of random data creating a
[543]
similarly strong or stronger
[545]
relationship is crazy small
[549]
to summarize what we've talked about so
[551]
far the maximum value for correlation
[554]
one occurs whenever you can draw a
[557]
straight line with a positive slope that
[560]
goes through all of the data and our
[563]
confidence in how useful the
[565]
relationship is depends on how much data
[567]
we have
[569]
of these three examples we should have
[572]
the least confidence in this
[573]
relationship since it is supported by
[576]
the least amount of data and we should
[579]
have the most confidence in this
[581]
relationship since it is supported by
[583]
the most data and has the smallest
[585]
p-value BAM
[590]
when a straight line with a negative
[592]
slope can go through the center of every
[594]
data point then the correlation equals
[597]
negative one
[599]
since a straight line can go through all
[601]
of the data points correlation equals
[604]
negative one implies that there is a
[606]
strong relationship in the data and if
[609]
someone gives us a value for gene X then
[612]
we can guess a value for gene Y within a
[615]
very narrow range
[617]
just like before our confidence in that
[620]
guess which we quantify with a p-value
[623]
depends on how much data we have if we
[627]
had a lot of data we could have a lot of
[629]
confidence in the guess because the
[631]
p-value would be super small and the
[635]
less data we have the less confidence we
[638]
have in the guess because the p-value
[639]
gets larger
[643]
like before as long as a straight line
[645]
goes through all of the data and the
[647]
slope of the line is negative
[649]
correlation equals negative one when the
[651]
slope is large and when the slope is
[654]
small BAM so far we've seen that when
[660]
the slope of the line is negative the
[662]
strongest relationship has correlation
[665]
equal to negative one and when the slope
[668]
of the line is positive the strongest
[670]
relationship has correlation equal to
[673]
one
[675]
in both cases if a straight line cannot
[678]
go through all of the data then we will
[680]
get correlation values closer to zero
[682]
and the worse the fit the closer the
[686]
correlation gets to zero
[689]
and when there is no relationship that
[692]
we can represent with a straight line
[694]
correlation equals zero when correlation
[698]
equals zero a value on the x-axis
[702]
doesn't tell us anything about what to
[704]
expect on the y-axis because there is no
[708]
reason to choose one value over another
[711]
BAM as long as the correlation value is
[716]
not zero we can still use the line to
[721]
make inferences
[723]
but our guesses become more refined the
[725]
closer the correlation values get to
[727]
negative one or one
[731]
and just like before our confidence in
[734]
our inferences depends on the amount of
[736]
data we have collected and the p-value
[739]
in the left graph we have very little
[742]
confidence in the trim because we have
[744]
very little data and the p value equals
[747]
0.8
[749]
in the middle we have moderate
[752]
confidence in the trend because we have
[754]
more data and the p value equals 0.08 on
[758]
the right we have a lot of confidence in
[761]
the trend because we have even more data
[763]
in the p value equals zero point zero
[766]
zero eight note the correlation equals
[771]
zero point three in all three examples
[773]
in this case increasing the sample size
[777]
did not increase correlation and that
[781]
means adding data did not refine our
[784]
guests
[786]
all it did was increase our confidence
[788]
in the guess
[791]
thus our guesses will probably be pretty
[793]
bad in all three cases however we'll
[797]
have the most confidence in the bad
[799]
guest that came from this data
[802]
in other words just because you have a
[805]
lot of data and you have a lot of
[807]
confidence in your guests
[809]
if the correlation value is small your
[812]
guests will still be bad double bam if
[817]
you know how to calculate variance and
[820]
covariance calculating correlation is a
[823]
snap note if you're not already familiar
[827]
with the concepts of variance and
[829]
covariance check out the quests the
[831]
links are in the description below
[834]
if this were the data then the
[838]
correlation equals the covariance of
[841]
gene X and gene Y divided by the square
[845]
root of the variance for gene X times
[849]
the square root of the variance for gene
[851]
y
[853]
as we saw in the stat quest on
[856]
covariance the numerator can be any
[858]
value between positive and negative
[861]
infinity depending on whether the slope
[864]
of the line that represents the
[866]
relationship is positive or negative how
[870]
far the data are spread out around the
[873]
means and the scale of the data
[878]
thus when we calculate correlation the
[881]
denominator squeezes the covariance to
[884]
be a number from negative 1 to 1 in
[887]
other words the denominator ensures that
[890]
the scale of the data does not affect
[893]
the correlation value and this makes
[895]
correlations much easier to interpret
[898]
when the data all fall on a straight
[901]
line with a positive or negative slope
[904]
then the covariance and the product of
[907]
the square root of the variance terms
[909]
are the same and division gives us 1 or
[912]
negative 1 depending on the slope
[916]
when the data do not fall on a straight
[918]
line with a positive or negative slope
[921]
then the covariance accounts for less of
[924]
the variance in the data and the
[926]
correlation is closer to zero
[929]
as we saw in the stat quest on
[932]
covariance the covariance value for this
[935]
data is 116 so the denominator will
[940]
squeeze 116 down to a value from
[943]
negative 1 to 1
[946]
the variants in the gene X data is 100
[949]
1.8 and the variants in the gene Y data
[953]
is 160 point 3 and when we do the math
[958]
we get 0.9
[961]
like I mentioned earlier we can quantify
[964]
our confidence in this relationship with
[966]
a p-value
[968]
the smaller the p-value the more
[971]
confidence we can have in the guesses we
[973]
make in this case the p-value is 0.03
[980]
that means that there is a 3% chance
[982]
that random data could produce a
[984]
similarly strong relationship or
[987]
stronger
[989]
triple bam
[992]
before we go there's one last important
[995]
thing I want to mention about
[996]
correlation even though correlation
[1000]
values are way easier to interpret then
[1002]
covariance values they are still not
[1005]
super easy to interpret for example it's
[1009]
not super obvious that this relationship
[1011]
where correlation equals zero point nine
[1014]
is twice as good as making predictions
[1017]
as this relationship where correlation
[1021]
equals zero point six four
[1023]
the good news is that R squared which is
[1027]
related to correlation solves this
[1029]
problem the better news is that if you
[1033]
want to learn more about r-squared you
[1035]
can check out these quests the links are
[1038]
in the description below
[1040]
pS another awesome thing about R squared
[1044]
is that it can quantify relationships
[1046]
that are more complicated than simple
[1048]
straight lines
[1051]
in summary correlation quantifies the
[1055]
strengths of relationships if you have a
[1058]
weak relationship then you will have a
[1061]
small correlation value if you have a
[1064]
moderate relationship then you'll have a
[1066]
moderate correlation value and if you
[1069]
have a strong relationship then you will
[1072]
have a large correlation value
[1075]
correlation values go from negative one
[1077]
which is the strongest linear
[1079]
relationship with a negative slope to
[1082]
one which is the strongest linear
[1084]
relationship with a positive slope in
[1087]
both cases if a straight line cannot go
[1091]
through all of the data then we will get
[1093]
correlation values closer to zero and
[1096]
the worse the fit the closer the
[1099]
correlation values get to zero and when
[1103]
there is no relationship that we can
[1105]
represent with a straight line
[1107]
correlation equals zero lastly our
[1111]
confidence in the inferences depends on
[1113]
the amount of data we have collected and
[1115]
the p-value
[1118]
the more data we have the smaller the
[1120]
p-value and the more confidence we have
[1122]
in our inferences BAM hooray
[1129]
we made it to the end of another
[1130]
exciting stat quest if you like this
[1133]
stack quest and want to see more please
[1135]
subscribe and if you want to support
[1137]
stack quest consider buying one or two
[1139]
of my original songs or a t-shirt or a
[1141]
hoodie or just donate the links are in
[1144]
the description below alright until next
[1147]
time quest on
Most Recent Videos:
You can go back to the homepage right here: Homepage





