p-values: What they are and how to interpret them - YouTube

Channel: StatQuest with Josh Starmer

[0]
[Music]
[0]
gonna talk about p values yeah
[5]
statquest
[8]
hello i'm josh starmer and welcome to
[10]
statquest today we're going to talk
[12]
about what p-values are and how to
[15]
interpret them
[17]
imagine i have two drugs
[20]
drug a
[21]
and drug b
[24]
and i want to know if drug a is
[26]
different from drug b
[29]
so i give one person drug a
[32]
and i give one other person drug b
[36]
the one person using drug a is cured
[40]
hooray
[43]
the one person using drug b is not cured
[47]
bummer
[49]
can we conclude that drug a is better
[51]
than drug b
[54]
nope
[55]
drug b may have failed for a lot of
[58]
different reasons
[60]
maybe this guy is taking a medication
[62]
that has a bad interaction with drug b
[66]
or maybe this guy has a rare allergy to
[68]
drug b
[70]
or maybe this guy didn't take drug b
[72]
properly and missed a dose
[75]
or maybe drug a doesn't actually work
[78]
and the placebo effect deserves all of
[81]
the credit
[83]
there are a lot of weird random things
[86]
that can happen when doing a test
[89]
and this means that we need to try each
[91]
drug on more than just one person each
[95]
so we redo the experiment but this time
[98]
we give each drug to two different
[100]
people
[102]
this time both people taking drug a are
[105]
cured
[107]
hooray
[109]
and one person taking drug b is cured
[112]
and one person is not cured
[115]
hooray
[116]
and bummer
[118]
is drug a better
[121]
are both drugs the same
[124]
we can't answer either of those
[126]
questions because maybe something weird
[128]
happened to this guy that caused drug b
[130]
to fail
[132]
or maybe something weird happened to
[134]
this guy like maybe the drug was
[136]
mislabeled and he actually took drug a
[139]
and that's why he was cured
[142]
so now we test the drugs on a lot of
[145]
different people
[147]
and these are the results
[150]
drug a cured a whole lot of people 1043
[155]
compared to the number of people it
[157]
didn't cure
[158]
3.
[160]
in other words
[162]
99.7
[164]
of the 1046 people using drug a were
[167]
cured
[169]
in contrast drug b only cured a few
[172]
people
[173]
two
[175]
compared to the number of people it
[177]
didn't cure one thousand four hundred
[179]
thirty two
[181]
in other words only 0.1 percent of the
[186]
1434 people using drug b were cured
[191]
if these were the results then it would
[193]
be pretty obvious that drug a was better
[196]
than drug b
[199]
in other words it would seem unrealistic
[202]
to suppose that these results were just
[204]
random chance and that there is no real
[207]
difference between drug a and drug b
[211]
it's possible that some of these people
[213]
were cured by placebo
[216]
and some of these people were not cured
[219]
because of some rare allergy
[222]
but they are just too many people cured
[224]
by drug a and too few cured by drug b
[228]
for us to seriously think that these
[230]
results are just random and that drug a
[233]
is no better or worse than drug b
[237]
in contrast what if these were the
[240]
results
[242]
now only 37 percent of the people that
[245]
took drug a were cured
[248]
compared to 31 percent that took drug b
[253]
so drug a cured a larger percentage of
[255]
people
[257]
but given that no study is perfect and
[260]
there are always a few random things
[261]
that happen
[263]
how confident can we be that drug a is
[265]
superior
[267]
that's where the p-value comes in
[271]
p-values are numbers between 0 and 1
[274]
that in this example
[276]
quantify how confident we should be that
[279]
drug a is different from drug b
[282]
the closer a p-value is to zero
[285]
the more confidence we have that drug a
[288]
and drug b are different
[291]
so the question is how small does a
[294]
p-value have to be before we are
[296]
sufficiently confident that drug a is
[299]
different from drug b
[302]
in other words what threshold can we use
[304]
to make a good decision
[307]
in practice a commonly used threshold is
[310]
0.05
[313]
it means that if there is no difference
[315]
between drug a and drug b
[318]
and if we did this exact same experiment
[320]
a bunch of times
[322]
then only 5
[323]
of those experiments would result in the
[325]
wrong decision
[328]
yes
[329]
this is an awkward sentence
[332]
so let's go through an example and work
[335]
this out one step at a time
[339]
imagine i gave the same drug drug a to
[342]
two different groups
[345]
now
[346]
any differences in the results are 100
[348]
percent attributable to weird random
[351]
things
[352]
like a rare allergy in one person or a
[355]
strong placebo effect in another
[358]
in this case the p-value would be 0.9
[362]
which is way larger than 0.05
[367]
thus we would say that we fail to see a
[370]
difference between the two groups
[373]
if we repeated this same experiment a
[375]
lot of times
[377]
most of the time we would get similarly
[379]
large p values
[382]
however
[383]
every once in a while all of the people
[385]
with rare allergies might end up in the
[388]
group on the left
[390]
and all of the people with the strong
[392]
placebo reactions might end up in the
[394]
group on the right
[397]
as a result the p-value for this
[399]
specific run of the experiment is 0.01
[403]
since the results are pretty different
[407]
thus in this case we would say that the
[410]
two groups are different even though
[412]
they both took the same drug
[415]
oh no it's the dreaded terminology alert
[419]
getting a small p value when there is no
[421]
difference is called a false positive
[426]
a
[427]
0.05 threshold for p values means that 5
[431]
of the experiments where the only
[433]
differences come from weird random
[436]
things we'll generate a p-value smaller
[438]
than 0.05
[442]
in other words if there's no difference
[445]
between drug a and drug b
[447]
5 percent of the time we do the
[449]
experiment we will get a p-value less
[452]
than 0.05
[454]
aka a false positive
[457]
note if it is extremely important that
[460]
we are correct when we say the drugs are
[462]
different then we can use a smaller
[465]
threshold like 0.00001
[471]
using a threshold of 0.00001
[476]
means we would only get a false positive
[479]
once every 100 000 experiments
[483]
likewise if it's not that important for
[486]
example if we're trying to decide if the
[488]
ice cream truck will arrive on time
[491]
then we can use a larger threshold like
[493]
0.2
[496]
using a threshold of 0.2 means we are
[499]
willing to get a false positive two
[502]
times out of 10.
[504]
that said the most common threshold is
[507]
0.05
[509]
because trying to reduce the number of
[511]
false positives below 5
[514]
often costs more than it's worth
[517]
so if we calculate a p-value for this
[520]
experiment
[522]
and the p-value is less than 0.05
[526]
then we will decide that drug a is
[528]
different from drug b
[531]
that said
[532]
the p-value is actually 0.24
[536]
so we are not confident that drug a is
[538]
different from drug b
[541]
bam
[543]
okay
[544]
before we're done let me say two more
[546]
things about p-values
[549]
unfortunately the first thing i want to
[551]
say is just more terminology
[554]
in fancy statistical lingo the idea of
[558]
trying to determine if these drugs are
[560]
the same or not is called hypothesis
[562]
testing
[564]
the null hypothesis is that the drugs
[567]
are the same
[568]
and the p-value helps us decide if we
[571]
should reject the null hypothesis or not
[575]
small bam
[577]
okay
[579]
now that we have that fancy terminology
[581]
out of the way
[582]
the second thing i want to say is way
[584]
more interesting
[586]
while a small p-value helps us decide if
[589]
drug a is different from drug b
[592]
it does not tell us how different they
[594]
are
[596]
in other words you can have a small
[598]
p-value regardless of the size of
[601]
difference between drug a and drug b
[604]
the difference can be tiny or huge
[608]
for example this experiment gives us a
[611]
relatively large p-value
[613]
0.24
[614]
even though there is a six-point
[616]
difference between drug a and drug b
[621]
in contrast this experiment which
[623]
involves a lot more people gives us a
[626]
smaller p-value 0.04
[630]
even though
[631]
given the new data there is a one point
[634]
difference between drug a and drug b
[639]
in summary a small p-value does not
[642]
imply that the effect size or difference
[645]
between drug a and drug b is large
[649]
double bam
[653]
hooray
[654]
we've made it to the end of another
[656]
exciting stat quest if you liked this
[659]
stat quest and want to see more please
[661]
subscribe
[662]
and if you want to support statquest
[664]
consider contributing to my patreon
[666]
campaign becoming a channel member
[669]
buying one or two of my original songs
[671]
or a t-shirt or a hoodie or just donate
[674]
the links are in the description below
[677]
alright until next time quest on