Calculating the Mean, Variance and Standard Deviation, Clearly Explained!!! - YouTube

Channel: StatQuest with Josh Starmer

[0]
I was home last night
[4]
barians of standard deviation so I
[8]
estimated them and its goal that quest
[13]
[Music]
[16]
hello I'm Josh stormer and welcome to
[19]
stat quest today we're gonna continue
[21]
our series on statistics fundamentals
[23]
this time we're gonna talk about
[25]
estimating the mean variance and
[27]
standard deviation note this stat quest
[31]
assumes you already know about
[33]
histograms statistical distributions and
[36]
specifically the normal distribution if
[39]
not check out the quests the links are
[42]
in the description below also this stat
[46]
quest assumes you already understand why
[48]
we want to estimate population
[49]
parameters if not check out the quest
[55]
the stat quest on population parameters
[57]
we counted the number of mRNA
[59]
transcripts from gene X in five
[61]
different liver cells alternatively if
[65]
mRNA transcripts and liver cells didn't
[68]
mean anything to you we counted the
[70]
number of green apples in five different
[72]
grocery stores or green t-shirts and
[75]
five different clothing stores or
[78]
whatever you want to measure in five
[81]
different units
[83]
this green dot represented a liver cell
[85]
that had three mRNA transcripts for gene
[88]
X and this green dot represented a liver
[91]
cell that had 13 mRNA transcripts 1924
[98]
and 29 now if we had a lot of time and
[103]
money on our hands we could count the
[105]
number of mRNA transcripts for gene X in
[107]
all 240 billion liver cells
[111]
now we can draw a histogram of the
[114]
measurements if we wanted to fit a
[117]
normal curve to this histogram like this
[120]
then we need to calculate the population
[123]
mean in the population variance or
[125]
population standard deviation
[129]
calculating the population mean is easy
[131]
we take the average of all 240 billion
[134]
measurements BP booty P due to baby boo
[139]
doo bee doo doo boo boo and we get 24
[144]
the population mean
[147]
and we Center the normal curve on the
[149]
population mean note because we
[153]
calculated the mean with all 240 billion
[156]
measurements in the population this is
[158]
not an estimate of the population mean
[160]
it is the population mean
[165]
however since we rarely if ever have
[168]
enough time and money to measure every
[170]
single thing in a population we almost
[173]
always estimate the population mean
[175]
using a relatively small sample in this
[179]
example we have the measurements from
[182]
only five of the 240 billion cells
[186]
estimating the population mean is super
[189]
easy we just calculate the average of
[191]
the measurements we collected doodoo
[195]
booty boot boot boot boot and in this
[199]
case the estimated population mean is
[201]
seventeen point six
[204]
oh no it's the dreaded terminology alert
[209]
statisticians often use the symbol x-bar
[212]
to refer to the estimated mean which is
[215]
also called the sample mean
[218]
and they use the Greek symbol mu to
[221]
refer to the population mean
[224]
the estimated mean x-bar is different
[228]
from the population mean mu but with
[230]
more and more data
[232]
x-bar should get closer and closer
[236]
going back to the full set of population
[238]
data
[239]
we will now determine how wide to make
[242]
the curve by calculating not estimating
[245]
the variance and standard deviation in
[248]
other words we want to calculate how the
[251]
data are spread around the population
[253]
mean
[255]
this is the formula we use to calculate
[258]
not estimate the population variance
[262]
note I'm making a big deal about
[265]
calculating versus estimating variance
[267]
because it makes a big difference that
[269]
we'll talk about later
[271]
this part X minus mu means we subtract
[275]
the population mean mu from each
[278]
measurement X boo-boo-boo-boo-boo the
[285]
square tells us to square each term
[290]
character Sigma tells us to add up all
[293]
the terms
[295]
lastly we want the average of the
[297]
squared differences so we divide by the
[300]
number of measurements n which in this
[302]
case is 240 billion
[306]
thus we're just calculating the average
[308]
of the squared differences between the
[310]
data and the population mean
[313]
note squaring each term ensures that
[316]
each difference is positive
[319]
otherwise the measurements on the left
[321]
side of the mean would give negative
[323]
differences
[325]
which would cancel out the positive
[327]
differences from the measurements on the
[329]
right side of the mean
[332]
note if you are wondering why we don't
[334]
take the absolute value of each term
[336]
great we'll talk about that in the
[339]
follow-up video that dives deep into
[341]
these details
[344]
anyway and now we just do the math and
[347]
we get 100 for the population variance
[351]
BAM okay we calculated the population
[356]
variance and we're all proud of
[358]
ourselves however there is one thing
[360]
that is annoying about it
[363]
because each term is squared the units
[367]
for the result 100 are mRNA transcripts
[371]
squared note if the data have been the
[375]
number of apples and grocery stores then
[377]
the variance would be 100 apples squared
[380]
either way we can't plot the variance on
[384]
the graph since the units on the x axis
[386]
are not squared
[390]
to solve this problem we can take the
[392]
square root of everything and that gives
[395]
us the population standard deviation
[398]
so the population standard deviation is
[401]
the square root of 100 the population
[404]
variance which is 10
[407]
and we can plot that on the graph
[411]
this shows the main 20 plus and minus
[414]
the standard deviation 10 mRNA
[417]
transcripts
[419]
um
[423]
note before we move on I want to
[425]
emphasize the point that we almost never
[427]
have the population data so we almost
[430]
never calculate the population mean and
[432]
population variance in standard
[434]
deviation
[436]
instead we estimate the population
[439]
variance and population standard
[441]
deviation from the relatively small
[443]
number of measurements that we have
[446]
remember the population variance and
[449]
standard deviation determines how much
[451]
the curve spreads out
[454]
and that means the estimated variance
[456]
and the estimated standard deviation
[459]
should reflect how the data are spread
[461]
around the population mean
[464]
however when we do an experiment we
[467]
don't see the curve or the population
[469]
mean we only see the data
[473]
so we have to use the estimated mean
[475]
x-bar instead
[479]
this is the formula we use to estimate
[480]
the population variance
[484]
because we almost always work with a
[486]
relatively small sample and not the
[488]
entire population this is the formula we
[491]
will use most of the time
[493]
the differences between this formula and
[496]
the one for the calculated population
[498]
variance are subtle but important first
[502]
since we don't know the population mean
[504]
mu we use the sample mean x-bar
[509]
second we are dividing by n minus 1
[512]
instead of n
[514]
dividing by n minus one compensates for
[517]
the fact that we are calculating the
[519]
differences from the sample mean instead
[521]
of the population mean otherwise we
[525]
would consistently underestimate the
[527]
variance around the population mean
[531]
this is because the differences between
[533]
the data and the sample mean tend to be
[536]
smaller than the differences between the
[538]
data and the population mean
[540]
thus the differences around the
[543]
population mean or result in a larger
[546]
average and the larger average is what
[550]
we are trying to estimate
[552]
note if you're like me and want to know
[555]
more details about why we need to
[557]
compensate for calculating differences
[559]
from the sample mean check out the
[561]
follow-up stat quest the link is in the
[564]
description below
[567]
now let's do the math
[570]
before we calculate the differences
[572]
between the mean and the data doopa
[575]
doopa doopa doopa - then we square each
[579]
term
[581]
add up each term
[583]
but now we divide by n minus 1
[587]
and the estimated population variance is
[590]
100 1.8
[593]
now we just take the square root of the
[596]
estimated variance to get the estimated
[598]
standard deviation and we get 10.1
[604]
we can draw the mean plus and minus the
[607]
standard deviation on the graph
[610]
double BAM
[613]
the estimated population parameters
[616]
correspond to this purple curve with
[618]
mean equals seventeen point six and
[621]
standard deviation equals ten point one
[625]
which isn't too far off from the true
[627]
distribution with mean equals 20 and
[630]
standard deviation equals 10
[633]
with more data the estimated parameters
[636]
would be more accurate and we would have
[638]
more confidence in them
[641]
however with just five measurements we
[644]
still did pretty well and that saved us
[646]
a ton of time and money hooray
[651]
in summary if we have all of the data
[654]
from a population we can calculate the
[657]
population mean the population mean
[660]
equals the sum of the measurements
[662]
divided by the number of measurements
[664]
and that equals the average measurement
[666]
mu
[669]
when we don't have the population data
[671]
we can estimate the population mean with
[674]
the same formula the estimated
[677]
population mean equals the sum of the
[680]
measurements divided by the number of
[682]
measurements which equals the average
[684]
measurement x-bar
[688]
when we have the population data we can
[691]
calculate the population variance and
[693]
standard deviation
[695]
the population variance is the average
[698]
of the squared differences between the
[700]
data and the population mean mu
[705]
in other words we square these
[706]
differences to prevent the ones on the
[709]
left from canceling the ones on the
[710]
right and then take the average
[714]
and the population standard deviation is
[717]
just the square root of the population
[719]
variance and since the standard
[722]
deviation is in the original units that
[725]
we measured we can draw it on the graph
[728]
however we almost never have the
[731]
population data so chances are you
[734]
should not use these formulas
[737]
said we almost always estimate the
[740]
variance and standard deviation
[743]
when we estimate the population variance
[745]
we divide by n minus one to compensate
[749]
for measuring distances from the sample
[751]
mean instead of the population mean
[754]
and the estimated standard deviation is
[757]
just the square root of the estimated
[759]
population variance and since the
[762]
standard deviation is in the same units
[765]
that we measured the data we can draw it
[767]
on the graph and one last shameless plug
[771]
for the follow up stat quest if you want
[774]
to know why dividing by n underestimates
[777]
the variance check out the quest the
[780]
link is in the description below
[783]
triple bell
[784]
[Music]
[787]
is in this stat quest I made a big deal
[790]
about how we rarely have the population
[792]
data and we almost always estimate the
[795]
population parameters one reason I did
[799]
this was because while many software
[801]
packages estimate the variance and
[803]
standard deviation by default Microsoft
[806]
Excel does not
[809]
instead it gives two choices one
[812]
function VAR p calculates the population
[816]
variance the other VAR s estimates it
[821]
since we almost always have a relatively
[824]
small sample rather than the population
[827]
data we should almost always use far s
[832]
hooray we've made it to the end of
[834]
another exciting stat quest if you like
[837]
this that quest and want to see more
[839]
please subscribe and if you want to
[841]
support stack quest well consider buying
[843]
an original song or a t-shirt or a
[845]
hoodie or just donating the links are
[848]
all in the description below alright
[850]
until next time quest on
[860]
you