🔍

Calculating the Mean, Variance and Standard Deviation, Clearly Explained!!! - YouTube

Channel: StatQuest with Josh Starmer

[0]

I was home last night

[4]

barians of standard deviation so I

[8]

estimated them and its goal that quest

[13]

[Music]

[16]

hello I'm Josh stormer and welcome to

[19]

stat quest today we're gonna continue

[21]

our series on statistics fundamentals

[23]

this time we're gonna talk about

[25]

estimating the mean variance and

[27]

standard deviation note this stat quest

[31]

assumes you already know about

[33]

histograms statistical distributions and

[36]

specifically the normal distribution if

[39]

not check out the quests the links are

[42]

in the description below also this stat

[46]

quest assumes you already understand why

[48]

we want to estimate population

[49]

parameters if not check out the quest

[55]

the stat quest on population parameters

[57]

we counted the number of mRNA

[59]

transcripts from gene X in five

[61]

different liver cells alternatively if

[65]

mRNA transcripts and liver cells didn't

[68]

mean anything to you we counted the

[70]

number of green apples in five different

[72]

grocery stores or green t-shirts and

[75]

five different clothing stores or

[78]

whatever you want to measure in five

[81]

different units

[83]

this green dot represented a liver cell

[85]

that had three mRNA transcripts for gene

[88]

X and this green dot represented a liver

[91]

cell that had 13 mRNA transcripts 1924

[98]

and 29 now if we had a lot of time and

[103]

money on our hands we could count the

[105]

number of mRNA transcripts for gene X in

[107]

all 240 billion liver cells

[111]

now we can draw a histogram of the

[114]

measurements if we wanted to fit a

[117]

normal curve to this histogram like this

[120]

then we need to calculate the population

[123]

mean in the population variance or

[125]

population standard deviation

[129]

calculating the population mean is easy

[131]

we take the average of all 240 billion

[134]

measurements BP booty P due to baby boo

[139]

doo bee doo doo boo boo and we get 24

[144]

the population mean

[147]

and we Center the normal curve on the

[149]

population mean note because we

[153]

calculated the mean with all 240 billion

[156]

measurements in the population this is

[158]

not an estimate of the population mean

[160]

it is the population mean

[165]

however since we rarely if ever have

[168]

enough time and money to measure every

[170]

single thing in a population we almost

[173]

always estimate the population mean

[175]

using a relatively small sample in this

[179]

example we have the measurements from

[182]

only five of the 240 billion cells

[186]

estimating the population mean is super

[189]

easy we just calculate the average of

[191]

the measurements we collected doodoo

[195]

booty boot boot boot boot and in this

[199]

case the estimated population mean is

[201]

seventeen point six

[204]

oh no it's the dreaded terminology alert

[209]

statisticians often use the symbol x-bar

[212]

to refer to the estimated mean which is

[215]

also called the sample mean

[218]

and they use the Greek symbol mu to

[221]

refer to the population mean

[224]

the estimated mean x-bar is different

[228]

from the population mean mu but with

[230]

more and more data

[232]

x-bar should get closer and closer

[236]

going back to the full set of population

[238]

data

[239]

we will now determine how wide to make

[242]

the curve by calculating not estimating

[245]

the variance and standard deviation in

[248]

other words we want to calculate how the

[251]

data are spread around the population

[253]

mean

[255]

this is the formula we use to calculate

[258]

not estimate the population variance

[262]

note I'm making a big deal about

[265]

calculating versus estimating variance

[267]

because it makes a big difference that

[269]

we'll talk about later

[271]

this part X minus mu means we subtract

[275]

the population mean mu from each

[278]

measurement X boo-boo-boo-boo-boo the

[285]

square tells us to square each term

[290]

character Sigma tells us to add up all

[293]

the terms

[295]

lastly we want the average of the

[297]

squared differences so we divide by the

[300]

number of measurements n which in this

[302]

case is 240 billion

[306]

thus we're just calculating the average

[308]

of the squared differences between the

[310]

data and the population mean

[313]

note squaring each term ensures that

[316]

each difference is positive

[319]

otherwise the measurements on the left

[321]

side of the mean would give negative

[323]

differences

[325]

which would cancel out the positive

[327]

differences from the measurements on the

[329]

right side of the mean

[332]

note if you are wondering why we don't

[334]

take the absolute value of each term

[336]

great we'll talk about that in the

[339]

follow-up video that dives deep into

[341]

these details

[344]

anyway and now we just do the math and

[347]

we get 100 for the population variance

[351]

BAM okay we calculated the population

[356]

variance and we're all proud of

[358]

ourselves however there is one thing

[360]

that is annoying about it

[363]

because each term is squared the units

[367]

for the result 100 are mRNA transcripts

[371]

squared note if the data have been the

[375]

number of apples and grocery stores then

[377]

the variance would be 100 apples squared

[380]

either way we can't plot the variance on

[384]

the graph since the units on the x axis

[386]

are not squared

[390]

to solve this problem we can take the

[392]

square root of everything and that gives

[395]

us the population standard deviation

[398]

so the population standard deviation is

[401]

the square root of 100 the population

[404]

variance which is 10

[407]

and we can plot that on the graph

[411]

this shows the main 20 plus and minus

[414]

the standard deviation 10 mRNA

[417]

transcripts

[419]

[423]

note before we move on I want to

[425]

emphasize the point that we almost never

[427]

have the population data so we almost

[430]

never calculate the population mean and

[432]

population variance in standard

[434]

deviation

[436]

instead we estimate the population

[439]

variance and population standard

[441]

deviation from the relatively small

[443]

number of measurements that we have

[446]

remember the population variance and

[449]

standard deviation determines how much

[451]

the curve spreads out

[454]

and that means the estimated variance

[456]

and the estimated standard deviation

[459]

should reflect how the data are spread

[461]

around the population mean

[464]

however when we do an experiment we

[467]

don't see the curve or the population

[469]

mean we only see the data

[473]

so we have to use the estimated mean

[475]

x-bar instead

[479]

this is the formula we use to estimate

[480]

the population variance

[484]

because we almost always work with a

[486]

relatively small sample and not the

[488]

entire population this is the formula we

[491]

will use most of the time

[493]

the differences between this formula and

[496]

the one for the calculated population

[498]

variance are subtle but important first

[502]

since we don't know the population mean

[504]

mu we use the sample mean x-bar

[509]

second we are dividing by n minus 1

[512]

instead of n

[514]

dividing by n minus one compensates for

[517]

the fact that we are calculating the

[519]

differences from the sample mean instead

[521]

of the population mean otherwise we

[525]

would consistently underestimate the

[527]

variance around the population mean

[531]

this is because the differences between

[533]

the data and the sample mean tend to be

[536]

smaller than the differences between the

[538]

data and the population mean

[540]

thus the differences around the

[543]

population mean or result in a larger

[546]

average and the larger average is what

[550]

we are trying to estimate

[552]

note if you're like me and want to know

[555]

more details about why we need to

[557]

compensate for calculating differences

[559]

from the sample mean check out the

[561]

follow-up stat quest the link is in the

[564]

description below

[567]

now let's do the math

[570]

before we calculate the differences

[572]

between the mean and the data doopa

[575]

doopa doopa doopa - then we square each

[579]

term

[581]

add up each term

[583]

but now we divide by n minus 1

[587]

and the estimated population variance is

[590]

100 1.8

[593]

now we just take the square root of the

[596]

estimated variance to get the estimated

[598]

standard deviation and we get 10.1

[604]

we can draw the mean plus and minus the

[607]

standard deviation on the graph

[610]

double BAM

[613]

the estimated population parameters

[616]

correspond to this purple curve with

[618]

mean equals seventeen point six and

[621]

standard deviation equals ten point one

[625]

which isn't too far off from the true

[627]

distribution with mean equals 20 and

[630]

standard deviation equals 10

[633]

with more data the estimated parameters

[636]

would be more accurate and we would have

[638]

more confidence in them

[641]

however with just five measurements we

[644]

still did pretty well and that saved us

[646]

a ton of time and money hooray

[651]

in summary if we have all of the data

[654]

from a population we can calculate the

[657]

population mean the population mean

[660]

equals the sum of the measurements

[662]

divided by the number of measurements

[664]

and that equals the average measurement

[666]

[669]

when we don't have the population data

[671]

we can estimate the population mean with

[674]

the same formula the estimated

[677]

population mean equals the sum of the

[680]

measurements divided by the number of

[682]

measurements which equals the average

[684]

measurement x-bar

[688]

when we have the population data we can

[691]

calculate the population variance and

[693]

standard deviation

[695]

the population variance is the average

[698]

of the squared differences between the

[700]

data and the population mean mu

[705]

in other words we square these

[706]

differences to prevent the ones on the

[709]

left from canceling the ones on the

[710]

right and then take the average

[714]

and the population standard deviation is

[717]

just the square root of the population

[719]

variance and since the standard

[722]

deviation is in the original units that

[725]

we measured we can draw it on the graph

[728]

however we almost never have the

[731]

population data so chances are you

[734]

should not use these formulas

[737]

said we almost always estimate the

[740]

variance and standard deviation

[743]

when we estimate the population variance

[745]

we divide by n minus one to compensate

[749]

for measuring distances from the sample

[751]

mean instead of the population mean

[754]

and the estimated standard deviation is

[757]

just the square root of the estimated

[759]

population variance and since the

[762]

standard deviation is in the same units

[765]

that we measured the data we can draw it

[767]

on the graph and one last shameless plug

[771]

for the follow up stat quest if you want

[774]

to know why dividing by n underestimates

[777]

the variance check out the quest the

[780]

link is in the description below

[783]

triple bell

[784]

[Music]

[787]

is in this stat quest I made a big deal

[790]

about how we rarely have the population

[792]

data and we almost always estimate the

[795]

population parameters one reason I did

[799]

this was because while many software

[801]

packages estimate the variance and

[803]

standard deviation by default Microsoft

[806]

Excel does not

[809]

instead it gives two choices one

[812]

function VAR p calculates the population

[816]

variance the other VAR s estimates it

[821]

since we almost always have a relatively

[824]

small sample rather than the population

[827]

data we should almost always use far s

[832]

hooray we've made it to the end of

[834]

another exciting stat quest if you like

[837]

this that quest and want to see more

[839]

please subscribe and if you want to

[841]

support stack quest well consider buying

[843]

an original song or a t-shirt or a

[845]

hoodie or just donating the links are

[848]

all in the description below alright

[850]

until next time quest on

[860]

you

Most Recent Videos:

WE KILLED 6 HEROIC BOSSES! - YouTube

¿Quién inventó el dinero? - YouTube

Cuándo se inventó el dinero y cómo el dólar se convirtió en la principal moneda del mundo - YouTube

This Citizenship Program is Failing - YouTube

Candida Treatment Protocol w/ Dr. DiNezza - YouTube

$500M investor reacts to Real Estate Tik Toks 2 - YouTube

You can go back to the homepage right here: Homepage