🔍
Variance, Standard Deviation, Coefficient of Variation - YouTube
Channel: 365 Data Science
[0]
There are many ways to quantify variability,
however, we will focus on the most common
[4]
ones: variance, standard deviation, and coefficient
of variation.
[9]
In the field of statistics, we will typically
use different formulas when working with population
[14]
data and sample data.
[16]
Let’s think about this for a bit.
[18]
When you have the whole population, each data
point is known so you are 100% sure of the
[23]
measures you are calculating.
[26]
When you take a sample of this population
and you compute a sample statistic, it is
[30]
interpreted as an approximation of the population
parameter.
[33]
Moreover, if you extract 10 different samples
from the same population, you will get 10
[38]
different measures.
[40]
Statisticians have solved the problem by adjusting
the algebraic formulas for many statistics
[45]
to reflect this issue.
[46]
Therefore, we will explore both population
and sample formulas, as they are both used.
[53]
You must be asking yourself why there are
unique formulas for the mean, median and mode.
[59]
Well, actually, the sample mean is the average
of the sample data points, while the population
[64]
mean is the average of the population data
points.
[68]
Technically there are two different formulas,
but they are computed in the same way.
[73]
Okay, now.
[75]
After this short clarification, it’s time
to get onto variance.
[81]
Variance measures the dispersion of a set
of data points around their mean value.
[86]
Population variance, denoted by sigma squared,
is equal to the sum of squared differences
[90]
between the observed values and the population
mean, divided by the total number of observations.
[98]
Sample variance, on the other hand, is denoted
by s squared and is equal to the sum of squared
[104]
differences between observed sample values
and the sample mean, divided by the number
[109]
of sample observations minus 1.
[113]
Alright.
[114]
***
When you are getting acquainted with statistics,
[116]
it is hard to grasp everything right away.
[119]
Therefore, let’s stop for a second to examine
the formula for the population and try to
[124]
clarify its meaning.
[126]
The main part of the formula is its numerator,
so that’s what we want to comprehend.
[132]
The sum of differences between the observations
and the mean, squared.
[136]
Hmm… so, the closer a number to the mean,
the lower the result we will obtain, right?
[142]
And the further away from the mean it lies,
the larger this difference.
[148]
Easy.
[149]
But why do we elevate to the second degree?
[153]
Squaring the differences has two main purposes.
[156]
First, by squaring the numbers, we always
get non-negative computations.
[161]
Without going too deep into the mathematics
of it, it is intuitive that dispersion cannot
[166]
be negative.
[167]
Dispersion is about distance and distance
cannot be negative.
[171]
If, on the other hand, we calculate the difference
and do not elevate to the second degree, we
[177]
would obtain both positive and negative values
that when summed would cancel out, leaving
[182]
us with no information about the dispersion.
[185]
Second, squaring amplifies the effect of large
differences.
[190]
For example, if the mean is 0 and you have
an observation of 100, the squared spread
[195]
is 10,000!
Alright, enough dry theory.
[200]
It is time for a practical example.
[203]
We have a population of five observations
– 1, 2, 3, 4 and 5.
[209]
Let’s find its variance.
[211]
We start by calculating the mean: 1+2+3+4+5
divided by 5 equals 3.
[222]
Then we apply the formula we just saw: 1 minus
3 squared, plus, 2 minus 3 squared, plus,
[230]
3 minus 3, squared, plus, 4 minus 3, squared,
plus, 5 minus 3, squared.
[240]
All of these components have to be divided
by 5.
[243]
When we do the math, we get 2.
[246]
So, the population variance of the data set
is 2.
[250]
But what about the sample variance?
[253]
This would only be suitable if we were told
that these five observations were a sample
[257]
drawn from a population.
[258]
So, let’s imagine that’s the case.
[262]
The sample mean is once again 3.
[265]
The numerator is the same, but the denominator
is going to be 4, instead of 5, giving us
[270]
a sample variance of 2.5.
[274]
To conclude the variance topic, we should
interpret the result.
[278]
Why is the sample variance bigger than the
population variance?
[282]
In the first case, we knew the population,
that is, we had all the data and we calculated
[287]
the variance.
[288]
In the second case, we were told that 1, 2,
3, 4 and 5 was a sample, drawn from a bigger
[295]
population.
[297]
Imagine the population of this sample were
these 9 numbers: 1, 1, 1, 2, 3, 4, 5, 5 and
[306]
5.
[307]
Clearly, the numbers are the same, but there
is a concentration around the two extremes
[312]
of the data set – 1 and 5.
[315]
The variance of this population is 2.96.
[319]
So, our sample variance has rightfully corrected
upwards in order to reflect the higher potential
[326]
variability.
[328]
This is the reason why there are different
formulas for sample and population data.
[334]
***
While variance is a common measure of data
[337]
dispersion, in most cases the figure you will
obtain is pretty large and hard to compare
[341]
as the unit of measurement is squared.
[343]
The easy fix is to calculate its square root
and obtain a statistic known as standard deviation.
[349]
In most analyses you perform, standard deviation
will be much more meaningful than variance.
[356]
As we saw in the previous lecture, there are
different measures for the population and
[360]
sample variance.
[361]
Consequently, there is also population and
sample standard deviation.
[366]
The formulas are: the square root of the population
variance and square root of the sample variance
[372]
respectively.
[373]
I believe there is no need for an example
of the calculation, right?
[378]
If you have a calculator in your hands, you’ll
be able to do the job.
[382]
Alright.
[384]
The other measure we still have to introduce
is the coefficient of variation.
[388]
It is equal to the standard deviation, divided
by the mean.
[393]
Another name for the term is relative standard
deviation.
[396]
This is an easy way to remember its formula
– it is simply the standard deviation relative
[401]
to the mean.
[403]
As you probably guessed, there is a population
and sample formula once again.
[408]
So, standard deviation is the most common
measure of variability for a single data set.
[414]
But why do we need yet another measure such
as the coefficient of variation?
[418]
Well, comparing the standard deviations of
two different data sets is meaningless, but
[424]
comparing coefficients of variation is not.
[428]
Aristotle once said: “Tell me, I’ll forget.
[432]
Show me, I’ll remember.
[435]
Involve me, I’ll understand.”
[437]
To make sure you remember, here’s an example
of a comparison between standard deviations.
[443]
Let’s take the prices of pizza at 10 different
places in New York.
[447]
They range from 1 to 11 dollars.
[451]
Now, imagine that you only have Mexican pesos
and to you the prices look more like 18.81
[457]
pesos to 206.91 pesos, given the exchange
rate of 18.81 pesos for one dollar.
[464]
Let’s combine our knowledge so far and find
the standard deviations and coefficients of
[469]
variation of these two data sets.
[472]
First, we have to see if this is a sample
or a population.
[476]
Are there only 11 restaurants in New York?
[479]
Of course not; this is obviously a sample
drawn from all the restaurants in the city.
[483]
Then we have to use the formulas for sample
measures of variability.
[487]
Second, we have to find the mean.
[490]
The mean in dollars is equal to 5.5 and the
mean in pesos to 103.46.
[495]
The third step of the process is finding the
sample variance.
[500]
Following the formula that we showed earlier,
we can obtain 10.72 dollars squared and 3793.69
[509]
pesos squared.
[511]
The respective sample standard deviations
are 3.27 dollars and 61.59 pesos.
[517]
Let’s make a couple of observations.
[521]
First, variance gives results in squared units,
while standard deviation in original units.
[526]
This is the main reason why professionals
prefer to use standard deviation as the main
[531]
measure of variability.
[532]
It is directly interpretable.
[535]
Squared dollars means nothing even in the
field of statistics.
[539]
Second, we got standard deviations of 3.27
and 61.59 for the same pizza at the same 11
[547]
restaurants in New York City.
[549]
Seems wrong, right?
[550]
Don’t worry.
[551]
It is time to use our last tool – the coefficient
of variation.
[557]
Dividing the standard deviations by the respective
means, we get the two coefficients of variation.
[562]
The result is the same – 0.60.
[567]
Notice that it is not dollars, pesos, dollars
squared or pesos squared.
[571]
It is just 0.60.
[573]
This shows us the great advantage that the
coefficient of variation gives us.
[579]
Now, we can confidently say that the two data
sets have the same variability, which was
[584]
what we expected beforehand.
[587]
Let’s recap what we have learned so far.
[590]
There are three main measures of variability
– variance, standard deviation and coefficient
[595]
of variation.
[597]
Each of them has different strengths and applications.
[600]
You should feel confident using all of them
as we are getting closer to more complex statistical
[604]
topics.
[605]
Thanks for watching!
Most Recent Videos:
You can go back to the homepage right here: Homepage





