🔍

Variance, Standard Deviation, Coefficient of Variation - YouTube

Channel: 365 Data Science

[0]

There are many ways to quantify variability, however, we will focus on the most common

[4]

ones: variance, standard deviation, and coefficient of variation.

[9]

In the field of statistics, we will typically use different formulas when working with population

[14]

data and sample data.

[16]

Let’s think about this for a bit.

[18]

When you have the whole population, each data point is known so you are 100% sure of the

[23]

measures you are calculating.

[26]

When you take a sample of this population and you compute a sample statistic, it is

[30]

interpreted as an approximation of the population parameter.

[33]

Moreover, if you extract 10 different samples from the same population, you will get 10

[38]

different measures.

[40]

Statisticians have solved the problem by adjusting the algebraic formulas for many statistics

[45]

to reflect this issue.

[46]

Therefore, we will explore both population and sample formulas, as they are both used.

[53]

You must be asking yourself why there are unique formulas for the mean, median and mode.

[59]

Well, actually, the sample mean is the average of the sample data points, while the population

[64]

mean is the average of the population data points.

[68]

Technically there are two different formulas, but they are computed in the same way.

[73]

Okay, now.

[75]

After this short clarification, it’s time to get onto variance.

[81]

Variance measures the dispersion of a set of data points around their mean value.

[86]

Population variance, denoted by sigma squared, is equal to the sum of squared differences

[90]

between the observed values and the population mean, divided by the total number of observations.

[98]

Sample variance, on the other hand, is denoted by s squared and is equal to the sum of squared

[104]

differences between observed sample values and the sample mean, divided by the number

[109]

of sample observations minus 1.

[113]

Alright.

[114]

*** When you are getting acquainted with statistics,

[116]

it is hard to grasp everything right away.

[119]

Therefore, let’s stop for a second to examine the formula for the population and try to

[124]

clarify its meaning.

[126]

The main part of the formula is its numerator, so that’s what we want to comprehend.

[132]

The sum of differences between the observations and the mean, squared.

[136]

Hmm… so, the closer a number to the mean, the lower the result we will obtain, right?

[142]

And the further away from the mean it lies, the larger this difference.

[148]

Easy.

[149]

But why do we elevate to the second degree?

[153]

Squaring the differences has two main purposes.

[156]

First, by squaring the numbers, we always get non-negative computations.

[161]

Without going too deep into the mathematics of it, it is intuitive that dispersion cannot

[166]

be negative.

[167]

Dispersion is about distance and distance cannot be negative.

[171]

If, on the other hand, we calculate the difference and do not elevate to the second degree, we

[177]

would obtain both positive and negative values that when summed would cancel out, leaving

[182]

us with no information about the dispersion.

[185]

Second, squaring amplifies the effect of large differences.

[190]

For example, if the mean is 0 and you have an observation of 100, the squared spread

[195]

is 10,000! Alright, enough dry theory.

[200]

It is time for a practical example.

[203]

We have a population of five observations – 1, 2, 3, 4 and 5.

[209]

Let’s find its variance.

[211]

We start by calculating the mean: 1+2+3+4+5 divided by 5 equals 3.

[222]

Then we apply the formula we just saw: 1 minus 3 squared, plus, 2 minus 3 squared, plus,

[230]

3 minus 3, squared, plus, 4 minus 3, squared, plus, 5 minus 3, squared.

[240]

All of these components have to be divided by 5.

[243]

When we do the math, we get 2.

[246]

So, the population variance of the data set is 2.

[250]

But what about the sample variance?

[253]

This would only be suitable if we were told that these five observations were a sample

[257]

drawn from a population.

[258]

So, let’s imagine that’s the case.

[262]

The sample mean is once again 3.

[265]

The numerator is the same, but the denominator is going to be 4, instead of 5, giving us

[270]

a sample variance of 2.5.

[274]

To conclude the variance topic, we should interpret the result.

[278]

Why is the sample variance bigger than the population variance?

[282]

In the first case, we knew the population, that is, we had all the data and we calculated

[287]

the variance.

[288]

In the second case, we were told that 1, 2, 3, 4 and 5 was a sample, drawn from a bigger

[295]

population.

[297]

Imagine the population of this sample were these 9 numbers: 1, 1, 1, 2, 3, 4, 5, 5 and

[306]

[307]

Clearly, the numbers are the same, but there is a concentration around the two extremes

[312]

of the data set – 1 and 5.

[315]

The variance of this population is 2.96.

[319]

So, our sample variance has rightfully corrected upwards in order to reflect the higher potential

[326]

variability.

[328]

This is the reason why there are different formulas for sample and population data.

[334]

*** While variance is a common measure of data

[337]

dispersion, in most cases the figure you will obtain is pretty large and hard to compare

[341]

as the unit of measurement is squared.

[343]

The easy fix is to calculate its square root and obtain a statistic known as standard deviation.

[349]

In most analyses you perform, standard deviation will be much more meaningful than variance.

[356]

As we saw in the previous lecture, there are different measures for the population and

[360]

sample variance.

[361]

Consequently, there is also population and sample standard deviation.

[366]

The formulas are: the square root of the population variance and square root of the sample variance

[372]

respectively.

[373]

I believe there is no need for an example of the calculation, right?

[378]

If you have a calculator in your hands, you’ll be able to do the job.

[382]

Alright.

[384]

The other measure we still have to introduce is the coefficient of variation.

[388]

It is equal to the standard deviation, divided by the mean.

[393]

Another name for the term is relative standard deviation.

[396]

This is an easy way to remember its formula – it is simply the standard deviation relative

[401]

to the mean.

[403]

As you probably guessed, there is a population and sample formula once again.

[408]

So, standard deviation is the most common measure of variability for a single data set.

[414]

But why do we need yet another measure such as the coefficient of variation?

[418]

Well, comparing the standard deviations of two different data sets is meaningless, but

[424]

comparing coefficients of variation is not.

[428]

Aristotle once said: “Tell me, I’ll forget.

[432]

Show me, I’ll remember.

[435]

Involve me, I’ll understand.”

[437]

To make sure you remember, here’s an example of a comparison between standard deviations.

[443]

Let’s take the prices of pizza at 10 different places in New York.

[447]

They range from 1 to 11 dollars.

[451]

Now, imagine that you only have Mexican pesos and to you the prices look more like 18.81

[457]

pesos to 206.91 pesos, given the exchange rate of 18.81 pesos for one dollar.

[464]

Let’s combine our knowledge so far and find the standard deviations and coefficients of

[469]

variation of these two data sets.

[472]

First, we have to see if this is a sample or a population.

[476]

Are there only 11 restaurants in New York?

[479]

Of course not; this is obviously a sample drawn from all the restaurants in the city.

[483]

Then we have to use the formulas for sample measures of variability.

[487]

Second, we have to find the mean.

[490]

The mean in dollars is equal to 5.5 and the mean in pesos to 103.46.

[495]

The third step of the process is finding the sample variance.

[500]

Following the formula that we showed earlier, we can obtain 10.72 dollars squared and 3793.69

[509]

pesos squared.

[511]

The respective sample standard deviations are 3.27 dollars and 61.59 pesos.

[517]

Let’s make a couple of observations.

[521]

First, variance gives results in squared units, while standard deviation in original units.

[526]

This is the main reason why professionals prefer to use standard deviation as the main

[531]

measure of variability.

[532]

It is directly interpretable.

[535]

Squared dollars means nothing even in the field of statistics.

[539]

Second, we got standard deviations of 3.27 and 61.59 for the same pizza at the same 11

[547]

restaurants in New York City.

[549]

Seems wrong, right?

[550]

Don’t worry.

[551]

It is time to use our last tool – the coefficient of variation.

[557]

Dividing the standard deviations by the respective means, we get the two coefficients of variation.

[562]

The result is the same – 0.60.

[567]

Notice that it is not dollars, pesos, dollars squared or pesos squared.

[571]

It is just 0.60.

[573]

This shows us the great advantage that the coefficient of variation gives us.

[579]

Now, we can confidently say that the two data sets have the same variability, which was

[584]

what we expected beforehand.

[587]

Let’s recap what we have learned so far.

[590]

There are three main measures of variability – variance, standard deviation and coefficient

[595]

of variation.

[597]

Each of them has different strengths and applications.

[600]

You should feel confident using all of them as we are getting closer to more complex statistical

[604]

topics.

[605]

Thanks for watching!

Most Recent Videos:

WE KILLED 6 HEROIC BOSSES! - YouTube

¿Quién inventó el dinero? - YouTube

Cuándo se inventó el dinero y cómo el dólar se convirtió en la principal moneda del mundo - YouTube

This Citizenship Program is Failing - YouTube

Candida Treatment Protocol w/ Dr. DiNezza - YouTube

$500M investor reacts to Real Estate Tik Toks 2 - YouTube

You can go back to the homepage right here: Homepage