Real-world application of the Central Limit Theorem (CLT) - YouTube

Channel: 365 Data Science

[0]
Hi everyone and welcome! In this video, we’ll talk about the real-world application of one
[5]
of the most widely used theorems in data science: The Central Limit Theorem. For more super
[10]
practical videos like this one, make sure to subscribe to our channel right now.
[14]
The Central Limit Theorem is the core of ‘hypothesis testing’- an approach in statistics that
[19]
lets you use data to evaluate your ideas. In fact, this theorem can be applied to a
[23]
variety of real-life problems. Let’s illustrate with an example.
[28]
Say, you own a business connected to the fish market area, more specifically - a trout farm.
[33]
You own hundreds of fish reservoirs where you keep and breed trout in order to sell
[37]
them to the main fish stores which supply the biggest cities in the country. The farm
[41]
operates in the following way: you buy and breed fish, which you later sell. Your clients
[46]
range from single fish vendors, to supermarket chains.
[49]
Quite straightforward, right? But to transform the above procedure into
[53]
a cycle, you need to use your capacity of reservoirs appropriately. Here’s how it
[57]
works. First, you have to label the reservoir depending on the approximate size of fish
[62]
in it. The labels are three - newly hatched, middle size, and first-class. As the fish
[68]
grow, you move them from newly hatched to middle size and from middle size to first-class
[72]
and once they’re fully grown, you sell them. Meanwhile, you’ve stocked the pool with
[77]
newly hatched trout and the process goes on. Now, what’s crucial to know is that first-class
[82]
fish are the largest among all the fish-groups. Why is that so important? Well, as a business
[87]
owner, your goal is to maximize profit, and, to achieve that, you must sell fish when they
[92]
reach the largest possible size, as customers pay by the pound. What’s more, there’s
[97]
a regulation set by the government that allows you to keep 1,000 fish maximum in the first-class
[102]
reservoirs. All things considered, selling the first-class
[105]
fish as large as possible would be your best strategy to increase profit. Therefore, you
[110]
need to maximize the length of each fish in every single tank.
[114]
Easy to say, but how can you do that? How long is it going to take? Is the effort worth
[119]
the time you will lose while in competition with other fish farms?
[123]
Let’s think about it for a second. One option is to try to measure each fish separately...
[128]
But there are 1,000 fish in each tank, and more than 20 tanks stocked with first-class
[132]
fish. So, manually measuring each fish doesn’t sound like a good idea. It will simply take
[138]
too long. You and your employees will be stuck measuring fish from dawn till dusk which is
[143]
highly inefficient. What’s more, if you can quickly find out the average size of your
[147]
fish, you can project how long it will take for each tank to grow the necessary size.
[152]
This will allow you to plan key resources such as staff and fish food supplies.
[156]
Finally, having an edge in the business depends on your ability to stay competitive and agile.
[162]
Knowing what kind of sales volumes you can produce daily helps you to be prepared when
[166]
a customer calls to purchase a certain number of tanks due a particular date.
[171]
And this is where you can truly benefit from some maths knowledge to optimize the process.
[175]
More precisely, you can use the Central Limit Theorem - it will help you tremendously with
[180]
time-saving and, what’s more, you will also maximize your profit at the same time.
[185]
So, what is the Central Limit Theorem and how does it work?
[189]
The Central Limit Theorem is a theorem in probability theory, whose first version was
[193]
proposed by the French mathematician Abraham de Moivre in 1733. Moivre published an article
[199]
where he used a normal distribution to approximate the distribution of the number of heads resulting
[204]
from many tosses of a fair coin. The finding was nearly forgotten until the French mathematician
[209]
Pierre-Simon Laplace expanded it in his monumental work in the 19th century. Over the years,
[215]
numerous versions of it have been discovered and proven by other mathematicians.
[219]
In its base form, the Central Limit Theorem states that if we have a population and we
[224]
take sufficiently large random samples from it, then the sample means will be approximately
[229]
normally distributed. We call the average of each sample a sample mean.
[234]
In case you’re wondering what a sample mean is, let’s go back to our fish example and
[238]
see. If you simply take groups of fish from each
[240]
first-class reservoir and record the average size in each group, that would be the so-called
[245]
sample means. In fact, that’s exactly how the Central Limit Theorem can be applied to
[250]
improve our fish measuring dilemma. Let’s see it in practice!
[253]
It’s a rule of thumb that the minimum sample size to apply the CLT on is 30. That’s why
[259]
we start with a sample size of 30. That means, each group of fish you select from your 1,000
[264]
first-class fish reservoir will consist of 30 fish. Then, you’ll increase the sample
[269]
size to 50,100 etc. And each time you record the sample mean. The idea is to gradually
[275]
increase the sample size because the bigger the sample the better the theorem applies.
[280]
However, you must not try with too many values for sample sizes because you want to be able
[284]
to act as quickly as possible for each first-class tank. So, there are two possible outcomes:
[290]
The size of fish in the respective pool is the maximized one, 50 cm, which means you
[295]
can sell the whole tank; Or you must keep feeding them and take measurements
[299]
at a future date; Then you do this for each first-class reservoir.
[303]
After recording and plotting the sample means, you can see the plot fits under the bell-shaped
[307]
curve, illustrating the Normal distribution. Therefore, going forward, you are in a position
[312]
to perform a statistical analysis using the properties of this distribution.
[316]
From the normal distribution graph, you see that the middle is denoted by ”-the mean
[321]
of sample means, which divides the area into two equal and symmetric halves. Moreover,
[326]
the area under the curve, which is this area here is equal to 1.
[330]
The first key observation is that approximately two thirds of the collected means are one
[335]
standard deviation away from the mean of sample means and approximately all of the data lies
[340]
within two standard deviations away from it. But how does all this affect your fish measuring
[344]
business? Well, for example if you have a sample mean
[347]
of 48 and standard deviation of 2 for a tank, then the theorem says that approximately two
[353]
thirds of your observed sample means are in the range between 46 and 50. Moreover, almost
[358]
all of the sample means are between 44 and 52. This tells you that you must feed the
[363]
fish in the tank a little bit more and on the next measurements. This can have a massive
[367]
effect on your planning, as it helps you track the rate of growth in length of fish.
[371]
Also, the sample means are normally distributed random variables, which yields that we can
[376]
standardize them. More precisely, standardization in this case stands for transforming each
[381]
of our variables’ mean to 0 and variance to 1. This way, we can easily find information
[387]
in statistical tables about the area under the curve of the standard normal distribution.
[393]
Knowing all this, now we can answer some very interesting questions:
[397]
What is the probability of seeing the average of the 5th sample of 30 fish in the range
[401]
between 45 and 50? Or what is the probability to obtain that the mean of 10th sample is
[407]
bigger than 48cm? Extracting those probabilities from the table
[412]
above, will give you a more intuitive picture of what’s happening in your tanks.
[416]
Alright! So, this is how the Central Limit Theorem
[419]
can be applied in a real-world scenario. The power of this theorem lies in the fact that
[424]
it makes it possible to analyze data even with incomplete information about it and allows
[429]
large datasets to be well approximated in a highly accurate manner.
[433]
If you enjoyed this video, don’t forget to hit the “like” or “share” button!
[438]
And if you’d like to become an expert in all things data science, subscribe to our
[441]
channel for more great videos every week! Thanks for watching!