The Shape of Data: Distributions: Crash Course Statistics #7 - YouTube

Channel: CrashCourse

[3]
Hi, I’m Adriene Hill and Welcome to Crash Course Statistics.
[5]
We’ve spent out a lot of time talking about data visualization and different kinds of
[9]
frequency plots--like dot-plots and histograms--that tell us how frequently things occur in data
[15]
we actually have.
[17]
But so far in this series, the data we have talked about usually isn’t ALL the data
[20]
that exists.
[21]
If I want to know about student loan debt in America, I am definitely not going to ask
[26]
over 300 million Americans.
[28]
I’m lazy like that.
[29]
But maybe I can find the time to ask 2,000 of them.
[32]
Samples...and the shapes they give us...are shadows of what all the data would look like.
[37]
We collect samples because we think they’ll give us a glimpse of the bigger picture.
[41]
They’ll tell us something about the shape of all the data.
[44]
Because it turns out we can learn almost everything we need to know about data from its shape.
[49]
Intro
[58]
Picture a histogram of every single person’s height.
[61]
Now imagine the bars getting thinner, and thinner, and thinner ...as the bins get smaller
[66]
and smaller.
[67]
Till they are so thin that the outline of our histogram looks like a smooth line ...since
[72]
this is a distribution of continuous numbers.
[74]
And there’s an infinite possibility of heights.
[77]
I am 1.67642… (and on and on) meters tall.
[81]
If we let our bars be infinitely small, we get a smooth curve, also known as the distribution
[86]
of the data.
[87]
A distribution represents all possible values for a set of data...and how often those values
[93]
occur.
[93]
Distributions can also be discrete.
[95]
Like the number of countries people have visited.
[97]
That means they only have a few set values--that they can take on.
[101]
These distributions look a lot more like the histograms we’re used to seeing.
[105]
Like a histogram, the distribution tells us about the shape and spread of data.
[109]
We can think of distributions as a set of instructions for a machine that generates
[113]
random numbers.
[114]
Let’s say it generates the number of leaves on a tree.
[116]
You may well be wondering why we’d have a tree-leaf-number generating machine.
[120]
The idea here is that EVERYTHING can generate data.
[123]
It’s not just mechanical stuff.
[124]
It’s leaves and animals and even people.
[127]
The distribution is what specifies how the knobs and dials on our machine are set.
[131]
Once the machine is set, every time there’s a new tree, the machine pops out a random
[135]
number of leaves from the distribution.
[137]
It won’t be the same number each time though.
[140]
That’s because it’s a random selection based on the information the knobs and dials
[144]
tell us about our distribution of leaves.
[146]
When we look at samples of data generated by our leaf machine, we’re trying to guess
[150]
the shape of the distribution and how that machine’s knobs and dials are set.
[155]
But remember, samples of data are not all the data, so when we compare the shapes of
[159]
two samples of data, we’re really asking whether the same distribution--these two machine
[164]
settings--could have produced these two different--but sorta similar--shapes.
[169]
If you got an especially expensive electricity bill last month, you may want to look at the
[173]
histogram of your average daily energy consumption this month, and the same month last year side-by-side.
[179]
It’s not that realistic to expect that you consumed energy at EXACTLY the same rate this
[185]
month as you did the year before.
[187]
There are probably some differences.
[188]
But your question is whether there’s enough difference to conclude that your energy consuming
[192]
behaviors have changed.
[194]
When we think about data samples as being just some of the data made using a certain
[198]
distribution shape, it helps us compare samples in a more meaningful way.
[203]
Because we know that the samples approximate some theoretical shape, we can draw connections
[207]
between the sample and the theoretical machine that generated it, which is what we really
[212]
care about.
[213]
While data come in all sorts of shapes, let’s take a look at a few of the most common, starting
[218]
with the normal distribution.
[219]
We mentioned the Normal distribution when we talked about the different ways to measure
[222]
the center of data--since the mean, median, and mode of a normal distribution are the
[226]
same.
[227]
This tells us that the distribution is symmetric, meaning you could fold it in half and those
[232]
halves would be the same and that it’s unimodal, meaning there’s only one peak.
[236]
The shape of a normal distribution is set by two familiar statistics: the mean and standard
[241]
deviation.
[242]
The mean tells us where the center of the distribution is.
[245]
The standard deviation tells us how thin or squished the normal distribution is.
[250]
Since the standard deviation is the average distance between any point and the mean, the
[255]
smaller it is the closer all the data will be to the mean.
[258]
We’ll have a skinnier normal distribution.
[260]
Most of the data in the normal distribution--about 68%--is within 1 standard deviation of the
[266]
mean on either side.
[268]
Just like the quartiles in a boxplot, the smaller the range that 68% of the data has
[273]
to occupy, the more squished it gets.
[275]
Speaking of boxplots here’s what the boxplot for normally distributed data looks like.
[280]
The two halves of our box are exactly the same because the normal distribution is symmetric.
[285]
You’ve probably seen the normal distribution in a lot of different places, it gets called
[288]
a Bell Curve sometimes.
[290]
Attributes like IQ and the number of Fruit Loops you get in a box are approximately normally
[294]
distributed.
[295]
Normal distributions come up a lot when we look at groups of things, like the total value
[300]
rolled after 10 dice rolls, or birth weights.
[302]
We’ll talk more about why the normal distribution is so useful in the future.
[306]
As we’ve seen in this series, data isn’t always normal or symmetric, often times it
[310]
has some extreme values on one side making it a little bit skewed.
[314]
Age at death during the middle ages is left-skewed...cause lots of people died young while the time it
[319]
takes to fill out the Nerdfighteria survey was right skewed because some people lolly-gagged.
[323]
In a boxplot of data from a skewed distribution, the median will not usually split the box
[329]
into two even pieces.
[331]
Instead the side with the skewed tail will tend to be stretched out, and often, we’ll
[335]
see a lot of outliers on that side, just like the boxplot of the Nerdfighteria survey times.
[340]
When we see those features in our sample of data, it suggests that the distribution that
[344]
generated our data also has some kind of skewed tail.
[347]
Skew can be a useful way to compare data.
[349]
For example, teachers often look at the distribution of scores on a test to see how difficult the
[354]
test was.
[355]
Really difficult tests tend to generate skewed scores, with most students doing pretty poorly
[360]
and a few who still ace it.
[362]
Say we flashed pictures of 20 pokemon and asked people to name them.
[366]
Here are their grades.
[367]
Or another sample from a test asking people to list all 195 countries.
[372]
We can compare the shapes and centers of these two groups of tests, as well as any other
[376]
notable features.
[378]
First of all, these two samples look pretty similar.
[380]
Both have a right skew.
[382]
Both have a pretty low center, but the second test has a more extreme skew.
[386]
Bigger skewed tails usually mean that the data--and therefore the distribution--has
[390]
both a larger range, and a bigger standard deviation than data with a smaller tail.
[396]
The standard deviation is higher because not only are extreme data further away from the
[400]
mean, they drag the mean toward them, making most of the other points just a little further
[405]
from the mean too.
[406]
While the direction of the skew tells you where most of the data is--always on the opposite
[410]
side of the skewed tail--the extremeness of the skew can help you mentally compare the
[415]
approximate measures of spread, like range and standard deviation.
[418]
But we compare the shapes of two samples in order to ask whether the shape of the distributions
[423]
that generated them are different, or whether ONE shape could have randomly created both
[428]
samples.
[429]
In terms of our machine analogy, we ask whether one machine with its knob settings could have
[434]
spit out two sets of scores, one that looks like test A, and one that looks like test
[438]
B. Answering that question get’s complicated, but we’ll get there.
[442]
Now that we’ve examined the tails, let’s look at the middle of some distributions.
[446]
Almost all the distributions we’ve seen so far are unimodal--they only have one peak.
[451]
But there’s many times when data might have two or more peaks.
[454]
We call it bimodal or multimodal data.
[457]
And it looks like the back of a camel, or maybe like two of our unimodal distributions
[460]
pasted side by side.
[462]
And, that’s probably what’s happening--the unimodal distributions, not the camel thing.
[466]
Often when you see multimodal data in the world it’s because there are two different
[470]
machines with two different distributions that are both generating data that is being--for
[474]
some reason or other--measured together.
[476]
One possible example of this is the length in minutes that the geyser Old Faithful erupts.
[481]
Most eruptions last either about 2 minutes, or about 4 minutes, with few eruptions around
[486]
the 3-minute mark, giving us a bimodal distribution.
[489]
It’s entirely possible that there are two different mechanisms behind the data, even
[493]
though they’re being measured together.
[495]
For example, one set of conditions may lead to an eruption that’s about 2 minutes long,
[499]
and another--maybe a different temperature or latency--leads to a different kind of eruption
[503]
which lasts on average 4 minutes.
[505]
Since these two potentially different types of eruptions are being measured together,
[509]
the data look like they come from one distribution with two bumps, but it is likely that there’s
[515]
two unimodal distributions being measured at the same time.
[518]
Another example that you don’t need to be a geologist to understand is the race times
[521]
for some marathons.
[522]
While this data may look like it comes from a unimodal distribution, in reality there’s
[526]
two big groups of people who run a marathon: those that are competing, and those that do
[531]
it to prove they can do it.
[532]
There’s usually one peak around the time that all the professional runners cross the
[535]
finish line, and another when the amateurs do.
[538]
While we don’t know for sure that bimodal data is secretly two distributions disguised
[543]
as one, it is a good reason to look at things more closely.
[546]
We’ll finish today with uniform distribution.
[549]
Even though we haven’t mentioned uniform distributions yet, you’ve probably come
[552]
across them in your everyday life.
[554]
Each value in a uniform distribution has the same frequency, just like each number on a
[558]
die has exactly the same chance of being rolled.
[560]
When you need to decide something fairly, like which of your 6 roomates has to do dishes
[564]
tonight, or which friend to take to the Jay-Z concert--the best thing you can do is use
[568]
something--like a die--that has a uniform distribution.
[571]
That gives everyone an equal chance of being picked.
[573]
And you can have uniform distributions with any number of outcomes.
[577]
There are 20-sided dice.
[578]
When you’re in Vegas playing a round of roulette the ball is equally likely to land
[582]
in any of 38 slots.
[584]
There’s a difference between the shape of all the data, and the shape of a sample of
[588]
the data.
[589]
When we talk about a uniform distribution, we’re talking about the settings of that
[592]
data generating machine, it doesn’t mean that every sample--or even most samples--
[596]
of our data will have exactly the same frequency for each outcome.
[599]
It’s entirely possible that rolling a die 60 times results in a sample shaped like this:
[605]
Even if we know the theoretical distribution looks like this:
[608]
Using statistics allow us to take the shape of samples that has some randomness and uncertainty,
[613]
and make a guess about the true distribution that created that sample of data.
[617]
Statistics is all about making decisions when we’re not sure.
[620]
It allows us to look at the shape of 60 dice rolls and figure out whether we believe the
[624]
die is fair... or whether the die is loaded or whether we need to keep rolling.
[628]
Whether it’s finding the true distribution of eruption times at Old Faithful, or showing
[632]
evidence that a company is discriminating based on age, gender, or race.
[636]
The shape of data gives us a glimpse into the true nature
[639]
of what is happening in the world.
[640]
Thanks for watching and DFTBAQ.
[643]
I’ll see you next time.