🔍
The Shape of Data: Distributions: Crash Course Statistics #7 - YouTube
Channel: CrashCourse
[3]
Hi, I’m Adriene Hill and Welcome to Crash
Course Statistics.
[5]
We’ve spent out a lot of time talking about
data visualization and different kinds of
[9]
frequency plots--like dot-plots and histograms--that
tell us how frequently things occur in data
[15]
we actually have.
[17]
But so far in this series, the data we have
talked about usually isn’t ALL the data
[20]
that exists.
[21]
If I want to know about student loan debt
in America, I am definitely not going to ask
[26]
over 300 million Americans.
[28]
I’m lazy like that.
[29]
But maybe I can find the time to ask 2,000
of them.
[32]
Samples...and the shapes they give us...are
shadows of what all the data would look like.
[37]
We collect samples because we think they’ll
give us a glimpse of the bigger picture.
[41]
They’ll tell us something about the shape
of all the data.
[44]
Because it turns out we can learn almost everything we need to know about data from its shape.
[49]
Intro
[58]
Picture a histogram of every single person’s
height.
[61]
Now imagine the bars getting thinner, and
thinner, and thinner ...as the bins get smaller
[66]
and smaller.
[67]
Till they are so thin that the outline of
our histogram looks like a smooth line ...since
[72]
this is a distribution of continuous numbers.
[74]
And there’s an infinite possibility of heights.
[77]
I am 1.67642… (and on and on) meters tall.
[81]
If we let our bars be infinitely small, we
get a smooth curve, also known as the distribution
[86]
of the data.
[87]
A distribution represents all possible values
for a set of data...and how often those values
[93]
occur.
[93]
Distributions can also be discrete.
[95]
Like the number of countries people have visited.
[97]
That means they only have a few set values--that
they can take on.
[101]
These distributions look a lot more like the
histograms we’re used to seeing.
[105]
Like a histogram, the distribution tells us
about the shape and spread of data.
[109]
We can think of distributions as a set of
instructions for a machine that generates
[113]
random numbers.
[114]
Let’s say it generates the number of leaves
on a tree.
[116]
You may well be wondering why we’d have
a tree-leaf-number generating machine.
[120]
The idea here is that EVERYTHING can generate
data.
[123]
It’s not just mechanical stuff.
[124]
It’s leaves and animals and even people.
[127]
The distribution is what specifies how the
knobs and dials on our machine are set.
[131]
Once the machine is set, every time there’s
a new tree, the machine pops out a random
[135]
number of leaves from the distribution.
[137]
It won’t be the same number each time though.
[140]
That’s because it’s a random selection
based on the information the knobs and dials
[144]
tell us about our distribution of leaves.
[146]
When we look at samples of data generated
by our leaf machine, we’re trying to guess
[150]
the shape of the distribution and how that
machine’s knobs and dials are set.
[155]
But remember, samples of data are not all
the data, so when we compare the shapes of
[159]
two samples of data, we’re really asking
whether the same distribution--these two machine
[164]
settings--could have produced these two different--but
sorta similar--shapes.
[169]
If you got an especially expensive electricity
bill last month, you may want to look at the
[173]
histogram of your average daily energy consumption
this month, and the same month last year side-by-side.
[179]
It’s not that realistic to expect that you
consumed energy at EXACTLY the same rate this
[185]
month as you did the year before.
[187]
There are probably some differences.
[188]
But your question is whether there’s enough
difference to conclude that your energy consuming
[192]
behaviors have changed.
[194]
When we think about data samples as being
just some of the data made using a certain
[198]
distribution shape, it helps us compare samples
in a more meaningful way.
[203]
Because we know that the samples approximate
some theoretical shape, we can draw connections
[207]
between the sample and the theoretical machine
that generated it, which is what we really
[212]
care about.
[213]
While data come in all sorts of shapes, let’s
take a look at a few of the most common, starting
[218]
with the normal distribution.
[219]
We mentioned the Normal distribution when
we talked about the different ways to measure
[222]
the center of data--since the mean, median,
and mode of a normal distribution are the
[226]
same.
[227]
This tells us that the distribution is symmetric,
meaning you could fold it in half and those
[232]
halves would be the same and that it’s unimodal,
meaning there’s only one peak.
[236]
The shape of a normal distribution is set
by two familiar statistics: the mean and standard
[241]
deviation.
[242]
The mean tells us where the center of the
distribution is.
[245]
The standard deviation tells us how thin or
squished the normal distribution is.
[250]
Since the standard deviation is the average
distance between any point and the mean, the
[255]
smaller it is the closer all the data will
be to the mean.
[258]
We’ll have a skinnier normal distribution.
[260]
Most of the data in the normal distribution--about
68%--is within 1 standard deviation of the
[266]
mean on either side.
[268]
Just like the quartiles in a boxplot, the
smaller the range that 68% of the data has
[273]
to occupy, the more squished it gets.
[275]
Speaking of boxplots here’s what the boxplot
for normally distributed data looks like.
[280]
The two halves of our box are exactly the
same because the normal distribution is symmetric.
[285]
You’ve probably seen the normal distribution
in a lot of different places, it gets called
[288]
a Bell Curve sometimes.
[290]
Attributes like IQ and the number of Fruit
Loops you get in a box are approximately normally
[294]
distributed.
[295]
Normal distributions come up a lot when we
look at groups of things, like the total value
[300]
rolled after 10 dice rolls, or birth weights.
[302]
We’ll talk more about why the normal distribution
is so useful in the future.
[306]
As we’ve seen in this series, data isn’t
always normal or symmetric, often times it
[310]
has some extreme values on one side making
it a little bit skewed.
[314]
Age at death during the middle ages is left-skewed...cause lots of people died young while the time it
[319]
takes to fill out the Nerdfighteria survey
was right skewed because some people lolly-gagged.
[323]
In a boxplot of data from a skewed distribution,
the median will not usually split the box
[329]
into two even pieces.
[331]
Instead the side with the skewed tail will
tend to be stretched out, and often, we’ll
[335]
see a lot of outliers on that side, just like
the boxplot of the Nerdfighteria survey times.
[340]
When we see those features in our sample of
data, it suggests that the distribution that
[344]
generated our data also has some kind of skewed
tail.
[347]
Skew can be a useful way to compare data.
[349]
For example, teachers often look at the distribution
of scores on a test to see how difficult the
[354]
test was.
[355]
Really difficult tests tend to generate skewed scores, with most students doing pretty poorly
[360]
and a few who still ace it.
[362]
Say we flashed pictures of 20 pokemon and
asked people to name them.
[366]
Here are their grades.
[367]
Or another sample from a test asking people
to list all 195 countries.
[372]
We can compare the shapes and centers of these
two groups of tests, as well as any other
[376]
notable features.
[378]
First of all, these two samples look pretty
similar.
[380]
Both have a right skew.
[382]
Both have a pretty low center, but the second
test has a more extreme skew.
[386]
Bigger skewed tails usually mean that the
data--and therefore the distribution--has
[390]
both a larger range, and a bigger standard
deviation than data with a smaller tail.
[396]
The standard deviation is higher because not
only are extreme data further away from the
[400]
mean, they drag the mean toward them, making
most of the other points just a little further
[405]
from the mean too.
[406]
While the direction of the skew tells you
where most of the data is--always on the opposite
[410]
side of the skewed tail--the extremeness of
the skew can help you mentally compare the
[415]
approximate measures of spread, like range
and standard deviation.
[418]
But we compare the shapes of two samples in
order to ask whether the shape of the distributions
[423]
that generated them are different, or whether
ONE shape could have randomly created both
[428]
samples.
[429]
In terms of our machine analogy, we ask whether
one machine with its knob settings could have
[434]
spit out two sets of scores, one that looks
like test A, and one that looks like test
[438]
B. Answering that question get’s complicated,
but we’ll get there.
[442]
Now that we’ve examined the tails, let’s
look at the middle of some distributions.
[446]
Almost all the distributions we’ve seen
so far are unimodal--they only have one peak.
[451]
But there’s many times when data might have
two or more peaks.
[454]
We call it bimodal or multimodal data.
[457]
And it looks like the back of a camel, or
maybe like two of our unimodal distributions
[460]
pasted side by side.
[462]
And, that’s probably what’s happening--the
unimodal distributions, not the camel thing.
[466]
Often when you see multimodal data in the
world it’s because there are two different
[470]
machines with two different distributions
that are both generating data that is being--for
[474]
some reason or other--measured together.
[476]
One possible example of this is the length
in minutes that the geyser Old Faithful erupts.
[481]
Most eruptions last either about 2 minutes,
or about 4 minutes, with few eruptions around
[486]
the 3-minute mark, giving us a bimodal distribution.
[489]
It’s entirely possible that there are two
different mechanisms behind the data, even
[493]
though they’re being measured together.
[495]
For example, one set of conditions may lead
to an eruption that’s about 2 minutes long,
[499]
and another--maybe a different temperature
or latency--leads to a different kind of eruption
[503]
which lasts on average 4 minutes.
[505]
Since these two potentially different types
of eruptions are being measured together,
[509]
the data look like they come from one distribution
with two bumps, but it is likely that there’s
[515]
two unimodal distributions being measured
at the same time.
[518]
Another example that you don’t need to be
a geologist to understand is the race times
[521]
for some marathons.
[522]
While this data may look like it comes from
a unimodal distribution, in reality there’s
[526]
two big groups of people who run a marathon:
those that are competing, and those that do
[531]
it to prove they can do it.
[532]
There’s usually one peak around the time
that all the professional runners cross the
[535]
finish line, and another when the amateurs do.
[538]
While we don’t know for sure that bimodal
data is secretly two distributions disguised
[543]
as one, it is a good reason to look at things
more closely.
[546]
We’ll finish today with uniform distribution.
[549]
Even though we haven’t mentioned uniform
distributions yet, you’ve probably come
[552]
across them in your everyday life.
[554]
Each value in a uniform distribution has the
same frequency, just like each number on a
[558]
die has exactly the same chance of being rolled.
[560]
When you need to decide something fairly,
like which of your 6 roomates has to do dishes
[564]
tonight, or which friend to take to the Jay-Z
concert--the best thing you can do is use
[568]
something--like a die--that has a uniform
distribution.
[571]
That gives everyone an equal chance of being
picked.
[573]
And you can have uniform distributions with
any number of outcomes.
[577]
There are 20-sided dice.
[578]
When you’re in Vegas playing a round of
roulette the ball is equally likely to land
[582]
in any of 38 slots.
[584]
There’s a difference between the shape of
all the data, and the shape of a sample of
[588]
the data.
[589]
When we talk about a uniform distribution,
we’re talking about the settings of that
[592]
data generating machine, it doesn’t mean
that every sample--or even most samples--
[596]
of our data will have exactly the same frequency
for each outcome.
[599]
It’s entirely possible that rolling a die
60 times results in a sample shaped like this:
[605]
Even if we know the theoretical distribution
looks like this:
[608]
Using statistics allow us to take the shape
of samples that has some randomness and uncertainty,
[613]
and make a guess about the true distribution
that created that sample of data.
[617]
Statistics is all about making decisions when
we’re not sure.
[620]
It allows us to look at the shape of 60 dice
rolls and figure out whether we believe the
[624]
die is fair... or whether the die is loaded
or whether we need to keep rolling.
[628]
Whether it’s finding the true distribution
of eruption times at Old Faithful, or showing
[632]
evidence that a company is discriminating
based on age, gender, or race.
[636]
The shape of data gives us a glimpse into
the true nature
[639]
of what is happening in the world.
[640]
Thanks for watching and DFTBAQ.
[643]
I’ll see you next time.
Most Recent Videos:
You can go back to the homepage right here: Homepage





