🔍
Non-Normal Distribution in Statistics – Skewness and Kurtosis (3-9) - YouTube
Channel: Research By Design
[2]
Now that I have explained to you the
ubiquity of the normal distribution, Its
[8]
regular appearance in human measurements,
you may begin to hope or even expect
[13]
that all of the distributions that we
will encounter will be normal curves, but
[19]
if that is your expectation, you will
have to get used to disappointment.
[24]
(Princess bride reference there) Because
many curves, perhaps most curves, are not
[37]
normal distributions, we need a way to
talk about the shape of distributions
[42]
when they differ from normality. The
first difference that we may find is
[47]
that the scores in the distribution are
more spread out than we would have
[51]
expected, or we may find the scores are
more closely packed together than we
[56]
expected. The name for the peaked or
flatness of a curve is called kurtosis.
[63]
When the scores are very close together,
then the curve becomes peaked. We call
[70]
this a "leptokurtic" curve think of the
scores leaping up - leptokurtic. When the
[79]
scores are very spread out, the curve
becomes flat like a plate, we call this
[84]
platykurtic. "Plat" rhymes with flat.
Platykurtic is a flattened curve in
[91]
the shape of a plate. A normal curve is
mesokurtic. It's kurtosis is medium. So
[101]
kurtosis can be measured as leptokurtic -
tall, platykurtic - flat, or mesokurtic -
[108]
medium. Kurtosis is caused by the
variability in the distribution. Another
[116]
thing that can happen to a curve is when
the scores are pulled out in only one
[121]
direction.
[127]
When the scores are dragged down (or
rather, out) in only one direction, this
[133]
creates a skew in our curve. Therefore,
we need to talk about the skewness of
[139]
our distribution. Negatively skewed
distributions have a higher than
[144]
expected frequency of high or extreme
scores on the right, and the tail is
[148]
pulled out to the left end of the number
line on the x-axis. For example, if we
[156]
were interested in the running speeds of
football players, we might find a lot of
[160]
very fast players - high scores, but only a
few slower runners - low scores. Skewness
[167]
is always caused by outliers in the
direction of the tail. In a positively
[176]
skewed distribution, the higher than
expected frequencies are on the low end
[180]
of the curve. The tail is pulled back on
the right or positive end of the number
[185]
line. If we were measuring reaction time,
we would expect to have a large number
[192]
of very quick responses - low scores, and
only a few slower responses, taking more
[198]
time, further up the positive end of that
scale. Skewed distributions are not
[204]
normal. How can you remember which
direction is positive or negative when
[211]
we talk about skewness? Stats cow tells
us that the skew is in the tail. Skewness
[220]
is caused by outliers, extreme scores in
the tail of the distribution, the
[226]
direction that the tail is pulled out,
(positive or negative) is the direction of
[232]
the skew. Here are two curves. This first
one is positively skewed, and the second
[240]
is negatively skewed the top curve is
positively skewed because the tail is
[246]
pulled out on the right, or the positive
direction of the number line. The bottom
[251]
curve, is negatively skewed, the tail is
pulled out on the negative, or left end
[256]
of the
number line. Both of these curves show us
[261]
what happens to the mean and the median
in the case of kurtosis. In both of these
[267]
curves, you can see what happens to the
mean and the median in the case of
[272]
skewness. Both of them are pulled in the
direction of the outlier but the mean is
[279]
pulled further. That is because the mean
is more susceptible to the outlier that
[285]
is causing the skewness. Mathematically
we can calculate a measure of skewness
[290]
by comparing the mean and the median and this will give us a value that we can
[295]
use to quantify the skewness of our
curve. But there are other things that
[300]
can go wrong with our normal curve!
Instead of having one peak sometimes we
[306]
have two peaks. This occurs when there is more than one most frequently occurring
[312]
score we call this type of curve bimodal. A curve can be bimodal when there really
[319]
are two most frequently occurring scores.
For instance when is the best time to go
[325]
fishing? At what time of day will you
catch the most fish? Probably early in
[331]
the morning, and then in the evening when the sun is going down. In the middle of
[336]
the day, when the Sun is at its height,
you will catch fewer fish. So if we plot
[340]
the number of fish caught, we will see a
peak in the morning at dawn, and another
[345]
peak in the evening at dusk. This would
be a true bimodal distribution. On the
[351]
other hand, we might have a bimodal
distribution when there are actually two
[355]
distributions overlying each other. When
we had both males and females on the
[361]
football field and we were comparing
heights, we saw that there was a
[364]
distribution for males and another
distribution for females. The
[369]
distributions overlapped - some females
were taller than some males - but the
[374]
average heights were taller for males.
They really were two distinct
[379]
distributions that should be separated
before being analyzed. A multimodal
[385]
distribution has three or more most
frequently occurring scores.
[390]
You may wonder why we don't call it a
trimodal distribution or a quadrimodal
[394]
distribution - four peaks - the answer is
that when we start getting three, four,
[400]
five, modes, there is something very wrong in our data set. Three or more modes is
[407]
multimodal, and it's messed up. We need to figure out what is going on before we
[413]
try to analyze those data. Rectangular
distributions have the same frequency
[419]
for all scores. If you roll a single die
100 times, how many times do you expect
[426]
to get a one? About one-sixth of the time,
in fact you would expect to get each of
[432]
the scores, one through six, approximately one sixth of a time. That is a
[438]
rectangular distribution. Once you add a
second die, however your distribution
[443]
will begin to look more normal.
Rectangular distributions have exactly
[448]
the same frequency for all scores, and do
not have tails. Before we conclude, there
[456]
is - one more thing - that I want to tell
you about the normal curve, and that is
[463]
that the normal curve can be overlaid
with a number line, and this is where
[468]
things get really interesting and quite
useful. If we have a normal curve, we can
[473]
add the value of the mean right in the
middle where it belongs, and in this
[478]
example we're going to imagine that our
mean is 50, so then we could lay out a
[483]
number line with four point delineations.
Half of our scores will always be above
[490]
the mean, or above 50. The remaining half
of the scores will always be below 50.
[496]
That is what a measure of central
tendency tells us. It is the point at
[501]
which half of the scores fall above and
half of the scores fall below. The
[506]
next thing that we could do is measure
the proportion of the scores that fall
[511]
within a certain range, above or below
the mean. The next thing that we could do
[517]
is measure the proportion of the scores
that fall within a certain range above
[521]
or below the mean. The
proportion is the total area under the
[525]
normal curve that corresponds to the
relative frequency of those scores. To
[531]
better understand this, let's return to
our picture of the people standing on
[535]
the football field. Remember that
everyone (100%) are
[540]
standing below the rope that represents
our distribution we want to know the
[546]
proportion of people who are between
five foot six and five foot nine inches
[551]
tall. We ask everyone who is in those
rows, five foot six, seven, eight, and nine
[559]
to stay where they are, everyone else
please leave the field. So how many
[565]
people are in these four rows? Divide the
number of people in the four rows by the
[571]
total number of people and you have a
proportion. This is the proportion of
[576]
people who are in that range underneath
the distribution. It would also be the
[582]
relative frequency of the number of
people in that range, and this is going
[587]
to become a very useful technique when
we talk about z-scores. But for now, just
[592]
remember what we've learned about the
frequency table, and specifically how the
[596]
relative frequency relates to what we
know about the normal curve.
Most Recent Videos:
You can go back to the homepage right here: Homepage