馃攳
What are outliers, How to Detect Them, When to Remove Them (Trimmed Mean, Z Score, Modified Z Score) - YouTube
Channel: The Engineering Toolbox Channel
[5]
in the last video I showed how to remove
outliers from a data set using the IQR
[9]
method in this video I want to talk a
little bit more in depth about outliers
[12]
and some other methods of outlier
detection and mobile I'll go over three
[16]
methods of trimming outliers from data
sets to get more accurate means the trim
[19]
mean method the standard z-score method
and the modified z-score method hello
[23]
everyone welcome to the engineering tool
box channel where I give you the tools
[26]
you need to solve real-world engineering
problems so just like in the last video
[30]
I want to preface this with saying
values determined as outliers aren't
[33]
always something that should be removed
in fact pretty much the opposite this is
[37]
very important to understand anyone with
a good understanding of stats that's
[40]
watching this video is probably yelling
at their screen already but hopefully
[43]
I'm able to explain my case typically in
statistics were told that outliers
[46]
should not be removed unless they can be
proven to be erroneous data points these
[50]
are things that are either not possible
like an adult weighing 12 pounds or
[53]
something like that or we can gather at
least anecdotal evidence to prove that
[57]
the data point is invalid and this
totally makes sense however from an
[60]
engineering perspective we're often
trying to act on these statistics with
[63]
limited data time or ability to set up
controlled experiments so if 9 out of 10
[67]
data points suggest one thing and the
last looks to be an outlier I'm gonna
[70]
assume that's a special case and act on
the 90 percent but just remember that by
[74]
removing outliers from a data sets we're
making an assumption and it's not
[77]
necessarily correct as you'll see in
this video there are different ways of
[80]
determining outliers that right there
shows us that outliers are somewhat
[83]
loosely defined the method used to
determine outliers directly affects what
[86]
is classified as one each does set and
scenario is different and without
[90]
understanding the method used to
determine outliers what we plan to do
[93]
with them and having an overall
conceptual understanding of the data
[95]
itself we can end up making incorrect
assumptions we should always ask
[99]
ourselves what the goal is and be
careful not to introduce bias by picking
[102]
a certain method or just removing
outliers just because it gives us the
[105]
numbers we want by the way if there's
any statisticians out there I'd love to
[108]
get your input on this anyways
let's dive in before we do this in Excel
[111]
it's quick run through each method first
is the trimmed mean method this is not a
[115]
method for detecting outliers but it is
one way of removing the potential
[118]
outlier values in order to give us a
potentially more accurate mean it's very
[122]
important to note that this method could
very easily remove not all
[125]
our values it works by taking a certain
number of values or a certain percent
[129]
with a total count and removes them from
the main equation in this example the
[133]
top three and bottom three values are
being removed this method does not work
[136]
well for small data sets but it does
work well on skewed or fat-tailed
[139]
distributions the assumption we're
making here is that the data left after
[142]
trimming will be more normally
distributed and more useful
[144]
approximation of the mean next the
standard z-score method is a common way
[147]
to detect outliers in large normally
distributed data sets
[150]
it works by calculating a mean and
standard deviation for the entire data
[153]
set then calculating the z score from
each data point then comparing those
[157]
z-scores to a certain threshold often
times three two and a half or two
[161]
standard deviations from the mean can be
considered outliers you can start to see
[164]
how our detection can be pretty dicey
those bets will not work with small
[168]
sample size is definitely not under
around 12 because the max standard
[171]
z-score for that sample size is around
like three and it gets even smaller as
[175]
the sample size gets smaller last is the
modified z-score which is very robust
[179]
because of the uses median values to
calculate a modified z-score value for
[183]
that we use a formula where we need to
find the ma D or median absolute
[187]
deviation now it's pretty hard to
explain this when visually but I'll try
[190]
to explain it more when you get into
Excel just for comparison let's look at
[193]
an example here the resulting means
after removing the outliers each method
[197]
detected you can see that the values
vary depending on the method used
[202]
you may have also noticed that
essentially what's happening when we
[206]
remove outliers is the mean of the
remaining values move closer to the
[209]
original median so you might ask why not
just use the medium in the first place
[212]
depending on what you're trying to
achieve this is certainly an option we
[215]
need to ask ourselves again what is the
goal is it to approximate the normal
[219]
distribution of the data or just to
understand where the central tendency is
[221]
if you're just trying to understand the
central tendency where 50% of the values
[225]
are above and 50% or below then using
the median is obviously the way to go
[229]
let's look at some simple examples where
we might want to remove outliers to get
[232]
more representative means if we're
trying to understand the distribution
[235]
and central tendency of a univariate
data set or one variable we might want
[239]
to detect and remove outliers to gain a
more realistic understanding of the
[242]
distribution in this example we can see
that the mean is very different
[245]
depending on the inclusion of the
outliers if my job were to create a
[249]
model to describe the interaction
between two variables like in this
[252]
regression analysis I would more than
likely exclude these outliers because
[255]
the resulting regression line does not
fit the majority of the data points
[257]
again this is making a huge assumption
that the data points aren't actually
[260]
representative of the interaction even
though we may choose to exclude the
[264]
outliers we should probably follow up
this analysis by implementing more
[267]
controls of the process and that brings
me to my final very important note as
[270]
engineers we should always ask the
question why are there outliers in the
[273]
first place not liar just the fact that
there are outliers can tell us a lot
[277]
about the underlying process with a
method of measurement we should always
[280]
try to investigate outliers and try to
understand why they even exist often
[283]
time there's a ton of value in this
investigation now you might find that
[286]
there was just a simple error or you
might find that there is something
[288]
fundamentally wrong with the process
you're measuring the cool thing about
[291]
being an engineer is usually we aren't
trying to simply understand data just
[294]
for the sake of understanding it but
actually understand the data so we can
[297]
change the underlying process that is
driving it that's what makes outlier
[300]
detection so valuable all right I admit
I was all over the place in this one
[304]
probably because the topic of outliers
can be so nebulous and it's dependent on
[307]
the situation and the outcomes were
looking for I felt like I was talking in
[310]
circles at times but sometimes that's
just the way my mind works now that we
[313]
hopefully have a better understanding of
outliers why detecting them is important
[316]
and why we might want to remove them
I'll go over how we can actually do this
[319]
using Excel in the next video anyway I
hope you found this video valuable and
[322]
if not at least interesting or
entertaining if you did make sure to
[325]
LIKE and subscribe to get my weekly
videos if you have any questions or
[327]
there's anything I can go into more
depth on make sure to let me know and
[330]
maybe if you're lucky I'll make a video
about it and as always thank you so much
[333]
for watching
Most Recent Videos:
You can go back to the homepage right here: Homepage





