馃攳
Removing Outliers From Excel Pivot Table Part 2 - (Trimmed Mean, Standard Z, Modified Z Methods) - YouTube
Channel: The Engineering Toolbox Channel
[5]
in the last two videos I've been
discussing outliers first I showed you
[9]
how to use one of the most common and
robust measures of outliers to identify
[12]
and even remove those outliers from the
excel data set and cooler still I even
[15]
showed you how to remove them from a
given table summary for me that was a
[18]
two key method which uses quartiles and
the interquartile range to identify
[22]
outliers then in the last video I
covered three other methods of all our
[25]
infection or removal the trim mean
standard z-score and modified z-score
[29]
methods I also rounded things off a
little bit by going into some more depth
[32]
about outliers how to detect them when
to remove them and why all our detection
[36]
can be very valuable so if you haven't
watched it all - make sure to go back
[39]
and check them all now in this video I
actually want to go into Excel and
[42]
figure out how we can identify those
outliers using the three methods of
[45]
entry from the last video hey everyone
and welcome back to the engineering tool
[48]
box channel where I give you the tools
you need to solve real-world engineering
[51]
problems so in the first video we got up
to this point where we identified
[55]
outliers using the do key method now
let's look at the three other methods
[58]
that I discussed and compare the results
so first was the trim mean method that's
[63]
why I have a column for that and in
order to calculate this we'll have to
[66]
use the trim mean function and we'll
also make this an array function so
[71]
we're going to use an if statement to
say if this part number is equal to this
[76]
array of parts then we're going to look
at this array of values and then the
[86]
false will just enter a blank value
because it should always be true and
[89]
then the percent will use is 10% so
point one and 10% is a pretty typical
[96]
value but you might also want to use 20%
or something like that that might be a
[99]
bit extreme depending on your sample
size but what control shift enter to
[103]
make that an array function you'll
notice these little brackets so let's
[107]
take a look at the pivot table of this
will refresh our data sort of things up
[111]
to date and then we'll drop in the trim
mean value and
[115]
average and just for comparison let's
look at the hours per part we'll take
[120]
the average of that so this will be the
original mean and we can see that
[124]
they're pretty much the same except for
this one value so that was the only case
[130]
where it took out some values in order
to give us a trim mean the reason it
[134]
only affected this part number is
because we use the value of 10% for our
[140]
trim percentage that means it takes off
5% of the count or five percent of the
[145]
samples from both ends of the
distribution so as part number was the
[149]
only one that had more than 20 samples
so 5% of 20 is one so that means it
[154]
takes off one value from each end if
it's less than 20 it actually rounds
[158]
down so it won't actually trim any
values alright next let's look at the
[163]
standard z-score method for detecting
outliers so standard z-score or a lot of
[168]
column and for this we're actually going
to add another column and this is going
[172]
to be the standard deviation and the
standard deviation is used to calculate
[179]
AZ value so we'll need to calculate that
first and we're again we're gonna use an
[183]
array method we're gonna say standard
deviation and we'll want there's two
[189]
here this is for the population standard
deviation this is for the sample
[193]
standard deviation we're gonna use the
sample and then we'll do the standard
[197]
deviation if same thing we're gonna
match our part number to the array the
[204]
value of true is going to be our our
part values like the false control shift
[213]
enter this in the parentheses okay there
we go
[216]
this is going to be the standard
deviation for each part number we'll
[220]
also need to add in the mean so the
original mean or average this is going
[226]
to be average and if this part number
equals the rave
[235]
then use hours per partner rate Elia
falses playing close off the average
[243]
pick control-shift-n dirt
so now that we have a standard deviation
[247]
of mean we can calculate the standard
z-score and the z-score is calculated
[252]
with this equation basically we're
finding out how many standard deviations
[255]
are there between a given sample and the
media or mean of that sample so we could
[261]
just enter in or set up an equation
manually but Excel also has a built-in
[265]
function for calculating this so we can
say equals standardized find the x value
[271]
for that sample compare that to the mean
of all the samples and then feed the
[276]
function the standard deviation for
these samples and then when we hit enter
[281]
there we get our z-score values so if
you remember from the last video then I
[286]
said that we can determine outliers by
using the z-score and setting a limit
[290]
for the number of standard deviations
that were allowing data to be considered
[294]
non outliers so I said we could use plus
or minus two two-and-a-half or three
[299]
standard deviations or z-score values so
let's just use three as our limit on the
[307]
z-score values so let's say a z-score
outlier I'm glad I've gone for that then
[313]
we'll just set up an if statement say if
the absolute value of this is greater
[325]
than three then true and if not then
false
[331]
so now we can go back to our pivot table
and see how the mean changes whether or
[336]
not B includes ZL layers so let's just
kind of copy and paste these down to so
[341]
we can compare things so this is the
original mean original mean and then
[348]
we'll say trim to mean and now if we add
a slicer to our pivot table to look at
[356]
ZL our values we only want to look at
values that are false so if we basically
[361]
exclude any Z outliers we can compare
those values to our original means so
[367]
there's our original mean here is our
mean with the Z outliers excluded then
[375]
let's actually add in the two key
outliers this we can compare that as
[379]
well so for Tokyo iers add that in there
and we want those to be false and that
[387]
to be true as well let's look at that
looks like so now we can start to see
[393]
how each method returns different
results so now let's look at the last
[397]
method which is going to be the modified
z-score this one is tricky but it's
[403]
pretty robust so this is one of my
favorites it works pretty well for small
[407]
sample sizes and ends up returning
similar results to the Tukey method as
[411]
well because it uses median values to
determine all tires so we're going to
[415]
need to add two columns here we're gonna
say the MA D and modified z-score so the
[420]
MA D is going to be the median absolute
deviation of each data point from the
[425]
meaning so that's kind of hard to
explain so I'm actually gonna add
[427]
another column which will hopefully make
it a little bit easier to understand so
[431]
this is going to be the median or
actually the absolute
[438]
so now this value is going to be the
difference from our data point and the
[444]
sample median so add that actually sorry
that's the absolute value of that so
[450]
we're not concerned with direction we're
just concerned with how far our data
[454]
point is from the sample median and the
M ad then is going to be another array
[458]
function we're going to take the median
of these values will go quartile just
[462]
because I like to use that formula and
then this would be another array
[466]
function so we'll say if this part
number matches the part array again then
[472]
we're going to look to our absolute
deviation values nothing goes off the if
[480]
statement and then tell the quartile
function which quartile we're looking at
[485]
and that's the second quartile for the
median then control shift enter for that
[488]
and this is our median absolute
deviation so this tells us how far each
[494]
data point is from the meeting so in
order to calculate the modified z-score
[498]
now if we look back at our formula from
our last video here the modified z-score
[502]
equals this constant times the
difference between the data sample and
[506]
the sample median divided by the ma D
let's build this formula into our table
[511]
here they say point six four six seven
four five times the difference of our
[520]
data sample and our median quartile 2
and then all that divided by M ad and I
[528]
know if we go and through there these
are going to be the values for a
[531]
modified z-score the last thing we need
to do is add the column to say if it is
[536]
a mod z-score outlier so I'll add a
column here mod Z outlier right quick if
[543]
statement to say if the absolute value
of this
[548]
it is greater than 3.5 is typically the
value used as a threshold for outliers
[555]
when using the modified Zeitz for method
so if it is greater than 3.5 then we're
[560]
gonna return true to say yes it is an
all liar and if not false it is not
[563]
enough like and then now we have
identified our mod Z outliers so we can
[570]
go to our pivot table quick and take
this filter off refresh our data here we
[575]
want to add a slicer here so we can
filter out the false values or the true
[580]
values to filter out the outliers so
I'll just select just the false here is
[585]
the mean after modified Z outliers have
been removed so there are results so I'm
[594]
sure some of you might have been
wondering why we were using array
[597]
functions they can be a little bit
tricky but the reason we need to use
[600]
them is because we're trying to
summarize our data by part number to get
[605]
an average excluding these certain all
liars
[609]
that was kind of the overall goal that's
just to kind of compare the different
[612]
methods and see what they do for
identifying outliers how they change the
[615]
mean things like that then one other
note just be cautious of using these
[619]
array functions in large data sets
they're very nice that we can kind of
[623]
aggregate data by a different you know a
category here like we did for the part
[627]
numbers but if this were a really long
list the part numbers definitely a lot
[630]
bigger than this I'm talking in like the
thousands it would really slow down the
[633]
calculations that Excel has to do in
order to calculate our values so just be
[637]
cautious of that I've had spreadsheets
before in the past that are in the
[640]
thousands of part numbers and it really
gets bogged down so I'll just be
[645]
cautious all right I hope you all
enjoyed that one to get more videos as
[649]
they are released make sure to hit that
subscribe button if you like this video
[653]
make sure you like and share and if you
didn't like it well then leave me a
[656]
comment and let me know how I can
improve last but not least this channel
[659]
is for you so if you have any
suggestions for topics you want me to
[662]
cover make sure to let me know thanks
for watching we'll see you next time
Most Recent Videos:
You can go back to the homepage right here: Homepage





