Removing Outliers From Excel Pivot Table Part 2 - (Trimmed Mean, Standard Z, Modified Z Methods) - YouTube

Channel: The Engineering Toolbox Channel

[5]
in the last two videos I've been discussing outliers first I showed you
[9]
how to use one of the most common and robust measures of outliers to identify
[12]
and even remove those outliers from the excel data set and cooler still I even
[15]
showed you how to remove them from a given table summary for me that was a
[18]
two key method which uses quartiles and the interquartile range to identify
[22]
outliers then in the last video I covered three other methods of all our
[25]
infection or removal the trim mean standard z-score and modified z-score
[29]
methods I also rounded things off a little bit by going into some more depth
[32]
about outliers how to detect them when to remove them and why all our detection
[36]
can be very valuable so if you haven't watched it all - make sure to go back
[39]
and check them all now in this video I actually want to go into Excel and
[42]
figure out how we can identify those outliers using the three methods of
[45]
entry from the last video hey everyone and welcome back to the engineering tool
[48]
box channel where I give you the tools you need to solve real-world engineering
[51]
problems so in the first video we got up to this point where we identified
[55]
outliers using the do key method now let's look at the three other methods
[58]
that I discussed and compare the results so first was the trim mean method that's
[63]
why I have a column for that and in order to calculate this we'll have to
[66]
use the trim mean function and we'll also make this an array function so
[71]
we're going to use an if statement to say if this part number is equal to this
[76]
array of parts then we're going to look at this array of values and then the
[86]
false will just enter a blank value because it should always be true and
[89]
then the percent will use is 10% so point one and 10% is a pretty typical
[96]
value but you might also want to use 20% or something like that that might be a
[99]
bit extreme depending on your sample size but what control shift enter to
[103]
make that an array function you'll notice these little brackets so let's
[107]
take a look at the pivot table of this will refresh our data sort of things up
[111]
to date and then we'll drop in the trim mean value and
[115]
average and just for comparison let's look at the hours per part we'll take
[120]
the average of that so this will be the original mean and we can see that
[124]
they're pretty much the same except for this one value so that was the only case
[130]
where it took out some values in order to give us a trim mean the reason it
[134]
only affected this part number is because we use the value of 10% for our
[140]
trim percentage that means it takes off 5% of the count or five percent of the
[145]
samples from both ends of the distribution so as part number was the
[149]
only one that had more than 20 samples so 5% of 20 is one so that means it
[154]
takes off one value from each end if it's less than 20 it actually rounds
[158]
down so it won't actually trim any values alright next let's look at the
[163]
standard z-score method for detecting outliers so standard z-score or a lot of
[168]
column and for this we're actually going to add another column and this is going
[172]
to be the standard deviation and the standard deviation is used to calculate
[179]
AZ value so we'll need to calculate that first and we're again we're gonna use an
[183]
array method we're gonna say standard deviation and we'll want there's two
[189]
here this is for the population standard deviation this is for the sample
[193]
standard deviation we're gonna use the sample and then we'll do the standard
[197]
deviation if same thing we're gonna match our part number to the array the
[204]
value of true is going to be our our part values like the false control shift
[213]
enter this in the parentheses okay there we go
[216]
this is going to be the standard deviation for each part number we'll
[220]
also need to add in the mean so the original mean or average this is going
[226]
to be average and if this part number equals the rave
[235]
then use hours per partner rate Elia falses playing close off the average
[243]
pick control-shift-n dirt so now that we have a standard deviation
[247]
of mean we can calculate the standard z-score and the z-score is calculated
[252]
with this equation basically we're finding out how many standard deviations
[255]
are there between a given sample and the media or mean of that sample so we could
[261]
just enter in or set up an equation manually but Excel also has a built-in
[265]
function for calculating this so we can say equals standardized find the x value
[271]
for that sample compare that to the mean of all the samples and then feed the
[276]
function the standard deviation for these samples and then when we hit enter
[281]
there we get our z-score values so if you remember from the last video then I
[286]
said that we can determine outliers by using the z-score and setting a limit
[290]
for the number of standard deviations that were allowing data to be considered
[294]
non outliers so I said we could use plus or minus two two-and-a-half or three
[299]
standard deviations or z-score values so let's just use three as our limit on the
[307]
z-score values so let's say a z-score outlier I'm glad I've gone for that then
[313]
we'll just set up an if statement say if the absolute value of this is greater
[325]
than three then true and if not then false
[331]
so now we can go back to our pivot table and see how the mean changes whether or
[336]
not B includes ZL layers so let's just kind of copy and paste these down to so
[341]
we can compare things so this is the original mean original mean and then
[348]
we'll say trim to mean and now if we add a slicer to our pivot table to look at
[356]
ZL our values we only want to look at values that are false so if we basically
[361]
exclude any Z outliers we can compare those values to our original means so
[367]
there's our original mean here is our mean with the Z outliers excluded then
[375]
let's actually add in the two key outliers this we can compare that as
[379]
well so for Tokyo iers add that in there and we want those to be false and that
[387]
to be true as well let's look at that looks like so now we can start to see
[393]
how each method returns different results so now let's look at the last
[397]
method which is going to be the modified z-score this one is tricky but it's
[403]
pretty robust so this is one of my favorites it works pretty well for small
[407]
sample sizes and ends up returning similar results to the Tukey method as
[411]
well because it uses median values to determine all tires so we're going to
[415]
need to add two columns here we're gonna say the MA D and modified z-score so the
[420]
MA D is going to be the median absolute deviation of each data point from the
[425]
meaning so that's kind of hard to explain so I'm actually gonna add
[427]
another column which will hopefully make it a little bit easier to understand so
[431]
this is going to be the median or actually the absolute
[438]
so now this value is going to be the difference from our data point and the
[444]
sample median so add that actually sorry that's the absolute value of that so
[450]
we're not concerned with direction we're just concerned with how far our data
[454]
point is from the sample median and the M ad then is going to be another array
[458]
function we're going to take the median of these values will go quartile just
[462]
because I like to use that formula and then this would be another array
[466]
function so we'll say if this part number matches the part array again then
[472]
we're going to look to our absolute deviation values nothing goes off the if
[480]
statement and then tell the quartile function which quartile we're looking at
[485]
and that's the second quartile for the median then control shift enter for that
[488]
and this is our median absolute deviation so this tells us how far each
[494]
data point is from the meeting so in order to calculate the modified z-score
[498]
now if we look back at our formula from our last video here the modified z-score
[502]
equals this constant times the difference between the data sample and
[506]
the sample median divided by the ma D let's build this formula into our table
[511]
here they say point six four six seven four five times the difference of our
[520]
data sample and our median quartile 2 and then all that divided by M ad and I
[528]
know if we go and through there these are going to be the values for a
[531]
modified z-score the last thing we need to do is add the column to say if it is
[536]
a mod z-score outlier so I'll add a column here mod Z outlier right quick if
[543]
statement to say if the absolute value of this
[548]
it is greater than 3.5 is typically the value used as a threshold for outliers
[555]
when using the modified Zeitz for method so if it is greater than 3.5 then we're
[560]
gonna return true to say yes it is an all liar and if not false it is not
[563]
enough like and then now we have identified our mod Z outliers so we can
[570]
go to our pivot table quick and take this filter off refresh our data here we
[575]
want to add a slicer here so we can filter out the false values or the true
[580]
values to filter out the outliers so I'll just select just the false here is
[585]
the mean after modified Z outliers have been removed so there are results so I'm
[594]
sure some of you might have been wondering why we were using array
[597]
functions they can be a little bit tricky but the reason we need to use
[600]
them is because we're trying to summarize our data by part number to get
[605]
an average excluding these certain all liars
[609]
that was kind of the overall goal that's just to kind of compare the different
[612]
methods and see what they do for identifying outliers how they change the
[615]
mean things like that then one other note just be cautious of using these
[619]
array functions in large data sets they're very nice that we can kind of
[623]
aggregate data by a different you know a category here like we did for the part
[627]
numbers but if this were a really long list the part numbers definitely a lot
[630]
bigger than this I'm talking in like the thousands it would really slow down the
[633]
calculations that Excel has to do in order to calculate our values so just be
[637]
cautious of that I've had spreadsheets before in the past that are in the
[640]
thousands of part numbers and it really gets bogged down so I'll just be
[645]
cautious all right I hope you all enjoyed that one to get more videos as
[649]
they are released make sure to hit that subscribe button if you like this video
[653]
make sure you like and share and if you didn't like it well then leave me a
[656]
comment and let me know how I can improve last but not least this channel
[659]
is for you so if you have any suggestions for topics you want me to
[662]
cover make sure to let me know thanks for watching we'll see you next time