🔍

Removing Outliers From Excel Pivot Table Part 2 - (Trimmed Mean, Standard Z, Modified Z Methods) - YouTube

Channel: The Engineering Toolbox Channel

[5]

in the last two videos I've been discussing outliers first I showed you

[9]

how to use one of the most common and robust measures of outliers to identify

[12]

and even remove those outliers from the excel data set and cooler still I even

[15]

showed you how to remove them from a given table summary for me that was a

[18]

two key method which uses quartiles and the interquartile range to identify

[22]

outliers then in the last video I covered three other methods of all our

[25]

infection or removal the trim mean standard z-score and modified z-score

[29]

methods I also rounded things off a little bit by going into some more depth

[32]

about outliers how to detect them when to remove them and why all our detection

[36]

can be very valuable so if you haven't watched it all - make sure to go back

[39]

and check them all now in this video I actually want to go into Excel and

[42]

figure out how we can identify those outliers using the three methods of

[45]

entry from the last video hey everyone and welcome back to the engineering tool

[48]

box channel where I give you the tools you need to solve real-world engineering

[51]

problems so in the first video we got up to this point where we identified

[55]

outliers using the do key method now let's look at the three other methods

[58]

that I discussed and compare the results so first was the trim mean method that's

[63]

why I have a column for that and in order to calculate this we'll have to

[66]

use the trim mean function and we'll also make this an array function so

[71]

we're going to use an if statement to say if this part number is equal to this

[76]

array of parts then we're going to look at this array of values and then the

[86]

false will just enter a blank value because it should always be true and

[89]

then the percent will use is 10% so point one and 10% is a pretty typical

[96]

value but you might also want to use 20% or something like that that might be a

[99]

bit extreme depending on your sample size but what control shift enter to

[103]

make that an array function you'll notice these little brackets so let's

[107]

take a look at the pivot table of this will refresh our data sort of things up

[111]

to date and then we'll drop in the trim mean value and

[115]

average and just for comparison let's look at the hours per part we'll take

[120]

the average of that so this will be the original mean and we can see that

[124]

they're pretty much the same except for this one value so that was the only case

[130]

where it took out some values in order to give us a trim mean the reason it

[134]

only affected this part number is because we use the value of 10% for our

[140]

trim percentage that means it takes off 5% of the count or five percent of the

[145]

samples from both ends of the distribution so as part number was the

[149]

only one that had more than 20 samples so 5% of 20 is one so that means it

[154]

takes off one value from each end if it's less than 20 it actually rounds

[158]

down so it won't actually trim any values alright next let's look at the

[163]

standard z-score method for detecting outliers so standard z-score or a lot of

[168]

column and for this we're actually going to add another column and this is going

[172]

to be the standard deviation and the standard deviation is used to calculate

[179]

AZ value so we'll need to calculate that first and we're again we're gonna use an

[183]

array method we're gonna say standard deviation and we'll want there's two

[189]

here this is for the population standard deviation this is for the sample

[193]

standard deviation we're gonna use the sample and then we'll do the standard

[197]

deviation if same thing we're gonna match our part number to the array the

[204]

value of true is going to be our our part values like the false control shift

[213]

enter this in the parentheses okay there we go

[216]

this is going to be the standard deviation for each part number we'll

[220]

also need to add in the mean so the original mean or average this is going

[226]

to be average and if this part number equals the rave

[235]

then use hours per partner rate Elia falses playing close off the average

[243]

pick control-shift-n dirt so now that we have a standard deviation

[247]

of mean we can calculate the standard z-score and the z-score is calculated

[252]

with this equation basically we're finding out how many standard deviations

[255]

are there between a given sample and the media or mean of that sample so we could

[261]

just enter in or set up an equation manually but Excel also has a built-in

[265]

function for calculating this so we can say equals standardized find the x value

[271]

for that sample compare that to the mean of all the samples and then feed the

[276]

function the standard deviation for these samples and then when we hit enter

[281]

there we get our z-score values so if you remember from the last video then I

[286]

said that we can determine outliers by using the z-score and setting a limit

[290]

for the number of standard deviations that were allowing data to be considered

[294]

non outliers so I said we could use plus or minus two two-and-a-half or three

[299]

standard deviations or z-score values so let's just use three as our limit on the

[307]

z-score values so let's say a z-score outlier I'm glad I've gone for that then

[313]

we'll just set up an if statement say if the absolute value of this is greater

[325]

than three then true and if not then false

[331]

so now we can go back to our pivot table and see how the mean changes whether or

[336]

not B includes ZL layers so let's just kind of copy and paste these down to so

[341]

we can compare things so this is the original mean original mean and then

[348]

we'll say trim to mean and now if we add a slicer to our pivot table to look at

[356]

ZL our values we only want to look at values that are false so if we basically

[361]

exclude any Z outliers we can compare those values to our original means so

[367]

there's our original mean here is our mean with the Z outliers excluded then

[375]

let's actually add in the two key outliers this we can compare that as

[379]

well so for Tokyo iers add that in there and we want those to be false and that

[387]

to be true as well let's look at that looks like so now we can start to see

[393]

how each method returns different results so now let's look at the last

[397]

method which is going to be the modified z-score this one is tricky but it's

[403]

pretty robust so this is one of my favorites it works pretty well for small

[407]

sample sizes and ends up returning similar results to the Tukey method as

[411]

well because it uses median values to determine all tires so we're going to

[415]

need to add two columns here we're gonna say the MA D and modified z-score so the

[420]

MA D is going to be the median absolute deviation of each data point from the

[425]

meaning so that's kind of hard to explain so I'm actually gonna add

[427]

another column which will hopefully make it a little bit easier to understand so

[431]

this is going to be the median or actually the absolute

[438]

so now this value is going to be the difference from our data point and the

[444]

sample median so add that actually sorry that's the absolute value of that so

[450]

we're not concerned with direction we're just concerned with how far our data

[454]

point is from the sample median and the M ad then is going to be another array

[458]

function we're going to take the median of these values will go quartile just

[462]

because I like to use that formula and then this would be another array

[466]

function so we'll say if this part number matches the part array again then

[472]

we're going to look to our absolute deviation values nothing goes off the if

[480]

statement and then tell the quartile function which quartile we're looking at

[485]

and that's the second quartile for the median then control shift enter for that

[488]

and this is our median absolute deviation so this tells us how far each

[494]

data point is from the meeting so in order to calculate the modified z-score

[498]

now if we look back at our formula from our last video here the modified z-score

[502]

equals this constant times the difference between the data sample and

[506]

the sample median divided by the ma D let's build this formula into our table

[511]

here they say point six four six seven four five times the difference of our

[520]

data sample and our median quartile 2 and then all that divided by M ad and I

[528]

know if we go and through there these are going to be the values for a

[531]

modified z-score the last thing we need to do is add the column to say if it is

[536]

a mod z-score outlier so I'll add a column here mod Z outlier right quick if

[543]

statement to say if the absolute value of this

[548]

it is greater than 3.5 is typically the value used as a threshold for outliers

[555]

when using the modified Zeitz for method so if it is greater than 3.5 then we're

[560]

gonna return true to say yes it is an all liar and if not false it is not

[563]

enough like and then now we have identified our mod Z outliers so we can

[570]

go to our pivot table quick and take this filter off refresh our data here we

[575]

want to add a slicer here so we can filter out the false values or the true

[580]

values to filter out the outliers so I'll just select just the false here is

[585]

the mean after modified Z outliers have been removed so there are results so I'm

[594]

sure some of you might have been wondering why we were using array

[597]

functions they can be a little bit tricky but the reason we need to use

[600]

them is because we're trying to summarize our data by part number to get

[605]

an average excluding these certain all liars

[609]

that was kind of the overall goal that's just to kind of compare the different

[612]

methods and see what they do for identifying outliers how they change the

[615]

mean things like that then one other note just be cautious of using these

[619]

array functions in large data sets they're very nice that we can kind of

[623]

aggregate data by a different you know a category here like we did for the part

[627]

numbers but if this were a really long list the part numbers definitely a lot

[630]

bigger than this I'm talking in like the thousands it would really slow down the

[633]

calculations that Excel has to do in order to calculate our values so just be

[637]

cautious of that I've had spreadsheets before in the past that are in the

[640]

thousands of part numbers and it really gets bogged down so I'll just be

[645]

cautious all right I hope you all enjoyed that one to get more videos as

[649]

they are released make sure to hit that subscribe button if you like this video

[653]

make sure you like and share and if you didn't like it well then leave me a

[656]

comment and let me know how I can improve last but not least this channel

[659]

is for you so if you have any suggestions for topics you want me to

[662]

cover make sure to let me know thanks for watching we'll see you next time

Most Recent Videos:

WE KILLED 6 HEROIC BOSSES! - YouTube

¿Quién inventó el dinero? - YouTube

Cuándo se inventó el dinero y cómo el dólar se convirtió en la principal moneda del mundo - YouTube

This Citizenship Program is Failing - YouTube

Candida Treatment Protocol w/ Dr. DiNezza - YouTube

$500M investor reacts to Real Estate Tik Toks 2 - YouTube

You can go back to the homepage right here: Homepage