Log normal distribution | Math, Statistics for data science, machine learning - YouTube

Channel: codebasics

[0]
we need to understand normal
[2]
distribution before we move on to
[4]
log normal distribution. here I have
[6]
people's highest database
[8]
and if you plot that on a histogram it
[11]
looks like a bell curve so
[12]
if you are aware about normal
[15]
distribution you know that
[16]
this shape is called bell curve and this
[19]
is
[20]
normal distribution there are many
[22]
examples of normal distribution in real
[24]
life such as the test score,
[26]
employee performance and so on but let's
[29]
think about
[30]
people's income database here most of
[32]
the people have income
[34]
around 50 000 here in US
[37]
but on the higher end you could have
[39]
people like
[40]
Jeff Bezos, Elon Musk you know they could
[42]
be earning
[43]
a lot more than the regular population
[46]
so this curve
[47]
is right skewed actually it looks very
[50]
different than normal distribution
[52]
because on the right hand side this tail
[54]
it kind of never ends you know because
[57]
people might earn 1 billion 2 billion it
[59]
could be like the income could be really
[61]
high
[62]
whereas if you're thinking about let's
[64]
say employees performance
[66]
it could be just in a limited range
[69]
test score cannot never be more than 100
[72]
that's why these distribution form a
[75]
bell curve whereas
[76]
this other distribution is on a right
[80]
skewed and
[81]
the chart this tail can really get
[84]
very long but if you apply log function
[87]
to x
[88]
axis so I will apply again log function
[90]
to x axis
[92]
then it becomes a normal distribution
[94]
you see
[95]
I'm adding zero between these two
[98]
numbers
[99]
the this is multiplied by 10 this is
[101]
multiplied by 10
[103]
this is the fundamental idea behind log
[106]
so when I do again log when I apply log
[109]
function
[110]
to this axis this x axis
[113]
the distribution becomes normal
[116]
okay so here is what i did i had this
[120]
distribution i applied log
[122]
of income it become a bell curve and
[125]
when that happens this original
[127]
distribution is called
[129]
log normal distribution so again if you
[132]
get a normal distribution by applying a
[133]
log function to a data set
[135]
then the data set is say to have a log
[138]
normal
[139]
distribution all right there are other
[142]
examples of log normal distribution such
[144]
as
[144]
hospitalization days most of the people
[147]
spend 5 10 days 15 days
[149]
I hope you're you don't have to spend
[152]
any days
[152]
in the hospital but there are
[155]
unfortunately
[156]
some critically ill patients they spend
[158]
300 days
[159]
400 days my wife works in a hospital and
[161]
she says
[162]
there are people who spends many days in
[164]
the hospitals so this is also
[167]
a log normally distributed graph
[170]
advertising budget small or mid-tier
[173]
companies will not have much budget
[175]
advertising budget but the big companies
[179]
the the you know companies who have
[181]
higher revenues a lot of consumers
[183]
they might have a a huge budget you know
[187]
500 million
[188]
1 billion 10 billion so in this case
[190]
also when you're doing some budget
[192]
analysis
[193]
you will come across this type of log
[196]
normal
[196]
distribution how log normal distribution
[199]
is used in data science well we have
[200]
seen this example before
[202]
but let's say you are trying to build a
[204]
machine learning model
[205]
which can predict if you want to give a
[207]
loan to a person or not
[209]
you are doing some credit risk analysis
[211]
you want to figure out
[212]
if you want to approve a loan for a
[215]
given person or not
[216]
here you can see that lady puja has a
[220]
lot of income she seems to be a rich
[222]
lady
[222]
and this value is quite different
[226]
than other values so if you're using
[228]
income as your independent variable
[230]
in your building machine learning model
[233]
the model might not get a higher
[235]
accuracy because the general principle
[236]
of machine learning model is that
[238]
the numbers if they are on a similar
[241]
scale
[242]
then the model will perform better so
[245]
you can apply a log transform on this
[248]
income column comes with the new column
[251]
called log income
[252]
where by applying log function you will
[256]
get the values in a similar lane
[257]
range you see 4.9 4.8 and now puja
[261]
although
[261]
she has a high income after applying log
[263]
you have 5.7 which is kind of in a
[265]
similar
[266]
range as other numbers so just to
[269]
summarize
[270]
log transform is a popular technique
[273]
where
[274]
if you are having log normal
[275]
distribution you apply log transform
[278]
and use that particular
[281]
as a feature in building your machine
[285]
learning
[285]
model we're going to show
[289]
the log normal distribution using c bond
[291]
library in python
[292]
here I have U.S income data set which I
[296]
got from census.gov website
[299]
so these are the range people's income
[302]
and
[302]
let's say between twenty thousand and
[305]
twenty five thousand dollar
[306]
there are six thousand people you know
[308]
and I took I used this data but I came
[311]
up with a simplified
[313]
version of this file where I have only
[315]
two columns
[316]
one is income and the count so people
[319]
who have income
[321]
up to five thousand dollars is this
[322]
between 5000 to 10 000
[324]
is this and so on and I
[327]
loaded that here into my pandas data
[331]
frame
[331]
see I loaded that and my data frame
[334]
looks like this
[335]
and I use c bond library to plot a bar
[337]
plot
[338]
and you can see my bar plot looks normal
[340]
log normally distributed
[342]
by the way I skipped all the all the
[345]
the data which is having more than two
[347]
hundred thousand dollar income if I
[349]
include all of that
[350]
you'll see a long tail you know very
[352]
right skewed graph here
[354]
but this also looks like normally
[357]
distributed
[358]
when you apply log to the x scale
[361]
you see it becomes more like normally
[364]
distributed
[365]
just ignore the bar width here they are
[367]
not uniform
[368]
but overall if you see the c the chart
[370]
looks more like
[372]
a bell curve you see that all right
[375]
I guess you have a pretty good
[376]
understanding of log normal distribution
[379]
the link of this code is given in the
[382]
video description below
[383]
if you like this video please share it
[385]
with your friends it's a simple concept
[387]
but
[387]
we see log normal distribution in our
[389]
day to day life and
[391]
while solving data science problem you
[393]
will come across this
[394]
and if it's creating problem in your
[397]
machine learning model accuracy
[399]
don't forget to apply log transform
[402]
thank you