馃攳
Statistics 101: Logistic Regression, An Introduction - YouTube
Channel: Brandon Foltz
[0]
you
[4]
hello and welcome Brandon here thanks
[7]
for choosing my video if you liked the
[9]
video please give it a thumbs up if you
[12]
think someone you know can also benefit
[13]
by watching please share and as always
[16]
please subscribe I appreciate it very
[18]
much
[19]
so let's go ahead and get started so
[22]
here we are in logistic regression a
[25]
very useful if in my opinion
[27]
underutilized statistical procedure that
[30]
is not all that intuitive which is maybe
[32]
why it's underutilized now as with many
[34]
of my other videos we're going to start
[36]
out with an actual problem now this
[38]
problem is one that I made up so I made
[40]
up the text and the data for it so just
[42]
keep that in mind going forward however
[44]
I do think it has the side benefit of
[47]
being potentially useful in your
[49]
everyday life so let's go ahead and take
[51]
a look at it so we'll call it first time
[54]
homebuyer so as a first time homebuyer
[56]
you are busy organizing your financial
[59]
records so you can apply for a home
[60]
mortgage as part of this process you
[63]
order a copy of your credit report to
[65]
check for errors and gauge your credit
[67]
score which can range at least here in
[69]
the US from 300 to 850 now lenders will
[74]
factor in your credit score when
[75]
deciding to approve or not approve you
[78]
for a mortgage they will factor in other
[80]
things like your income how long you've
[83]
been at your job and other things but
[84]
your credit score is definitely an
[86]
important part of their decision now it
[89]
turns out your credit score is 720 on
[92]
that scale of 300 to 850 now while doing
[96]
your research which you are dutifully
[98]
doing as a potential home buyer you find
[101]
some raw data online so there's data
[104]
floating around the web everywhere so
[106]
you are lucky enough to come across this
[108]
data set that has 1000 applicant credit
[111]
scores and whether or not the
[113]
application was approved so yes or no
[116]
for the home mortgage now using the data
[121]
you found you would like to do the
[124]
following number one develop a model
[127]
that will provide the probability and
[129]
the odds of being approved for any given
[133]
credit score and again we will do all of
[135]
these as we progress throughout the
[136]
video series
[138]
number two discover approximately what
[141]
credit score is associated with our
[143]
probability of 50% so the odds are even
[148]
for being approved so if I walk into the
[150]
bank with a certain credit score it's
[152]
basically like flipping a coin my
[155]
probability is 50% of being approved
[157]
which is the same way of saying the odds
[159]
or even we want to know what credit
[161]
score that is on our scale number three
[164]
input your score of 720 into the model
[168]
to determine the probability and the
[171]
odds of you being approved for a
[173]
mortgage which of course is very
[175]
important to you and finally determine
[179]
how improving your credit score from 720
[183]
the 750 would affect your probability
[186]
and odds for being approved for the
[190]
mortgage so let's say your score is 720
[192]
you find that out so you're going to
[194]
wait a little bit and see if you can get
[195]
your credit score a bit higher up to 750
[198]
by paying down some debt maybe you know
[200]
that you're going to get a promotion and
[202]
a higher salary sometime soon or
[204]
something like that and you think that
[205]
your score may improve and you want to
[208]
know how that improvement in your score
[210]
would affect your probability and odds
[211]
of being approved with the mortgage so
[216]
here is just a little chunk of that
[217]
1,000 observation data set so there are
[220]
only 15 here but it's wanted you to see
[222]
how it's organized so we have the credit
[224]
score on the left and approved on the
[227]
right so again the N is a thousand the
[230]
credit score is the applicant's credit
[232]
score from 300 all the way up to 850 now
[236]
approved is coded as a 1/4 approved and
[239]
0 4 not approved so it is binary
[243]
it is a dichotomous variable and it is
[246]
mutually exclusive so you're either
[248]
approved or you're not approved there's
[252]
no in-between now as a good analyst and
[256]
a good stat student or whatever it is
[258]
you might be you create a scatterplot of
[260]
your 1,000 observations but it looks
[263]
like this now what is this so if you
[267]
look on the left hand side we have
[268]
approve and we have 0 at the bottom
[271]
so that means the application was not
[272]
approved at the top we have a one so
[275]
that means the application was approved
[277]
but we have the data points in two lines
[280]
to the credit score on the bottom so
[283]
FICO score is just a certain type of
[285]
credit score that's widely used so if
[287]
the dot is on the bottom that means for
[290]
whatever credit score that was the
[292]
application was not approved if it's at
[294]
the top it means it was approved now how
[298]
can we put a best-fit regression line on
[300]
a scatterplot that looks like this it
[303]
doesn't make any sense to do it how we
[306]
will usually do it in normal linear
[308]
regression so obviously we're going to
[310]
come up with some other technique and
[312]
that's what logistic regression allows
[314]
us to do now that we have set the stage
[319]
with the problem we're going to look at
[320]
what is logistic regression
[324]
now logistic regression seeks to do the
[326]
following among other things it seeks to
[329]
model the probability of an event
[331]
occurring depending on the values of the
[335]
independent variables in this case
[337]
credit score which can be categorical or
[339]
numerical so model the probability of an
[344]
event occurring depending on the other
[347]
independent variables it seeks to
[350]
estimate the probability that an event
[353]
occurs for a randomly selected
[356]
observation versus the probability that
[359]
the event does not occur so for a random
[363]
observation in the data or some other
[365]
observation that we would want to
[367]
predict we want to estimate the
[368]
probability that the event occurs versus
[372]
that it does not occur it seeks to
[375]
predict the effect of a series of
[377]
variables on a binary response variable
[380]
so in this case we only have one
[381]
independent variable credit score but we
[384]
can have more so logistic regression can
[386]
work a lot like multiple regression with
[388]
several independent variables and the
[390]
one dependent variable that is binary to
[393]
0 or 1 you can also stick the classify
[396]
observations by estimating the
[398]
probability that an observation is in a
[400]
particular category in this case the
[404]
applicant is either in the approved
[406]
category or they're in the not approved
[409]
category so we can classify observations
[412]
so model estimate predict and classify
[420]
so let's try to understand and visualize
[422]
the problem we're working with so in
[424]
this case we have a bunch of credit
[426]
scores so an applicant walks into the
[428]
bank and they have some sort of credit
[430]
score now the bank or other lending
[432]
institution feeds that into their
[434]
lending model the credit score goes into
[437]
the model and then when it comes out
[439]
it's either approved or it's not
[443]
approved so this black box in the middle
[447]
is what we're trying to understand so we
[450]
could ask what is the probability that
[452]
an application having a credit score of
[454]
670 so it would end up in the approved
[459]
category up here on the top so credit
[462]
scores get put into some model a
[464]
decision model by the bank or other
[466]
lender and then the bank or the lender
[468]
puts that application into the approved
[471]
or not approved categories that's
[473]
basically what we're trying to model in
[475]
this logistic regression problem now I
[479]
am kind of making the assumption that if
[481]
you're studying logistic regression you
[483]
have to some extent studied simple
[486]
linear regression and multiple
[488]
regression now if you studied those you
[490]
might have a very good question
[492]
why can I use one of those for this type
[495]
of problem well here's why number one
[498]
simple linear regression is one
[501]
quantitative variable predicting another
[503]
quantitative variable now in this case
[506]
we have a dichotomous dependent variable
[509]
so approve or not approve is 1 or 0 it's
[512]
not a quantitative variable now multiple
[515]
regression is just simple regression
[518]
with more independent variables so those
[521]
are basically the same type of problem
[523]
now we have nonlinear regression that's
[526]
still two quantitative variables but the
[529]
data is curvilinear now if we ignored
[532]
those morning's running a typical linear
[534]
regression in the same way on this type
[537]
of data
[538]
has some major problems now binary data
[541]
in this case approve are not approved
[543]
does not have a normal distribution and
[546]
you can see that by looking at the
[548]
scatter plot which is a condition needed
[551]
for most other types of regression the
[554]
predicted values of the dependent
[556]
variable can be beyond 0 & 1 in those
[560]
other types of regression so remember in
[562]
logistic regression we're dealing with
[564]
probabilities and the rule of
[566]
probability is that it has to be between
[568]
0 & 1 if we use the other types of
[570]
regression these values can be beyond 0
[574]
& 1 which obviously is not going to work
[576]
and probabilities are often not linear
[581]
such as you shapes where the probability
[584]
is very low or very high at the extremes
[588]
of the X values so you can probably
[590]
think of different examples so one
[593]
example could be the probability of
[594]
contracting the flu so the probability
[597]
of getting the flu is higher if you're
[600]
younger so a baby or an infant or
[602]
toddler and if you're older so say in
[606]
your 60s 70s and 80s so the probability
[609]
is higher in the extremes than it is in
[611]
the middle so probabilities often have
[614]
different shapes in their distribution
[616]
along the X variables so now that we
[621]
have set the stage by introducing our
[622]
problem and going over the basic
[624]
conceptual foundation of what logistic
[626]
regression is let's talk about where
[628]
we're going in the next video so in the
[631]
next video we will do the following we
[633]
will review basic probability so we will
[636]
go into much depth well let's go over
[637]
the basics because I was the
[639]
understanding probability is central to
[642]
learning about logistic regression we
[645]
will learn about what odds are and what
[648]
the odds ratio is because again that's
[650]
central to understanding logistic
[652]
regression we will briefly discuss how
[655]
to interpret the odds ratio in logistic
[659]
regression context and finally we will
[662]
note things we have to keep in mind when
[665]
interpreting the odds ratio so the odds
[668]
ratio is related to probability of
[670]
course
[671]
there are some dangers in how we
[673]
interpret it and we'll definitely
[674]
discuss that in the next video so let's
[676]
go ahead and wrap up this video and I
[678]
will see you in the next one
[680]
you
Most Recent Videos:
You can go back to the homepage right here: Homepage





