Multiple Linear Regression - Variance Inflation Factor - Part 1 - YouTube

Channel: unknown

[0]
[Music]
[13]
welcome back to this session on multiple
[15]
regression in the previous session we
[17]
have seen
[18]
how path diagram helps us
[21]
understand the direct and indirect
[23]
effect of an explanatory variable on the
[25]
response variable
[27]
and as we saw the path variables are
[29]
more relevant when the explanatory
[31]
variables are correlated today we are
[34]
going to extend that discussion and talk
[36]
about
[37]
one more
[38]
[Applause]
[40]
quantification of this co-linearity the
[43]
relationship between the explanatory
[45]
variables that particular situation is
[47]
referred to as collinearity or
[48]
multicollinearity
[50]
right and
[52]
we have seen the effect of collinearity
[55]
ah through path diagram today what we
[58]
are going to do is look at what is
[60]
called as ah look at what is called as
[62]
variance inflation factor v i f which is
[66]
also a quantification of collinearity
[69]
amongst the exponential variable so
[71]
first of all
[72]
what is what is variance inflation
[74]
factor variance inflation factor is the
[77]
amount of unique variation in each
[80]
explanatory variable and it essentially
[82]
measures the effect of collinearity
[85]
right
[86]
so pop in particular this v i f for a
[88]
particular explanatory variable is
[90]
calculated as 1 over 1 minus r squared
[94]
where this r squared is not the regular
[96]
r squared of the multiple linear
[97]
regression but this r squared is the
[100]
coefficient of determination
[102]
on a special regression where
[105]
that particular explanatory variable is
[107]
the response variable and all the other
[109]
explanatory variables are the
[111]
explanatory variables what do i mean by
[114]
that so let us say that there is a
[116]
multiple multiple linear regression
[118]
where the explanatory variables are x1
[120]
x2 x3
[122]
x4
[123]
and we are trying to study the impact of
[125]
these explanatory variables on the
[127]
response variable y
[129]
now what will be v i f of
[134]
x 1
[135]
v i f of x 1 will be 1 over 1 minus r 1
[139]
squared what is this r1
[142]
r1 or r1 squared r1 squared is the
[147]
ah is the coefficient of determination
[150]
ah in a regression
[152]
in a regression where
[155]
x1 is the response variable
[159]
x1 is the response variable and
[163]
x2 x3 x4 are your explanatory variables
[173]
ok when we calculate v i f
[177]
of x 2
[179]
we will say it is 1 minus r 2 squared
[182]
and what is this r2 squared r2 squared
[184]
comes from the regression where x2 is
[187]
the response variable
[190]
x2 is the response variable and x1 x3 x4
[196]
they are our explanatory variables
[202]
okay so now what will happen if
[206]
if this r squared is a large value when
[209]
will the r squared be a large value r
[211]
squared will be a large value if this
[213]
regression is significant
[214]
right if this regression is significant
[218]
which means that explanatory variable x1
[221]
is
[223]
fairly correlated with x2 x3 x4 right if
[226]
that happens then this r squared will be
[228]
a large value if this r squared is a
[230]
large value v i f will be a large value
[234]
right v i f will be a large value so
[236]
this is how you quantify variance
[239]
inflation factor now why is it called
[241]
variance inflation factor
[243]
now if you recall if you recall
[247]
ah we we discussed about the we
[250]
discussed about the partial slope and
[252]
the marginal slope
[254]
now when we discussed about the partial
[256]
slope and the marginal slope what is the
[259]
expression for
[260]
this estimation of the partial slope so
[263]
partial slope partial slope
[266]
we discussed uh that it is beta naught
[269]
plus beta 1 x 1 plus beta 2 x 2 and so
[274]
on so this beta 1
[276]
beta 2 are the partial slopes now from
[279]
the sample of data on which we are going
[281]
to run the regression i am going to get
[284]
an estimate of beta1 which we called as
[286]
b1
[288]
from the s from the data i am going to
[290]
get an estimate of beta2 which is b2 now
[292]
this is only an estimate and therefore
[294]
it is going to have a standard error of
[296]
itself
[298]
right standard error in estimating beta1
[300]
can also be calculated and if you had
[302]
noticed in the excel output this
[304]
standard error is also getting recorded
[307]
so similarly there is going to be a
[309]
standard error in predicting beta2 which
[311]
is called se of b2 right a standard
[313]
error in b2 standard error in b1
[317]
now what what was the expression for
[319]
this standard error
[321]
ah so
[322]
first of all let us let us first of all
[324]
see where is this getting recorded so
[327]
let us go to excel right let us go to
[329]
excel of our
[331]
gpa example that we had discussed last
[333]
time right a gpa example of what we had
[336]
discussed last time where remember we
[338]
were looking at we were looking at
[342]
we were looking at
[344]
cgpa in the mba program as our response
[347]
variable cgpa in the mba program as our
[349]
response variable the scores in the
[352]
entrance examination and scores in the
[354]
interview were our explanatory variables
[357]
you recall that
[358]
regression in the previous session so
[360]
here
[362]
here we had said that the estimate of
[365]
beta1
[366]
is 0.455 and the estimate of beta2 is
[369]
0.622
[371]
now excel
[373]
also reported the standard error in this
[375]
estimation standard error reported was
[378]
0.168 and 0.213
[382]
so this is essentially this value
[385]
this value is essentially standard error
[388]
in estimating b2
[390]
this value is essentially standard error
[393]
in estimating b1
[395]
right
[396]
so
[397]
how how are these calculated right so
[399]
this is where it is reported if we had
[401]
seen the simple linear regression where
[403]
we are consider where we had considered
[404]
only
[405]
one of the explanatory variables one of
[408]
the explanatory variables
[410]
this was our standard error in
[412]
estimating the beta so this was our se
[415]
b1
[416]
and similarly
[418]
this would have been our
[420]
standard error standard error in
[423]
estimating beta2
[425]
standard error in estimating beta2
[427]
right
[428]
so this is this is what we are talking
[430]
about this is what we mean by standard
[433]
error in estimating the slopes standard
[436]
error in estimating the slopes
[438]
how are the standard errors calculated
[440]
standard error in estimating b1 is
[443]
generally given by
[446]
the standard error
[448]
right standard error which you already
[450]
know what this is this is an estimate of
[452]
sigma of epsilon right uh divided by
[454]
square root of n multiplied by 1 over
[457]
standard deviation in x
[460]
what is standard deviation in x this
[462]
would be standard deviation in x1 right
[464]
so
[465]
let us go to that variance inflation
[469]
factor okay so uh generally
[472]
the standard error in b1 is estimated as
[475]
standard error of the the standard error
[478]
in the
[479]
error terms
[481]
divided by the square root of n 1 over
[484]
standard deviation of that particular
[486]
explanatory variable so if you are
[487]
estimating standard error in b 1 this
[488]
will be standard
[490]
deviation of x one
[492]
right if you are if you are calculating
[494]
the standard deviation in b two this
[496]
will be standard deviation of x two
[498]
variable
[499]
now we all know y standard deviation in
[501]
x two is in the denominator right if the
[504]
x range is quite large if the x1 range
[508]
is quite large which means that the
[509]
standard deviation of x1 is quite large
[511]
that actually helps me
[514]
understand the variation in y
[516]
and therefore if the standard deviation
[518]
in x is quite large
[520]
the standard error in the corresponding
[523]
beta value will be smaller
[526]
will be smaller and what what is what do
[528]
i mean by uh this standard error value
[531]
being smaller i get very high precision
[534]
in estimating that particular beta value
[537]
right once again take the extreme
[539]
example what if all the x values right
[541]
all the x values all the x values were
[544]
same
[545]
1 1 1 1 1 1 and therefore the standard
[548]
deviation of this will be 0
[551]
right if the standard deviation is 0
[553]
what will happen to the standard error
[555]
of b 1 standard error of b 1 will
[556]
skyrocket
[557]
which means that you will get absolutely
[559]
no precision in estimating that
[561]
particular beta
[563]
okay so this is typically what happens
[566]
in absence of vif
[568]
in absence of vif okay when will when
[571]
will vif be absent
[573]
vif will be absent if the explanatory
[576]
variables are uncorrelated okay but if
[580]
there is vif if this is if there is vif
[584]
the standard error in the estimation of
[586]
v1 gets inflated and how does it get
[588]
inflated it gets inflated by this factor
[592]
it gets inflated by this factor okay so
[595]
with vf the standard error in estimating
[597]
b1 is actually much more standard error
[600]
in estimating b1 is actually much more
[603]
by this much amount by standard
[605]
deviation of vif amount
[607]
now going further
[610]
how how is this vif so if the
[612]
explanatory variables are completely
[614]
uncorrelated
[615]
if the explanatory variables are
[617]
completely uncorrelated then the
[619]
coefficient of determination in that
[620]
spatial regression would be 0
[623]
right would be 0 and if you plug in 0
[625]
here you will get a v i f of 1.
[627]
now v i f of 1 if you plug it in here
[631]
which means that there is no change in
[633]
our
[634]
precision the standard error of b1
[636]
remains pretty much the same if vif is
[639]
actually one when will vif be one vif
[642]
will be one if the correlation
[645]
ah amongst the explanatory variable is
[647]
just absent right
[649]
when we run that special regression
[650]
where one of the explanatory variable is
[652]
made response variable
[654]
if that regression has a r squared of 0
[658]
then the vif will be one however if the
[661]
explanatory variables are correlated
[663]
correlated somehow
[665]
and v i f turns out to be a value more
[667]
than one
[669]
then
[670]
we can say that there is collinearity in
[672]
our model okay there is collinearity in
[675]
our model
[677]
and as we saw larger value of vif ins
[680]
essentially increases the standard error
[682]
in predicting the partial slope and
[684]
therefore it can make our predictions
[687]
very very unreliable
[689]
okay very very unreliable
[692]
now let us look at let us look at our
[694]
data right let us look at our data
[696]
let us go back to the gpa example right
[699]
so this was the
[701]
this was the
[702]
multiple linear regression model this
[704]
was the multiple linear regression model
[706]
and if you recall from the data the
[708]
explanatory variables were correlated
[710]
explanatory variables were correlated
[712]
entrance examination was an explanatory
[714]
variable interview was an explanatory
[716]
variable and there was a 54 coefficient
[719]
of correlation
[720]
uh 0.54 was the coefficient of
[722]
correlation between the two explanatory
[725]
variables
[726]
how does that get reflected that gets
[728]
reflected
[731]
that gets reflected
[732]
by running a regression by running a
[734]
regression
[736]
where you make one of the explanatory
[738]
variables or response variable the other
[740]
explanatory variables remains as an
[742]
explanatory variable and you see that
[744]
the r squared is
[746]
almost 0.3 0.29 is the r squared this is
[750]
the r squared that is going to get used
[752]
in calculating the vif let me say that
[756]
again
[757]
what is this special regression this is
[759]
a special regression there was a simple
[761]
linear regression where we had used
[763]
one of the explanatory variable against
[765]
the response variable right here the
[768]
response variable was the cgpa in
[770]
college and one of the explanatory
[772]
variables was kept in the model
[774]
in the second simple linear regression
[777]
our response variable did not change our
[779]
response variable was cgpa during the
[781]
mba program
[783]
the explanatory variable changed now
[785]
this is a special regression this is a
[787]
special regression where we have made
[789]
one of the explanatory variables where
[790]
one of the original response original
[792]
explanatory variables as a response
[794]
variable
[796]
and the other explanatory variable
[797]
remains as an explanatory variable
[800]
so the r squared reported was 0.29
[804]
and therefore i can calculate the
[805]
variance inflation factor this 0.29 is
[808]
the r squared similarly the other
[810]
regression is also going to report the
[812]
same 0.9 or 0.29
[814]
right
[815]
where
[816]
it doesn't matter whether i have the
[817]
entrance examination as the explanatory
[820]
variable or whether i have the interview
[823]
as an explanatory variable right the vif
[826]
is still going to be 0.29
[829]
and therefore the vif is going to be
[831]
point
[833]
1.41 1.41 okay
[836]
and and the square root of this right
[839]
square root of this
[843]
square root of this
[845]
is going to be the inflation is going to
[848]
be the inflation
[849]
so
[850]
we can say from this value that there is
[853]
going to be a 18 percent increase
[856]
there is going to be a 18 percent
[858]
increase in the
[860]
standard error of the corresponding beta
[862]
value
[863]
okay
[865]
so uh going back to this expression here
[868]
right corresponding this square root of
[870]
v i f turned out to be 1.18 and
[873]
therefore the standard error in beta 1
[876]
is going to increase by about 18 percent
[878]
similarly the standard error in
[880]
estimating beta 2 is going to increase
[883]
by 18 percent
[885]
okay now fortunately for us in our
[888]
example that we have taken
[890]
for our example the inflation
[893]
because of vif was not much it was only
[896]
18 percent increase okay it was only 18
[899]
percent increase
[901]
however
[902]
however sometimes the vif could be very
[904]
large right
[906]
for us we were little more fortunate we
[909]
were little more fortunate that our r
[912]
was only 0.54 and particularly the r
[916]
squared the r squared was 0.29 right was
[920]
only 0.29 now imagine imagine if this r
[924]
squared was of the range of
[927]
0.7 for example right 0.7 now if this r
[931]
squared was 0.7 right if this r squared
[934]
was 0.7 let us see what what would have
[936]
happened
[938]
okay
[939]
the variance inflation factor would have
[941]
been three point three three okay
[942]
variance inflation factor would have
[944]
been three point three three
[946]
ah therefore uh the square root of that
[949]
one point eight two there there would
[950]
have been eighty two percent increase
[952]
there would have been an 82 percent
[954]
increase in the standard error of b1
[957]
okay what does this do why do i want to
[960]
keep the standard error in estimating
[962]
beta1 to be small in general because
[964]
it's standard error
[966]
anywhere i see standard error in
[967]
regression i want to keep it to the
[969]
minimum
[971]
now
[972]
what will happen if this standard error
[975]
gets inflated which is why we are
[976]
referring to this as vif it is variance
[979]
inflation factor right what if this
[981]
standard error in estimating beta1 gets
[983]
inflated look at the
[986]
look at the multiple linear regression
[988]
model
[989]
now let me let me delete this so that
[991]
this becomes clearer
[992]
okay now if this if the standard error
[995]
terms get inflated right and uh
[998]
for us we were fortunate that uh
[1000]
standard error did not get inflated by
[1002]
82 percent right if the the error in the
[1004]
standard error inflation was quite small
[1007]
only 18 percent
[1009]
if this would have been very high what
[1010]
would happen to the t statistic the t
[1012]
statistic would come down right why
[1015]
would t statistic come down
[1016]
t statistic is how is that how is the t
[1018]
statistic calculated
[1020]
that t statistic is calculated like this
[1024]
okay how is this t statistic calculated
[1026]
this this t statistic is for a null
[1029]
hypothesis that that particular beta
[1032]
value is zero
[1033]
okay and how is this calculated this is
[1035]
calculated as the prediction of
[1037]
predicted value of that beta divided by
[1040]
the standard error of that beta now if
[1042]
this standard error gets inflated
[1044]
because of
[1047]
ah
[1049]
because of v i f
[1051]
now because of v i f let us say the
[1053]
standard error gets inflated this t
[1055]
value is going to reduce
[1057]
okay this t value is going to come down
[1060]
okay and what if this value comes down
[1063]
what if this value comes down right what
[1066]
if this value comes down it may actually
[1068]
impact my p value it may actually impact
[1070]
my p value
[1072]
if this t statistic is very small if
[1075]
this t statistic is very small
[1078]
if this t statistic is very small i may
[1081]
not be able to reject this hypothesis
[1085]
i may not be able to reject this null
[1087]
hypothesis
[1088]
okay
[1089]
what if i am not able to reject this
[1091]
null hypothesis if i am not able to
[1093]
reject this null hypothesis i may end up
[1096]
saying that well i don't know this beta
[1098]
could be zero i cannot say for i cannot
[1100]
say with confidence that this beta is
[1102]
not zero i am not able to reject this
[1104]
null hypothesis
[1105]
okay and what if i am not able to reject
[1107]
this null hypothesis
[1109]
if i am not able to reject this null
[1111]
hypothesis it means that that particular
[1113]
explanatory variable may be
[1115]
statistically insignificant for the
[1117]
regression
[1119]
okay that particular explanatory
[1120]
variable may turn out to be
[1122]
insignificant for the regression
[1125]
okay that is really the extreme case of
[1128]
collinearity that is really the extreme
[1130]
case of collinearity
[1132]
okay i i will discuss
[1134]
another example for this i will discuss
[1136]
another example for this where i will
[1138]
demonstrate an extreme uh case
[1140]
but
[1141]
coming back to this
[1143]
uh i don't want this explanatory
[1145]
i don't want this explanatory variable
[1147]
to be insignificant in my regression
[1150]
therefore i don't want this t value to
[1152]
be small
[1153]
therefore i don't want this standard
[1155]
error to be a large value
[1157]
if i don't want this standard error to
[1159]
be a large value i better make sure that
[1162]
the vif is in control and the only way
[1165]
to make vif in control is to ensure that
[1168]
the explanatory variables
[1170]
don't have too much correlation
[1173]
okay don't have too much correlation
[1176]
is that point understood
[1178]
okay