Durbin Watson Test for checking Residual Autocorrelation - YouTube

Channel: Bhavesh Bhatt

[6]
Hello everyone.
[8]
After we've created a linear regression model,
[10]
It becomes really important to validate if the linear regression assumptions are met or not.
[16]
One of the assumptions of linear regression is
[18]
that there should be no autocorrelation in the residuals of your linear regression model.
[24]
Essentially implying that there is no lagged version of the residuals, which is related to the residual itself.
[31]
So, how do we go about doing that?
[33]
We have something called as Durbin Watson Statistic
[38]
The Durbin Watson Statistic is a test for auto correlation
[41]
in the residuals from a statistical regression analysis.
[45]
The Durban Watson statistic will always have a value between 0 to 4.
[50]
How do we infer that if there is any auto correlation in the residuals
[54]
based on Durbin Watson test is what we look at in this video.
[57]
So let's get started by importing the necessary modules.
[61]
I have a file called as durban_data.csv,
[65]
which I load into my data frame DF.
[70]
Let's look at the total number of rows & columns in my dataframe.
[75]
As you can clearly see I have 200 rows in my data frame and 4 columns.
[80]
Let's move forward and print out the first 5 rows of my dataframe.
[84]
As you can clearly see
[86]
My dataframe has 3 feature columns, which are X1, X2 & X3
[91]
& also has a target column called as y.
[95]
So it's a regression problem having all the features continuous as well as my target is continuous.
[101]
Let's now separate X and y values
[104]
where X will contain the data of my feature columns which are X1, X2 & X3
[109]
and y will contain the value of my target column which in our case is capital Y.
[115]
After we have splitted the data into X and y
[118]
Let's proceed forward and split the data into train and test.
[123]
I'll be using 70 percent of the total 200 rows in training
[127]
and 30 percent for my testing purposes.
[132]
Now without doing much of EDA or
[134]
pre-processing or feature engineering,
[136]
let's go forward and fit a linear regression model.
[140]
I'm using the statsmodels API to fit a linear regression model.
[146]
After fitting the regression model,
[148]
this is what I get.
[149]
I get an intercept value of 2.70.
[152]
For feature X1, I get a weight value of 0.04
[156]
For feature X2, I get a weight value of 0.199
[160]
for feature X3, I get a weight value of 0.006.
[166]
We can also print out the OLS regression results, which is given by the statsmodels API.
[172]
To find out how good or bad our model is fit.
[176]
As my results suggest
[177]
that the model has a very high R-squared value as well as adjusted R-squared
[183]
Both having a value of close to around 90 percent.
[186]
Rather than spending a lot of time on this report on which I've already created a lot of videos.
[191]
I'll link the videos in the description section of the video so you can have a look at it.
[196]
let's jump onto something called as a Durbin Watson value.
[199]
which in our case turns out to be 2.285.
[204]
How is this calculated?
[205]
what is the significance?
[207]
what does it imply in terms of autocorrelation of the residuals?
[211]
is what we look at and try to derive it from scratch.
[217]
So now let's spend some time in understanding the Durbin Watson test.
[223]
For any statistical test
[225]
You have something called as a Null Hypothesis
[227]
& an Alternate Hypothesis,
[230]
so in our case
[231]
The Null Hypothesis states that
[232]
there is no auto correlation of my residuals with any of its lagged versions.
[238]
On the other hand the alternate hypothesis states that
[242]
there is a heavy autocorrelation of the residual series with its lagged versions.
[248]
Now that the hypothesis statement is clear,
[251]
let's go on to calculate the statistic value for Durban Watson Test.
[257]
Now you might be wondering what is this formula exactly
[261]
But to give you a simple interpretation of the formula
[264]
The value dw which is the test statistic in case of Durban Watson test
[269]
is basically a ratio of the sum of squared of difference of residuals
[274]
upon the sum of squared of residuals.
[278]
So essentially
[279]
your numerator is nothing else, but the difference
[281]
of your actual residual values
[283]
& the lagged version of one unit of your residual values.
[289]
So essentially this term that is e of i minus one
[292]
is your residual values itself with a lag of one
[295]
so you take the difference of these two terms
[298]
square it
[299]
which gives you the numerator term
[301]
the denominator term is basically the sum of square of residual values
[306]
So essentially your
[307]
residual values are nothing else, but your
[310]
observations minus the predictions.
[314]
Which is what is denoted by this formula.
[319]
After you have computed the test statistic in case of Durban Watson Test
[323]
which would be stored into a variable called as dw.
[327]
If the value of dw is less than a lower threshold value of dl,
[333]
I reject the Null Hypothesis
[335]
which was that there is no auto correlation in my residuals compared with the lag version of the residuals
[342]
If my Durban Watson test statistic
[345]
is greater than an upper critical value called as du.
[349]
then I fail to reject the null hypothesis.
[353]
There is also case if your dw
[355]
that is if your test statistic lies between my upper critical value
[360]
and the lower critical value
[362]
then the test is inconclusive
[365]
so you can't reject the null hypothesis
[367]
neither can you fail to reject the null hypothesis
[370]
so this is that grey area which you have to keep in mind when you perform this test.
[375]
If this idea is clear to you,
[377]
let's try to compute these values from scratch.
[382]
To compute the values from scratch
[385]
Firstly, I'll make use of the model that I've created to find out the predictions
[389]
on my training data set
[391]
Just a disclaimer.
[396]
I want to come at a Durban Watson value of 2.285,
[399]
which has been computed for the training data set.
[402]
So whatever I'm doing is essentially on the training dataset itself.
[407]
So I calculate y_predict
[409]
which would be basically predictions of my X training itself using the model that I've computed.
[416]
Let's also visualize the shape of my y_predict
[418]
so as you can clearly see it as 140 samples.
[422]
& my residuals are nothing else, but my observed minus my predicted values.
[428]
I save all my residual values into a dataframe called as residual_df
[433]
and this is how my overall residual data frame would look like.
[438]
It will have 140 rows
[440]
along with one column which is specified as "ei"
[444]
where each row is basically the difference of my actual value minus the predicted value.
[450]
The next thing that I do is
[452]
I compute something called as ei_square
[456]
which is nothing else, but taking the square of your ei column.
[460]
now if you go up
[462]
the calculation that I've done which is ei_square
[466]
is for the denominator term.
[469]
So essentially what I do is, I take the sum values
[473]
of my ei_square column
[475]
and save it into a variable called as
[478]
sum_of_squared_residuals,
[482]
so i run the cell and print out the value
[485]
so the denominator that I have computed
[487]
which is essentially the sum of squared residuals
[489]
is having a value of 347.10.
[495]
Let's go forward and visualize the dataframe that i have
[498]
So, I have 2 columns at this stage I have a column of ei & ei_square.
[505]
Next thing that I have to do is, I have to introduce a lag of one
[508]
to my ei column in the residual dataframe,
[511]
so i run the cell.
[514]
& this is what i observe.
[517]
So i created a new column called as
[519]
ei_minus_one
[521]
which is basically the lagged version of my ei column
[525]
so this point which was at location or index 0 has shifted to index 1.
[531]
So if i now visualize the last five rows as well.
[536]
This value has disappeared in ei_minus_one.
[540]
& this second last value becomes the last value.
[545]
Since I have the first row as an NaN value
[548]
I dropped that row entirely.
[551]
& now if i visualize the shape of my dataframe i'll have 139 rows.
[557]
The next that was there in the numerator calculation is
[561]
finding out the difference of the ei column and the ei_minus_1 column,
[565]
so I do this operation in this cell.
[571]
Once I have the difference of ei and ei_minus_1.
[574]
I square that column & save it into a new column called
[577]
as square_of_ei_sub_ei_minus_1 column
[582]
& I visualize the data frame that I have.
[586]
So now this column is of prime importance
[588]
for the numerator sum calculation,
[591]
which was computed by first taking the difference
[593]
of ei column & ei-1 column and saving it into this column.
[598]
squaring this column gave me
[600]
this column.
[601]
So that's how I have been able to reach at the penultimate stage of my overall calculations.
[606]
Now, I have the column of all the square differences
[610]
of my ei
[611]
& ei lagged with a value of 1
[614]
Now, all I have to do is I have to take the sum of these values.
[618]
Which is a fairly easy task in Pandas,
[621]
so I call the dot sum function
[623]
pass in the value of the column that I want to sum
[626]
and I save this value
[628]
into a variable called
[630]
as sum of squared of difference residuals
[632]
and I run the cell.
[635]
So the numerator term
[637]
which is
[638]
sum of squared of difference residuals has a value of 793.31
[644]
Now the test statistic that I mentioned
[646]
is basically
[648]
the ratio of numerator by denominator.
[650]
So when I run the cell
[652]
and when I kind of visualize the value of
[654]
the test statistic,
[655]
it turns out to be 2.285,
[658]
let's go up
[659]
and validate if the value is correct based on our calculations.
[663]
So we are chasing a value of 2.285,
[666]
if I go back up.
[670]
& if I validate it against this value
[673]
I get the same exact value that is given in the statsmodels API report.
[677]
So we have successfully implemented a Durban Watson test
[681]
from scratch using Pandas
[684]
But the story is still far from over
[686]
because we have to also validate whatever
[688]
value we are getting
[690]
does it imply that our overall residuals
[693]
are non-correlated or auto-correlated?
[696]
For that we will have to make use of
[699]
something called as the test table.
[705]
So as I've already mentioned you have
[707]
the upper critical value DU
[708]
you have a lower critical value called as DL
[711]
where do you find it?
[712]
How do you find it?
[713]
is where the Durbin Watson table comes in handy,
[717]
so I'll switch over to a PDF
[719]
which kind of has the table ready.
[721]
I'll also link that table in the description section of the video.
[725]
So you can have a look at the PDF file as well.
[729]
So, this was the Durban Watson table that I was mentioning sometime back.
[733]
Now, how do you find out the values of dU,
[736]
and dL
[737]
that is your upper critical value & the lower critical value.
[740]
So if I keep going down, I'll start seeing a table like format.
[744]
So, this is the table that I was mentioning.
[747]
So you are something called a k mentioned here
[749]
k = 1, k = 2, k = 3 and so on and so forth
[753]
and you have some numbers N here.
[757]
K, basically signifies
[759]
the total number of features that are there in your overall linear regression model.
[767]
So as you can clearly see
[769]
whatever I have just mentioned is again validated in the PDF itself
[773]
where k is the total number of regressors
[775]
excluding the intercept term, so
[777]
if you remember we had three features X1, X2 and X3.
[783]
So, we'll only be looking at k equal to 3 column.
[786]
for finding out our upper critical value & lower critical value.
[791]
Now, we have finalized the column
[793]
Now, we have to finalize the row.
[796]
Now "N" is the total number of samples that are considered
[799]
that your testing for
[800]
So in our case,
[802]
we used our entire training dataset to come up with that value of 2.28.
[808]
So our training set if you remember consist of 140 samples.
[811]
So I'll go down.
[813]
& I'll pick 150 since it's closest to 140.
[817]
So the lower critical value
[819]
at "N" equal to 150
[820]
is 1.584
[823]
and the upper critical value at "N" equal to 150
[827]
for 3 regressors is 1.665
[831]
the statistic value that we computed was 2.28
[834]
our lower & upper critical values are 1.5 & 1.6 respectively,
[839]
so our statistic value
[841]
is greater than our lower and upper critical value.
[845]
Let's go back to the hypothesis that we framed.
[849]
I again repeat since I've changed screens
[852]
The lower critical value was 1.5
[855]
for N equal to 150
[857]
& k equal to 3
[859]
the upper critical value which in our case du was around 1.6.
[863]
for the same set of combinations of n and k
[866]
& the dw that we calculated was 2.28
[870]
so essentially dw is greater than dU
[873]
so I fail to reject the Null Hypothesis
[875]
which states that there is no auto correlation of my residuals
[879]
with any of its lagged versions.
[883]
So this is a good linear regression model that I've constructed
[886]
which does not violate the autocorrelation assumption of my residuals.
[892]
Now that we've proven this statistically
[895]
we can also validate this by using the ACF plot or the auto correlation plot.
[904]
So as you can clearly see from this autocorrelation plot
[906]
if I have my residual values as it is
[909]
and if I keep taking a lagged version of my
[911]
residual combinations & try finding out the correlation,
[914]
This is what I find.
[916]
There is no significant
[917]
auto correlation or correlation of the actual residual series
[922]
with any of its lagged series.
[925]
So this was my attempt at explaining how Durban Watson test works internally
[930]
& how you can use it smartly to validate your linear regression assumption
[934]
of autocorrelation of residuals.
[936]
I hope you found this video informative.
[940]
If you do have any questions with
[941]
what we covered in this video then
[942]
feel free to ask in the comment section below
[945]
& I'll do my best to answer those.
[947]
If you enjoy these tutorials & would like to support
[949]
them then the easiest way is to simply like the video
[952]
& give it a thumbs up
[954]
& also it's a huge help to share these videos with anyone who you think would find them useful.
[960]
Please consider clicking the SUBSCRIBE button to be notified for future videos
[963]
& thank you so much for watching the video.