🔍

Durbin Watson Test for checking Residual Autocorrelation - YouTube

Channel: Bhavesh Bhatt

[6]

Hello everyone.

[8]

After we've created a linear regression model,

[10]

It becomes really important to validate if the linear regression assumptions are met or not.

[16]

One of the assumptions of linear regression is

[18]

that there should be no autocorrelation in the residuals of your linear regression model.

[24]

Essentially implying that there is no lagged version of the residuals, which is related to the residual itself.

[31]

So, how do we go about doing that?

[33]

We have something called as Durbin Watson Statistic

[38]

The Durbin Watson Statistic is a test for auto correlation

[41]

in the residuals from a statistical regression analysis.

[45]

The Durban Watson statistic will always have a value between 0 to 4.

[50]

How do we infer that if there is any auto correlation in the residuals

[54]

based on Durbin Watson test is what we look at in this video.

[57]

So let's get started by importing the necessary modules.

[61]

I have a file called as durban_data.csv,

[65]

which I load into my data frame DF.

[70]

Let's look at the total number of rows & columns in my dataframe.

[75]

As you can clearly see I have 200 rows in my data frame and 4 columns.

[80]

Let's move forward and print out the first 5 rows of my dataframe.

[84]

As you can clearly see

[86]

My dataframe has 3 feature columns, which are X1, X2 & X3

[91]

& also has a target column called as y.

[95]

So it's a regression problem having all the features continuous as well as my target is continuous.

[101]

Let's now separate X and y values

[104]

where X will contain the data of my feature columns which are X1, X2 & X3

[109]

and y will contain the value of my target column which in our case is capital Y.

[115]

After we have splitted the data into X and y

[118]

Let's proceed forward and split the data into train and test.

[123]

I'll be using 70 percent of the total 200 rows in training

[127]

and 30 percent for my testing purposes.

[132]

Now without doing much of EDA or

[134]

pre-processing or feature engineering,

[136]

let's go forward and fit a linear regression model.

[140]

I'm using the statsmodels API to fit a linear regression model.

[146]

After fitting the regression model,

[148]

this is what I get.

[149]

I get an intercept value of 2.70.

[152]

For feature X1, I get a weight value of 0.04

[156]

For feature X2, I get a weight value of 0.199

[160]

for feature X3, I get a weight value of 0.006.

[166]

We can also print out the OLS regression results, which is given by the statsmodels API.

[172]

To find out how good or bad our model is fit.

[176]

As my results suggest

[177]

that the model has a very high R-squared value as well as adjusted R-squared

[183]

Both having a value of close to around 90 percent.

[186]

Rather than spending a lot of time on this report on which I've already created a lot of videos.

[191]

I'll link the videos in the description section of the video so you can have a look at it.

[196]

let's jump onto something called as a Durbin Watson value.

[199]

which in our case turns out to be 2.285.

[204]

How is this calculated?

[205]

what is the significance?

[207]

what does it imply in terms of autocorrelation of the residuals?

[211]

is what we look at and try to derive it from scratch.

[217]

So now let's spend some time in understanding the Durbin Watson test.

[223]

For any statistical test

[225]

You have something called as a Null Hypothesis

[227]

& an Alternate Hypothesis,

[230]

so in our case

[231]

The Null Hypothesis states that

[232]

there is no auto correlation of my residuals with any of its lagged versions.

[238]

On the other hand the alternate hypothesis states that

[242]

there is a heavy autocorrelation of the residual series with its lagged versions.

[248]

Now that the hypothesis statement is clear,

[251]

let's go on to calculate the statistic value for Durban Watson Test.

[257]

Now you might be wondering what is this formula exactly

[261]

But to give you a simple interpretation of the formula

[264]

The value dw which is the test statistic in case of Durban Watson test

[269]

is basically a ratio of the sum of squared of difference of residuals

[274]

upon the sum of squared of residuals.

[278]

So essentially

[279]

your numerator is nothing else, but the difference

[281]

of your actual residual values

[283]

& the lagged version of one unit of your residual values.

[289]

So essentially this term that is e of i minus one

[292]

is your residual values itself with a lag of one

[295]

so you take the difference of these two terms

[298]

square it

[299]

which gives you the numerator term

[301]

the denominator term is basically the sum of square of residual values

[306]

So essentially your

[307]

residual values are nothing else, but your

[310]

observations minus the predictions.

[314]

Which is what is denoted by this formula.

[319]

After you have computed the test statistic in case of Durban Watson Test

[323]

which would be stored into a variable called as dw.

[327]

If the value of dw is less than a lower threshold value of dl,

[333]

I reject the Null Hypothesis

[335]

which was that there is no auto correlation in my residuals compared with the lag version of the residuals

[342]

If my Durban Watson test statistic

[345]

is greater than an upper critical value called as du.

[349]

then I fail to reject the null hypothesis.

[353]

There is also case if your dw

[355]

that is if your test statistic lies between my upper critical value

[360]

and the lower critical value

[362]

then the test is inconclusive

[365]

so you can't reject the null hypothesis

[367]

neither can you fail to reject the null hypothesis

[370]

so this is that grey area which you have to keep in mind when you perform this test.

[375]

If this idea is clear to you,

[377]

let's try to compute these values from scratch.

[382]

To compute the values from scratch

[385]

Firstly, I'll make use of the model that I've created to find out the predictions

[389]

on my training data set

[391]

Just a disclaimer.

[396]

I want to come at a Durban Watson value of 2.285,

[399]

which has been computed for the training data set.

[402]

So whatever I'm doing is essentially on the training dataset itself.

[407]

So I calculate y_predict

[409]

which would be basically predictions of my X training itself using the model that I've computed.

[416]

Let's also visualize the shape of my y_predict

[418]

so as you can clearly see it as 140 samples.

[422]

& my residuals are nothing else, but my observed minus my predicted values.

[428]

I save all my residual values into a dataframe called as residual_df

[433]

and this is how my overall residual data frame would look like.

[438]

It will have 140 rows

[440]

along with one column which is specified as "ei"

[444]

where each row is basically the difference of my actual value minus the predicted value.

[450]

The next thing that I do is

[452]

I compute something called as ei_square

[456]

which is nothing else, but taking the square of your ei column.

[460]

now if you go up

[462]

the calculation that I've done which is ei_square

[466]

is for the denominator term.

[469]

So essentially what I do is, I take the sum values

[473]

of my ei_square column

[475]

and save it into a variable called as

[478]

sum_of_squared_residuals,

[482]

so i run the cell and print out the value

[485]

so the denominator that I have computed

[487]

which is essentially the sum of squared residuals

[489]

is having a value of 347.10.

[495]

Let's go forward and visualize the dataframe that i have

[498]

So, I have 2 columns at this stage I have a column of ei & ei_square.

[505]

Next thing that I have to do is, I have to introduce a lag of one

[508]

to my ei column in the residual dataframe,

[511]

so i run the cell.

[514]

& this is what i observe.

[517]

So i created a new column called as

[519]

ei_minus_one

[521]

which is basically the lagged version of my ei column

[525]

so this point which was at location or index 0 has shifted to index 1.

[531]

So if i now visualize the last five rows as well.

[536]

This value has disappeared in ei_minus_one.

[540]

& this second last value becomes the last value.

[545]

Since I have the first row as an NaN value

[548]

I dropped that row entirely.

[551]

& now if i visualize the shape of my dataframe i'll have 139 rows.

[557]

The next that was there in the numerator calculation is

[561]

finding out the difference of the ei column and the ei_minus_1 column,

[565]

so I do this operation in this cell.

[571]

Once I have the difference of ei and ei_minus_1.

[574]

I square that column & save it into a new column called

[577]

as square_of_ei_sub_ei_minus_1 column

[582]

& I visualize the data frame that I have.

[586]

So now this column is of prime importance

[588]

for the numerator sum calculation,

[591]

which was computed by first taking the difference

[593]

of ei column & ei-1 column and saving it into this column.

[598]

squaring this column gave me

[600]

this column.

[601]

So that's how I have been able to reach at the penultimate stage of my overall calculations.

[606]

Now, I have the column of all the square differences

[610]

of my ei

[611]

& ei lagged with a value of 1

[614]

Now, all I have to do is I have to take the sum of these values.

[618]

Which is a fairly easy task in Pandas,

[621]

so I call the dot sum function

[623]

pass in the value of the column that I want to sum

[626]

and I save this value

[628]

into a variable called

[630]

as sum of squared of difference residuals

[632]

and I run the cell.

[635]

So the numerator term

[637]

which is

[638]

sum of squared of difference residuals has a value of 793.31

[644]

Now the test statistic that I mentioned

[646]

is basically

[648]

the ratio of numerator by denominator.

[650]

So when I run the cell

[652]

and when I kind of visualize the value of

[654]

the test statistic,

[655]

it turns out to be 2.285,

[658]

let's go up

[659]

and validate if the value is correct based on our calculations.

[663]

So we are chasing a value of 2.285,

[666]

if I go back up.

[670]

& if I validate it against this value

[673]

I get the same exact value that is given in the statsmodels API report.

[677]

So we have successfully implemented a Durban Watson test

[681]

from scratch using Pandas

[684]

But the story is still far from over

[686]

because we have to also validate whatever

[688]

value we are getting

[690]

does it imply that our overall residuals

[693]

are non-correlated or auto-correlated?

[696]

For that we will have to make use of

[699]

something called as the test table.

[705]

So as I've already mentioned you have

[707]

the upper critical value DU

[708]

you have a lower critical value called as DL

[711]

where do you find it?

[712]

How do you find it?

[713]

is where the Durbin Watson table comes in handy,

[717]

so I'll switch over to a PDF

[719]

which kind of has the table ready.

[721]

I'll also link that table in the description section of the video.

[725]

So you can have a look at the PDF file as well.

[729]

So, this was the Durban Watson table that I was mentioning sometime back.

[733]

Now, how do you find out the values of dU,

[736]

and dL

[737]

that is your upper critical value & the lower critical value.

[740]

So if I keep going down, I'll start seeing a table like format.

[744]

So, this is the table that I was mentioning.

[747]

So you are something called a k mentioned here

[749]

k = 1, k = 2, k = 3 and so on and so forth

[753]

and you have some numbers N here.

[757]

K, basically signifies

[759]

the total number of features that are there in your overall linear regression model.

[767]

So as you can clearly see

[769]

whatever I have just mentioned is again validated in the PDF itself

[773]

where k is the total number of regressors

[775]

excluding the intercept term, so

[777]

if you remember we had three features X1, X2 and X3.

[783]

So, we'll only be looking at k equal to 3 column.

[786]

for finding out our upper critical value & lower critical value.

[791]

Now, we have finalized the column

[793]

Now, we have to finalize the row.

[796]

Now "N" is the total number of samples that are considered

[799]

that your testing for

[800]

So in our case,

[802]

we used our entire training dataset to come up with that value of 2.28.

[808]

So our training set if you remember consist of 140 samples.

[811]

So I'll go down.

[813]

& I'll pick 150 since it's closest to 140.

[817]

So the lower critical value

[819]

at "N" equal to 150

[820]

is 1.584

[823]

and the upper critical value at "N" equal to 150

[827]

for 3 regressors is 1.665

[831]

the statistic value that we computed was 2.28

[834]

our lower & upper critical values are 1.5 & 1.6 respectively,

[839]

so our statistic value

[841]

is greater than our lower and upper critical value.

[845]

Let's go back to the hypothesis that we framed.

[849]

I again repeat since I've changed screens

[852]

The lower critical value was 1.5

[855]

for N equal to 150

[857]

& k equal to 3

[859]

the upper critical value which in our case du was around 1.6.

[863]

for the same set of combinations of n and k

[866]

& the dw that we calculated was 2.28

[870]

so essentially dw is greater than dU

[873]

so I fail to reject the Null Hypothesis

[875]

which states that there is no auto correlation of my residuals

[879]

with any of its lagged versions.

[883]

So this is a good linear regression model that I've constructed

[886]

which does not violate the autocorrelation assumption of my residuals.

[892]

Now that we've proven this statistically

[895]

we can also validate this by using the ACF plot or the auto correlation plot.

[904]

So as you can clearly see from this autocorrelation plot

[906]

if I have my residual values as it is

[909]

and if I keep taking a lagged version of my

[911]

residual combinations & try finding out the correlation,

[914]

This is what I find.

[916]

There is no significant

[917]

auto correlation or correlation of the actual residual series

[922]

with any of its lagged series.

[925]

So this was my attempt at explaining how Durban Watson test works internally

[930]

& how you can use it smartly to validate your linear regression assumption

[934]

of autocorrelation of residuals.

[936]

I hope you found this video informative.

[940]

If you do have any questions with

[941]

what we covered in this video then

[942]

feel free to ask in the comment section below

[945]

& I'll do my best to answer those.

[947]

If you enjoy these tutorials & would like to support

[949]

them then the easiest way is to simply like the video

[952]

& give it a thumbs up

[954]

& also it's a huge help to share these videos with anyone who you think would find them useful.

[960]

Please consider clicking the SUBSCRIBE button to be notified for future videos

[963]

& thank you so much for watching the video.

Most Recent Videos:

WE KILLED 6 HEROIC BOSSES! - YouTube

¿Quién inventó el dinero? - YouTube

Cuándo se inventó el dinero y cómo el dólar se convirtió en la principal moneda del mundo - YouTube

This Citizenship Program is Failing - YouTube

Candida Treatment Protocol w/ Dr. DiNezza - YouTube

$500M investor reacts to Real Estate Tik Toks 2 - YouTube

You can go back to the homepage right here: Homepage