Polynomial Regression in R | R Tutorial 5.12 | MarinStatsLectures - YouTube

Channel: MarinStatsLectures-R Programming & Statistics

[0]
Hi, I'm Mike Marin and in this video we'll discuss the idea of polynomial
[5]
regression and how to fit and assess these models in R. Polynomial
[10]
regression is a special case of linear regression where the relationship
[14]
between x and y is modeled using a polynomial rather than a line. It can be
[20]
used when the relationship between x and y is nonlinear although this is still
[24]
considered to be a special case of multiple linear regression; we will be
[29]
working with a different version of the lung capacity data as was used in other
[33]
videos. A link to download this data can be found in the video description below.
[37]
You can also download the R script used in this video there as well. I've already
[42]
imported the data into RStudio and attached it. As an example we'll model
[48]
the relationship between lung capacity and height. Let's begin by looking at a
[52]
scatter plot and let's fit a simple linear regression model and look at a
[56]
summary of that model. Here we'll model the relationship between lung capacity
[60]
and height, we can take note that the r-squared is about 75% and the residual
[66]
standard error is 1.292; we can also add the line for this model
[71]
to the plot using the abline command (function); visually we can see that the
[76]
relationship between lung capacity and height looks a bit curved or non linear;
[80]
a reminder that we can also use residual plots to help with assessing linearity
[86]
and checking model assumptions. for a more thorough discussion of this topic
[90]
you can refer to one of our earlier videos on checking assumptions for
[94]
linear regression. There are many approaches to dealing with
[97]
nonlinearities, one we will discuss in this video is including polynomial terms
[102]
in our model we will start with including height squared in our model
[106]
first let's take a look at the wrong way to do this while it may seem like
[111]
including height squared directly in the model call will work, R will not include
[115]
height squared in the model if entered this way let's take a look at that here
[120]
we can see if we enter height squared directly in the model call, in the model
[125]
summary you can see height squared is left out, R has just ignored height
[129]
squared. It is important to take note of this because R does not give us a
[133]
warning or an error message with this. Now let's take a look at the right way
[138]
to do this: to do so we'll use a capital I and then include height squared within
[143]
the parentheses and if we ask for a summary of the model we can now see that
[149]
height squared has been included in the model we'll get to talking about this
[154]
model in a moment but first let's take a look at a few other ways to get the same
[158]
result instead of using the approach we just
[160]
showed we could instead first create a new variable called height squared and
[165]
then include this variable in the model here we can see we're creating this
[169]
height square variable and including that in our model and if we ask for the
[173]
summary you can see that this produces the exact same model and results as the
[178]
previous set of commands, we could also make use of the Poly command (function) in R,
[183]
here we would let R know that we would like to include polynomial terms for the
[188]
height variable and we set the degree argument to the degree of polynomial
[192]
we'd like, in this case setting the degree equal 2 will include height and
[197]
height squared and if instead of setting the raw argument to TRUE we set it to
[202]
FALSE, R would fit a model that used orthogonal polynomials so let's fit that
[208]
model and again you can take the time to verify that this produces the exact same
[215]
results as the earlier two ways we looked at so let's just give ourselves a
[219]
quick reminder of the model that we fit we can see here that with this
[225]
polynomial including height squared the R squared is about 77% and the residual
[231]
standard error is 1.238, you may recall that for the model
[236]
with only height the R squared was about 75% and the residual standard error was
[241]
1.292, it looks like height squared may be improving the
[246]
model let's take a look at this visually we can add the polynomial model to the
[251]
plot using the lines command, so let's add this line using a thick blue line
[259]
subjectively it looks like the model that includes height squared may provide
[263]
a better fit to the data than the model that does not include height squared
[267]
let's compare these two models formally using the partial F test. for a more
[272]
detailed discussion of the partial F test you can see one of our earlier
[275]
videos where we discuss this test; this test has a null hypothesis that there is
[280]
no significant difference between the two models and an alternative hypothesis
[284]
that's a full model, the model that includes height squared, is significantly
[288]
better we can run the test in R, using the ANOVA command we can see that with
[295]
such a small p-value we will reject the null hypothesis and conclude that we
[299]
have evidence to believe the model including height squared provides a
[303]
statistically significantly better fit than the model without height squared
[307]
most often we won't want to include polynomial terms much beyond x squared
[312]
or x cubed let's explore a model that includes X cubed as well and it's worth
[317]
noting that you must always include all lower order terms in a model. If we
[322]
include height cubed we must also include height squared and height in the
[325]
model let's fit this model including height squared and height cubed and
[330]
let's ask for a summary we can add this model to the plot using the lines
[336]
command and we will do this as a thick green dashed line let's also add a
[343]
legend to the plot to help remind ourselves which model is which we can
[350]
see visually that there's almost no difference between the model that
[354]
includes height cubed and the model that does not as before we can use the
[358]
partial F test to help us decide if height cubed improves the model here
[363]
when we conduct the partial F test we can see that the p-value is large and
[368]
there is not a statistically significant difference in the models we can decide
[372]
that including height cubed is not necessary in our model before finishing
[377]
off we should add a note that there are other approaches to dealing with
[381]
nonlinearities some of which include transforming the x or y variable
[386]
converting x to a categorical variable or factor or using nonlinear regression
[391]
methods instead. All of these different approaches have their pros and cons!
[395]
thanks for watching this video make sure to subscribe to marinstatslectures, like us on
[400]
Facebook visit our website (statslectures.com)