Sunteți pe pagina 1din 28

Stepwise

regression

Statement of problem
A common problem is that there is a large set

of candidate predictor variables.


Goal is to choose a small subset from the
larger set so that the resulting regression
model is simple, yet have good predictive
ability.

What method should you use:


forward or backward?
If you have a verylargeset of potential

independent variables from which you wish


toextracta few, you should generally go
forward
if you have amodest-sizedset of potential
variables from which you wish to eliminatea
few, you should generally gobackward.

Stepwise regression:
Preliminary steps
1. Specify an Alpha-to-Enter (E = 0.05)

significance level.
2. Specify an Alpha-to-Remove (R = 0.05)
significance level.

Stepwise regression:
Stopping the procedure
The procedure is stopped when adding an

additional predictor does not yield a t-test Pvalue below E = 0.05.

Caution about stepwise


regression!
Do not jump to the conclusion
that all the important predictor variables for
predicting y have been identified, or
that all the unimportant predictor variables
have been eliminated.

Caution about stepwise


regression!
The probability is high
that we included some unimportant predictors
that we excluded some important predictors

Drawbacks of stepwise
regression
The final model is not guaranteed to be

optimal in any specified sense.


The procedure yields a single final model,
although in practice there are often several
equally good models.
It doesnt take into account a researchers
knowledge about the predictors.

The three most commonly used


automated procedures are
Forward selection -- start with the best predictor and add
predictors to get a best model

Backward selection -- start with a full model and delete predictors


to get a best model

Stepwise selection -- a combination of the first two

Forward selection
Step 1 the first predictor in the model is the best single predictor
Select the predictor with the numerically largest simple
correlation with the dependent variable

ry,x1

vs.

ry,x2

vs.

ry,x3

vs.

ry,x4

Step 2 the next predictor in the model is the one that will
contribute the most -- with two equivalent definitions
1. The 2-predictor model (including the first predictor) with the
numerically largest R -- if the R is significant and
significantly larger than the r from the first step

R2y.x3,x1

vs.

R2y.x3,x2

vs.

R2y.x3,x4

2. Add to the model that predictor with the highest semi-partial


correlation with the dependent variable, controlling the
predictor for the predictor already in the model

ry(x1.x3)

vs.

ry(x2.x3)

vs.

ry.(x4.x3)

All subsequent steps -- the next predictor in the model is the one
that will contribute the most -- with two equivalent definitions
1. The 3-predictor model (including the first predictor) with the
numerically largest R -- if the R is significant and
significantly larger than the R from the previous step

R2y.x3,x2,x1

vs.

R2y.x3,x2,x4

2. Add to the model that predictor with the highest semi-partial


correlation with the dependent variable, controlling the
predictor for the predictors already in the model

r y.(x1.x3,x2)

vs.

r y.(x4.x3,x2)

When to quit ??? When no additional predictor will significantly


increase the R (same as when no multiple semi-partial is
significant).
Difficulties with the forward inclusion model
The major potential problem is over-inclusion -- a predictor that
contributes to a smaller (earlier) model fails to continue to
contribute as the model gets larger (with increased
collinearity), but the predictor stays in the model.
The resulting model may not be the best -- there may be
another model with the same # predictors but larger R, etc
All of these problems are exacerbated by increased collinearity !!

Backward selection
Step 1 -- start with the full model (all predictors) -- if the R is
significant. Consider the regression weights of this model.
Step 2 -- remove from the model that predictor that contributes
the least
Delete that predictor with the largest p-value associated
with its regression (b) weight -- if that p-value is greater
than .05. (The idea is the predictor with the largest pvalue is the one most likely to not be contributing to the
model in the population)

bx1(p=.08) vs. bx2(p=.02) vs. bx3(p=.02) vs. bx4(p=.27)

On all subsequent steps -- the next predictor dropped from the


model is that with the largest (non-significant) regression weight.

bx1(p=.21) vs. bx2(p=.14) vs. bx3(p=.012)


When to quit ?? When all the predictors in the model are
contributing to the model.
Difficulties with the backward deletion model
The major potential problem is under-inclusion -- a predictor
that is deleted from a larger (earlier) model would
contribute to a smaller model, but isnt re-included.
The resulting model may not be the best -- there may be
another model with the same # predictors but larger R, etc
All of these problems are exacerbated by increased collinearity !!

Stepwise regression
Step 1 the first predictor in the model is the best single predictor
(same as the forward inclusion model)
Select the predictor with the numerically largest simple
correlation with the criterion -- if it is a significant
correlation
by using this procedure we are sure that the initial model works

Step 2 the next predictor in the model is the one that will
contribute the most -- with two equivalent definitions
(same as the forward inclusion model)
1. The 2-predictor model (including the first predictor) with the
numerically largest R -- if the R is significant and
significantly larger than the r from the first step
2. Add to the model that predictor with the highest semi-partial
correlation with the criterion, controlling the predictor for the
predictor already in the model -- if the semi-partial is
significant
by using this procedure we are sure the 2-predictor model works
and works better than the 1-predictor model

On all Subsequent steps (each having two parts)


a. -- remove from the model that predictor that contributes
the least (same as the backward deletion model)
Delete that predictor with the largest p-value associated
with its regression (b) weight -- if that p-value is greater
than .05. (The idea is the predictor with the largest pvalue is the one most likely to not be contributing to the
model in the population)
-- if a predictor is deleted, look for a second (third, etc) that
should also be deleted, before moving on to part b.
by using this procedure, we are sure that all the predictors
in the model are contributing before adding any additional
predictors to the model

b. the next predictor in the model is the one that will


contribute the most ( same as for forward inclusion ) -with two equivalent definitions
1. The 2-predictor model (including the first predictor) with the
numerically largest R -- if the R is significant and
significantly larger than the r from the first step
2. Add to the model that predictor with the highest semi-partial
correlation with the criterion, controlling the predictor for the
predictor already in the model -- if the semi-partial is
significant
by using this procedure we are sure the model with the added
predictor works and works better than the model without it

When to quit ? -- when BOTH of two conditions exist


1. All predictors included in the model are contributing to it
2. None of the predictors that are not in the model would
contribute if they were added.
by using this procedure we avoid both over-inclusion and underinclusion

Difficulty with the stepwise regression


The resulting model may not be the best -- there may be
another model with the same # predictors but larger R
Assumes that the best model is found by starting with the best
single predictor
This problem is exacerbated by increased collinearity !!

Model selection
A full model is one that includes all the

variables
A null model is one that includes only the
intercept
Selection of which variables to include can be
done by you, by the computer, or both
Types of selection:
Forward, backward, stepwise

Backward selection
Starts with a full model
Removes variables starting with the least

significant variable
Often the best approach to start with

What do you get when you cross a statistician

with a chiropractor?
You get an adjusted R squared from a
BACKward regression problem!

Forward selection
Starts with a null model
Enters the variables into the model starting

with the most significant


Can miss important associations or
interactions

Stepwise selection
Starts with a full or null model (usually a full

model or backwards stepwise)


Adds or removes variables based on their
significance in the model
Looks at variable itself and the relationship
with other in the model
Can be considered the best automatic model
selection especially with many exposure
variables

Stepwise Regression
Analysis
Stepwise finds the explanatory variable with the
highest R2 to start with. It then checks each of
the remaining variables until two variables with
highest R2 are found. It then repeats the process
until three variables with highest R 2 are found,
and so on.
The overall R2 gets larger as more variables are
added.
Stepwise may be useful in the early exploratory
stage of data analysis, but not to be relied upon
for the confirmatory stage.

Week assignment
Summary stepwise regression (3 pages)
Run stepwise regression (data in next slide)
https://www.youtube.com/watch?v=eme0ErU

7GJA

Data for stepwise


regression
childAA

childA

childIn

parentIn

teacherI
n

frequenc
y

30

30

30

30

30

30

30

30

30

S-ar putea să vă placă și