Documente Academic
Documente Profesional
Documente Cultură
Lesson 42
Section 6.2
Model Selection
1
Model Selection
The best model is not always the most
complicated. Sometimes including variables that
are not evidently important can actually reduce the
accuracy of predictions.
However, it is not always clear when a variable
should or should not be included in the final model,
so a strategy needs to be developed that will help
us eliminate from the model variables that are less
important.
2
Model Selection
The model that includes all available explanatory
variables is often referred to as the full model.
Our goal is to assess whether the full model is the
best model. If it isn't, we want to identify a smaller
model that is preferable.
3
Model Selection
The table below provides a summary of the
regression output for the full model for the Mario
Kart auction data.
9
Backward-Elimination
The backward-elimination strategy starts with the
model that includes all potential predictor variables.
Variables are eliminated one-at-a-time from the model
until only variables with statistically significant p-values
remain.
The strategy within each elimination step is to drop the
variable with the largest p-value, refit the model, and
reassess the inclusion of all variables.
10
Example 3
Results corresponding to the full model for the Mario
Kart data are shown below. How should we proceed
under the backward-elimination strategy?
13
Backward-Elimination
In the latest model, we see that the two remaining
predictors have statistically significant coefficients with
p-values of about zero.
Since there are no variables remaining that could be
eliminated from the model, we stop.
16
Forward-Selection
For the Mario Kart data, we would start with (1) the
model that includes no variables.
17
Forward-Selection
Now we fit each of the possible models with just one
variable. That is, we fit (2) the model including just the
cond_new predictor, then (3) the model including just
the stock_photo variable, then (4) the model with just
duration, and (5) the model with just wheels.
Each of the four models (yes, we fit four models!)
provides a p-value for the coefficient of the predictor
variable.
18
Forward-Selection
19
Forward-Selection
20
Forward-Selection
Out of these four variables, the wheels variable had
the smallest p-value and largest test statistic. Since its
p-value is less than 0.05 (the p-value was smaller than
2 E 16), we add the Wii wheels variable to the model.
Once a variable is added in forward-selection, it will be
included in all models considered as well as the final
model.
21
Forward-Selection
Since we successfully found a first variable to add, we
consider adding another. We fit three new models: (6)
the model including just the cond_new and wheels
variables, (7) the model including just the stock_photo
and wheels variables, and (8) the model including only
the duration and wheels variables.
22
Forward-Selection
Model 6 Estimate Std. Error t value Pr(>|t|)
(Intercept) 39.7849 0.7066 52.06 0.0000
wheels 7.2328 0.5419 13.35 0.0000
cond_new 5.5848 0.9245 6.04 0.0000
2
Radj 0.7124 df = 138
23
Forward-Selection
Model 8 Estimate Std. Error t value Pr(>|t|)
(Intercept) 39.8029 1.1806 33.71 0.0000
wheels 8.1844 0.5664 14.45 0.0000
duration 0.4729 0.1848 2.56 0.0116
2
Radj 0.6528 df = 138
24
Forward-Selection
Of these models, the model with the wheels and
cond_new variables had the lowest p-value and
highest test statistic for its new variable (the p-value
corresponding to cond_new was 1.4 E 8).
25
Forward-Selection
We now repeat the process a third time, fitting two new
models: (9) the model including the stock_photo,
cond_new, and wheels variables and (10) the model
including the duration, cond_new, and wheels
variables.
26
Forward-Selection
Model 9 Estimate Std. Error t value Pr(>|t|)
(Intercept) 36.0483 0.9745 36.99 0.0000
wheels 7.2984 0.5448 13.40 0.0000
cond_new 5.1763 0.9961 5.20 0.0000
stock_photo 1.1177 1.0192 1.10 0.2747
2
Radj 0.7128 df = 137
28
Model Selection Summary
The backward-elimination strategy begins with the
largest model and eliminates variables one-by-one
until we are satisfied that all remaining variables are
important to the model.
The forward-selection strategy starts with no variables
included in the model, then it adds in variables
according to their importance until no other important
variables are found.
29
Model Selection Summary
It is worth noting that there is no guarantee that the
backward-elimination and forward-selection strategies
will arrive at the same final model.
It is also worth noting that there is also no guarantee
that either strategy will arrive at the overall best model,
especially when there are hundreds, thousands, or
even millions of variables to check.
30
Model Selection Summary
For 50 variables, even if you could check 1,000,000
models a second, it would take you about 36 years to
look all 250 possible models.
For 100 variables at 1 trillion models a second, it
would still take about 40 billion years to check all
possible models. Good luck!
It is often impossible to consider all possible models,
so the best model cannot be guaranteed. Fortunately,
backward elimination and forward selection will give
pretty good models to use.
31
Model Selection Summary
It is generally acceptable to use just one strategy.
However, if the backwards-elimination and forward-
selection strategies are both tried and they arrive at
different models, choose the model with the larger
adjusted R2 as a tie-breaker.
32