Sunteți pe pagina 1din 32

MATH& 146

Lesson 42
Section 6.2
Model Selection

1
Model Selection
The best model is not always the most
complicated. Sometimes including variables that
are not evidently important can actually reduce the
accuracy of predictions.
However, it is not always clear when a variable
should or should not be included in the final model,
so a strategy needs to be developed that will help
us eliminate from the model variables that are less
important.

2
Model Selection
The model that includes all available explanatory
variables is often referred to as the full model.
Our goal is to assess whether the full model is the
best model. If it isn't, we want to identify a smaller
model that is preferable.

3
Model Selection
The table below provides a summary of the
regression output for the full model for the Mario
Kart auction data.

Estimate Std. Error t value Pr(>|t|)


(Intercept) 36.2110 1.5140 23.92 0.0000
cond_new 5.1306 1.0511 4.88 0.0000
stock_photo 1.0803 1.0568 1.02 0.3085
duration 0.0268 0.1904 0.14 0.8882
wheels 7.2852 0.5547 13.13 0.0000
2
Radj 0.7108 df = 136
4
Model Selection
The last column of the table lists the p-values that can
be used to assess hypotheses of the following form:
H0 : i 0, H A : i 0 , assuming the other
explanatory variables are held constant in the model.
Estimate Std. Error t value Pr(>|t|)
(Intercept) 36.2110 1.5140 23.92 0.0000
cond_new 5.1306 1.0511 4.88 0.0000
stock_photo 1.0803 1.0568 1.02 0.3085
duration 0.0268 0.1904 0.14 0.8882
wheels 7.2852 0.5547 13.13 0.0000
2
Radj 0.7108 df = 136
5
Example 1
The coefficient of cond_new has a point estimate of
b1 = 5.13 and a p-value for its corresponding
hypotheses (H0: 1 = 0, HA: 1 0) of about zero. How
can this be interpreted?

Estimate Std. Error t value Pr(>|t|)


(Intercept) 36.2110 1.5140 23.92 0.0000
cond_new 5.1306 1.0511 4.88 0.0000
stock_photo 1.0803 1.0568 1.02 0.3085
duration 0.0268 0.1904 0.14 0.8882
wheels 7.2852 0.5547 13.13 0.0000
2
Radj 0.7108 df = 136
6
Example 2
Identify the p-values for each variable in the model. Is
there strong evidence supporting the connection of
these variables with the total price in the model?

Estimate Std. Error t value Pr(>|t|)


(Intercept) 36.2110 1.5140 23.92 0.0000
cond_new 5.1306 1.0511 4.88 0.0000
stock_photo 1.0803 1.0568 1.02 0.3085
duration 0.0268 0.1904 0.14 0.8882
wheels 7.2852 0.5547 13.13 0.0000
2
Radj 0.7108 df = 136
7
Model Selection
There is not statistically significant evidence that either
stock_photo or duration variables contribute
meaningfully to the model. Next we consider common
strategies for pruning such variables from a model.

Estimate Std. Error t value Pr(>|t|)


(Intercept) 36.2110 1.5140 23.92 0.0000
cond_new 5.1306 1.0511 4.88 0.0000
stock_photo 1.0803 1.0568 1.02 0.3085
duration 0.0268 0.1904 0.14 0.8882
wheels 7.2852 0.5547 13.13 0.0000
2
Radj 0.7108 df = 136
8
Model Selection
Two common strategies for adding or removing
variables in a multiple regression model are called
backward-elimination and forward-selection.
These techniques are often referred to as stepwise
model selection strategies, because they add or delete
one variable at a time as they "step" through the
candidate predictors.

9
Backward-Elimination
The backward-elimination strategy starts with the
model that includes all potential predictor variables.
Variables are eliminated one-at-a-time from the model
until only variables with statistically significant p-values
remain.
The strategy within each elimination step is to drop the
variable with the largest p-value, refit the model, and
reassess the inclusion of all variables.

10
Example 3
Results corresponding to the full model for the Mario
Kart data are shown below. How should we proceed
under the backward-elimination strategy?

Estimate Std. Error t value Pr(>|t|)


(Intercept) 36.2110 1.5140 23.92 0.0000
cond_new 5.1306 1.0511 4.88 0.0000
stock_photo 1.0803 1.0568 1.02 0.3085
duration 0.0268 0.1904 0.14 0.8882
wheels 7.2852 0.5547 13.13 0.0000
2
Radj 0.7108 df = 136
11
Example 4
The variable duration has been removed and a new
model fitted. Now how should we proceed under the
backward-elimination strategy?

Estimate Std. Error t value Pr(>|t|)


(Intercept) 36.0483 0.9745 36.99 0.0000
cond_new 5.1763 0.9961 5.20 0.0000
stock_photo 1.1177 1.0192 1.10 0.2747
wheels 7.2984 0.5448 13.40 0.0000
2
Radj 0.7128 df = 137
12
Backward-Elimination
Notice that the p-value for stock photo changed a little
from the full model (0.3085) to the model that did not
include the duration variable (0.2747).
It is common for p-values of one variable to change,
due to collinearity, after eliminating a different variable.
This fluctuation emphasizes the importance of refitting
a model after each variable elimination step. The p-
values tend to change dramatically when the
eliminated variable is highly correlated with another
variable in the model.

13
Backward-Elimination
In the latest model, we see that the two remaining
predictors have statistically significant coefficients with
p-values of about zero.
Since there are no variables remaining that could be
eliminated from the model, we stop.

Estimate Std. Error t value Pr(>|t|)


(Intercept) 36.7849 0.7066 52.06 0.0000
cond_new 5.5848 0.9245 6.04 0.0000
wheels 7.2328 0.5419 13.35 0.0000
2
Radj 0.7124 df = 138
14
Example 5
a) Write out our final model for predicting the total
auction price?
b) What is the expected price for a new Mario Kart
game that included two wheels?
c) What is the expected price for a used Mario Kart
game that did not include any wheels?

Estimate Std. Error t value Pr(>|t|)


(Intercept) 36.7849 0.7066 52.06 0.0000
cond_new 5.5848 0.9245 6.04 0.0000
wheels 7.2328 0.5419 13.35 0.0000
2
Radj 0.7124 df = 138
15
Forward-Selection
The forward-selection strategy is the reverse of the
backward-elimination technique.
Instead of eliminating variables one-at-a-time, we
add variables one-at-a-time until we cannot find
any variables that present strong evidence of their
importance in the model.

16
Forward-Selection
For the Mario Kart data, we would start with (1) the
model that includes no variables.

Model 1 Estimate Std. Error t value Pr(>|t|)


(Intercept) 47.4319 0.7675 61.80 0.0000
2
Radj 0 df = 140

17
Forward-Selection
Now we fit each of the possible models with just one
variable. That is, we fit (2) the model including just the
cond_new predictor, then (3) the model including just
the stock_photo variable, then (4) the model with just
duration, and (5) the model with just wheels.
Each of the four models (yes, we fit four models!)
provides a p-value for the coefficient of the predictor
variable.

18
Forward-Selection

Model 2 Estimate Std. Error t value Pr(>|t|)


(Intercept) 42.8711 0.8140 52.67 0.0000
cond_new 10.8996 1.2583 8.66 0.0000
2
Radj 0.3459 df = 139

Model 3 Estimate Std. Error t value Pr(>|t|)


(Intercept) 44.3272 1.4935 29.68 0.0000
stock_photo 4.1692 1.7307 2.41 0.0173
2
Radj 0.0332 df = 139

19
Forward-Selection

Model 4 Estimate Std. Error t value Pr(>|t|)


(Intercept) 52.3736 1.2608 41.54 0.0000
duration 1.3172 0.2769 4.76 0.0000
2
Radj 0.1338 df = 139

Model 5 Estimate Std. Error t value Pr(>|t|)


(Intercept) 37.5020 0.7804 48.06 0.0000
wheels 8.6427 0.5479 15.77 0.0000
2
Radj 0.6390 df = 139

20
Forward-Selection
Out of these four variables, the wheels variable had
the smallest p-value and largest test statistic. Since its
p-value is less than 0.05 (the p-value was smaller than
2 E 16), we add the Wii wheels variable to the model.
Once a variable is added in forward-selection, it will be
included in all models considered as well as the final
model.

21
Forward-Selection
Since we successfully found a first variable to add, we
consider adding another. We fit three new models: (6)
the model including just the cond_new and wheels
variables, (7) the model including just the stock_photo
and wheels variables, and (8) the model including only
the duration and wheels variables.

22
Forward-Selection
Model 6 Estimate Std. Error t value Pr(>|t|)
(Intercept) 39.7849 0.7066 52.06 0.0000
wheels 7.2328 0.5419 13.35 0.0000
cond_new 5.5848 0.9245 6.04 0.0000
2
Radj 0.7124 df = 138

Model 7 Estimate Std. Error t value Pr(>|t|)


(Intercept) 35.3144 1.0512 33.60 0.0000
wheels 8.5384 0.5339 15.99 0.0000
stock_photo 3.0985 1.0305 3.08 0.0031
2
Radj 0.6587 df = 138

23
Forward-Selection
Model 8 Estimate Std. Error t value Pr(>|t|)
(Intercept) 39.8029 1.1806 33.71 0.0000
wheels 8.1844 0.5664 14.45 0.0000
duration 0.4729 0.1848 2.56 0.0116
2
Radj 0.6528 df = 138

24
Forward-Selection
Of these models, the model with the wheels and
cond_new variables had the lowest p-value and
highest test statistic for its new variable (the p-value
corresponding to cond_new was 1.4 E 8).

Because this p-value is below 0.05, we add the


cond_new variable to the model. Now the final model
is guaranteed to include both the condition and wheels
variables.

25
Forward-Selection
We now repeat the process a third time, fitting two new
models: (9) the model including the stock_photo,
cond_new, and wheels variables and (10) the model
including the duration, cond_new, and wheels
variables.

26
Forward-Selection
Model 9 Estimate Std. Error t value Pr(>|t|)
(Intercept) 36.0483 0.9745 36.99 0.0000
wheels 7.2984 0.5448 13.40 0.0000
cond_new 5.1763 0.9961 5.20 0.0000
stock_photo 1.1177 1.0192 1.10 0.2747
2
Radj 0.7128 df = 137

Model 10 Estimate Std. Error t value Pr(>|t|)


(Intercept) 37.1750 1.1846 31.38 0.0000
wheels 7.2018 0.5488 13.12 0.0000
cond_new 5.4170 1.0133 5.35 0.0000
duration 0.0758 0.1843 0.41 0.6817
2
Radj 0.7107 df = 137
27
Forward-Selection
The p-value corresponding to stock_photo in Model 9
(0.2747) was smaller than the p-value corresponding
to duration in Model 10 (0.6817).
However, since this smaller p-value was not below
0.05, there was no evidence that it should be included
in the best model. Therefore, neither variable is added
and we are finished.

28
Model Selection Summary
The backward-elimination strategy begins with the
largest model and eliminates variables one-by-one
until we are satisfied that all remaining variables are
important to the model.
The forward-selection strategy starts with no variables
included in the model, then it adds in variables
according to their importance until no other important
variables are found.

29
Model Selection Summary
It is worth noting that there is no guarantee that the
backward-elimination and forward-selection strategies
will arrive at the same final model.
It is also worth noting that there is also no guarantee
that either strategy will arrive at the overall best model,
especially when there are hundreds, thousands, or
even millions of variables to check.

30
Model Selection Summary
For 50 variables, even if you could check 1,000,000
models a second, it would take you about 36 years to
look all 250 possible models.
For 100 variables at 1 trillion models a second, it
would still take about 40 billion years to check all
possible models. Good luck!
It is often impossible to consider all possible models,
so the best model cannot be guaranteed. Fortunately,
backward elimination and forward selection will give
pretty good models to use.
31
Model Selection Summary
It is generally acceptable to use just one strategy.
However, if the backwards-elimination and forward-
selection strategies are both tried and they arrive at
different models, choose the model with the larger
adjusted R2 as a tie-breaker.

32

S-ar putea să vă placă și