Documente Academic
Documente Profesional
Documente Cultură
Topics Outline
Partial F Test
Adjusted r 2
C p Statistic
Include/Exclude Decisions
Variable Selection Procedures
Partial F Test
There are many situations where a set of explanatory variables form a logical group. It is then
common to include all of the variables in the equation or exclude all of them. An example of this
is when one of the explanatory variables is categorical with more than two categories.
In this case you model it by including dummy variables one fewer than the number of categories.
If you decide that the categorical variable is worth including, you might want to keep all of the
dummies. Otherwise, you might decide to exclude all of them.
Consider the following general situation. Suppose you have already estimated a reduced
multiple regression model that includes the variables x1 through x j :
y = + 1 x1 + L j x j +
Now you are proposing to estimate a larger, referred to as full, model that includes x j +1 through x k
in addition to the variables x1 through x j :
y = + 1 x1 + L j x j + j +1 x j +1 + L k x k +
That is, the full model includes all of the variables from the smaller model, but it also includes
k j extra variables.
The partial F test is used to determine whether the extra variables provide enough extra
explanatory power as a group to warrant their inclusion in the equation. In other words,
the partial F test tests whether the full model is significantly better than the reduced model.
The null and alternative hypotheses can be stated as follows.
H 0 : j +1 = L = k = 0
This test statistic measures how much the sum of squared residuals, SSE, decreases by including
the extra variables in the equation. It must decrease by some amount because the sum of squared
residuals cannot increase when extra variables are added to an equation. But if it does not
decrease sufficiently, the extra variables might not explain enough to justify their inclusion in the
equation, and they should probably be excluded.
If the null hypothesis is true, this test statistic has an F distribution with df1 = k j and
df2 = n k 1 degrees of freedom. If the corresponding P-value is sufficiently small,
you can reject the null hypothesis that the extra variables have no explanatory power.
To perform the partial F test in Excel, run two regressions, one for the reduced model
(with explanatory variables x1 through x j ) and one for the full model (with explanatory
variables x1 through x k ), and use the appropriate values from their ANOVA tables to calculate
the F test statistic. Then use Excel's FDIST function to calculate the corresponding P-value.
Reminder: The ANOVA table for the full equation has the following form.
Source of
Variation
Degrees
Sum
of
of Squares
Freedom
Regression
SSR
Error
nk1
SSE
Total
n 1
SST
Mean Squares
(Variance)
SSR
k
SSE
MSE =
n k 1
MSR =
F statistic
P-value
MSR
MSE
Prob > F
F=
Notes:
1. Many users look only at the r 2 and se values to check whether extra variables are doing a
good job. For example, they might cite that r 2 went from 80% to 90% or that se went
from 500 to 400 as evidence that extra variables provide a significantly better fit.
Although these are important indicators, they are not the basis for a formal hypothesis test.
The partial F test is the formal test of significance for an extra set of variables.
2. If the partial F test shows that a group of variables is significant, it does not imply that each
variable in this group is significant. Some of these variables can have low t-values
(and consequently, large P-values). Some analysts favor excluding the individual variables
that aren't significant, whereas others favor keeping the whole group or excluding the whole
group. Either approach is valid. Fortunately, the final model building results are often nearly
the same either way.
3. StatTools performs partial F tests as part of the procedure for building regression models
when the option Block is chosen in the Regression Type dropdown list.
-2-
Example 1
Heating Oil Consumption
A real estate developer wants to predict the heating oil consumption in single-family houses
based on the effect of atmospheric temperature and the amount of attic insulation.
Data are collected from a sample of 15 single-family houses. Of the 15 houses selected, houses
1, 4, 6, 7, 8, 10, and 12 are ranch-style houses. The data are organized and stored in Heating_Oil.xlsx.
House
1
2
M
14
15
Gallons
275.3
363.8
M
323
52.5
Temperature (F)
40
27
M
38
58
Insulation (inch)
3
3
M
3
10
Style
1
0
M
0
0
0.9942
0.9884
0.9853
15.7489
15
df
Regression
Residual
Total
Intercept
Temperature
Insulation
Style
3
11
14
SS
233406.9094
2728.3200
236135.2293
MS
77802.3031
248.0291
F
313.6822
Significance F
0.0000
Coefficients
592.5401
-5.5251
-21.3761
-38.9727
Standard Error
14.3370
0.2044
1.4480
8.3584
t Stat
41.3295
-27.0267
-14.7623
-4.6627
P-value
0.0000
0.0000
0.0000
0.0007
Lower 95%
560.9846
-5.9751
-24.5632
-57.3695
-3-
Upper 95%
624.0956
-5.0752
-18.1891
-20.5759
Regression Statistics
Multiple R
R Square
Adjusted R Square
Standard Error
Observations
0.9966
0.9931
0.9880
14.2506
15
ANOVA
df
6
8
SS
234510.5818
1624.6475
14
236135.2293
Coefficients
642.8867
-6.9263
-27.8825
-84.6088
0.1702
0.6596
4.9870
Standard Error
26.7059
0.7531
3.5801
29.9956
0.0886
0.4617
3.5137
Regression
Residual
Total
Intercept
Temperature
Insulation
Style
Temp*Insulation
Temp*Style
Insulation*Style
MS
39085.0970
203.0809
F
192.4607
Significance F
0.0000
t Stat
24.0728
-9.1969
-7.7882
-2.8207
1.9204
1.4286
1.4193
P-value
0.0000
0.0000
0.0001
0.0225
0.0911
0.1910
0.1936
Lower 95%
581.3028
-8.6629
-36.1383
-153.7787
-0.0342
-0.4051
-3.1156
Upper 95%
704.4706
-5.1896
-19.6268
-15.4389
0.3746
1.7242
13.0895
To test whether the three interactions significantly improve the regression model, you use the
partial F test. The null and alternative hypotheses are
H 0 : 4 = 5 = 6 = 0
Because of the large P-value, you conclude that the interactions do not make a significant
contribution to the model, given that the model already includes temperature x1 , insulation x 2 ,
and whether the house is ranch style x3 . Therefore, the multiple regression model using x1 , x2 ,
and x3 but no interaction terms is the better model.
If you rejected this null hypothesis, you would then test the contribution of each interaction
separately in order to determine which interaction terms to include in the model.
Adjusted r2
Adding new explanatory variables will always keep the r 2 value the same or increase it;
it can never decrease it. In general, adding explanatory variables to the model causes the
prediction errors to become smaller, thus reducing the sum of squares due to error, SSE.
Because SSR = SST SSE, when SSE becomes smaller, SSR becomes larger, causing
SSR
to increase. Therefore, if a variable is added to the model, r 2 usually becomes larger
r2 =
SST
even if the variable added is not statistically significant. This can lead to fishing expeditions,
where you keep adding variables to the model, some of which have no conceptual relationship to
the response variable, just to inflate the r 2 value.
To avoid overestimating the impact of adding an explanatory variable on the amount of
variability explained by the estimated regression equation, many analysts prefer adjusting r 2 for
the number of explanatory variables. The adjusted r 2 is defined as
2
radj
= 1 1 r2
) n n k 1 1
The adjusted r 2 imposes a penalty for each new term that is added to the model in an attempt
to make models of different sizes (numbers of explanatory variables) comparable. It can decrease
when unnecessary explanatory variables are added to the regression model. Therefore, it serves
as an index that you can monitor. If you add variables and the adjusted r 2 decreases, the extra
variables are essentially not pulling their weight and should probably be omitted.
For the full model of the Heating Oil Consumption example (with the interaction terms),
n = 15, k = 6, and r 2 = 0.9931. Thus, the adjusted r 2 is:
n 1
15 1
2
radj
= 1 1 r2
= 1 (1 0.9931)
= 1 (0.0069)(1.75) = 0.9880
n k 1
15 6 1
The adjusted r 2 for the reduced model (without the interaction terms) is 0.9853.
The adjusted r 2 for the full model indicates too small an improvement in explaining the variation
in the consumption of heating oil to justify keeping the interaction terms in the model even if the
partial F test were significant.
2
Note: It can happen that the value of radj
is negative. This is not a mistake, but a result of a model that fits
2
the data very poorly. In this case, some software systems set radj
equal to 0. Excel will print the actual value.
-6-
C p Statistic
Another measure often used in the evaluation of competing regression models is the C p statistic
developed by Mallows. The formula for computing C p is
Cp =
SSE (k )
( n 2 k 2)
MSE (full)
where
SSE (k ) is the error sum of squares for a regression model that has k explanatory variables, k = 1, 2, ...
MSE (full) is the mean square error for a regression model that has all explanatory variables included.
Theory says that if the value of C p is large, then the mean square error of the fitted values is large,
indicating either a poor fit, substantial bias in the fit, or both. In addition, if the value of C p is
much greater than k + 1, then there is a large bias component in the regression, usually indicating
omission of an important variable. Therefore, when evaluating which regression is best, it is
recommended that regressions with small C p values and those with values near k + 1 be considered.
Although the C p measure is highly recommended as a useful criterion in choosing between
alternate regressions, keep in mind that the bias is measured with respect to the total group of
variables provided by the researcher. This criterion cannot determine when the researcher has
forgotten about some variable not included in the total group.
Include/Exclude Decisions
Finding the best xs (or the best form of the xs) to include in a regression model is undoubtedly
the most difficult part of any real regression analysis problem. You are always trying to get the
best fit possible. The principle of parsimony suggests using the fewest number of explanatory
variables that can predict the response variable adequately. Regression models with fewer
explanatory variables are easier to interpret and are less likely to be affected by interaction or
collinearity problems. On the other hand, more variables certainly increase r 2 , and they usually
reduce the standard error of estimate se. This presents a trade-off which is the heart of the
challenge of selecting a good model.
The best regression models, in addition to satisfying the conditions of multiple regression, have:
A relatively small value of se , the standard deviation of the residuals, indicating that the
magnitude of the errors is small.
Relatively small P-values for the F- and t-statistics, showing that the overall model is better than
a simple summary with the mean and that the individual parameters are reliably different from zero.
-7-
Here are several guidelines for including and excluding variables. These guidelines are not
ironclad rules. They typically involve choices at the margin, that is, between equations that are
very similar and seem equally useful.
Guidelines for Including/Excluding Variables in a Regression Model
1. Look at a variable's t-value and its associated P-value. If the P-value is above some accepted
significance level, such as 0.05, this variable is a candidate for exclusion.
2. It is a mathematical fact that:
If t-value < 1, then se will decrease and adjusted r 2 will increase if this variable is excluded
from the equation.
If t-value > 1, the opposite will occur.
Because of this, some statisticians advocate excluding variables with t-values less than 1 and
including variables with t-values greater than 1. However, analysts who base the decision on
statistical significance at the usual 5% level, as in guideline 1, typically exclude a variable
from the equation unless its t-value is at least 2 (approximately). This latter approach is more
stringent fewer variables will be retained but it is probably the more popular approach.
3. When there is a group of variables that are in some sense logically related, it is sometimes a
good idea to include all of them or exclude all of them. In this case, their individual t-values
are less relevant. Instead, a partial F test can be used to make the include/exclude decision.
4. Use economic, theoretical or practical considerations to decide whether to include or exclude
variables. Some variables might really belong in an equation because of their theoretical
relationship with the response variable, and their low t-values, possibly the result of an
unlucky sample, should not necessarily disqualify them from being in the equation.
Similarly, a variable that has no economic or physical relationship with the response variable
might have a significant t-value just by chance. This does not necessarily mean that it should
be included in the equation.
You should not agonize too much about whether to include or exclude a variable at the margin.
If you decide to exclude a variable that doesn't add much explanatory power, you get a somewhat
2
, or s e .
cleaner model, and you probably won't see any dramatic shifts in C p , r 2 , radj
On the other hand, if you decide to keep such a variable in the model, the model is less parsimonious
and you have one more variable to interpret, but otherwise, there is no real penalty for including it.
In real applications there are often several equations that, for all practical purposes, are equally
useful for describing the relationships or making predictions. There are so many aspects of what
makes a model useful that human judgment is necessary to make a final choice. For example,
in addition to favoring explanatory variables that can be measured reliably, you may want to
favor those that are less expensive to measure. The statistician George Boc, who had an
illustrious academic career at the University of Wisconsin, is often quoted saying,
All models are wrong, but some models are useful.
-8-
In most cases the final results of these four procedures are very similar. However, there is no guarantee
that they will all produce exactly the same final equation. Deciding which estimated regression
equation to use remains a topic for discussion. Ultimately, the analysts judgment must be applied.
Excel does not come with any variable selection techniques built in. StatTools can be used for
forward selection, backward elimination and stepwise regression, but it cannot perform the best
subsets regression. SAS and Minitab can perform all four techniques.
Example 2
Standby Hours
The operations manager at WTT-TV station is looking for ways to reduce labor expenses.
Currently, the graphic artists at the station receive hourly pay for a significant number of hours
during which they are idle. These hours are called standby hours. The operations manager wants to
determine which factors most heavily affect standby hours of graphic artists. Over a period of 26
weeks, he collected data concerning standby hours (y) and four factors that he suspects are related
to the excessive number of standby hours the station is currently experiencing:
Standby
245
177
M
261
232
Total Staff
338
333
M
315
331
Remote
414
598
M
164
270
Dubner
323
340
M
223
272
Total Labor
2001
2030
M
1839
1935
How to build a multiple regression model with the most appropriate mix of explanatory variables?
Solution:
(a) Compute the variance inflation factors to measure the amount of collinearity among the
1
explanatory variables. (Reminder: VIF j =
)
1 r j2
This is always a good starting point for any multiple regression analysis. It involves running
four regressions one regression for each explanatory variable against the other x variables.
The following table summarizes the results.
- 10 -
Multiple R
R Square
Adjusted R Square
Standard Error
Observations
VIF
Total Staff
and all other X
0.6437
0.4143
0.3345
16.4715
26
1.7074
Remote
and all other X
0.4349
0.1891
0.0786
124.9392
26
1.2333
Dubner
and all other X
0.5610
0.3147
0.2213
57.5525
26
1.4592
Total Labor
and all other X
0.7070
0.4998
0.4316
114.4118
26
1.9993
All the VIF values are relatively small, ranging from a high of 1.9993 for the total labor hours to
a low of 1.2333 for remote hours. Thus, on the basis of the criteria that all VIF values should be
less than 5, there is little evidence of collinearity among the set of explanatory variables.
(b) Run forward selection, backward elimination, and stepwise regression and compare the results.
StatTools reression output from running the three procedures is shown on the next two pages.
A significance level of 0.05 is used to enter a variable into the model or to delete a variable
from the model (that is, P-value to Enter = P-value to Leave = 0.05).
The correlations between the response variable and the explanatory variables are:
Total Staff
Standby
0.6050
Remote
Dubner
0.0953 0.2443
Total Labor
0.4136
As the computer output shows, the forward selection and stepwise regression methods
produce the same results for these data. The first variable entered into the model is total staff,
the variable that correlates most highly with the response variable standby hours
(r = 0.6050). The P-value for the t-test of total staff is 0.0011 (Note: StatTools does not show it
in the final output.) Because it is less than 0.05, total staff is included in the regression model.
The next step involves selecting a second independent variable for the model. The second variable
chosen is one that makes the largest contribution to the model, given that the first variable has
been selected. For this model, the second variable is remote hours. Because the P-value of 0.0269
for remote hours is less than 0.05, remote hours is included in the regression model.
After the remote hours variable is entered into the model, the stepwise regression procedure
determines whether total staff is still an important contributing variable or whether it can be
eliminated from the model. Because the P-value of 0.0001 for total staff is less than 0.05,
total staff remains in the regression model.
The next step involves selecting a third independent variable for the model. Because none of
the other variables meets the 0.05 criterion for entry into the model, the stepwise procedure
terminates with a model that includes total staff present and the number of remote hours.
The backward elimination procedure produces a model that includes all explanatory variables.
- 11 -
Forward Selection
Multiple
Summary
Adjusted
StErr of
R-Square
Estimate
0.4899
0.4456
35.3873
Degrees of
Sum of
Mean of
Freedom
Squares
Squares
2
23
27662.5429
28802.0725
0.6999
ANOVA Table
Explained
Unexplained
Coefficient
Regression Table
Constant
Total Staff
Remote
Total Staff
Remote
Standard
F-Ratio
p-Value
13831.2714
1252.2640
11.0450
0.0004
t-Value
p-Value
-2.8389
4.6562
-2.3635
0.0093
0.0001
0.0269
Error
-330.6748
1.7649
-0.1390
Multiple
Step Information
R-Square
116.4802
0.3790
0.0588
R-Square
0.6050
0.6999
0.3660
0.4899
Upper
-571.6325
0.9808
-0.2606
-89.7171
2.5490
-0.0173
Adjusted
StErr of
Entry
R-Square
Estimate
Number
0.3396
0.4456
38.6206
35.3873
1
2
Stepwise Regression
Multiple
Summary
ANOVA Table
Explained
Unexplained
Remote
Total Staff
Remote
35.3873
0.4456
Degrees of
Sum of
Mean of
Freedom
Squares
Squares
2
23
27662.5429
28802.0725
Standard
F-Ratio
p-Value
13831.2714
1252.2640
11.0450
0.0004
t-Value
p-Value
Error
-330.6748
1.7649
-0.1390
Multiple
Step Information
StErr of
Estimate
0.4899
Regression Table
Total Staff
Adjusted
R-Square
0.6999
Coefficient
Constant
R-Square
116.4802
0.3790
0.0588
R-Square
0.6050
0.6999
0.3660
0.4899
Upper
-89.7171
2.5490
-0.0173
-2.8389
4.6562
-2.3635
0.0093
0.0001
0.0269
-571.6325
0.9808
-0.2606
Adjusted
StErr of
Enter or
R-Square
Estimate
Exit
0.3396
0.4456
38.6206
35.3873
Enter
Enter
- 12 -
Backward Elimination
Multiple
Summary
ANOVA Table
Explained
Unexplained
Remote
Dubner
Total Labor
All Variables
31.8350
0.6231
0.5513
Sum of
Mean of
Freedom
Squares
Squares
4
21
35181.7937
21282.8217
Standard
F-Ratio
p-Value
8795.4484
1013.4677
8.6786
0.0003
t-Value
p-Value
Error
-330.8318
1.2456
-0.1184
-0.2971
0.1305
Multiple
Step Information
StErr of
Estimate
0.7894
Coefficient
Total Staff
Adjusted
R-Square
Degrees of
Regression Table
Constant
R-Square
110.8954
0.4121
0.0543
0.1179
0.0593
R-Square
0.7894
0.6231
Upper
-100.2123
2.1026
-0.0054
-0.0518
0.2539
-2.9833
3.0229
-2.1798
-2.5189
2.2004
0.0071
0.0065
0.0408
0.0199
0.0391
-561.4514
0.3887
-0.2314
-0.5423
0.0072
Adjusted
StErr of
Exit
R-Square
Estimate
Number
0.5513
31.8350
(c) Which of the two models suggested by the above procedures would you choose based on the
C p selection criterion?
The model suggested by the forward selection and stepwise regression procedures includes
two explanatory variables: total staff ( x1 ) and remote hours ( x 2 ). The backward elimination
procedure suggests the full model with all explanatory variables included.
For the model suggested by the forward selection and stepwise regression procedures,
n = 26, k = 2
SSE (k ) = 28802.0725
MSE (full) = 1013.4677
SSE (k )
28802.0725
Cp =
( n 2 k 2) =
(26 4 2) = 28.4193 20 = 8.4193
MSE (full)
1013.4677
For the model suggested by the backward elimination procedure,
n = 26, k = 4
SSE (k ) = SSE (full) = 21282.8217
MSE (full) = 1013.4677
SSE (k )
21282.8217
Cp =
( n 2 k 2) =
(26 8 2) = 21 16 = 5
MSE (full)
1013.4677
The model chosen by the forward selection and stepwise regression procedures has a C p value of
8.4193, which is substantially above the suggested criterion of k + 1 = 2 + 1 = 3 for that model.
For the model chosen by the backward elimination procedure, k + 1 = 4 + 1 = 5 and C p = 5.
Thus, according to the C p criterion, the model including all four variables is the better model.
- 13 -
(d) Below are the results from the best subsets regression procedure of all possible regression
models for the standby hours data. Which is the best model?
Model
X1
X2
X3
X4
X1X2
X1X3
X1X4
X2X3
X2X4
X3X4
X1X2X3
X1X2X4
X1X3X4
X2X3X4
X1X2X3X4
k+1
2
2
2
2
3
3
3
3
3
3
4
4
4
4
5
Cp
13.32
33.21
30.39
24.18
8.42
10.65
14.80
32.31
23.25
11.82
7.84
9.34
7.75
12.14
5.00
r2
2
radj
0.3660
0.3396
0.0091 0.0322
0.0597
0.0205
0.1710
0.1365
0.4899
0.4456
0.4499
0.4021
0.3754
0.3211
0.0612 0.0205
0.2238
0.1563
0.4288
0.3791
0.5362
0.4729
0.5092
0.4423
0.5378
0.4748
0.4591
0.3853
0.6231
0.5513
se
38.62
48.28
47.03
44.16
35.39
36.75
39.16
48.01
43.65
37.45
34.50
35.49
34.44
37.26
31.84
Because model building requires you to compare models with different numbers of explanatory
2
variables, the adjusted coefficient of determination radj
is more appropriate than r 2
(although sometimes it is a matter of preference). The adjusted r 2 reaches a maximum value
of 0.5513 when all four explanatory variables are included in the model. Therefore, using this
criterion the best model is the model with all four explanatory variables.
The same conclusion is reached when using the C p selection criterion because only the model
with all four explanatory variables considered has a C p value close to or below k + 1.
Note: Although it was not the case here, the C p statistic often provides several alternative
models for you to evaluate in greater depth. Moreover, the best model or models using the C p
criterion might differ from the model selected using the adjusted r 2 and/or the models selected
using the three procedures discussed in (a) through (c).
(e) Perform a residual analysis to evaluate the regression assumptions for the best model.
The best model turned out to be the model containing all four explanatory variables.
On the next page are the plots for the residual analysis of this model.
None of the residual plots versus the total staff, the remote hours, the Dubner hours, and the
total labor hours reveal apparent patterns. In addition, a plot of the residuals versus the
predicted values of y does not show any patterns or evidence of unequal variance.
The histogram of the residuals indicates only moderate departure from normality (skewness = 0.54).
The plot of the residuals versus time shows no indication of autocorrelation in the residuals.
- 14 -
-10
40
280
300
320
340
360
Residuals
Residuals
40
380
-10
150
-60
-60
40
40
200
300
-60
400
Residuals
Residuals
550
Remote
Total Staff
-10
350
500
-10
1600
1800
-60
Dubner
2000
2200
Total Labor
Histogram of Residuals
Residuals vs Fit
10
8
100
150
200
Frequency
-10
250
6
4
2
-10
-60
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26
Week
- 15 -
49.61
31.12
12.63
-5.85
Predicted Standby
-24.34
-42.83
-60
Residuals
Residuals
40