Sunteți pe pagina 1din 15

BUILDING REGRESSION MODELS PART 1

Topics Outline
Partial F Test
Adjusted r 2
C p Statistic
Include/Exclude Decisions
Variable Selection Procedures
Partial F Test
There are many situations where a set of explanatory variables form a logical group. It is then
common to include all of the variables in the equation or exclude all of them. An example of this
is when one of the explanatory variables is categorical with more than two categories.
In this case you model it by including dummy variables one fewer than the number of categories.
If you decide that the categorical variable is worth including, you might want to keep all of the
dummies. Otherwise, you might decide to exclude all of them.
Consider the following general situation. Suppose you have already estimated a reduced
multiple regression model that includes the variables x1 through x j :
y = + 1 x1 + L j x j +

Now you are proposing to estimate a larger, referred to as full, model that includes x j +1 through x k
in addition to the variables x1 through x j :
y = + 1 x1 + L j x j + j +1 x j +1 + L k x k +

That is, the full model includes all of the variables from the smaller model, but it also includes
k j extra variables.
The partial F test is used to determine whether the extra variables provide enough extra
explanatory power as a group to warrant their inclusion in the equation. In other words,
the partial F test tests whether the full model is significantly better than the reduced model.
The null and alternative hypotheses can be stated as follows.
H 0 : j +1 = L = k = 0

(The extra variables have no effect on y.)

H a : At least one of j +1 , K , k is not zero

(At least one of the extra variables has an effect on y.)

The test statistic is:


SSE (reduced) SSE (full)
F = number of extra terms
MSE (full)
-1-

This test statistic measures how much the sum of squared residuals, SSE, decreases by including
the extra variables in the equation. It must decrease by some amount because the sum of squared
residuals cannot increase when extra variables are added to an equation. But if it does not
decrease sufficiently, the extra variables might not explain enough to justify their inclusion in the
equation, and they should probably be excluded.
If the null hypothesis is true, this test statistic has an F distribution with df1 = k j and
df2 = n k 1 degrees of freedom. If the corresponding P-value is sufficiently small,
you can reject the null hypothesis that the extra variables have no explanatory power.
To perform the partial F test in Excel, run two regressions, one for the reduced model
(with explanatory variables x1 through x j ) and one for the full model (with explanatory
variables x1 through x k ), and use the appropriate values from their ANOVA tables to calculate
the F test statistic. Then use Excel's FDIST function to calculate the corresponding P-value.
Reminder: The ANOVA table for the full equation has the following form.
Source of
Variation

Degrees
Sum
of
of Squares
Freedom

Regression

SSR

Error

nk1

SSE

Total

n 1

SST

Mean Squares
(Variance)
SSR
k
SSE
MSE =
n k 1

MSR =

F statistic

P-value

MSR
MSE

Prob > F

F=

Notes:
1. Many users look only at the r 2 and se values to check whether extra variables are doing a
good job. For example, they might cite that r 2 went from 80% to 90% or that se went
from 500 to 400 as evidence that extra variables provide a significantly better fit.
Although these are important indicators, they are not the basis for a formal hypothesis test.
The partial F test is the formal test of significance for an extra set of variables.
2. If the partial F test shows that a group of variables is significant, it does not imply that each
variable in this group is significant. Some of these variables can have low t-values
(and consequently, large P-values). Some analysts favor excluding the individual variables
that aren't significant, whereas others favor keeping the whole group or excluding the whole
group. Either approach is valid. Fortunately, the final model building results are often nearly
the same either way.
3. StatTools performs partial F tests as part of the procedure for building regression models
when the option Block is chosen in the Regression Type dropdown list.

-2-

Example 1
Heating Oil Consumption
A real estate developer wants to predict the heating oil consumption in single-family houses
based on the effect of atmospheric temperature and the amount of attic insulation.
Data are collected from a sample of 15 single-family houses. Of the 15 houses selected, houses
1, 4, 6, 7, 8, 10, and 12 are ranch-style houses. The data are organized and stored in Heating_Oil.xlsx.
House
1
2
M
14
15

Gallons
275.3
363.8
M
323
52.5

Temperature (F)
40
27
M
38
58

Insulation (inch)
3
3
M
3
10

Style
1
0
M
0
0

(a) Develop and analyze an appropriate regression model.


The explanatory variables considered are
x1 atmospheric temperature
x 2 the amount of attic insulation
x3 dummy variable = 1 if the style is ranch, 0 otherwise
Assuming that the slope between heating oil consumption and atmospheric temperature x1 ,
and between heating oil consumption and the amount of attic insulation x 2 , is the same for
both styles of houses, the regression model is
y = + 1 x1 + 2 x 2 + 3 x3 +
The regression results for this model are:
Regression Statistics
Multiple R
R Square
Adjusted R Square
Standard Error
Observations
ANOVA

0.9942
0.9884
0.9853
15.7489
15
df

Regression
Residual
Total

Intercept
Temperature
Insulation
Style

3
11
14

SS
233406.9094
2728.3200
236135.2293

MS
77802.3031
248.0291

F
313.6822

Significance F
0.0000

Coefficients
592.5401
-5.5251
-21.3761
-38.9727

Standard Error
14.3370
0.2044
1.4480
8.3584

t Stat
41.3295
-27.0267
-14.7623
-4.6627

P-value
0.0000
0.0000
0.0000
0.0007

Lower 95%
560.9846
-5.9751
-24.5632
-57.3695

-3-

Upper 95%
624.0956
-5.0752
-18.1891
-20.5759

(b) Interpret the regression coefficients.


The regression equation is
y = 592.5401 5.5251x1 21.3761x 2 38.9727 x3
Predicted Consumption = 592.5401 5.5251Temperature 21.3761Insulation 38.9727Style
For houses that are ranch style, because x3 = 1, the regression equation reduces to

y = 553.5674 5.5251x1 21.3761x2


For houses that are not ranch style, because x3 = 0, the regression equation reduces to

y = 592.5401 5.5251x1 21.3761x2


The regression coefficients are interpreted as follows:
b1 = 5.5251: Holding constant the attic insulation and the house style, for each additional
1F increase in atmospheric temperature, you estimate that the predicted
heating oil consumption decreases by 5.5251 gallons.
b2 = 21.3761: Holding constant the atmospheric temperature and the house style, for each
additional 1-inch increase in attic insulation, you estimate that the predicted
heating oil consumption decreases by 21.3761 gallons.
b3 = 38.9727: b3 measures the effect on oil consumption of having a ranch-style house ( x3 = 1)
compared with having a house that is not ranch style ( x3 = 0). Thus, with
atmospheric temperature and attic insulation held constant, you estimate that the
predicted heating oil consumption is 38.9727 gallons less for a ranch-style house
than for a house that is not ranch style.
(c) Does each of the three variables make a significant contribution to the regression model?
The three t-test statistics representing the slopes for temperature, insulation, and ranch style are
27.0267, 14.7623, and 4.6627. Each of the corresponding P-values is extremely small
(less than 0.001). Thus, each of the three variables makes a significant contribution to the model.
In addition, the coefficient of determination indicates that 98.84% of the variation in oil usage
is explained by variation in temperature, insulation, and whether the house is ranch style.
(d) Determine whether adding the interaction terms makes a significant contribution to the model.
To evaluate possible interactions between the explanatory variables, three interaction terms
are constructed as follows:

x4 = x1 x2 (interaction between temperature and insulation)


x5 = x1 x3 (interaction between temperature and style)
x6 = x 2 x3 (interaction between insulation and style)
The regression model is now
y = + 1 x1 + 2 x 2 + 3 x3 + 4 x 4 + 5 x5 + 6 x6 +
The regression results for this model are:
-4-

Regression Statistics
Multiple R
R Square
Adjusted R Square
Standard Error
Observations

0.9966
0.9931
0.9880
14.2506
15

ANOVA
df
6
8

SS
234510.5818
1624.6475

14

236135.2293

Coefficients
642.8867
-6.9263
-27.8825
-84.6088
0.1702
0.6596
4.9870

Standard Error
26.7059
0.7531
3.5801
29.9956
0.0886
0.4617
3.5137

Regression
Residual
Total

Intercept
Temperature
Insulation
Style
Temp*Insulation
Temp*Style
Insulation*Style

MS
39085.0970
203.0809

F
192.4607

Significance F
0.0000

t Stat
24.0728
-9.1969
-7.7882
-2.8207
1.9204
1.4286
1.4193

P-value
0.0000
0.0000
0.0001
0.0225
0.0911
0.1910
0.1936

Lower 95%
581.3028
-8.6629
-36.1383
-153.7787
-0.0342
-0.4051
-3.1156

Upper 95%
704.4706
-5.1896
-19.6268
-15.4389
0.3746
1.7242
13.0895

To test whether the three interactions significantly improve the regression model, you use the
partial F test. The null and alternative hypotheses are
H 0 : 4 = 5 = 6 = 0

(There are no interactions among x1 , x2 , and x3 )

H a : At least one of 4 , 5 , 6 is not zero

( x1 interacts with x 2 , and/or x1 interacts with x3 ,


and/or x 2 interacts with x3 )

From the full regression output (see above),


SSE(full) = 1624.6475
MSE(full) = 203.0809
From the reduced regression output (see part (a)),
SSE(reduced) = 2728.3200
The test statistic is:
SSE (reduced) SSE (full) 2728.3200 1624.6475
367.8908
3
F = number of extra terms =
=
= 1.8115
MSE (full)
203.0809
203.0809
df1 = k j = 6 3 = 3
df2 = n k 1 = 15 6 1 = 8
P-value = FDIST(1.8115,3,8) = 0.2230
-5-

Because of the large P-value, you conclude that the interactions do not make a significant
contribution to the model, given that the model already includes temperature x1 , insulation x 2 ,
and whether the house is ranch style x3 . Therefore, the multiple regression model using x1 , x2 ,
and x3 but no interaction terms is the better model.
If you rejected this null hypothesis, you would then test the contribution of each interaction
separately in order to determine which interaction terms to include in the model.
Adjusted r2
Adding new explanatory variables will always keep the r 2 value the same or increase it;
it can never decrease it. In general, adding explanatory variables to the model causes the
prediction errors to become smaller, thus reducing the sum of squares due to error, SSE.
Because SSR = SST SSE, when SSE becomes smaller, SSR becomes larger, causing
SSR
to increase. Therefore, if a variable is added to the model, r 2 usually becomes larger
r2 =
SST
even if the variable added is not statistically significant. This can lead to fishing expeditions,
where you keep adding variables to the model, some of which have no conceptual relationship to
the response variable, just to inflate the r 2 value.
To avoid overestimating the impact of adding an explanatory variable on the amount of
variability explained by the estimated regression equation, many analysts prefer adjusting r 2 for
the number of explanatory variables. The adjusted r 2 is defined as

2
radj
= 1 1 r2

) n n k 1 1

The adjusted r 2 imposes a penalty for each new term that is added to the model in an attempt
to make models of different sizes (numbers of explanatory variables) comparable. It can decrease
when unnecessary explanatory variables are added to the regression model. Therefore, it serves
as an index that you can monitor. If you add variables and the adjusted r 2 decreases, the extra
variables are essentially not pulling their weight and should probably be omitted.
For the full model of the Heating Oil Consumption example (with the interaction terms),
n = 15, k = 6, and r 2 = 0.9931. Thus, the adjusted r 2 is:
n 1
15 1

2
radj
= 1 1 r2
= 1 (1 0.9931)
= 1 (0.0069)(1.75) = 0.9880

n k 1
15 6 1

The adjusted r 2 for the reduced model (without the interaction terms) is 0.9853.
The adjusted r 2 for the full model indicates too small an improvement in explaining the variation
in the consumption of heating oil to justify keeping the interaction terms in the model even if the
partial F test were significant.
2
Note: It can happen that the value of radj
is negative. This is not a mistake, but a result of a model that fits
2
the data very poorly. In this case, some software systems set radj
equal to 0. Excel will print the actual value.

-6-

C p Statistic

Another measure often used in the evaluation of competing regression models is the C p statistic
developed by Mallows. The formula for computing C p is
Cp =

SSE (k )
( n 2 k 2)
MSE (full)

where
SSE (k ) is the error sum of squares for a regression model that has k explanatory variables, k = 1, 2, ...
MSE (full) is the mean square error for a regression model that has all explanatory variables included.

Theory says that if the value of C p is large, then the mean square error of the fitted values is large,
indicating either a poor fit, substantial bias in the fit, or both. In addition, if the value of C p is
much greater than k + 1, then there is a large bias component in the regression, usually indicating
omission of an important variable. Therefore, when evaluating which regression is best, it is
recommended that regressions with small C p values and those with values near k + 1 be considered.
Although the C p measure is highly recommended as a useful criterion in choosing between
alternate regressions, keep in mind that the bias is measured with respect to the total group of
variables provided by the researcher. This criterion cannot determine when the researcher has
forgotten about some variable not included in the total group.
Include/Exclude Decisions
Finding the best xs (or the best form of the xs) to include in a regression model is undoubtedly
the most difficult part of any real regression analysis problem. You are always trying to get the
best fit possible. The principle of parsimony suggests using the fewest number of explanatory
variables that can predict the response variable adequately. Regression models with fewer
explanatory variables are easier to interpret and are less likely to be affected by interaction or
collinearity problems. On the other hand, more variables certainly increase r 2 , and they usually
reduce the standard error of estimate se. This presents a trade-off which is the heart of the
challenge of selecting a good model.
The best regression models, in addition to satisfying the conditions of multiple regression, have:

Relatively few explanatory variables.


Relatively high r 2 and radj2 , indicating that much of the variability in y is accounted for by
the regression model.
A small value of C p (close to or less than k + 1)

A relatively small value of se , the standard deviation of the residuals, indicating that the
magnitude of the errors is small.

Relatively small P-values for the F- and t-statistics, showing that the overall model is better than
a simple summary with the mean and that the individual parameters are reliably different from zero.
-7-

Here are several guidelines for including and excluding variables. These guidelines are not
ironclad rules. They typically involve choices at the margin, that is, between equations that are
very similar and seem equally useful.
Guidelines for Including/Excluding Variables in a Regression Model
1. Look at a variable's t-value and its associated P-value. If the P-value is above some accepted
significance level, such as 0.05, this variable is a candidate for exclusion.
2. It is a mathematical fact that:
If t-value < 1, then se will decrease and adjusted r 2 will increase if this variable is excluded
from the equation.
If t-value > 1, the opposite will occur.
Because of this, some statisticians advocate excluding variables with t-values less than 1 and
including variables with t-values greater than 1. However, analysts who base the decision on
statistical significance at the usual 5% level, as in guideline 1, typically exclude a variable
from the equation unless its t-value is at least 2 (approximately). This latter approach is more
stringent fewer variables will be retained but it is probably the more popular approach.
3. When there is a group of variables that are in some sense logically related, it is sometimes a
good idea to include all of them or exclude all of them. In this case, their individual t-values
are less relevant. Instead, a partial F test can be used to make the include/exclude decision.
4. Use economic, theoretical or practical considerations to decide whether to include or exclude
variables. Some variables might really belong in an equation because of their theoretical
relationship with the response variable, and their low t-values, possibly the result of an
unlucky sample, should not necessarily disqualify them from being in the equation.
Similarly, a variable that has no economic or physical relationship with the response variable
might have a significant t-value just by chance. This does not necessarily mean that it should
be included in the equation.
You should not agonize too much about whether to include or exclude a variable at the margin.
If you decide to exclude a variable that doesn't add much explanatory power, you get a somewhat
2
, or s e .
cleaner model, and you probably won't see any dramatic shifts in C p , r 2 , radj
On the other hand, if you decide to keep such a variable in the model, the model is less parsimonious
and you have one more variable to interpret, but otherwise, there is no real penalty for including it.
In real applications there are often several equations that, for all practical purposes, are equally
useful for describing the relationships or making predictions. There are so many aspects of what
makes a model useful that human judgment is necessary to make a final choice. For example,
in addition to favoring explanatory variables that can be measured reliably, you may want to
favor those that are less expensive to measure. The statistician George Boc, who had an
illustrious academic career at the University of Wisconsin, is often quoted saying,
All models are wrong, but some models are useful.

-8-

Variable Selection Procedures


Model building is the process of developing an estimated regression equation that describes the
relationship between a response variable and one or more explanatory variables.
The major issues in model building are finding the proper functional form of the relationship and
selecting the explanatory variables to be included in the model.
Many statistical packages provide some assistance by including automatic model-building options.
These options estimate a series of regression models by successively adding or deleting variables
according to prescribed rules. These rules can vary from package to package but usually the t test
for the slope or the partial F test is used and the corresponding P-value serves as a criterion to
determine whether variables are added or deleted. The levels of significance 1 and 2 for
determining whether an explanatory variable should be entered into the model or removed from
the model are typically referred to as P-value to Enter and P-value to Leave.
Usually, by default P-value to Enter = 0.05 and P-value to Leave = 0.10.
The four most common types of model-building procedures that statistical packages implement are:
forward selection, backward elimination, stepwise regression, and best subsets regression.
Today, many businesses use these variable selection procedures as part of the research technique
called data mining, which tries to identify significant statistical relationships in very large data
sets that contain extremely large number of variables.
The forward selection procedure begins with no explanatory variables in the model and successively
adds variables one at a time until no remaining variables make a significant contribution.
The forward selection procedure does not permit a variable to be removed from the model once it
has been entered. The procedure stops if the P-value for each of the explanatory variables not in
the model is greater than the prescribed P-value to Enter.
The backward elimination procedure begins with a model that includes all potential
explanatory variables. It then deletes one explanatory variable at a time by comparing its P-value
to the prescribed P-value to Leave. The backward elimination procedure does not permit a
variable to be reentered once it has been removed. The procedure stops when none of the
explanatory variables in the model have a P-value greater than P-value to Leave.
The stepwise regression procedure is much like a forward procedure, except that it also considers
possible deletions along the way. Because of the nature of the stepwise regression procedure,
an explanatory variable can enter the model at one step, be removed at a subsequent step,
and then enter the model at a later step. The procedure stops when no explanatory variables can
be removed from or entered into the model.
The best subsets regression procedure works by trying possible subsets from the list of possible
explanatory variables. This procedure does not actually compute all possible regressions.
There are ways to exclude models known to be worse than some already examined models.
Typical computer output reports results for a collection of best models, usually the two best
one-variable models, the two best two-variable models, the two best three-variable models, and so on.
2
, se .
The user can then select the best model based on such measures as C p , r 2 , radj
-9-

In most cases the final results of these four procedures are very similar. However, there is no guarantee
that they will all produce exactly the same final equation. Deciding which estimated regression
equation to use remains a topic for discussion. Ultimately, the analysts judgment must be applied.
Excel does not come with any variable selection techniques built in. StatTools can be used for
forward selection, backward elimination and stepwise regression, but it cannot perform the best
subsets regression. SAS and Minitab can perform all four techniques.
Example 2
Standby Hours
The operations manager at WTT-TV station is looking for ways to reduce labor expenses.
Currently, the graphic artists at the station receive hourly pay for a significant number of hours
during which they are idle. These hours are called standby hours. The operations manager wants to
determine which factors most heavily affect standby hours of graphic artists. Over a period of 26
weeks, he collected data concerning standby hours (y) and four factors that he suspects are related
to the excessive number of standby hours the station is currently experiencing:

x1 the total number of staff present


x 2 remote hours
x3 Dubner hours
x4 total labor hours
The data are organized and stored in Standby.xlsx.
Week
1
2
M
25
26

Standby
245
177
M
261
232

Total Staff
338
333
M
315
331

Remote
414
598
M
164
270

Dubner
323
340
M
223
272

Total Labor
2001
2030
M
1839
1935

How to build a multiple regression model with the most appropriate mix of explanatory variables?
Solution:
(a) Compute the variance inflation factors to measure the amount of collinearity among the
1
explanatory variables. (Reminder: VIF j =
)
1 r j2
This is always a good starting point for any multiple regression analysis. It involves running
four regressions one regression for each explanatory variable against the other x variables.
The following table summarizes the results.

- 10 -

Multiple R
R Square
Adjusted R Square
Standard Error
Observations
VIF

Total Staff
and all other X
0.6437
0.4143
0.3345
16.4715
26
1.7074

Remote
and all other X
0.4349
0.1891
0.0786
124.9392
26
1.2333

Dubner
and all other X
0.5610
0.3147
0.2213
57.5525
26
1.4592

Total Labor
and all other X
0.7070
0.4998
0.4316
114.4118
26
1.9993

All the VIF values are relatively small, ranging from a high of 1.9993 for the total labor hours to
a low of 1.2333 for remote hours. Thus, on the basis of the criteria that all VIF values should be
less than 5, there is little evidence of collinearity among the set of explanatory variables.
(b) Run forward selection, backward elimination, and stepwise regression and compare the results.
StatTools reression output from running the three procedures is shown on the next two pages.
A significance level of 0.05 is used to enter a variable into the model or to delete a variable
from the model (that is, P-value to Enter = P-value to Leave = 0.05).
The correlations between the response variable and the explanatory variables are:
Total Staff
Standby

0.6050

Remote

Dubner

0.0953 0.2443

Total Labor
0.4136

As the computer output shows, the forward selection and stepwise regression methods
produce the same results for these data. The first variable entered into the model is total staff,
the variable that correlates most highly with the response variable standby hours
(r = 0.6050). The P-value for the t-test of total staff is 0.0011 (Note: StatTools does not show it
in the final output.) Because it is less than 0.05, total staff is included in the regression model.
The next step involves selecting a second independent variable for the model. The second variable
chosen is one that makes the largest contribution to the model, given that the first variable has
been selected. For this model, the second variable is remote hours. Because the P-value of 0.0269
for remote hours is less than 0.05, remote hours is included in the regression model.
After the remote hours variable is entered into the model, the stepwise regression procedure
determines whether total staff is still an important contributing variable or whether it can be
eliminated from the model. Because the P-value of 0.0001 for total staff is less than 0.05,
total staff remains in the regression model.
The next step involves selecting a third independent variable for the model. Because none of
the other variables meets the 0.05 criterion for entry into the model, the stepwise procedure
terminates with a model that includes total staff present and the number of remote hours.
The backward elimination procedure produces a model that includes all explanatory variables.
- 11 -

Forward Selection
Multiple
Summary

Adjusted

StErr of

R-Square

Estimate

0.4899

0.4456

35.3873

Degrees of

Sum of

Mean of

Freedom

Squares

Squares

2
23

27662.5429
28802.0725

0.6999

ANOVA Table
Explained
Unexplained

Coefficient
Regression Table
Constant
Total Staff
Remote

Total Staff
Remote

Standard

F-Ratio

p-Value

13831.2714
1252.2640

11.0450

0.0004

t-Value

p-Value

-2.8389
4.6562
-2.3635

0.0093
0.0001
0.0269

Error

-330.6748
1.7649
-0.1390
Multiple

Step Information

R-Square

116.4802
0.3790
0.0588
R-Square

0.6050
0.6999

0.3660
0.4899

Confidence Interval 95%


Lower

Upper

-571.6325
0.9808
-0.2606

-89.7171
2.5490
-0.0173

Adjusted

StErr of

Entry

R-Square

Estimate

Number

0.3396
0.4456

38.6206
35.3873

1
2

Stepwise Regression
Multiple
Summary

ANOVA Table
Explained
Unexplained

Remote

Total Staff
Remote

35.3873

0.4456

Degrees of

Sum of

Mean of

Freedom

Squares

Squares

2
23

27662.5429
28802.0725
Standard

F-Ratio

p-Value

13831.2714
1252.2640

11.0450

0.0004

t-Value

p-Value

Error

-330.6748
1.7649
-0.1390
Multiple

Step Information

StErr of
Estimate

0.4899

Regression Table

Total Staff

Adjusted
R-Square

0.6999

Coefficient
Constant

R-Square

116.4802
0.3790
0.0588
R-Square

0.6050
0.6999

0.3660
0.4899

Confidence Interval 95%


Lower

Upper

-89.7171
2.5490
-0.0173

-2.8389
4.6562
-2.3635

0.0093
0.0001
0.0269

-571.6325
0.9808
-0.2606

Adjusted

StErr of

Enter or

R-Square

Estimate

Exit

0.3396
0.4456

38.6206
35.3873

Enter
Enter

- 12 -

Backward Elimination
Multiple
Summary

ANOVA Table
Explained
Unexplained

Remote
Dubner
Total Labor

All Variables

31.8350

0.6231

0.5513

Sum of

Mean of

Freedom

Squares

Squares

4
21

35181.7937
21282.8217
Standard

F-Ratio

p-Value

8795.4484
1013.4677

8.6786

0.0003

t-Value

p-Value

Error

-330.8318
1.2456
-0.1184
-0.2971
0.1305
Multiple

Step Information

StErr of
Estimate

0.7894

Coefficient

Total Staff

Adjusted
R-Square

Degrees of

Regression Table
Constant

R-Square

110.8954
0.4121
0.0543
0.1179
0.0593
R-Square

0.7894

0.6231

Confidence Interval 95%


Lower

Upper

-100.2123
2.1026
-0.0054
-0.0518
0.2539

-2.9833
3.0229
-2.1798
-2.5189
2.2004

0.0071
0.0065
0.0408
0.0199
0.0391

-561.4514
0.3887
-0.2314
-0.5423
0.0072

Adjusted

StErr of

Exit

R-Square

Estimate

Number

0.5513

31.8350

(c) Which of the two models suggested by the above procedures would you choose based on the
C p selection criterion?
The model suggested by the forward selection and stepwise regression procedures includes
two explanatory variables: total staff ( x1 ) and remote hours ( x 2 ). The backward elimination
procedure suggests the full model with all explanatory variables included.
For the model suggested by the forward selection and stepwise regression procedures,
n = 26, k = 2
SSE (k ) = 28802.0725
MSE (full) = 1013.4677
SSE (k )
28802.0725
Cp =
( n 2 k 2) =
(26 4 2) = 28.4193 20 = 8.4193
MSE (full)
1013.4677
For the model suggested by the backward elimination procedure,
n = 26, k = 4
SSE (k ) = SSE (full) = 21282.8217
MSE (full) = 1013.4677
SSE (k )
21282.8217
Cp =
( n 2 k 2) =
(26 8 2) = 21 16 = 5
MSE (full)
1013.4677
The model chosen by the forward selection and stepwise regression procedures has a C p value of
8.4193, which is substantially above the suggested criterion of k + 1 = 2 + 1 = 3 for that model.
For the model chosen by the backward elimination procedure, k + 1 = 4 + 1 = 5 and C p = 5.
Thus, according to the C p criterion, the model including all four variables is the better model.
- 13 -

(d) Below are the results from the best subsets regression procedure of all possible regression
models for the standby hours data. Which is the best model?
Model
X1
X2
X3
X4
X1X2
X1X3
X1X4
X2X3
X2X4
X3X4
X1X2X3
X1X2X4
X1X3X4
X2X3X4
X1X2X3X4

k+1
2
2
2
2
3
3
3
3
3
3
4
4
4
4
5

Cp

13.32
33.21
30.39
24.18
8.42
10.65
14.80
32.31
23.25
11.82
7.84
9.34
7.75
12.14
5.00

r2

2
radj

0.3660
0.3396
0.0091 0.0322
0.0597
0.0205
0.1710
0.1365
0.4899
0.4456
0.4499
0.4021
0.3754
0.3211
0.0612 0.0205
0.2238
0.1563
0.4288
0.3791
0.5362
0.4729
0.5092
0.4423
0.5378
0.4748
0.4591
0.3853
0.6231
0.5513

se
38.62
48.28
47.03
44.16
35.39
36.75
39.16
48.01
43.65
37.45
34.50
35.49
34.44
37.26
31.84

Because model building requires you to compare models with different numbers of explanatory
2
variables, the adjusted coefficient of determination radj
is more appropriate than r 2
(although sometimes it is a matter of preference). The adjusted r 2 reaches a maximum value
of 0.5513 when all four explanatory variables are included in the model. Therefore, using this
criterion the best model is the model with all four explanatory variables.
The same conclusion is reached when using the C p selection criterion because only the model
with all four explanatory variables considered has a C p value close to or below k + 1.
Note: Although it was not the case here, the C p statistic often provides several alternative
models for you to evaluate in greater depth. Moreover, the best model or models using the C p
criterion might differ from the model selected using the adjusted r 2 and/or the models selected
using the three procedures discussed in (a) through (c).
(e) Perform a residual analysis to evaluate the regression assumptions for the best model.
The best model turned out to be the model containing all four explanatory variables.
On the next page are the plots for the residual analysis of this model.
None of the residual plots versus the total staff, the remote hours, the Dubner hours, and the
total labor hours reveal apparent patterns. In addition, a plot of the residuals versus the
predicted values of y does not show any patterns or evidence of unequal variance.
The histogram of the residuals indicates only moderate departure from normality (skewness = 0.54).
The plot of the residuals versus time shows no indication of autocorrelation in the residuals.
- 14 -

Total Staff Residual Plot

Remote Residual Plot

-10

40

280

300

320

340

360

Residuals

Residuals

40

380

-10

150

-60

-60

Total Labor Residual Plot

40

40

200

300

-60

400

Residuals

Residuals

550

Remote

Total Staff

Dubner Residual Plot

-10

350

500

-10
1600

1800

-60

Dubner

2000

2200

Total Labor

Histogram of Residuals

Residuals vs Fit
10
8

100

150

200

Frequency

-10

250

6
4
2

Time Series Plot of Residuals


40

-10

-60

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26

Week

- 15 -

49.61

31.12

12.63

-5.85

Predicted Standby

-24.34

-42.83

-60

Residuals

Residuals

40

S-ar putea să vă placă și