Sunteți pe pagina 1din 11

MULTIPLE REGRESSION

Put all your eggs in one basket andWATCH


THAT BASKET.
Mark Twain, Following the Equator

The multiple regression model is a generalization of the simple regression model. Multiple
regression has one response variable (y), but allows for many predictor variables. Multiple
regression accommodates the study of the relationship between many variables.
The groundwork for multiple regression has been laid by our study of simple regression. For
simple regression analysis, we wrote the sample regression line as:

y = b0 + b1 X
calling b0 the y-intercept and b1 the slope. The two-variable multiple regression model adds an
additional predictor to the right-hand-side of the model:

y = b0 +b1 X 1 +b2 X 2
Here, b1 is the slope for X1 and b2 is the slope for X2. The two-variable model has one y-intercept
and two slopes to be estimated from the data.
The k-variable multiple regression model is written:

y = b0 + b1 X 1 + b 2 X 2 + . . . + b k X k
We can add as many variables to the right-hand-side as we need (presuming that we have data for
all those X's, of course). The multiple regression model expands our ability to understand the
behavior of y by allowing more predictors to enter the model. The multiple regression model, like
the simple regression model, is typically estimated using the principle of least squares.
Many of the statistical quantities calculated from a simple regression model are also
available for multiple regression. The mathematical computations of these quantities are unwieldy
(and probably uninformative at this level). The extension from simple to multiple regression will be
illustrated with an example.

Copyright by Marlene A. Smith, 1997-2000. All rights reserved.

Predicting U.S. West's Stock Price1


The task is to construct a regression model of U.S. West's stock, with the ultimate goal of
predicting future values of its stock. The data set, displayed in the Appendix to this section, is a
quarterly time-series running from the first quarter of 1984 (U.S. West's first year in business)
through the last quarter of 1991.
The response variable is stock price (Stock_$) measured in dollars. This variable is the
closing price of U.S. West's stock on the last day of the quarter. The three predictor variables are
taken from the firm's financial statements: Sales is quarterly income measured in millions of
dollars, Liability is the ratio of total liability to total assets (%), and Capital is funds expended for
property, plant, and equipment, also measured in millions of dollars.
The three-variable regression model of U.S. WEST's stock price is shown in Exhibit 1. The
regression model containing sales, the ratio of liability to assets, and capital accounts for 86.5% of
the observed variability in U.S. West's stock price over this period. The model estimates that:
increasing sales by $100 million, while holding the other two variables constant, increases stock
price by $2 (with a typical fluctuation of 29); each 1% increase in the ratio of total liability to total
assets (all else constant) increases stock price by 89 (typical fluctuation of 30), and increasing
capital expenditures by 1 billion dollars relates to a $7 decrease in stock price (typically fluctuating
$3.83 from sample to sample).
Exhibit 1. Multiple Regression Model of U.S. West Stock Prices

The regression equation is


Stock_$ = - 69.8 + 0.0203 Sales + 0.894 Liability
- 0.00715 Capital
Predictor
Constant
Sales
Liability
Capital
s = 2.952

Coef
-69.83
0.020279
0.8939
-0.007148

Stdev
15.32
0.002858
0.3028
0.003838

R-sq = 86.5%

t-ratio
-4.56
7.09
2.95
-1.86

p
0.000
0.000
0.006
0.073

R-sq(adj) = 85.0%

Actual stock prices typically fluctuate $3 from those that would have been predicted by the
model (2.952 in Exhibit 1). If a 5% significance level is chosen, Sales and Liability are statistically
significant variables. Capital is not.
We might gauge the model's forecasting capability by using the regression equation to
forecast predicated values of stock price for 1992. (Recall that we stopped our data set in 1991 for
the purpose of estimating the model.) By looking up company reports for 1992, we know the
following.
1

This data set, and more, was collected by Kevin Aberle and Steve Dreiling.

Multiple Regression

1992i
1992ii
1992iii
1992iv

Sales
2514.3
2547.8
2556.2
2652.8

Liability
64.79
64.29
64.59
70.43

Capital
464.5
562.3
528.1
738.2

Using these data, the calculation to predict stock price for the first quarter of 1992 would be as
follows:
y

pred

= - 69.83 + 0.020279 Sales + 0.8939 Liability - 0.007148 Capital


= - 69.83 + 0.020279 (2514.3) + 0.8939 (64.79) - 0.007148 (464.5)
= - 69.83 + 50.987 + 57.916 - 3.320
= $35.75

The model estimates that the first quarter stock price will be $35.75. The actual stock price for this
quarter was $34.13, so the model has a forecast error (the difference between the actual and
predicted stock price) of 34.13 - 35.75 = -$1.62; the model overstated the actual stock price by
$1.62 for the first quarter of 1992.
The forecast errors for all four quarters in 1992 are as follows:
1992i
1992ii
1992iii
1992iv

Actual
34.13
36.50
38.00
38.38

Predicted
35.75
35.28
35.96
41.64

Error
- $1.62
+$1.22
+$2.04
- $3.26

The usefulness of the model depends on how much error you are willing to accept: the model
overestimates the true stock price for the fourth quarter by about 8%.
The mean square prediction error is a summary measure of the forecast error for all four
quarters. It is:
n+

MSEP =

(y
i=1

- y pred )2
n+

(-1.62 )2 + (1.22 )2 + (2.04 )2 + (-3.26 )2


=
= 4.73
4
(In the MSEP formula, n+ represents the number of forecasted data points, not the number of
observations in the original data set.)
A computer printout of the predictions from our model for all four quarters in 1992 is
shown in Exhibit 2, which includes 95% confidence and prediction intervals.
Multiple Regression

Exhibit 2. 1992 Predictions of Stock_$ for the Three-Variable Regression Model

Fit Stdev.Fit
95%
35.746
0.955
(33.788,
35.279
0.873
(33.491,
35.962
0.912
(34.092,
41.639
1.558
(38.448,

C.I.
37.703)
37.067)
37.831)
44.831)

95% P.I.
(29.388, 42.103)
(28.972, 41.586)
(29.631, 42.293)
(34.800, 48.478)

As in the simple regression model, it is wise to examine the residuals of the model.
(Residuals are still calculated as the difference between the actual and fitted values for the
multiple regression model.) Because this is a time-series data set, we usually begin with a timeseries plots of the residuals; see Exhibit 3. The Durbin-Watson statistic for this model is 1.07,
indicating the presence of positive first-order autocorrelation.

Exhibit 3. Time-series Residual Plot of the Three-Variable Multiple Regression Model

Residuals

2
+4 +
1
41 3
4
1 3 2
3 2 4
4
2 4
0 +- - - 2 - - - - - - - - -1 - -2
4
1
1
3 2
3
3
3
3
1
-4 + 12
4
4
+---+---+---+---+---+---+---+---+
0
8
16
24
32
Time (1984.i - 1991.iv)

Multiple Regression

Exhibit 4. Residuals and Fitted Values for the Three-Variable Multiple Regression Model

Year.Qtr
84.1
84.2
84.3
84.4
85.1
85.2
85.3
85.4
86.1
86.2
86.3
86.4
87.1
87.2
87.3
87.4
88.1
88.2
88.3
88.4
89.1
89.2
89.3
89.4
90.1
90.2
90.3
90.4
91.1
91.2
91.3
91.4

Standardized
Residuals
-1.36738
-1.47514
-1.17708
-0.42424
-0.68016
-0.06519
-1.00971
0.32502
0.71057
1.53993
0.70409
1.13702
1.25852
0.75402
1.20013
-1.68984
-0.23391
0.41685
-0.66783
0.45628
-1.04844
-0.62613
0.87603
1.11265
0.12984
0.73042
-0.42342
0.73887
1.50789
0.06050
-0.63201
-2.51733

Fitted
Values
18.1194
18.5533
19.0542
18.8100
20.6453
20.4350
21.2763
21.3353
23.0460
23.1470
24.6196
23.7307
23.9565
24.1687
26.0812
30.4434
26.8409
27.0471
31.1139
27.9790
34.0993
36.1906
33.3552
37.0526
36.5382
33.9272
35.2065
36.6032
34.8517
35.4575
36.8930
43.7521

Residuals
-3.67936
-4.11335
-3.30421
-1.18001
-1.92532
-0.18500
-2.86634
0.91467
2.04403
4.41304
2.01042
3.26930
3.54352
2.14129
3.35883
-4.88338
-0.65086
1.14294
-1.80392
0.90102
-2.90931
-1.69062
2.39476
3.00740
0.34177
2.07284
-1.20654
2.02675
4.27830
0.17247
-1.76299
-5.87214

In simple regression, we will also typically plot the residuals against the values for the predictor
variable to check for the presence of nonlinearities or heteroskedasticity. With the multiple
regression model, because there are multiple predictors, the appropriate plot is the scatterplot of the
residuals against the fitted values. We examine this plot for nonlinearities or other signs of
nonrandom behavior. The full listing of the residuals, standardized residuals, and fitted values over
the estimation period are shown in Exhibit 4. From these data, the diagnostic plot of the residuals
against the fitted values is constructed and displayed in Exhibit 5. Exhibit 5 shows a fairly random
pattern of the residuals within a constant band; one anomalous observation, the large fitted value

Multiple Regression

for the fourth quarter of 1992, may be worth further investigation.


Exhibit 5. Scatterplot of Residuals versus Fitted Values from the
Three-Variable Multiple Regression Model

8
6

Residuals

4
2
0
-2
-4
-6
-8
20

30

40

F itted V alue s

The Analysis of Variance (ANOVA) Table


The simple regression model examines the relevance of the single predictor variable by testing the
null hypothesis:
H0: 1 = 0
against the alternative hypothesis
H1: 1 0
where 1 is the true (unknown) population slope. This test is accomplished by examining the tratio, or the p-value, for the estimated slope b1.
Multiple regression, having k different predictor variables, has k tests of this kind, one for
each of the slope coefficients. For instance, in the U.S. West three-variable regression model
displayed in Exhibit 1, we decided that Sales, with an estimated slope of 0.020279 and t-ratio of
7.09 is a statistically significant variable, as is Liability; we also concluded Capital is not a
statistically significant factor at the 5% significance level. Two of the three slope coefficients are
statistically different from zero.
This leads to a more general test often applied to the multiple regression model: the test for
overall model adequacy. This test is stated in the form of the null hypothesis:
H0: 1 = 2 = . . . = k = 0.

Multiple Regression

This hypothesis states that all slope coefficients in the model equal zero. Because the slopes
represent the strength and importance of the predictor variables, and since the predictor variables
make up the multiple regression model, this hypothesis states that the regression model as a whole
is unsatisfactory.
The alternative hypothesis is stated as:
H1: at least one of the j's is not zero.
In order to test overall model adequacy, the variability observed in the data for y is broken up into
two components: the variability in y that is associated with the regression model, and the variability
in y that is not associated with the regression model. The higher the portion of variability related to
the regression model, the more likely we are to believe that the model is adequate.
Specifically, the total sum of squares for y is given by
n

TSS =

( y - y )

i =1

The total sum of squares makes reference only to the amount of variability around the mean for the
response variable, y. TSS makes no reference to the regression model. The total sum of squares
can be broken into two parts: the regression sum of squares (RSS) and the error sum of squares
(ESS).
n

RSS =

( y

- y )2

- y i ) 2

i=1

ESS =

(y
i=1

RSS is the variability in y around its mean when y is measured by its value from the regression
equation, y . RSS is sometimes interpreted to mean the variability in y associated with the
regression model. ESS, on the other hand, is the error sum of squares--the variability in y that is
not explained by the regression model. (ESS is the quantity that is minimized when applying
the least squares principle.) Mathematically, it will be true that TSS = RSS + ESS.
The analysis of variance table displays the disaggregation of the total sum of squares
into RSS and ESS. The ANOVA table for the U.S. West three-variable regression model is
shown in Exhibit 6. Under the column labeled SOURCE, we see three "sources" of variability:
Regression, Error, and Total. Under SS, standing for sum of squares, we have for this model:
TSS = 1803.19, RSS = 1559.18, and ESS = 244.01. Note that TSS = RSS + ESS, and that RSS
is a larger portion of the total than ESS.

Multiple Regression

Exhibit 6. ANOVA Table for the Three-Variable Multiple Regression Model

Analysis of Variance
SOURCE
Regression
Error
Total

DF
3
28
31

SS
1559.18
244.01
1803.19

MS
519.73
8.71

F
59.64

p
0.000

The idea behind the ANOVA table, and the test of the hypothesis of overall model
adequacy, is this. Of the total variability we observe in the response variable, y, some portion is
associated with the regression model, and some is not. The larger the ratio of the regression
component relative to the error component, the stronger the statistical evidence in favor of the
regression model, and the more the data indicate deviation from the null hypothesis. The value
labeled F in Exhibit 6 is designed to mimic this idea. When the F-statistic is large, RSS largely
outweighs ESS. We reject the null hypothesis of overall model inadequacy, and conclude that the
model is statistically significant, when the p-value for the F-statistic is less than the pre-chosen
significance level. The ANOVA table in Exhibit 6, with p-value < 0.001, indicates that the threevariable multiple regression model of U.S. West's stock prices is statistically significant. The
overall model is adequate, even though one of the variables (Capital, see Exhibit 1) is individually
insignificant.
Some of this discussion should sound familiar. Indeed, we considered extracting the
regression sum of squares from the total sum of squares when deriving the R2 statistic for the
simple regression model. R2 for the multiple regression model, sometimes called the multiple
coefficient of determination, is calculated as
2
R =

RSS
ESS
=1TSS
TSS

From the ANOVA table in Exhibit 6, R2 = 1559.18 1803.19 = 86.46% as shown in Exhibit 1.
We now fill in a few of the mathematical details of the ANOVA table. To begin, the
column labeled DF signifies the degrees of freedom for each of the sum of squares. In general, the
degrees of freedom of a statistic is the number of independent observations required to calculate the
statistic. The degrees of freedom for TSS is n - 1, where n is the sample size. We subtract one from
the sample size because we calculate one quantity, the sample mean y, before beginning the
summation. The degrees of freedom for the error sum of squares is n - k - 1, where k is the number
of predictors in the multiple regression model. (k + 1) is subtracted from n in this case because we
need to calculate k + 1 quantities before we can begin the summation: k slopes plus one intercept to
calculate y = b0 + b1X1 + ... + bkXk.

Multiple Regression

A mean square, labeled MS in Exhibit 6, is the sum of squares per degree of freedom, an
average sum of square per independent observation. The mean square for the regression sum of
squares is RSSk, which, from Exhibit 6 is 1559.183 519.73. The F-statistic is then the ratio of
the regression mean square to the error mean square. From Exhibit 6, F = 519.738.71 59.64.
Finally, the p-value for the F-statistic is derived from a table of F-distributions using k numerator
and (n-k-1) denominator degrees of freedom.
Exhibit 7. A Generic ANOVA Table

Analysis of Variance

SOURCE
Regression

Degrees of
Freedom:
DF
k

Sum of
Squares:
SS
RSS

Error

n-k-1

ESS

Total

n-1

TSS

Mean
Square:
MS
MSR=RSSk

F
MSRMSE

MSE=ESS(n-k-1)

Summary
Multiple regression is a straightforward extension of simple regression analysis, allowing many
predictors to serve as explanatory factors for the response variable. As in simple regression, we can
test the individual influence of each predictor, prepare forecasts and forecast intervals, calculate the
percentage of variation in the response variable attributable to the model, and prepare diagnostic
residual tests. A further test that is often undertaken with multiple regression is the test that all the
variables jointly influence the response variable. The ANOVA table decomposes the total sum of
squares into the regression and error sum of squares in order to test the hypothesis of overall model
adequacy.
There is a bit more to multiple regression than first meets the eye. This tutorial has buried
several important questions about multiple regression models. For instance, in the presence of
many potential predictor variables, which ones belong in the model? Do all of them belong, or just
some subset of them? What consequences arise when there are interactions between the various
predictor variables, and how does this influence our choice of a model and its forecasting ability?
Variable Selection addresses these and other questions. Selecting appropriate predictors is
at the core of multiple model building exercises.

Multiple Regression

Further Readings
Graybill, F.A., and H.K. Iyer (1994), Regression Analysis: Concepts and Applications, Duxbury
Press, Belmont, CA.
Mosteller, F., and J.W. Tukey (1977), Data Analysis and Regression, Addison Wesley, Reading,
MA.
Related Topics in the Handbook
Least Squares
Simple Regression
Statistical Models
Variable Selection

Multiple Regression

10

Appendix

U.S. West Data Set

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32

Year.Qtr
84.1
84.2
84.3
84.4
85.1
85.2
85.3
85.4
86.1
86.2
86.3
86.4
87.1
87.2
87.3
87.4
88.1
88.2
88.3
88.4
89.1
89.2
89.3
89.4
90.1
90.2
90.3
90.4
91.1
91.2
91.3
91.4

Multiple Regression

Stock_$
14.44
14.44
15.75
17.63
18.72
20.25
18.41
22.25
25.09
27.56
26.63
27.00
27.50
26.31
29.44
25.56
26.19
28.19
29.31
28.88
31.19
34.50
35.75
40.06
36.88
36.00
34.00
38.63
39.13
35.63
35.13
37.88

Sales
1735.7
1807.2
1860.8
1875.9
1894.9
1949.2
1969.4
1999.1
2029.9
2053.8
2140.2
2084.5
2040.6
2106.8
2225.5
2323.6
2193.5
2250.7
2427.7
2348.7
2393.6
2434.6
2323.0
2539.4
2426.1
2410.8
2482.6
2637.8
2449.8
2501.2
2616.9
3009.3

Liability Capital
61.64
327.6
61.00
389.7
61.04
476.7
60.94
541.2
61.25
377.1
60.70
491.8
61.55
537.7
61.37
591.2
61.34
435.5
60.85
427.9
60.87
469.5
61.18
474.6
61.01
297.2
60.55
397.8
60.53
464.5
64.46
624.0
60.96
321.2
60.34
377.1
61.37
439.1
65.27
1141.3
65.19
402.4
67.18
475.0
66.91
521.3
68.26
786.8
68.02
507.3
66.30
614.1
66.08
611.3
65.84
826.2
66.34
600.4
66.01
620.2
64.72
586.3
65.58
847.5

11

S-ar putea să vă placă și