Sunteți pe pagina 1din 10

An Example for Multiple Linear Regression

- by Ravindra Gokhale


Reference: Montgomery, D., C., and Runger, G., C. Applied Statistics and Probability
for Engineers, Third Edition, John Wiley and Sons, Inc.

The table below (on next page) presents data on taste-testing 38 brands of pinot noir wine (the
data were first reported in an article by Kwan, Kowalski, and Skogenboe in an article in the
Journal of Agricultural and Food Chemistry, Vol. 27, 1979, and it also appears as one of the
default data sets in Minitab software). The response variable is y (that is, quality) and we wish
to find the best regression equation that relates quality to the other five parameters.

Different aspects of multiple linear regression discussed in class are mentioned in this exercise,
with some additional insights.


The data: (number of data points is 38)
x1 x2 x3 x4 x5 y
Clarity Aroma Body Flavor Oakiness Quality
1 3.3 2.8 3.1 4.1 9.8
1 4.4 4.9 3.5 3.9 12.6
1 3.9 5.3 4.8 4.7 11.9
1 3.9 2.6 3.1 3.6 11.1
1 5.6 5.1 5.5 5.1 13.3
1 4.6 4.7 5 4.1 12.8
1 4.8 4.8 4.8 3.3 12.8
1 5.3 4.5 4.3 5.2 12
1 4.3 4.3 3.9 2.9 13.6
1 4.3 3.9 4.7 3.9 13.9
1 5.1 4.3 4.5 3.6 14.4
0.5 3.3 5.4 4.3 3.6 12.3
0.8 5.9 5.7 7 4.1 16.1
0.7 7.7 6.6 6.7 3.7 16.1
1 7.1 4.4 5.8 4.1 15.5
0.9 5.5 5.6 5.6 4.4 15.5
1 6.3 5.4 4.8 4.6 13.8
1 5 5.5 5.5 4.1 13.8
1 4.6 4.1 4.3 3.1 11.3
0.9 3.4 5 3.4 3.4 7.9
0.9 6.4 5.4 6.6 4.8 15.1
1 5.5 5.3 5.3 3.8 13.5
0.7 4.7 4.1 5 3.7 10.8
0.7 4.1 4 4.1 4 9.5
1 6 5.4 5.7 4.7 12.7
1 4.3 4.6 4.7 4.9 11.6
1 3.9 4 5.1 5.1 11.7
1 5.1 4.9 5 5.1 11.9
1 3.9 4.4 5 4.4 10.8
1 4.5 3.7 2.9 3.9 8.5
1 5.2 4.3 5 6 10.7
0.8 4.2 3.8 3 4.7 9.1
1 3.3 3.5 4.3 4.5 12.1
1 6.8 5 6 5.2 14.9
0.8 5 5.7 5.5 4.8 13.5
0.8 3.5 4.7 4.2 3.3 12.2
0.8 4.3 5.5 3.5 5.8 10.3
0.8 5.2 4.8 5.7 3.5 13.2
Figure below is the matrix of scatter plots for the wine quality data, as constructed by SPSS. We
notice that there are some indications of possible linear relationships between quality and the
regressors, but there is no obvious visual impression of which regressors would be appropriate.
(Of course, this is not relevant from the exam point of view since it was not discussed in class).
Oakiness Flavour Body Aroma Clarity Quality
O
a
k
i
n
e
s
s
F
l
a
v
o
u
r
B
o
d
y
A
r
o
m
a
C
l
a
r
i
t
y
Q
u
a
l
i
t
y

We first fit a model with all the five explanatory variables. Call this as MODEL 1. The SPSS
output is as follows:
Model Summary

Model R R Square
Adjusted R
Square
Std. Error of
the Estimate
1
.849(a) .721 .677 1.16254
a Predictors: (Constant), Oakiness, Body, Clarity, Aroma, Flavour

ANOVA(b)

Model
Sum of
Squares df Mean Square F Sig.
1 Regression
111.540 5 22.308 16.506 .000(a)
Residual
43.248 32 1.352
Total
154.788 37
a Predictors: (Constant), Oakiness, Body, Clarity, Aroma, Flavour
b Dependent Variable: Quality


Coefficients(a)

Model
Unstandardized
Coefficients
Standardized
Coefficients t Sig. Collinearity Statistics
B Std. Error Beta Tolerance VIF
1 (Constant)
3.997 2.232 1.791 .083
Clarity
2.339 1.735 .142 1.349 .187 .790 1.266
Aroma
.483 .272 .255 1.771 .086 .420 2.381
Body
.273 .333 .110 .821 .418 .486 2.056
Flavour
1.168 .304 .587 3.837 .001 .373 2.682
Oakiness
-.684 .271 -.247 -2.522 .017 .912 1.097
a Dependent Variable: Quality


Points to be noted:
R-square = 0.721 = 72.1%
Adjusted R-square = 0.677 = 67.7%
SSR = 111.540 SSE = 43.248 SST = 154.788
Degrees of freedom: for SSR = k = 5; for SSE = n k 1 = 38 5 1 = 32,
for SST = n 1 = 37
Also note MSR, MSE, and F-value.
Manually, F-ratio has to be compared with numerator df = 5, and denominator df = 32.
Note that p-value is 0.000 (that is, almost negligible). The model is useful.
Verify value of R-square = SSR/SST = 111.540 / 154.788 = 0.721
Also verify R-square adjusted = 1 MSE/[SST/(n-1)] = 1 1.352 / [154.788 / 37] = 0.677.

Although the model is useful and the R-square (and R-square adjusted) is good, the third table
indicates that not all coefficients are significant. For example, clarity, aroma, and body are
not showing significant (p-value > 0.05). Can they be eliminated from the model to form a
simpler model, without compromising much on the R-square?
Also note that the problem of multicollinearity is absent. [An indicator of this is that all VIF values
as given in the third table are far low than 10]. This is good.
The standardized coefficients are obtained by normalizing each variable of the data
(dependent as well as independent variables) and then fitting the regression on the normalized
data.
[A variable is normalized by transforming it using the equation:
normalized value = {Actual value Mean value of that variable}/{Std. Dev. of that variable}.
Since this was not done in class, you may not worry about the standardized coefficients for
exam point of view. However, that is used in this note to select the next variable.
I repeat, in case you are not able to understand the standardized coefficients, DO NOT worry
about it for exam, since you will not be expected to build a model. In exam, it will be interpreting
the model and calculations required for that]
Let us have a clean slate again and start including variables one by one. As per the absolute
values of the Standardized Coefficients in the third table, it is highest for flavor. So, let us
start with including just this one variable (this is the simple linear regression). Call this as
MODEL 2. The output obtained from SPSS is as follows:
Model Summary

Model R R Square
Adjusted R
Square
Std. Error of
the Estimate
1
.790(a) .624 .614 1.27119
a Predictors: (Constant), Flavour








ANOVA(b)

Model
Sum of
Squares df Mean Square F Sig.
1 Regression
96.615 1 96.615 59.789 .000(a)
Residual
58.173 36 1.616
Total
154.788 37
a Predictors: (Constant), Flavour
b Dependent Variable: Quality


Coefficients(a)

Model
Unstandardized
Coefficients
Standardized
Coefficients t Sig. Collinearity Statistics
B Std. Error Beta Tolerance VIF
1 (Constant)
4.941 .991 4.986 .000
Flavour
1.572 .203 .790 7.732 .000 1.000 1.000
a Dependent Variable: Quality


Points to be noted:
R-square = 0.624 = 62.4%
Adjusted R-square = 0.614 = 61.4%
SSR = 96.615 SSE = 58.173 SST = 154.788
(Important: SST has not changed. That is, same as the previous model. A little thinking over the
formula used for SST, will show that this is obvious)
Degrees of freedom: for SSR = k = 1; for SSE = n k 1 = 38 1 1 = 36,
for SST = n 1 = 37
Also note MSR, MSE, and F-value.
Manually, F-ratio has to be compared with numerator df = 1, and denominator df = 36.
Note that p-value is 0.000 (that is, almost negligible). The model is useful.
Verify value of R-square = SSR/SST = 96.615 / 154.788 = 0.624
Also verify R-square adjusted = 1 MSE/[SST/(n-1)] = 1 1.616 / [154.788 / 37] = 0.614.

The third table shows the significance of coefficients. As this is a simple linear regression, this
t-test is equivalent to the F-test of the second table. [As told in the class]. Thus, the t-test is
showing significance for the coefficient of flavor
Note that the test statistic in t-test is 7.732 and the test statistic in F-test is 59.789 and we verify
that (7.732 x 7.732) = 59.784 59.789. [Since it is equivalent]
The regression equation can be written as:
quality = 4.941 + 1.572 x flavor
(that is, given a value of flavor we will predict the value of quality using this equation).

The above equation, although simple, has a low explanatory power, as indicated by R-square.
Can any other variable be added? Going by the same logic as taking the flavor variable first,
the next variable that we can try is aroma. Note that in presence of all the other variables,
aroma was not showing out to be significant (p-value for t-test for aroma in first model was
0.086 which is greater than 0.05). However, let us try that.
The output of the model by including aroma along with flavor (call this as MODEL 3) is:
Model Summary

Model R R Square
Adjusted R
Square
Std. Error of
the Estimate
1
.812(a) .659 .639 1.22885
a Predictors: (Constant), Aroma, Flavour

ANOVA(b)

Model
Sum of
Squares df Mean Square F Sig.
1 Regression
101.936 2 50.968 33.752 .000(a)
Residual
52.852 35 1.510
Total
154.788 37
a Predictors: (Constant), Aroma, Flavour
b Dependent Variable: Quality





Coefficients(a)

Model
Unstandardized
Coefficients
Standardized
Coefficients t Sig. Collinearity Statistics
B Std. Error Beta Tolerance VIF
1 (Constant)
4.346 1.009 4.307 .000
Flavour
1.170 .291 .588 4.027 .000 .457 2.186
Aroma
.518 .276 .274 1.877 .069 .457 2.186
a Dependent Variable: Quality


By now, you must be able to interprete the model yourself. The important feature of MODEL 3
over MODEL 2 is: some improvement in R-square. Also both the variables are significant, as
indicated by t-test.

Going back to MODEL 2, a close contender for including the next variable going by our
standardized coefficients logic - (instead of aroma) is oakiness. Let us try that. Thus fit a
model with flavor and oakiness only. Call this as MODEL 4. The SPSS output is:

Model Summary

Model R R Square
Adjusted R
Square
Std. Error of
the Estimate
1
.813(a) .661 .642 1.22423
a Predictors: (Constant), Oakiness, Flavour

ANOVA(b)

Model
Sum of
Squares df Mean Square F Sig.
1 Regression
102.332 2 51.166 34.139 .000(a)
Residual
52.456 35 1.499
Total
154.788 37
a Predictors: (Constant), Oakiness, Flavour
b Dependent Variable: Quality









Coefficients(a)

Model
Unstandardized
Coefficients
Standardized
Coefficients t Sig. Collinearity Statistics
B Std. Error Beta Tolerance VIF
1 (Constant)
6.912 1.389 4.977 .000
Flavour
1.642 .199 .825 8.249 .000 .968 1.033
Oakiness
-.541 .277 -.195 -1.953 .059 .968 1.033
a Dependent Variable: Quality


Note that, MODEL 4 is almost similar (but a little better) to MODEL 3 (with respect to R-square
and adjusted R-square). The coefficient of oakiness is marginally insignificant (p-value of the t-
test is 0.059, slightly greater than 0.05), but that is okay. Notice that VIF values for MODEL 4
are lesser than that of MODEL 3. In that sense also, MODEL 4 is better comparatively. That
may also be a reason for a comparatively lower performance of MODEL 3. That is, even though
the standardized coefficient magnitude (the criterion that we use for selecting the variable as
taken from MODEL 1) for aroma was higher than oakiness, aroma was more correlated to
flavour than the correlation of oakiness to flavour.
[Therefore, the standardized coefficients the one that we used from MODEL 1 to determine
which variables to include are just guidelines and may not always work. However, they are
good guidelines, as revealed in building MODEL 2, and then transition from MODEL 2 to
MODEL 3, as well as MODEL 2 to MODEL 4].

As a final exercise, we again take a clean slate and then build MODEL 5 by taking the
explanatory variable body alone. The reason for choosing this variable is because, it had the
lowest standardized coefficient as per MODEL 1. Also, in MODEL 1, this variable was showing
to be highly insignificant (p-value of 0.418) in presence of the remaining variables. The output of
MODEL 5 is as follows:
Model Summary

Model R R Square
Adjusted R
Square
Std. Error of
the Estimate
1
.549(a) .301 .282 1.73354
a Predictors: (Constant), Body



ANOVA(b)

Model
Sum of
Squares df Mean Square F Sig.
1 Regression
46.603 1 46.603 15.508 .000(a)
Residual
108.186 36 3.005
Total
154.788 37
a Predictors: (Constant), Body
b Dependent Variable: Quality

Coefficients(a)

Model
Unstandardized
Coefficients
Standardized
Coefficients t Sig. Collinearity Statistics
B Std. Error Beta Tolerance VIF
1 (Constant)
6.058 1.644 3.685 .001
Body
1.362 .346 .549 3.938 .000 1.000 1.000
a Dependent Variable: Quality


The output reveals that although when considered alone body is showing to be significant (by
t-test and equivalent F-test), it is a poor choice, as revealed by low values of R-square (0.301)
and R-square adjusted (0.282). Thus, as mentioned in class, statistical significance does not
necessarily imply usefulness. This also verifies that using standardized coefficients is a good
guideline to select the variables.

However, in real life, while building models using good software, the software will select the best
variables for you. As mentioned in class, choose a model that is simple (that is, as low number
of variables as possible), but does not compromise much on R-square (and R-square adjusted).
Thus one can try many different combinations and finally choose the one that gives the best
tradeoff between simplicity and R-square. Moreover, the practical aspects like which variables
are easy from data collection point of view, cost of data collection, etc. will also play an
important role, in real life.

S-ar putea să vă placă și