Sunteți pe pagina 1din 18

stat 704: Homework solution 4

October 30, 2013

Carry out all hypothesis test at the 5% significant level.

1. Consider the brand preference data of Problem 6.5.

(a) Obtain and report the scatterplot matrix; what does it tell you about
the relationship between liking Y and each of the predictors x1 moisture
and x2 sweetness?

1
Interpretation: There is a clear linear relationship between moisture
and liking. But a slight linear relationship appears to be between sweet-
ness and liking.
(b) Fit the regression model Yi = 0 + 1 xi1 + 2 xi2 . Report the table of
regression effects.
> summary(brand.lm)

Call:
lm(formula = brandlike ~ moisture + sweet)

Residuals:
Min 1Q Median 3Q Max
-4.400 -1.762 0.025 1.587 4.200

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 37.6500 2.9961 12.566 1.20e-08 ***
moisture 4.4250 0.3011 14.695 1.78e-09 ***
sweet 4.3750 0.6733 6.498 2.01e-05 ***
---
Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1

Residual standard error: 2.693 on 13 degrees of freedom


Multiple R-squared: 0.9521,Adjusted R-squared: 0.9447
F-statistic: 129.1 on 2 and 13 DF, p-value: 2.658e-09

(i) Using the p-value from the F-statistics, test H0 : 1 = 2 = 0. What


does this imply about 1 and 2 ?
Hypothesis:
H0 : 1 = 2 = 0 V.S. Ha : at least one i 6= 0.
Decision rule: p-value= 2.658 109 < .01, reject H0 .
Conclusion: We have sufficient evidence to conclude that our linear
model is appropriate. It also implies that at least one i s is not sig-
nificant to be 0.

(ii) Report each of b1 and b2 along with two tests of H0 : 1 = 0 and


H0 : 2 = 0. Can either predictor be dropped in the presence of the

2
other?
Hypothesis: H0 : 1 = 0 vs Ha : 1 6= 0.
Decision rule: p-value=1.78 109 < 0.05, reject H0 .
Conclusion: We have sufficient evidence to conclude that the predictor
of moisture is significant in our model.
Hypothesis: H0 : 2 = 0 vs Ha : 2 6= 0.
Decision rule: p-value=2.01 105 < 0.05, reject H0 .
Conclusion: We have sufficient evidence to conclude that the predictor
of sweetness is significant in our model.
Neither of moisture or sweetness can be dropped from the model.

(iii) Interpret both estimated coefficients.


b1 = 4.425, means the increase of a person liking the candy is 4.425
unit on average for each unit increase of moisture for a fixed value of
sweetness.
b2 = 4.375, means the increase of a person liking the candy is 4.375
unit on average for each unit increase of sweetness for a fixed value of
moisture.

(c) Obtain residual plots of ei vs Yi , ei vs xi1 and ei vs xi2 . Obtain the


normal probability plot and a histogram of the residuals. What do these
plots tell you?

3
Interpretation: The residuals appear to be uniformly distributed around
0. And there is no pattern between residuals and y, x1 or x2 . It means
that the residuals are independent, constant and has a mean 0. Whats
more, the normal probability plot displays that the residuals are nor-
mally distributed.

4
(d) Use R to conduct the Breusch-Pagan test of H0 : 1 = 2 = 0 in the
variance model i = 0 + 1 xi1 + 2 xi2 .

> library(lmtest)
> bptest(liking ~ moisture + sweetness, studentize=FALSE)
Breusch-Pagan test
data: liking ~ moisture + sweetness
BP = 1.0422, df = 2, p-value = 0.5939

Hypothesis: H0 : 1 = 2 = 0 V.S. Ha : at least one i 6= 0.


Decision rule: p-value= .5939 > .05, fail to reject H0 .
Conclusion: We have insufficient evidence to conclude that the variance
in our model depend on the predictor variables. Namely, our constant
variance assumption holds for our model.

(e) Report R2 : how is it interpreted here?


R2 = 0.9521, 95.21% of the variability in the response variable of how
much likeness on a brand of candy is explained by two predictor variables
of moisture and sweetness of candy.

(f) Obtain and interpret an 95% interval estimate of Exh when xh1 = 5 and
xh2 = 4.

> predict(brand.lm,new,interval="confidence",level=.99)
fit lwr upr
1 77.275 73.88111 80.66889

Interpretation: We are 99% confident that the interval (73.88,80.67)


contains the actual mean degree to which the candy is liked for a mois-
ture level of 5 and sweetness level of 4.

(g) Obtain and interpret an 95% prediction interval for a new Yh when xh1 =
5 and xh2 = 4.

5
> predict(brand.lm,new,interval="prediction",level=.99)
fit lwr upr
1 77.275 68.48077 86.06923

Interpretation: We are 99% confident that the interval (68.48,86.07)


contains the actual possible individual degrees to which the candy is
liked for a moisture level of 5 and sweetness level of 4.
SAS output

(h) Obtain and interpret SSR(x1 |x2 ) and SSR(x2 |x1 ).


SSR(x1 |x2 ) = 1566.45. This means that 1566.45 extra sums of squares
is explained by moisture in addition to sweetness already being present
in the model.
SSR(x2 |x1 ) = 306.25. This means that 306.25 extra sums of squares is
explained by sweetness in addition to moisture already being present in
the model.

6
(i) Obtain SSR(x1 ), SSR(x2 |x1 ), and verify SSR(x1 , x2 ) = SSR(x1 ) +
SSR(x2 |x1 ).
SSR(x1 )=1566.45 and SSR(x2 |x1 ) = 306.25.
SSR(x1 ) + SSR(x2 |x1 )=1872.7=SSR(x1 , x2 ).
SAS output

(j) Obtain and interpret RY 1|2 2 and RY 2|1 2 .


RY 1|2 2 =.943. This indicates that 94.3% of the variability, that is not
explained by how much someone likes a brand of candy based off of the

7
moisture, is explained by sweetness.
RY 2|1 2 =.765. This indicates that 76.5% of the variability, that is not
explained by how much someone likes a brand of candy based off of the
sweetness, is explained by moisture.

(k) Obtain and interpret the variance inflation factors V IF1 and V IF2 .
V IF1 =V IF2 =1. The variance inflation factors for both predictors are
one which indicates that there is no correlation between the predictors.

2. Consider the commercial properties data of Problem 6.18.

(a) Obtain and report the scatterplot matrix; what does it tell you about
the relationship between rental rate Y and each of the predictors x1
age and x2 operating expense,x3 vacancy and x4 square footage ?

8
Interpretation: In the scatterplot matrix, the only predictor that ap-
pears to have any linear relationship with the rental rates is the total
square footage. It seems to be a quadratic pattern between rental rate
and operating expense. A dumbbell-shape appears to relate rental rate
with age. In addition, vacancy rate is heavily right skewed in the scatter
plot of vacancy rate versus rental rate.
(b) Fit the regression model Yi = 0 + 1 xi1 + 2 xi2 + 3 xi3 + 4 xi4 + i .
Report the table of regression effects.

> summary(commercial.lm)

Call:
lm(formula = rent ~ age + expense + vacancy + square)

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.220e+01 5.780e-01 21.110 < 2e-16 ***
age -1.420e-01 2.134e-02 -6.655 3.89e-09 ***
expense 2.820e-01 6.317e-02 4.464 2.75e-05 ***

9
vacancy 6.193e-01 1.087e+00 0.570 0.57
square 7.924e-06 1.385e-06 5.722 1.98e-07 ***

Residual standard error: 1.137 on 76 degrees of freedom


Multiple R-squared: 0.5847,Adjusted R-squared: 0.5629
F-statistic: 26.76 on 4 and 76 DF, p-value: 7.272e-14

(i) Using the p-value from the F-statistics, test H0 : 1 = 2 = 0 = 3 = 4 .


What does this imply about 1 ,2 ,3 and 4 ?
Hypothesis:
H0 : 1 = 2 = 3 = 4 = 0 V.S. Ha : at least two i s are not equal.
Decision rule: p-value= 7.27 1014 < 0.05, reject H0 .
Conclusion: We have sufficient evidence to conclude that our linear
model is appropriate. It also implied at least one of these four predictor
variables is significant in explaining the response variable.

(ii) Report each of b1 , b2 , b3 and b4 along with two tests of H0 : j = 0,


j = 1, 2, 3, 4. Can either predictor be dropped in the presence of the
other three?
Hypothesis: H0 : 1 = 0 vs Ha : 1 6= 0.
Decision rule: p-value= 3.89 109 < 0.05, reject H0 .
Conclusion: We have sufficient evidence to conclude that the predictor of
age is significant in addition to the other three predictors in determining
the rental rate for a property.
Hypothesis: H0 : 2 = 0 vs Ha : 2 6= 0.
Decision rule: p-value< 2.75 105 < 0.05, reject H0 .
Conclusion: We have sufficient evidence to conclude that the predictor of
operating expense is significant in addition to the other three predictors
in determining the rental rate for a property.
Hypothesis: H0 : 3 = 0 vs Ha : 3 6= 0.
Decision rule: p-value=0.57 > 0.05, do not reject H0 .

10
Conclusion: We have insufficient evidence to conclude that the predic-
tor of vacancy is significant in addition to the other three predictors in
determining the rental rate for a property. In other words, vacancy can
be dropped from the full model.
Hypothesis: H0 : 4 = 0 vs Ha : 4 6= 0.
Decision rule: p-value=1.98 107 < 0.05, reject H0 .
Conclusion: We have sufficient evidence to conclude that the predictor
of square footage is significant in addition to the other three predictors
in determining the rental rate for a property.

(iii) Interpret all four estimated coefficients.


The estimated regression equation:
Y = 1.22 0.142X1 + 0.282X2 + 0.619X3 + 7.92e 06X4 .
b1 = 0.142,holding all the other predictors constant, the rental rate
decreases by .142 on average for each year the property increases in age.
b2 = 0.282,holding all the other predictors constant, the rental rate in-
creases by .282 on average for each increase in operating expenses and
taxes.
b3 = 0.619,holding all the other predictors constant, the rental rate in-
creases by .619 on average for each increase in the vacancy rates.
b4 = 7.92 106 ,holding all the other predictors constant, the rental rate
increases by 7.93 106 on average for each increase in the total square
footage.

(c) Obtain residual plots of ei vs Yi , ei vs xi1 , ei vs xi2 , ei vs xi3 and ei vs


xi4 . Obtain the normal probability plot and a histogram of the residuals.
What do these plots tell you?

11
Interpretation: All these plots seems to show that the residuals look
like having a certain systemic pattern against any variable. The most
troubling predictor is age, which from the plot does not appear to have
a constant variance. The large number of lower vacancy rates is also
cause for concern. In addition, the Q-Q plot appears to have thick tails,
indicating the residuals could not be normally distributed.
(d) Use R to conduct the Breusch-Pagan test of H0 : 1 = 2 = 3 = 4 = 0
in the variance model i = 0 + 1 xi1 + 2 xi2 + 3 xi3 + 4 xi4 .

> bptest(commercial.reg,studentize=FALSE)
Breusch-Pagan test

data: commercial.reg

12
BP = 16.5156, df = 4, p-value = 0.0024

Hypothesis: H0 : 1 = 2 = 3 = 4 = 0 V.S. Ha : at least one


alphai 6= 0.
Decision rule: p-value= .0024 < .05, reject H0 .
Conclusion: We have sufficient evidence to conclude that the variance
in our model depend on the predictor variables. Namely, our constant
variance assumption does not hold for our model. We need modify the
data to satisfy the Gaussian-Markov assumption.

(e) Report R2 : how is it interpreted here?


SSR
R2 = SST O
= 0.5847, means 58.47% of variations in the response rental
rates is explained by the regression model based on the variables age,
expense rate, vacancy rate and square foot.
(f) Obtain the family of estimates using a 95 percent family confidence co-
efficient. Employ the most efficient procedure.

> predict(commercial.lm,new1,interval="confidence",level=.9875)
fit lwr upr
1 15.79813 15.08664 16.50962
2 16.02754 15.42391 16.63116
3 15.90072 15.33232 16.46913
4 15.84339 15.18040 16.50638

(g) Develop separate prediction intervals for the rental rates of these proper-
ties, using a 95 percent statement confidence coefficient in each case. Can
the rental rates of these three properties be predictied fairly precisely?
What is the family confidence level for the set of three predictions?

> predict(commercial.lm,new2,interval="prediction",level=.95)
fit lwr upr
1 15.14850 12.85249 17.44450
2 15.54249 13.24504 17.83994
3 16.91384 14.53469 19.29299

13
The rental rates cannot be predicted very precisely. Each confidence
interval covers the same range of data that we had originally from the
rental rates. The family confidence level is 1 5% 3 = 85%.
SASoutput

(h) Obtain and interpret SSR(x1 ), SSR(x2 |x1 ), SSR(x3 |x1 , x2 ) and SSR(x4 |x1 , x2 , x3 ).
SSR(x1 ) = 14.819. This means that 14.819 extra sums of squares is ex-
plained by including age into the model and not just using the mean of
the rental rates.
SSR(x2 |x1 ) = 72.802. This means that 72.802 extra sums of squares
in explained by including operating expenses and taxes to the model in
addition to age already being included in the model.
SSR(x3 |x1 , x2 ) = 8.381. This means that 8.381 extra sums of squares
is explained by including vacancy rates to the model in addition to age
and operating expenses/taxes being included in the model.
SSR(x4 |x1 , x2 , x3 ) = 42.325. This means that 42.325 extra sums of

14
squares is explained by including the total square footage in the model
in addition to age, operating expenses/taxes, and vacancy rates being
included in the model.

(i) Verify that the above extra sums of squares in (h) sum to SSR(x1 , x2 , x3 , x4 ).
SSR(x1 , x2 , x3 , x4 ) = 138.3269 = SSR(x1 )+SSR(x2 |x1 )+SSR(x3 |x1 , x2 )+
SSR(x4 |x1 , x2 , x3 ).
SAS output

(j) Obtain and interpret RY 1|234 2 , RY 2|134 2 , RY 3|124 2 and RY 4|123 2 .


RY 1|234 2 = .368, indicates 36.8% of the remaining variability is explained

15
by adding x1 to the model that alread had x2 , x3 , andx4 .
RY 2|134 2 = .208, indicates 20.8% of the remaining variability is explained
by adding x2 to the model that alread had x1 , x3 , andx4 .
RY 3|124 2 = .004, indicates 0.004% of the remaining variability is ex-
plained by adding x3 to the model that alread had x1 , x2 , andx4 .
RY 4|123 2 = .301, indicates 30.1% of the remaining variability is explained
by adding x4 to the model that alread had x1 , x2 , andx3 .

(k) Obtain and interpret the variance inflation factors V IFj for j = 1, 2, 3, 4.
All variance inflation factors are relatively close to one, indicating that
there is no significant correlation between predictors affecting our model.

3.7.1 (1) SSR(X1 |X2 ), d.f. = 1;


(2) SSR(X2 |X1 , X3 ), d.f. = 1;
(3) SSR(X1 , X2 |X3 , X4 ), d.f. = 2;
(4) SSR(X1 , X2 , X3 |X4 , X5 ), d.f. = 3.

3.7.2 X1 is adding explanation to the model Yi = 0 + i where there is no


predictor variables in it. Since b0 = y, y has 0 residual sum of squares
and the error sum of squares is equal to the total sum of squares. X1
adds extra sum of squares to the model with no predictor variables.

3.7.28

(a) (1) SSR(X5 |X1 ) = SSR(X1 , X5 ) SSR(X1 );


(2) SSR(X3 , X4 |X1 ) = SSR(X1 , X3 , X4 ) SSR(X1 );
(3) SSR(X4 |X1 , X2 , X3 ) = SSR(X1 , X2 , X3 , X4 ) SSR(X1 , X2 , X3 ).

b (1) 5 = 0, SSR(X5 |X1 , X2 , X3 , X4 );


(2) 2 = 4 = 0, SSR(X2 , X4 |X1 , X3 , X5 ).

16
3.7.29
(a) SSR(X1 , X2 , X3 , X4 )
= SSR(X1 ) + SSR(X2 , X3 |X1 ) + SSR(X4 |X1 , X2 , X3 )
= SSR(X1 ) + (SSR(X1 , X2 , X3 ) SSR(X1 )) + (SSR(X1 , X2 , X3 , X4 )
SSR(X1 , X3 , X4 ))
= SSR(X1 , X2 , X3 , X4 ).
SAS Code
filename webdata url "http://www.stat.ufl.edu/~rrandles/sta4210/Rclassnotes/data/textdatasets
/KutnerData/Chapter%20%206%20Data%20Sets/CH06PR05.txt";

data one;
infile webdata truncover;
input like moisture sweet;
put _infile_;
run;

proc sgscatter;
matrix like moisture sweet;
run;

proc corr data=one;


var like moisture sweet;
run;

proc glm data=one;


model like=moisture sweet/solution;
run;

proc reg data=one;


model like=moisture sweet/pcorr1 vif;
model like=sweet moisture/pcorr1;
run;

filename webdata url "https://netfiles.umn.edu/users/nacht001/www/nachtsheim/5th


/KutnerData/Chapter%20%206%20Data%20Sets/CH06PR18.txt";

data two;
infile webdata truncover;
input rental_rate age expense vacancy square;
put _infile_;
run;

17
proc glm data=two;
model rental_rate=age expense vacancy square/solution;
run;

proc reg data=two;


model rental_rate=age expense vacancy square/vif pcorr2;
run;

18