Sunteți pe pagina 1din 23

ENME392 Fall 2013

Homework 13 Solutions

Total number of points: 100

Assignment:

1. Analyze the following data, which are for houses on sale in the Joppa neighborhood (zip
code 21085). BR = number of bedrooms, BTH = number of bathrooms, SIZE = size in
square feet, and PRICE = price in $. How should you analyze these data, and what
conclusions can you draw?
(15pts)

BR BTH SIZE (ft
2
) PRICE ($)
4 2 1460 275,000
4 2 1499 265,000
3 2 1512 339,900
2 2 1700 254,900
3 2 1755 249,900
3 2.5 1794 325,000
3 2 1940 399,900
4 3 2090 289,000
3 2.5 2124 290,000
4 3 2200 389,000
3 2.5 2316 389,900
4 2 2400 539,900
4 2.5 2484 374,500
4 3 2780 749,900
4 3.5 2910 355,000
3 3.5 3113 399,900
4 3.5 3233 522,000
3 2.5 3500 349,900
5 3.5 3614 449,999
4 3.5 4460 589,900
6 4 4577 349,900
6 4 5396 629,900
6 3.55 6001 674,900

Start by doing some exploratory data plotting. Begin with histograms. The price data have a
peak around $400k, but have a long tail, and are perhaps even bimodal. The minimum price in
this neighborhood is ~$250k (recall that the bin labels are the upper limit on the bin). The size
data follow a similar skewed distribution. (Note that the shapes of the histogram change as you
change the bins, since we have so few points.) There is a minimum house size of ~1500 square
feet.
0
2
4
6
8
10
12
$
2
0
0
,
0
0
0
$
3
0
0
,
0
0
0
$
4
0
0
,
0
0
0
$
5
0
0
,
0
0
0
$
6
0
0
,
0
0
0
$
7
0
0
,
0
0
0
$
8
0
0
,
0
0
0
Price
F
r
e
q
u
e
n
c
y

0
1
2
3
4
5
6
7
8
9
1,000 2,000 3,000 4,000 5,000 6,000 7,000
Size
F
r
e
q
u
e
n
c
y

y = 75.622x + 197761
R
2
= 0.4594
0
100,000
200,000
300,000
400,000
500,000
600,000
700,000
800,000
0 2000 4000 6000 8000
Size
P
r
i
c
e

Next, a plot of price vs. size shows that these two variables are correlated, and that size accounts
for about half of the variation in price. The intercept is nonzero, suggesting that a house of zero
size (i.e., the raw land alone) has a value of $200k. The slope is 75, so the model predicts a $75
increase in price for each additional square foot.

Is regression a reasonable thing to do? Lets run a regression and look at the output.

SUMMARY OUTPUT
Regression Statistics
Multiple R 0.67781
R Square 0.459427
Adjusted R Square 0.433685
Standard Error 106436.5
Observations 23
ANOVA
df SS MS F Significance F
Regression 1 2.02E+11 2.02E+11 17.84767 0.00038
Residual 21 2.38E+11 1.13E+10
Total 22 4.4E+11
Coefficients Standard Error t Stat P-value Lower 95%Upper 95%Lower 95.0% Upper 95.0%
Intercept 197761.5 55140.45 3.586504 0.001738 83090.62 312432.3 83090.62 312432.3
SIZE (ft^2) 75.6219 17.90016 4.22465 0.00038 38.39649 112.8473 38.39649 112.8473
-400000
-200000
0
200000
400000
0 2000 4000 6000 8000
R
e
s
i
d
u
a
l
s


SIZE (ft^2) Residual Plot
-400000
-200000
0
200000
400000
0 2000 4000 6000 8000
SIZE (ft^2)
R
e
s
i
d
u
a
l
s

Normal Probability Plot
0
500000
1000000
0 20 40 60 80 100 120
Sample Percentile
P
R
I
C
E

There arent that many data points, but the residuals dont indicate any obvious problems. There
is a point that might be an outlier. The price data are not normally distributed, as expected from
our histogram. This is not an issue for regression, which only requires the residuals to be
normally distributed.

Looking at the ANOVA part of the output, there is an F value of 17.8 corresponding to a P-value
of 0.00038. This means that the amount of variation accounted for by the model is real it is
extremely unlikely to have arisen by chance.

The intercept has a standard error of 55,000, and the slope a standard error of 17.9. The upper
and lower 95% tell us the range within which we expect the real population values o and | to lie
with 95% certainty. Remember, we calculate a and b, which are estimates of o and |. We can
be 95% sure that the population intercept is from $83k to $312k and that the population slope is
between $38/sq foot and $112/sq ft. The estimate is so wide because of the huge scatter in the
data.

Now lets look at bedrooms and bathrooms. Start with histograms. The bedroom plot again
looks bimodal there is a peak at about 3.5 bedrooms, but a few houses have as many as six.
Bathrooms are fairly evenly distributed, or are even slightly decreasing, between 2 and 4.5.
0
2
4
6
8
10
12
0 1 2 3 4 5 6 7
Bedrooms
F
r
e
q
u
e
n
c
y

0
1
2
3
4
5
6
7
8
0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5
Bathrooms
F
r
e
q
u
e
n
c
y


Next lets look at the effect of bedrooms and bathrooms on price. These values are less well
correlated than size with price, but still account for some of the variation (28% and 27%,
respectively). The intercepts on these are also nonzero.
y = 70635x + 137684
R
2
= 0.279
0
100,000
200,000
300,000
400,000
500,000
600,000
700,000
800,000
0 2 4 6 8
Bedrooms
P
r
i
c
e

y = 105034x + 116228
R
2
= 0.2743
0
100,000
200,000
300,000
400,000
500,000
600,000
700,000
800,000
0 1 2 3 4 5
Bathrooms
P
r
i
c
e

Lets try a multiple regression to see if including both bedrooms and bathrooms improves the R
2

value (see below). The intercept of this model is $77k, each additional bedroom increases the
price by $42k, and each additional bath raises it by $60k. The multiple R
2
is only 0.32, not much
higher than for either bedrooms or bathrooms alone, and the adjusted R
2
is only 0.26, slightly
lower than fitting either alone! This is a bit surprising: one would have thought that bedrooms
plus bathrooms would have increased the predictive power of the model, but it did not. The P-
value for the multiple regression is 0.019, so we would expect to this amount of correlation,
when there is actually none, to occur by chance only 2% of the time. In other words, even
though this model accounts for only a small part of the variation, the correlation does not appear
to have occurred by chance. However, the upper and lower confidence intervals all span 0! So
we cannot be 95% sure that the population even has a positive slope or intercept. The scatter in
the data is too large for us to come up with a model in which we can be confident.

The residuals dont raise any obvious red flags.

SUMMARY OUTPUT
Regression Statistics
Multiple R 0.570165
R Square 0.325088
Adjusted R Square 0.257597
Standard Error 121865.8
Observations 23
ANOVA
df SS MS F Significance F
Regression 2 1.43E+11 7.15E+10 4.81675 0.01961
Residual 20 2.97E+11 1.49E+10
Total 22 4.4E+11
Coefficients Standard Error t Stat P-value Lower 95%Upper 95%Lower 95.0% Upper 95.0%
Intercept 77507.5 111052.1 0.697938 0.493251 -154143 309158.2 -154143 309158.2
BR 42328.08 34495.69 1.227054 0.234051 -29628.7 114284.8 -29628.7 114284.8
BTH 60469.83 51733.31 1.168876 0.256197 -47444 168383.6 -47444 168383.6


BR Residual Plot
-400000
-200000
0
200000
400000
0 2 4 6 8
BR
R
e
s
i
d
u
a
l
s

BTH Residual Plot
-400000
-200000
0
200000
400000
0 1 2 3 4 5
BTH
R
e
s
i
d
u
a
l
s


Out of curiosity, lets plot the bathrooms vs. the bedrooms. One might imagine that these are
related. The number of bedrooms accounts for half the variation in the number of bedrooms, but
there is a lot of noise: a 4-bedroom house can have anywhere from 2 to 3.5 bathrooms.

y = 0.4681x + 0.9951
R
2
= 0.4928
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
0 2 4 6 8
Bedrooms
B
a
t
h
r
o
o
m
s


Now lets plot the dependence of house size on the number of bedrooms and bathrooms.
y = 917.06x - 728.72
R
2
= 0.5854
0
1000
2000
3000
4000
5000
6000
7000
0 2 4 6 8
Bedrooms
S
i
z
e

y = 1472.6x - 1312.9
R
2
= 0.6711
0
1000
2000
3000
4000
5000
6000
7000
0 1 2 3 4 5
Bathrooms
S
i
z
e

As one might expect, these variables are correlated. The size is better correlated with the number
of bedrooms and bathrooms than the price is, with bathrooms being better correlated than
bedrooms. Again, this is somewhat surprising, and with such a small sample size, our
uncertainties on the slope are going to be large. Lets see what they are by running a regression.

Looking at the regression output for size vs. bedrooms, we see that the P-value is 2*10
-5
. The
confidence interval on the slope is 567 to 1267, so we can be 95% sure that the house size
increases with the number of bedrooms (the slope does not span zero), with the slope being in
this range. The 95% confidence interval on the intercept is -2131 to +674, which does span zero,
so we are not sure if the intercept is positive or negative.

SUMMARY OUTPUT
Regression Statistics
Multiple R 0.765085
R Square 0.585356
Adjusted R Square 0.565611
Standard Error 835.5301
Observations 23
ANOVA
df SS MS F Significance F
Regression 1 20696052 20696052 29.64581 2.11E-05
Residual 21 14660322 698110.6
Total 22 35356374
Coefficients Standard Error t Stat P-value Lower 95%Upper 95%Lower 95.0% Upper 95.0%
Intercept -728.724 674.6322 -1.08018 0.292312 -2131.7 674.2501 -2131.7 674.2501
BR 917.0636 168.4294 5.444797 2.11E-05 566.7956 1267.332 566.7956 1267.332


Looking at the regression output for size vs. bathrooms, we see that the P-value is 2*10
-6
. The
confidence interval on the slope is 1005 to 1940, so we can be 95% sure that the house size
increases with the number of bathrooms (the slope does not span zero), with the slope being in
this range. The 95% confidence interval on the intercept again spans zero, so we are not 95%
sure if the intercept is positive or negative. However, given that the range is -2665 to +39, we
are almost 95% sure that the intercept is negative. This means that a house of size 0 would have
at least a fraction of a bathroom, which I would interpret to mean that no matter how tiny the
house in this neighborhood, it will have at least one bathroom. This makes sense.

SUMMARY OUTPUT
Regression Statistics
Multiple R 0.819183
R Square 0.671061
Adjusted R Square 0.655398
Standard Error 744.1862
Observations 23
ANOVA
df SS MS F Significance F
Regression 1 23726297 23726297 42.8417 1.75E-06
Residual 21 11630077 553813.2
Total 22 35356374
Coefficients Standard Error t Stat P-value Lower 95%Upper 95%Lower 95.0% Upper 95.0%
Intercept -1312.89 650.1976 -2.01921 0.056412 -2665.05 39.27163 -2665.05 39.27163
BTH 1472.571 224.9794 6.545357 1.75E-06 1004.7 1940.441 1004.7 1940.441


Does multiple regression do better in this case? The R
2
value is 0.74, and the adjusted R
2
= 0.72.
So, bedrooms and bathrooms together do a slightly better job of accounting for the variation in
house size than either does alone (compare R
2
of 0.58 and 0.67).

Looking at the normal probability plot tells us that the house sizes are not normally distributed,
as we already knew from looking at the histogram above. The residuals again look OK.

Regression Statistics
Multiple R 0.861534
R Square 0.74224
Adjusted R Square 0.716464
Standard Error 675.0351
Observations 23
ANOVA
df SS MS F Significance F
Regression 2 26242927 13121464 28.79583 1.29E-06
Residual 20 9113447 455672.3
Total 22 35356374
Coefficients Standard Error t Stat P-value Lower 95%Upper 95%Lower 95.0% Upper 95.0%
Intercept -1723.67 615.1366 -2.80209 0.011006 -3006.82 -440.514 -3006.82 -440.514
BR 449.0481 191.0775 2.350084 0.029146 50.46744 847.6287 50.46744 847.6287
BTH 999.7992 286.5596 3.488975 0.002313 402.0464 1597.552 402.0464 1597.552
-2000
-1000
0
1000
2000
0 2 4 6 8
R
e
s
i
d
u
a
l
s


Normal Probability Plot
0
5000
10000
0 20 40 60 80 100 120
Sample Percentile
S
I
Z
E

(
f
t
^
2
)

BR Residual Plot
-2000
-1000
0
1000
2000
0 2 4 6 8
BR
R
e
s
i
d
u
a
l
s

BTH Residual Plot
-2000
-1000
0
1000
2000
0 1 2 3 4 5
BTH
R
e
s
i
d
u
a
l
s


We can therefore draw the following conclusions.
1. None of these variables is normally distributed.
2. House price is definitely correlated with house size, but size accounts for slightly less than
half of the variation in the price data.
3. The mean price of a lot without a house on it in this neighborhood can be extrapolated to be
between $100k and $300k. The price per square foot is between $38/sq foot and $112/sq ft
with 95% confidence.
4. The number of bedrooms and bathrooms accounts for less than 30% of the price, but we can
be sure that there is a correlation with price.
5. House size is more strongly related to the number of bedrooms and bathrooms, as makes
common sense, than house price is. Taking these two variables together, however, improves
prediction of house size only slightly.
6. Our model is missing one or more important variables that contribute to house size and price.
They say that in real estate, its location, location, location. Perhaps we should try to identify
this factor and include it in our study of house prices. Other factors that may play a role are
the age of the house (houses are increasing in size over time) and the size of the surrounding
plot (bigger and more expensive houses on larger pieces of land).


2. Chapter 13: 13.6 (Edition 8) / 13.4 (Edition 9)
(10pts)

The hypotheses are
H0 : A = B = C,
H1 : At least two of the means are not equal.
= 0.01.

Computation:
Source of
variation
Sum of squares Degrees of
freedom
Mean square Computed f
Tablets 158.867 2 79.433 5.46
Error 393.000 27 14.556
Total 551.867 29

with P-value=0.0102.
Decision: Since = 0.01, we fail to reject H0. However, this decision is very marginal
since the P-value is very close to the significance level.


3. Chapter 13: 13.14 (Edition 8) / 13.15 (Edition 9)
(10pts)

The means of the treatments are:
y1. = 5.44, y2. = 7.90, y3. = 4.30, y4. = 2.98, and y5. = 6.96.
Since q(0.05, 5, 20) = 4.24, the critical difference is (4.24) sqrt(2.9766/5) = 3.27. Therefore,
the Tukeys result may be summarized as follows:
y4. y3. y1. y5. y2.
2.98 4.30 5.44 6.96 7.90
-----------------------
-----------------------
-----------------------


4. Chapter 13: 13.16
(10pts)

(a) The hypotheses are
H0 : 1 = 2 = 3 = 4,
H1 : At least two of the means are not equal.
= 0.05.

Computation:
Source of
variation
Sum of squares Degrees of
freedom
Mean square Computed f
Tablets 119.649 3 39.883 7.10
Error 44.920 8 5.615
Total 164.569 11

with P-value= 0.0121.
Decision: Reject H0. There is a significant difference in mean yield reduction for
the 4 preselected blends.

(b) Since sqrt(s2/3) = 1.368 we get
p 2 3 4
rp 3.261 3.399 3.475
Rp 4.46 4.65 4.75
Therefore,
y3. y1. y2. y4.
23.23 25.93 26.17 31.90
------------------------ -------

(c) Since q(0.05, 4, 8) = 4.53, the critical difference is 6.20. Hence
y3. y1. y2. y4.
23.23 25.93 26.17 31.90
-------------------------
-------------------------


5. Chapter 13: 13.28 (Edition 8) / 13.26 (Edition 9)
(10 pts)

The hypotheses are
H0 : 1 = 2 = 3 = 0, no differences in the varieties
H1 : At least one of the is is not equal to zero.
= 0.05.
Critical region: f > 5.14.

Computation:
Source of
variation
Sum of squares Degrees of
freedom
Mean square Computed f
Treatments 24.500 2 12.250 1.74
Blocks 171.333 3 57.111
Error 42.167 6 7.028
Total 238.00 11

P-value=0.2535. Decision: Do not reject H0; could not show that the varieties of
potatoes differ in field.


6. Chapter 14: 14.2
(10 pts)

Brand Time Ascorbic-Acid
R 0 52.6 R 0 54.2
R 0 49.8
R 0 46.5
R 3 49.4
R 3 49.2
R 3 42.8
R 3 53.2
R 7 42.7
R 7 48.8
R 7 40.4
R 7 47.6
S 0 56.0
S 0 48.0
S 0 49.6
S 0 48.4
S 3 48.8
S 3 44.0
S 3 44.0
S 3 42.4
S 7 49.2
S 7 44.0
S 7 42.0
S 7 43.2
M 0 52.5
M 0 52.0
M 0 51.8
M 0 53.6
M 3 48.0
M 3 47.0
M 3 48.2
M 3 49.6
M 7 48.5
M 7 43.3
M 7 45.2
M 7 47.6

Make the table.

Brand 0 3 7
53 49 43
54 49 49
50 43 40
47 53 48
56 49 49.2
48 44 44
50 44 42
48 42 43.2
52.5 48 48.5
52 47 43.3
51.8 48.2 45.2
53.6 49.6 47.6
Richfood
Sealed-Sweed
Minute Maid
Time(days)


Plot the data.

0
10
20
30
40
50
60
0 1 2 3 4 5
Time (factor 1)
V
a
l
u
e
0
3
7
0
10
20
30
40
50
60
0 1 2 3 4 5 6
Brand (factor 2)
V
a
l
u
e
Richfood
Sealed-Sweet
Minute Maid


1. hypotheses: H
0
:
0
=
i

H
1
:
0

i
i = 1, 2, 3, 4

2. o = 0.05

3. calculate using Excel

Anova: Two-Factor With Replication
SUMMARY 0 3 7 Total
Richfood
Count 4 4 4 12
Sum 203.1 194.6 179.5 577.2
Average 50.775 48.65 44.875 48.1
Variance 11.42917 18.59667 15.8625 19.00909
Sealed-Sweet
Count 4 4 4 12
Sum 202 179.2 178.4 559.6
Average 50.5 44.8 44.6 46.63333
Variance 13.90667 7.68 10.08 16.79879
Minute Maid
Count 4 4 4 12
Sum 209.9 192.8 184.6 587.3
Average 52.475 48.2 46.15 48.94167
Variance 0.649167 1.146667 5.55 9.577197
Total
Count 12 12 12
Sum 615 566.6 542.5
Average 51.25 47.21667 45.20833
Variance 7.919091 10.70152 9.086288
ANOVA
Source of VariationSS df MS F P-value F crit
Sample 32.75167 2 16.37583 1.735937 0.195331 3.354131
Columns 227.2117 2 113.6058 12.0429 0.000183 3.354131
Interaction 17.32167 4 4.330417 0.45905 0.765025 2.727765
Within 254.7025 27 9.433426
Total 531.9875 35


5. decision for (a)
Part (a) asks about brands, which is the sample in this case. (A is the factor in the vertical
column, B is the factor in the horizontal one.)
Since P(f = 1.7) = 0.19, we cannot reject H
0
that all the means are the same.

decision for (b)
Part (b) asks about times, which is the columns in this case.
Since P(f = 12) << 0.01, we reject H
0
all the means are not the same.

decision for (c)
Part (c) asks about interactions.
Since P(f = 0.46) = 0.77, we cannot reject H
0
there are no interactions.

7. An engineer wants to determine the effect of 4 different chemicals used in the permanent
press finishing process on the strength of the fabric. Five fabric samples were selected.
Analyze the following fabric strength data.
(10 pts)

fabric
chemical 1 2 3 4 5
1 1.3 1.6 0.5 1.2 1.1
2 2.2 2.4 0.4 2 1.8
3 1.8 1.7 0.6 1.5 1.3
4 3.9 4.4 2 4.1 3.4

This problem describes a treatment at 4 levels, which requires a 1-factor ANOVA. Because of
the use of different fabrics, we need to block. The results of the 2-factor ANOVA without
replication are as follows. Based on the f = 76, with a P-value of 10
-8
, we reject the null
hypothesis that all treatments have the same effect on strength and conclude that they differ.
Notice that the effect of the blocks was substantial: f = 21.

Anova: Two-Factor Without Replication
SUMMARY Count Sum Average Variance
1 5 5.7 1.14 0.163
2 5 8.8 1.76 0.628
3 5 6.9 1.38 0.227
4 5 17.8 3.56 0.893
1 4 9.2 2.3 1.273333
2 4 10.1 2.525 1.689167
3 4 3.5 0.875 0.569167
4 4 8.8 2.2 1.713333
5 4 7.6 1.9 1.086667
ANOVA
Source of Variation SS df MS F P-value F crit
Rows 18.044 3 6.014667 75.89485 4.52E-08 3.490295
Columns 6.693 4 1.67325 21.11356 2.32E-05 3.259167
Error 0.951 12 0.07925
Total 25.688 19


One of the assumptions behind blocking is that the blocks and the treatments do not interact. If
we plot the data, we see that there might be a violation of that here: we would have expected the
green point for fabric 2 to be higher than the points for fabrics 1 and 3. However, the difference
is smaller than the square root of the error term, sqrt (0.079) = 0.28, shown by the error bars, so
this apparent interaction is within the noise. Also, there is no interaction for the other points.
Therefore, this assumption appears to be safe in this problem.

0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
5
0 1 2 3 4 5
Fabric
W
e
i
g
h
t
1
2
3
4
5



8. In a chemical distillation plant, measurements are made of the percentage of
hydrocarbons present in the main condenser of the distillation unit and the purity of
oxygen produced. Analyze these data in a meaningful way.
(15 pts)

observation hydrocarbon
level (%)
oxygen
purity (%)
1 0.99 90.01
2 1.02 89.05
3 1.15 91.43
4 1.29 93.74
5 1.46 96.73
6 1.36 94.45
7 0.87 87.59
8 1.23 91.77
9 1.55 99.42
10 1.4 93.65
11 1.19 93.54
12 1.15 92.52
13 0.98 90.56
14 1.01 89.54
15 1.11 89.85
16 1.2 90.39
17 1.26 93.25
18 1.32 93.41
19 1.43 94.98
20 0.95 87.33

Presumably, the more hydrocarbons on the condenser, the higher the purity. To test this
relationship, first plot the data.

86
88
90
92
94
96
98
100
102
0.85 1.05 1.25 1.45 1.65
Hydrocarbon Level (%)
O
x
y
g
e
n

P
u
r
i
t
y

(
%
)


Yes, there is clearly a relationship. Run a regression analysis to determine its strength.

SUMMARY OUTPUT
Regression Statistics
Multiple R 0.936715
R Square 0.877436
Adjusted R Square 0.870627
Standard Error 1.086529
Observations 20
ANOVA
df SS MS F Significance F
Regression 1 152.1271 152.1271 128.8617 1.23E-09
Residual 18 21.24982 1.180545
Total 19 173.3769
Coefficients Standard Error t Stat P-value Lower 95%Upper 95%Lower 95.0% Upper 95.0%
Intercept 74.28331 1.593473 46.61723 3.17E-20 70.93555 77.63108 70.93555 77.63108
X Variable 114.94748 1.316758 11.35173 1.23E-09 12.18107 17.71389 12.18107 17.71389


X Variable 1 Line Fit Plot
86
88
90
92
94
96
98
100
102
0.85 1.05 1.25 1.45 1.65
X Variable 1
Y
X Variable 1 Residual Plot
-2.5
-2
-1.5
-1
-0.5
0
0.5
1
1.5
2
2.5
0.85 1.05 1.25 1.45 1.65
X Variable 1
R
e
s
i
d
u
a
l
s


The R
2
value is 0.87, and the residuals are randomly distributed. The latter tells us that the linear
fit does seem to be appropriate. There is still quite a lot of scatter, however, so while the linear
relationship accounts for a good deal of the variation, the random error is nevertheless
significant.

Looking at the ANOVA table in the regression analysis, what does this tell us? Recall that the
mean squared error s
2
is an estimate of the residual variance o
2
:
2 2
2 1 1
2 2
n n
i i i
i i
e ( y y )
s
n n
= =

= =


=
2
o

In the ANOVA output,
2
1
n
i
i
e
=

= SSE

Also recall (linear regression lecture, slide 24) that we defined
2
1
SSE SSR
R
SST SST
= =
where SSR = regression sum of squares =
2
1
n
i
i
( y y )
=

. The ANOVA table gives us these


numbers. One can then also calculate the MSR = SSE/dof and MSE/dof and then f = MSR/MSE.
In this case, f = 128, and the P-value is 10
-9
. What this means is that such a linear relationship
had no probability of occurring by chance.

What about the standard error on the intercept and the slope? What does that mean? Remember
that the intercept and the slopes are estimates of the true population mean and slope. We can put
confidence intervals on these. If the confidence interval on the slope does not overlap 0, then we
can reject the null hypothesis that there is no variation in y accounted for by x. If it does overlap
zero, then there may be no relationship between x and y. The standard error of the slope a is
given by:
SE(a) =
2
xx

S
o

where S
xx
=
2
1
n
i
i
( x x )
=

(the denominator in the calculation of b), and the standard error of the
intercept is given by
SE(b) =
2
2
1
xx
x

n S
o
| |
+
|
\ .
.
These are the numbers given in the Standard Error column of the last table, next to the slope and
the intercept. The following column calculates the t-statistic, where a
0
= 0
0 0
2
xx
a a a a
t
SE( a )
/ S o

= = and

0
b b
t
SE( b )

= ,
where b
0
= 0. These tell us the significance of the regression. Failure to reject H
0
: a = 0 means
that there is no linear relationship between x and y. In this example, the P-value for this test is
negligible, so we reject H
0
and conclude that the value of x is of value in predicting the value of
y. The confidence intervals on the slope and intercept are given in the lower 95% and upper
95% columns.


9. An engineer in the chemical distillation plant measures the oxygen purity when using
two types of filters and two brands of condenser units; each measurement is made 3 times.
The measurement data were as follows. Analyze the results.
(10 pts)

filter 1 and brand 1 94.74 97.23 95.45
filter 1 and brand 2 87.33 87.59 90.01
filter 2 and brand 1 89.05 89.85 92.52
filter 2 and brand 2 93.25 90.39 91.77

One way to do this problem is as a two-level DOE, where the two levels are the two brands of
each. What is new in this problem is that there are three replicates, rather than the two we are
used to. That means we need to take averages over the 3 to get the mean and use the following
SS
effect
= contrast
2
/n2
k
= sum
2
*n/N
T

with n = 3, and k = 2, and
SSE =
( )
2
1 1
k n
ij i
i j
y y .
= =

with 2
k
(n-1) d.o.f.
We need to calculate the SSE with this formula, rather than our usual one: note the last
column. Setting up the table and performing the calculations gives us this.

Conditio
n j
Combinatio
n
Treatment-
combination
Response
1
Response
2
Response
3 Mean A B AB
1 f1, b1 -1 94.74 97.23 95.45 95.81 -1 -1 1 3.29
2 f1, b2 b 87.33 87.59 90.01 88.31 -1 1 -1 4.37
3 f2, b1 a 89.05 89.85 92.52 90.47 1 -1 -1 6.60
4 f2, b2 ab 93.25 90.39 91.77 91.80 1 1 1 4.09
sums 364.37 365.06 369.75 366.39 -1.84 -6.17 8.83 sum 18.35
divisors 4 4 4 4 2 2 2
effects 91.09 91.27 92.44 91.60 -0.92 -3.08 4.41
SS(effect) 2.54 28.52 58.43 SSE 18.35
dof 1 1 1 2
k
(n-1) dof 8
s
2
effect
2.54 28.52 58.43 s
2
= 2.29
f 1.11 12.43 25.47
P-value 0.32 7.78E-03 9.93E-04
not sig sig
1.51
1.07
to/2 (NT) = t0.025(4) = 2.776
to/2 2.97 not sig sig
2
Y T
/ N o
2
Y T
/ N o
Y
o
( )
3
2
1
n
ij i
i
y y .
=
=

2
Y
o


Both ways of calculating the significance of the effects give the same result: the type of filter
main effect appears not to be significant, while the brand of condenser unit is significant, and
there is a significant interaction. Compare the answers circled in various colors in this table with
the corresponding ones below.

Plotting the interaction data (two plots of the same data) shows us that the filter effect is masked
by the strong interaction.

87
88
89
90
91
92
93
94
95
96
97
-2.0 -1.0 0.0 1.0
Filter
O
x
y
g
e
n

P
u
r
i
t
y
B high
B low
87
88
89
90
91
92
93
94
95
96
97
-2.0 -1.0 0.0 1.0
Brand
O
x
y
g
e
n

P
u
r
i
t
y
F high
F low


Another way to do this problem is as a 2-factor ANOVA, since we have replicates. Setting up the
table like this:

brand
filter 1 2
1 94.74 87.33
97.23 87.59
95.45 90.01
2 89.05 93.25
89.85 90.39
92.52 91.77

Asking Excel for a 2-factor ANOVA with replication gives us this output. Notice that it gives us
exactly the same answers.

S-ar putea să vă placă și