Sunteți pe pagina 1din 7

STAT 526 - Spring 2011

Olga Vitek

Homework 4 - Solution
Each part of the problems 5 points

1. Project groups.
2. Agresti 2.3.

Safety Equipment in Use


None
Seat belt

Fatal
1601
510

Injury
Nonfatal
162,527
412, 368

response variable : the number of accidents in 1988 compiled by the Department of Highway
Safety and Motor Vehicles in Florida.
1,601
510
difference of proportions :
1
2 = 1,601+162,527
510+412,369
= 0.00852, the probability of a
fatal injury with no seat belt is 0.00852 more than the probability of fatal injury with seat belt.

relative risk : 21 = 7.897, the risk of fatal injury with no seat belt is about 8 tims more than that
of fatal injury with seat belt.
/12
22
odds ratio : 11
= 11
= 1601412,368
510162,527 = 7.9649, the odds of a fatal injury with no seat belt
21 /22
21 12
are about 8 times of that of a fatal injury with seat belt.
1
12
1
12
1)
relative risk = 12 , and odds ratio = 12 /(1
/(12 ) = 2 11 2 because of 11 1 (the number
of fatal injuries with or without seat belt are much smaller than the number of non-fatal injuries.
So, 1 and 2 are quite small and both 1 1 ,1 2 are close to 1.)

3. Agresti 3.1
(a) 95% confidence interval for the population odds ratio :
log N (log(),
2 ),
1601412,368
1
1
1
1
= 510162,527 = 7.9649,
2 = n111 + n112 + n121 + n122 = 1,601
+ 162,527
+ 510
+ 412,368
= 0.00259
95% Confidence interval for log :

= log(7.9649) (1.96)( 0.00259) = (1.975219, 2.17487) = (L, U )


log z/2 s{log }
Therefore, 95% Confidence interval for :
(eL , eU ) = (e1.975219 , e2.17487 ) = (7.208198, 8.801041)
or from R, we can get
95 percent confidence interval:
7.203380 8.817085
sample estimates:
odds ratio
7.964552
1

(b) 95% confidence interval for the population difference of proportions :


1,601
510

1
2 = ny11 ny22 = 1,601+162,527
510+412,369
= 0.009754582 0.001235229 = 0.00852
q
q
1 (11 )
2 (12 )
0.001240.999
4
+
= 0.009750.990
s{
1
2 } =
n1
n2
1,601+162,527 + 510+412,368 = 2.487 10
95% Confidence inter for 1 2 :

1
2 z/2 s{
1
2 } = 0.00852 (1.96)(2.487 104 ) = (0.008, 0.009)
or from R, we can get
95 percent confidence interval:
0.008027691 0.009011009
sample estimates:
prop 1
prop 2
0.009754582 0.001235232
(c) 95% confidence interval for the population relative risk between seat-belt use and type of injury
: (RR = relative risk)
= 1 = 7.897,
RR
2
10.00975
10.001235
3
= 11 + 12 =
s2 {log RR}

1 n1

2 n2
0.00975(1,601+162,527) + 0.001235(510+412,368) = 2.5769 10
95% Confidence interval for logRR :

z/2 s{log RR}


= log(7.897) (1.96)( 0.0025769) = (1.967, 2.166) = (L, U )
log RR
Therefore, 95% Confidence interval for RR :
(eL , eU ) = (e1.967 , e2.166 ) = (7.149, 8.723)
R code
X<-matrix(c(1601,510,162527,412368), nrow=2,
dimnames=list("Safety Equipment"=c("None","Seat-belt"), "Injury"=c("Fatal","Non-fatal")))
prop.test(X) ## proportion
fisher.test(X) ## odds ratio

4. Agresti 3.4 (a), (b) and (d).

Race
Black
White
sum

Party Identification
Democrat Independent Republican
103
15
11
341
105
405
444
120
416

sum
129
851
980

(a) Using X 2 and G2 , test the hypothesis of independence between party identification and race.
Report the P-values and interpret.
i. Pearson test, 2 :
P3 P2 (O E )2
2 = i=1 j=1 ijEij ij = 79.43 2(I1)(J1) = 2(31)(21)=2 = 5.991465
p-value 0, i.e. we reject the null hypothesis of independence between race and party.
R code

Y<-data.frame(y=c(103,15,11, 341,105, 405),


Party=rep(c("Democrat","Independent","Republican"), 2),
Race=rep(c("Black","White"),1, each=3))
ov<-xtabs(y~Race+Party, data=Y)
summary(ov)

ii. Likelihood Ratio test, G2 :


P3 P2
nij n++
2
2
G2 = 2[logL(R) logL(F )] = 2 i=1 j=1 nij log ni+
n+j = 90.45 (I1)(J1) = 2 =
5.991465
p-value 0, i.e. we reject the null hypothesis of independence between race and party.
(b) Use residuals to describe the evidence of association.
As we can see, there is a significant tendency that Black people are to be in the Democratic Party
than in Republican party and vice versa for White people. So, race and party are not independent.
Pearson residuals :
Black
White

Democrat
5.83
-2.27

Independent
-0.20
0.08

Republican
-5.91
2.30

Visualization the association

Black
Republican
Independent
Democrat

White
Republican
Independent
Democrat

-6

-4

-2

Pearson Residual

R code
e<-apply(ov,1,sum) \%*\% t(apply(ov,2,sum))/sum(ov)
personResid<-(ov-e)/sqrt(e)
dotchart(t(personResid), xlab="Pearson Residual", main="Visualization the association")
abline(v=c(-2,2))

(d) Summarize association by construction a 95% confidence interval for the odds ratio between race
and whether a Democrat or Republican. Interpret.

Race
Black
White
sum

Party Identification
Democrat Republican
103
11
341
405
444
416

103/11
= 341/405
= 11.121,
2 = n111 + n112 +
95% Confidence interval for log :

1
n21

1
n22

1
103

sum
114
746
860

1
11

1
341

1
405

= 0.106

= log(11.121) (1.96)( 0.106) = (1.771, 3.047) = (L, U )


log z/2 s{log }
Therefore, 95% Confidence interval for :
(eL , eU ) = (e1.771 , e3.047 ) = (5.875, 21.052)
It means that the odds of being a Democrat for Black people is 11 times than that for White
people. With 95% confidence interval, it can be from 6 times to 21 times.

5. Agresti 3.29
P P (npij npi+ p+j )2
P P n2 (pij pi+ p+j )2
P P (p p p+j )2
P P (O E )2
= i j
= n i j ij pi+i+
2 = i j ijEi j ij = i j
npi+ p+j
npi+ p+j
p+j
When all ps are fixed but n increase, then 2 increase. The degrees of freedom remain fixed. So as 2
increase with large n, p-value decreases with large n, which means that it is possible to reject H0 even
though H0 is true. Thus 2 itself does not characterize the strength of the association.
6. [Methods qualifying exam, January 2010: use paper and pencil.] The National Institute for Standards
and Technology develop standards for asbestos concentration in drinking water. In a designed experiment, asbestos dissolved in water was spread on a filter, and an operator counted the number of
asbestos fibers on a section of the filter. The procedure was repeated with 200 filters, yielding the
average of 27.7 asbestos fibers per filter, and the following summary results:

Observed # of filters Oi
Expected # of filters Ei
(Oi Ei )2
Ei

0-10
2
0.12
30.25

11-15
1
2.43
0.85

Number of asbestos fibers


16-20 21-24 25-27 28-30
36
52
50
39
34.62 57.51 45.36 32.62
0.06
0.53
0.48
1.25

31
20
27.34
1.97

Sum
200
200
35.39

(a) The consulting statistician assumed a Poisson model for the observed number of fibers on a filter.
Provide the rationale for using the Poisson model for this dataset. State the model and the
assumptions, and provide the parameter estimate.
Answer:
The Poisson distribution is appropriate as it represents the count of events in a fixed time or
y
= 27.7.
space. P (Y = y) = e y! , where Y is the number of fibers on a filter.

(b) The statistician would like to test the adequacy of the model using the 2 test. He uses the results
in the table above to derive the test statistic of 35.39.
i. State the null and the alternative hypotheses, and the conclusion of the test at the significance
level of 5%.
Answer:
H0 : P (Y = y) =

e y
y! ,

Ha : P (Y = y) M ultinomial(1 , 2 , . . . , 7 ),

7
P

i = 1.

i=1

261 (1

The test statistic 35.39 >


0.05) = 11.07. p-value = 1.257677e 06.Therefore we
reject H0 .
The degrees of freedom are calculated as the number of unconstrained counts (7-1 = 6, since
the total number of filters is fixed), minus the number of parameters (1, for the mean number
of particles).

ii. Is such use of the 2 test adequate? If not, correct the computations and reassess the adequacy
of the Poisson model.
Answer:
The test is inadequate since the first two cells have low counts. It is preferable to combine
the groups 1, 2 to 3.
5
P
y
i = 1
Test H0 : P (Y = y) = e y! , Ha : P (Y = y) M ultinomial(1 , 2 , 3 , 4 , 5 ),
i=1

The new test statistic as 4.32, based on 4-1=3 df. The critical value is 24 (10.05) = 7.814728,
the resulting p-value is 0.2289189, and we fail to reject H0 .

(c) Assume that the Poisson model is appropriate. Use Central Limit Theorem to construct an
approximate 95% confidence interval for the average number of asbestos fibers on a filter.
Answer:
Y
1.96, 95% Confidence interval of Y is
Using 1.96
/n

s
(1.96)(

27.7
) = 27.7 (1.96)(
) = (26.97, 28.43)
n
200

7. [Methods qualifying exam, January 2006: use paper and pencil.]


The following table classifies a random sample of the U.S. population by gender and vision:

Vision
Normal
Color-blind

Gender
Male Female
442
514
38
6

A genetic model stipulates that these numbers should have corresponding population relative frequencies given by
5

Male
p/2
q/2

Vision
Normal
Color-blind

Gender
Female
p2 /2 + pq
q 2 /2

where q = 1 p is the proportion of genes for color-blindness in the gene pool. Youve been asked to
test whether the data are consistent with the model.
(a) Describe two possible testing methods. For each, comment on the underlying assumptions of the
approach.
Answer:
We may use the likelihood ratio test statistic and Pearson 2 test statistic. The underline assumption for both of them are that the probability table given are correct. There is only a unknown p
to be estimated.

(b) Perform the test using these methods. Please show your work, making sure to comment on
assumptions and any remedies if the assumptions are violated.
Answer:
i. Likelihood ratio test statistic (G2 statistic)
First, compute the MLE of p. The total number of observations is 1000. Thus, the loglikelihood based on multinomial distribution is


 
 2

 
 2
1000!
p
p + pq
q
q
l(p) = log
+ 442log
+ 514log
+ 38log
+ 6log
442!514!38!6!
2
2
2
2
= C + 442log(p) + 514log(2p p2 ) + 50log(1 p)
0
50
442 514(2 2p)
+

l (p) =
p
2p p2
1p
0

Let l (p) = 0. So,


442(1 p)(2 p) + 1028(1 p)2 50p(2 p) = 0
1520p2 3482p + 1912 = 0
Then, we have p = 0.9129 or p = 1.3778. The second one does not make sense and we
delete it. Thus, the MLE is p = 0.9129. Based on the MLE, we have the following predicted
n
11 = 456.48, n
12 = 496.21, n
21 = 43.53, n
22 = 3.79 Therefore,
G2 = 2

2 X
2
X
i=1 j=1

nij log

nij
= 5.84
n
ij

ii. Pearson 2 statistic


X2 =

2 X
2
X
(nij n
ij )2
= 3.08
n
ij
i=1 j=1

Based on df = 2, 22,0.05 = 5.99 and both are not significant.

8. [Methods qualifying exam, August 2008: use paper and pencil.]


At the wholesale level, lettuce is sometimes graded into three categories: too small to be sold, large
enough to be sold at full price, and intermediate. When the lettuce is placed into the intermediate size
class it is put together with another of the same type and they are sold as a pair at full price. A plant
breeder wants to investigate whether a new lettuce variety is worth more at the wholesale level than
an old variety. He randomly collected 100 heads from each of these varieties and, counted the number
of heads of lettuce in the following table:
Variety
Old New
11
8
25
10
64
82
100 100

Size Class
Too small
Intermediate
Large enough
Total

Total
19
35
146
200

Based on these data, what is your conclusion?


Answer:
P3 P2 (O E )2
2 = i=1 j=1 ijEij ij = 9.12 2(I1)(J1) = 2(31)(21)=2 = 5.991465
Reject H0 for independence test and conclude that variety and size are not independent. In other
words, the distribution of sizes of lettuce heads differs between the varieties.
If the cost for one lettuce is a and the price for one lettuce is b, the expected profit from 100 lettuce
heads in each case is:
New : (82 + 10/2)b 100a = 87b 100a
Old : (64 + 25/2)b 100a = 76.5b 100a
Therefore the expected difference in profit from 100 lettuce heads is 87b 100a 76.5b 100a = 10.5b,
and the new lettuce variety is more profitable than the old lettuce variety.

S-ar putea să vă placă și