Documente Academic
Documente Profesional
Documente Cultură
Olga Vitek
Homework 4 - Solution
Each part of the problems 5 points
1. Project groups.
2. Agresti 2.3.
Fatal
1601
510
Injury
Nonfatal
162,527
412, 368
response variable : the number of accidents in 1988 compiled by the Department of Highway
Safety and Motor Vehicles in Florida.
1,601
510
difference of proportions :
1
2 = 1,601+162,527
510+412,369
= 0.00852, the probability of a
fatal injury with no seat belt is 0.00852 more than the probability of fatal injury with seat belt.
relative risk : 21 = 7.897, the risk of fatal injury with no seat belt is about 8 tims more than that
of fatal injury with seat belt.
/12
22
odds ratio : 11
= 11
= 1601412,368
510162,527 = 7.9649, the odds of a fatal injury with no seat belt
21 /22
21 12
are about 8 times of that of a fatal injury with seat belt.
1
12
1
12
1)
relative risk = 12 , and odds ratio = 12 /(1
/(12 ) = 2 11 2 because of 11 1 (the number
of fatal injuries with or without seat belt are much smaller than the number of non-fatal injuries.
So, 1 and 2 are quite small and both 1 1 ,1 2 are close to 1.)
3. Agresti 3.1
(a) 95% confidence interval for the population odds ratio :
log N (log(),
2 ),
1601412,368
1
1
1
1
= 510162,527 = 7.9649,
2 = n111 + n112 + n121 + n122 = 1,601
+ 162,527
+ 510
+ 412,368
= 0.00259
95% Confidence interval for log :
1
2 = ny11 ny22 = 1,601+162,527
510+412,369
= 0.009754582 0.001235229 = 0.00852
q
q
1 (11 )
2 (12 )
0.001240.999
4
+
= 0.009750.990
s{
1
2 } =
n1
n2
1,601+162,527 + 510+412,368 = 2.487 10
95% Confidence inter for 1 2 :
1
2 z/2 s{
1
2 } = 0.00852 (1.96)(2.487 104 ) = (0.008, 0.009)
or from R, we can get
95 percent confidence interval:
0.008027691 0.009011009
sample estimates:
prop 1
prop 2
0.009754582 0.001235232
(c) 95% confidence interval for the population relative risk between seat-belt use and type of injury
: (RR = relative risk)
= 1 = 7.897,
RR
2
10.00975
10.001235
3
= 11 + 12 =
s2 {log RR}
1 n1
2 n2
0.00975(1,601+162,527) + 0.001235(510+412,368) = 2.5769 10
95% Confidence interval for logRR :
Race
Black
White
sum
Party Identification
Democrat Independent Republican
103
15
11
341
105
405
444
120
416
sum
129
851
980
(a) Using X 2 and G2 , test the hypothesis of independence between party identification and race.
Report the P-values and interpret.
i. Pearson test, 2 :
P3 P2 (O E )2
2 = i=1 j=1 ijEij ij = 79.43 2(I1)(J1) = 2(31)(21)=2 = 5.991465
p-value 0, i.e. we reject the null hypothesis of independence between race and party.
R code
Democrat
5.83
-2.27
Independent
-0.20
0.08
Republican
-5.91
2.30
Black
Republican
Independent
Democrat
White
Republican
Independent
Democrat
-6
-4
-2
Pearson Residual
R code
e<-apply(ov,1,sum) \%*\% t(apply(ov,2,sum))/sum(ov)
personResid<-(ov-e)/sqrt(e)
dotchart(t(personResid), xlab="Pearson Residual", main="Visualization the association")
abline(v=c(-2,2))
(d) Summarize association by construction a 95% confidence interval for the odds ratio between race
and whether a Democrat or Republican. Interpret.
Race
Black
White
sum
Party Identification
Democrat Republican
103
11
341
405
444
416
103/11
= 341/405
= 11.121,
2 = n111 + n112 +
95% Confidence interval for log :
1
n21
1
n22
1
103
sum
114
746
860
1
11
1
341
1
405
= 0.106
5. Agresti 3.29
P P (npij npi+ p+j )2
P P n2 (pij pi+ p+j )2
P P (p p p+j )2
P P (O E )2
= i j
= n i j ij pi+i+
2 = i j ijEi j ij = i j
npi+ p+j
npi+ p+j
p+j
When all ps are fixed but n increase, then 2 increase. The degrees of freedom remain fixed. So as 2
increase with large n, p-value decreases with large n, which means that it is possible to reject H0 even
though H0 is true. Thus 2 itself does not characterize the strength of the association.
6. [Methods qualifying exam, January 2010: use paper and pencil.] The National Institute for Standards
and Technology develop standards for asbestos concentration in drinking water. In a designed experiment, asbestos dissolved in water was spread on a filter, and an operator counted the number of
asbestos fibers on a section of the filter. The procedure was repeated with 200 filters, yielding the
average of 27.7 asbestos fibers per filter, and the following summary results:
Observed # of filters Oi
Expected # of filters Ei
(Oi Ei )2
Ei
0-10
2
0.12
30.25
11-15
1
2.43
0.85
31
20
27.34
1.97
Sum
200
200
35.39
(a) The consulting statistician assumed a Poisson model for the observed number of fibers on a filter.
Provide the rationale for using the Poisson model for this dataset. State the model and the
assumptions, and provide the parameter estimate.
Answer:
The Poisson distribution is appropriate as it represents the count of events in a fixed time or
y
= 27.7.
space. P (Y = y) = e y! , where Y is the number of fibers on a filter.
(b) The statistician would like to test the adequacy of the model using the 2 test. He uses the results
in the table above to derive the test statistic of 35.39.
i. State the null and the alternative hypotheses, and the conclusion of the test at the significance
level of 5%.
Answer:
H0 : P (Y = y) =
e y
y! ,
Ha : P (Y = y) M ultinomial(1 , 2 , . . . , 7 ),
7
P
i = 1.
i=1
261 (1
ii. Is such use of the 2 test adequate? If not, correct the computations and reassess the adequacy
of the Poisson model.
Answer:
The test is inadequate since the first two cells have low counts. It is preferable to combine
the groups 1, 2 to 3.
5
P
y
i = 1
Test H0 : P (Y = y) = e y! , Ha : P (Y = y) M ultinomial(1 , 2 , 3 , 4 , 5 ),
i=1
The new test statistic as 4.32, based on 4-1=3 df. The critical value is 24 (10.05) = 7.814728,
the resulting p-value is 0.2289189, and we fail to reject H0 .
(c) Assume that the Poisson model is appropriate. Use Central Limit Theorem to construct an
approximate 95% confidence interval for the average number of asbestos fibers on a filter.
Answer:
Y
1.96, 95% Confidence interval of Y is
Using 1.96
/n
s
(1.96)(
27.7
) = 27.7 (1.96)(
) = (26.97, 28.43)
n
200
Vision
Normal
Color-blind
Gender
Male Female
442
514
38
6
A genetic model stipulates that these numbers should have corresponding population relative frequencies given by
5
Male
p/2
q/2
Vision
Normal
Color-blind
Gender
Female
p2 /2 + pq
q 2 /2
where q = 1 p is the proportion of genes for color-blindness in the gene pool. Youve been asked to
test whether the data are consistent with the model.
(a) Describe two possible testing methods. For each, comment on the underlying assumptions of the
approach.
Answer:
We may use the likelihood ratio test statistic and Pearson 2 test statistic. The underline assumption for both of them are that the probability table given are correct. There is only a unknown p
to be estimated.
(b) Perform the test using these methods. Please show your work, making sure to comment on
assumptions and any remedies if the assumptions are violated.
Answer:
i. Likelihood ratio test statistic (G2 statistic)
First, compute the MLE of p. The total number of observations is 1000. Thus, the loglikelihood based on multinomial distribution is
2
2
1000!
p
p + pq
q
q
l(p) = log
+ 442log
+ 514log
+ 38log
+ 6log
442!514!38!6!
2
2
2
2
= C + 442log(p) + 514log(2p p2 ) + 50log(1 p)
0
50
442 514(2 2p)
+
l (p) =
p
2p p2
1p
0
2 X
2
X
i=1 j=1
nij log
nij
= 5.84
n
ij
2 X
2
X
(nij n
ij )2
= 3.08
n
ij
i=1 j=1
Size Class
Too small
Intermediate
Large enough
Total
Total
19
35
146
200