Documente Academic
Documente Profesional
Documente Cultură
Assign 2
Due: Oct 30, 2013 in class
All relevant work must be shown for credit.
Note: In any question, if you are using R, all R codes and R outputs must
be given in an appendix and your written answers must be given separately
in the main part of the answers. You should assume that the reader is not
familiar with R outputs and explain all your findings from the outputs in
simple English.
1. The data set packetdata.csv (on course webpage) was obtained from the
Journal of Statistics Education Data Achieve (but I have made some changes
to data). The data set gives one hours data on internet network traffic between a company and the rest of the world on March 8th, 1995. (Paxson
and Floyd (1995) and (Sanchez and He (2005) also analyzed this data set.)
The actual data set is a large data set. In the data file packetdata.csv, I have
only included subset of that large data set, but lets assume that as our population. This data set contains six variables. In this question we will focus
on the variable databytes(the last column of the data set).
Note 1: Before selecting your SRSWOR, use set.seed(124) to make sure
that you get the same sample as mine.
(a) (5 points) Make a histogram of databyes and describe its main features.
(b) (13 points) Use R to take a SRSWOR of size n = 300 from this population and use that sample to estimate the population mean and its
standard error. You may use Sampling and Survey packages. Also
calculate an approximate 95% confidence interval for the population
mean. Comment on the theoretical validity and practical appropriateness (as a measure of center) of your estimates.
(c) (2 points) Calculate the population mean (i.e. the mean of the databytes
in the population). Is that close enough to your estimate in part b.
Solution:
>
>
>
>
>
>
>
Page 2 of 6
25000
10000
0
Frequency
Histogram of pop
500
1000
pop
1500
Page 3 of 6
srsdata$x=srsdata$data
srsdesign <- svydesign(id=1, fpc=rep(length(pop), nrow(srsdata)), data=srsdata)
m <- svymean(x, srsdesign)
m
mean
SE
x 180.46 13.857
> confint(m , level = 0.95)
2.5 %
97.5 %
x 153.2997 207.6203
> t <- svytotal(x, srsdesign)
> confint(t , level = 0.95)
2.5 %
97.5 %
x 7640919 10348416
> summary(pop)
Min. 1st Qu. Median
Mean 3rd Qu.
Max.
0.0
0.0
15.0
185.3
512.0 1460.0
Page 4 of 6
Note : Before selecting your SRSWOR, use set.seed(124) to make sure
that you get the same sample as mine.
Solution:
> pop=read.csv("C:/Users/Mahinda/Desktop/forest.csv", header=0)
> cover = pop[,15]
> pop$x = (cover > 1 )*(cover < 3)
> # pop$x = (cover == 2 ) is OK but the summary looks a bit different
> # but that is OK.
> n=1000 # sample size
> library(sampling)
Loading required package: MASS
Loading required package: lpSolve
> set.seed(124) #This gives the same sample
> s=srswor(n,nrow(pop))
> srsdata = getdata(pop,s)
> head(srsdata) # To see the first few data lines
ID_unit
V1 V2 V3 V4 V5
V6 V7 V8 V9 V10 V11 V12 V13 V14 V15 x
143
143 2837 112 8 272 16 3649 235 231 128 6221
1
0
0
0
2 1
262
262 3168 25 4 30
0 4426 218 231 150 3180
1
0
0
0
1 0
406
406 2656 110 23 85 35 1350 252 208 72 636
1
0
0
0
5 0
1417
1417 2779 78 47 134 118 150 227 111
0 1318
1
0
0
0
5 0
2270
2270 2041 112 25 30
2 510 253 203 62 685
0
0
0
1
4 0
2516
2516 2097 26 21
0
0 300 204 188 113 601
0
0
0
1
3 0
> summary(srsdata$x)
Min. 1st Qu. Median
Mean 3rd Qu.
Max.
0.000
0.000
0.000
0.498
1.000
1.000
> #----------------------------------------------------------> library (survey)
Attaching package: survey
The following object is masked from package:graphics:
dotchart
> srsdesign <- svydesign(id=1, fpc=rep(nrow(pop), nrow(srsdata)), data=srsdata)
> m <- svymean(x, srsdesign)
> m
mean
SE
x 0.498 0.0158
> confint(m , level = 0.95)
2.5 %
97.5 %
x 0.4670217 0.5289783
Stratum
1
1
1
2
2
2
y
1
2
4
4
7
7
Page 5 of 6
(a) (5 points) Write out all possible SRSs of size 2 from stratum 1. Do the
same for stratum 2.
Solution: All possible samples from stratum 1: (1, 2), (1, 3), (2, 3)
All possible samples from stratum 2: (4, 5), (4, 6), (5, 6)
(b) (5 points) Using your work in (a), find the sampling distribution of
yStr
Solution:
The SRSs
(1, 2), (4,5)
(1, 2), (4,6)
(1, 2), (5,6)
(1, 3), (4,5)
(1, 3), (4,6)
(1, 3), (5,6)
(2, 3), (4,5)
(2, 3), (4,6)
(2, 3), (5,6)
y
1
y
1
y
1
y
1
y
1
y
1
y
1
y
1
y
1
Sample means
= 1.5, y
2 = 5.5
= 1.5, y
2 = 5.5
= 1.5, y
2 = 7.0
= 2.5, y
2 = 5.5
= 2.5, y
2 = 5.5
= 2.5, y
2 = 7.0
= 3.0, y
2 = 5.5
= 3.0, y
2 = 5.5
= 3.0, y
2 = 7.0
1
2
1
2
1
2
1
2
1
2
1
2
1
2
1
2
1
2
1.5 +
1.5 +
1.5 +
2.5 +
2.5 +
2.5 +
3.0 +
3.0 +
3.0 +
y
Srt
1 5.5
2
1 5.5
2
1 7.0
2
1 5.5
2
1 5.5
2
1 7.0
2
1 5.5
2
1 5.5
2
1 7.0
2
p
=
=
=
=
=
=
=
=
=
3.50
3.50
4.25
4.00
4.00
4.75
4.25
4.25
5.00
1
3
1
3
1
3
1
3
1
3
1
3
1
3
1
3
1
3
1
3
1
3
1
3
1
3
1
3
1
3
1
3
1
3
1
3
=
=
=
=
=
=
=
=
=
1
9
1
9
1
9
1
9
1
9
1
9
1
9
1
9
1
9
(c) (5 points) Use your answer in part (b), to find the mean and variance
of ySrt (i.e. E(
ySrt ) and V ar(
yStr )).
Note: Do not use R in this question. Show all your work clearly.
P
Solution: E(
ySrt ) =
ySrt p = 3.5 19 + 3.5 19 + 4.25 19 + 4.00
1
1
1
+ 4.00 9 + 4.75 9 + 4.25 19 + 4.25 91 + 5.00 91 = 4.166666667
9
P 2
2
E(
ySrt
) = ySrt
p = 3.52 91 + 3.52 91 + 4.252 19 + 4.002 19 + 4.002
1
1
2
+ 4.75 9 + 4.252 91 + 4.252 19 + 5.002 19 = 17.58333333
9
2
)(E(
ySrt ))2 = 17.583333334.1666666672 = 0.22222
V ar(
yStr ) = E(
ySrt
4. (S. L. Lohr)Suppose that a city has 80,000 dwelling units, of which 30,000
are houses, 40,000 are apartments, and 10,000 are condominiums.
(a) (5 points) You believe that the mean electricity usage is about twice
as much for houses as for apartments or condominiums, and that the
standard deviation is proportional to the mean so that S1 = 2S2 = 2S3 .
The cost of taking an observation is the same for all dwelling units. We
want to take a stratified sample of n = 800 dwelling units in order to
estimate the mean electricity consumption per unit in the population.
How would you allocate a stratified sample of 800 observations if we
wanted to minimize the variance of the estimate?
Page 6 of 6
= 20500
= 0.25625.
Solution: p = 300000.40+400000.20+100000.05
80000
80000
PH
ph (1ph )
Nh
2
V (
pStr ) = h=1 Wh (1 fh ) nh , Wh = N , nh = Wh n,
fh = Nnhh = Nn , n = 800 N1 = 30000, N2 = 40000, N3 = 10000,
p1 = 0.40, p2 = 0.20, p3 = 0.05
P
P
ph (1ph )
2
= (1 Nn ) n1 H
V (
pStr ) = H
h=1 Wh ph (1ph ) =
h=1 Wh (1fh )
nh
800
1
3
4
(1 80000 800 ( 8 0.4 (1 0.4) + 8 0.2 (1 0.2) + 18 0.05
(1 0.05)) = 0.0002177226563.
N n p(1 p)
V (
pSRSW OR ) =
N 1
n
80000 800 0.25625(1 0.0.25625)
=
80000 1
800
= 0.0002358530458.
Vprop (
pstr )
V (
pSRSW OR )
0.0002177226563
= 0.0002358530458
0.92. The variability of the estimator
with stratified sampling is is about 92% of that with SRSWOR.