Sunteți pe pagina 1din 6

STAC50H3:Data Collection

Assign 2
Due: Oct 30, 2013 in class
All relevant work must be shown for credit.
Note: In any question, if you are using R, all R codes and R outputs must
be given in an appendix and your written answers must be given separately
in the main part of the answers. You should assume that the reader is not
familiar with R outputs and explain all your findings from the outputs in
simple English.
1. The data set packetdata.csv (on course webpage) was obtained from the
Journal of Statistics Education Data Achieve (but I have made some changes
to data). The data set gives one hours data on internet network traffic between a company and the rest of the world on March 8th, 1995. (Paxson
and Floyd (1995) and (Sanchez and He (2005) also analyzed this data set.)
The actual data set is a large data set. In the data file packetdata.csv, I have
only included subset of that large data set, but lets assume that as our population. This data set contains six variables. In this question we will focus
on the variable databytes(the last column of the data set).
Note 1: Before selecting your SRSWOR, use set.seed(124) to make sure
that you get the same sample as mine.
(a) (5 points) Make a histogram of databyes and describe its main features.
(b) (13 points) Use R to take a SRSWOR of size n = 300 from this population and use that sample to estimate the population mean and its
standard error. You may use Sampling and Survey packages. Also
calculate an approximate 95% confidence interval for the population
mean. Comment on the theoretical validity and practical appropriateness (as a measure of center) of your estimates.
(c) (2 points) Calculate the population mean (i.e. the mean of the databytes
in the population). Is that close enough to your estimate in part b.
Solution:
>
>
>
>
>
>
>

#If there are non-numerical data in the variable of intrest, we ca


#remove the non-numerical data first and then use the sampling and
#survey packages as in this code.
pop=read.csv("C:/Users/Mihinda/Desktop/packetdata.csv", header=1)
x=pop$databytes # Variable to be analyzed
#-----------------------------------------------------------# The following two lines will remove non-numerical data, if any

Question 1 continues on the next page. . .

Page 2 of 6

> pop$x <- as.numeric(as.character(x))


Warning message:
NAs introduced by coercion
> pop <- na.omit(pop$x) # Here we take pop to be the data on the
> # variable is interest,
> #------------------------------------------------------------> n=300 # sample size
> library(sampling)
Loading required package: MASS
Loading required package: lpSolve
> set.seed(124) #This gives the same sample
> s=srswor(n,length(pop))
> srsdata = getdata(pop,s)
> srsdata
ID_unit data
1
23 256
2
338 512
3
452 512
4
938
0
5
1390
0
6
1535
5
.
.
.
> # R assigns a name called data but we can change it as shown below.
> hist(pop)

a) The histogram of the databyes is shown below. It is not symmetric,


but for this sample size (with CLT) the effect of non-Normality will not
be a big problem. It looks like there is a few outliers. It would better to
remove these outliers and redo the analysis.

25000
10000
0

Frequency

Histogram of pop

500

1000
pop

> #----------------------------------------------------------> library (survey)

1500

Page 3 of 6

Attaching package: survey


The following object is masked from package:graphics:
dotchart
>
>
>
>

srsdata$x=srsdata$data
srsdesign <- svydesign(id=1, fpc=rep(length(pop), nrow(srsdata)), data=srsdata)
m <- svymean(x, srsdesign)
m
mean
SE
x 180.46 13.857
> confint(m , level = 0.95)
2.5 %
97.5 %
x 153.2997 207.6203
> t <- svytotal(x, srsdesign)
> confint(t , level = 0.95)
2.5 %
97.5 %
x 7640919 10348416
> summary(pop)
Min. 1st Qu. Median
Mean 3rd Qu.
Max.
0.0
0.0
15.0
185.3
512.0 1460.0

b) The 95 percent confidence interval for the population mean is


(153.2997, 207.6203). Even though the distribution of databytes is skewed
to the right, this confidence interval is still valid (theoretically approximately correct) since our sample size (n = 300) is large (CLT), but
median is a more appropriate measure of of center for skewed distributions.
(c) The population mean is 185.3 which is in the confidence interval
we calculated (153.2997, 207.6203) and so yes, it is close enough to our
estimate.
2. (10 points) Forest data. The data in file forest.dat are from
kdd.ics.uci.edu/databases/covertype/covertype.data.html (Blackard, 1998).
This link also provides a description of the data set (please read it). This
data set consists of a subset of the measurements from 581,012 30 30m
cells from Region 2 of the U.S. Forest Service Resource Information System.
These cells are classified into 7 cover types (denoted by 1 to 7), The original data were used in a data mining application, predicting forest cover
type from covariates. Data-mining methods are often used to explore relationships in very large data sets; in many cases, the data sets are so large
that statistical software packages cannot analyze them. Many data-mining
problems, however, can be alternatively approached by analyzing probability samples from the population. In this exercise, we treat forest.csv as
a population. Select an SRSWOR of size 1000 from this population. Using
your SRSWOR, estimate the percentage of cells in forest of cover type 2
along with 95% CI.

Page 4 of 6
Note : Before selecting your SRSWOR, use set.seed(124) to make sure
that you get the same sample as mine.
Solution:
> pop=read.csv("C:/Users/Mahinda/Desktop/forest.csv", header=0)
> cover = pop[,15]
> pop$x = (cover > 1 )*(cover < 3)
> # pop$x = (cover == 2 ) is OK but the summary looks a bit different
> # but that is OK.
> n=1000 # sample size
> library(sampling)
Loading required package: MASS
Loading required package: lpSolve
> set.seed(124) #This gives the same sample
> s=srswor(n,nrow(pop))
> srsdata = getdata(pop,s)
> head(srsdata) # To see the first few data lines
ID_unit
V1 V2 V3 V4 V5
V6 V7 V8 V9 V10 V11 V12 V13 V14 V15 x
143
143 2837 112 8 272 16 3649 235 231 128 6221
1
0
0
0
2 1
262
262 3168 25 4 30
0 4426 218 231 150 3180
1
0
0
0
1 0
406
406 2656 110 23 85 35 1350 252 208 72 636
1
0
0
0
5 0
1417
1417 2779 78 47 134 118 150 227 111
0 1318
1
0
0
0
5 0
2270
2270 2041 112 25 30
2 510 253 203 62 685
0
0
0
1
4 0
2516
2516 2097 26 21
0
0 300 204 188 113 601
0
0
0
1
3 0
> summary(srsdata$x)
Min. 1st Qu. Median
Mean 3rd Qu.
Max.
0.000
0.000
0.000
0.498
1.000
1.000
> #----------------------------------------------------------> library (survey)
Attaching package: survey
The following object is masked from package:graphics:
dotchart
> srsdesign <- svydesign(id=1, fpc=rep(nrow(pop), nrow(srsdata)), data=srsdata)
> m <- svymean(x, srsdesign)
> m
mean
SE
x 0.498 0.0158
> confint(m , level = 0.95)
2.5 %
97.5 %
x 0.4670217 0.5289783

3. Consider the hypothetical population below:


Unit number
1
2
3
4
5
6

Stratum
1
1
1
2
2
2

y
1
2
4
4
7
7

Question 3 continues on the next page. . .

Page 5 of 6
(a) (5 points) Write out all possible SRSs of size 2 from stratum 1. Do the
same for stratum 2.
Solution: All possible samples from stratum 1: (1, 2), (1, 3), (2, 3)
All possible samples from stratum 2: (4, 5), (4, 6), (5, 6)
(b) (5 points) Using your work in (a), find the sampling distribution of
yStr
Solution:
The SRSs
(1, 2), (4,5)
(1, 2), (4,6)
(1, 2), (5,6)
(1, 3), (4,5)
(1, 3), (4,6)
(1, 3), (5,6)
(2, 3), (4,5)
(2, 3), (4,6)
(2, 3), (5,6)

Measures on the samples


(1, 2), (4,7)
(1, 2), (4,7)
(1, 2), (7,7)
(1, 4), (4,7)
(1, 4), (4,7)
(1, 4), (7,7)
(2, 4), (4,7)
(2, 4), (4,7)
(2, 4), (7,7)

y
1
y
1
y
1
y
1
y
1
y
1
y
1
y
1
y
1

Sample means
= 1.5, y
2 = 5.5
= 1.5, y
2 = 5.5
= 1.5, y
2 = 7.0
= 2.5, y
2 = 5.5
= 2.5, y
2 = 5.5
= 2.5, y
2 = 7.0
= 3.0, y
2 = 5.5
= 3.0, y
2 = 5.5
= 3.0, y
2 = 7.0

1
2
1
2
1
2
1
2
1
2
1
2
1
2
1
2
1
2

1.5 +
1.5 +
1.5 +
2.5 +
2.5 +
2.5 +
3.0 +
3.0 +
3.0 +

y
Srt
1 5.5
2
1 5.5
2
1 7.0
2
1 5.5
2
1 5.5
2
1 7.0
2
1 5.5
2
1 5.5
2
1 7.0
2

p
=
=
=
=
=
=
=
=
=

3.50
3.50
4.25
4.00
4.00
4.75
4.25
4.25
5.00

1
3
1
3
1
3
1
3
1
3
1
3
1
3
1
3
1
3

1
3
1
3
1
3
1
3
1
3
1
3
1
3
1
3
1
3

=
=
=
=
=
=
=
=
=

1
9
1
9
1
9
1
9
1
9
1
9
1
9
1
9
1
9

(c) (5 points) Use your answer in part (b), to find the mean and variance
of ySrt (i.e. E(
ySrt ) and V ar(
yStr )).
Note: Do not use R in this question. Show all your work clearly.
P
Solution: E(
ySrt ) =
ySrt p = 3.5 19 + 3.5 19 + 4.25 19 + 4.00
1
1
1
+ 4.00 9 + 4.75 9 + 4.25 19 + 4.25 91 + 5.00 91 = 4.166666667
9
P 2
2
E(
ySrt
) = ySrt
p = 3.52 91 + 3.52 91 + 4.252 19 + 4.002 19 + 4.002
1
1
2
+ 4.75 9 + 4.252 91 + 4.252 19 + 5.002 19 = 17.58333333
9
2
)(E(
ySrt ))2 = 17.583333334.1666666672 = 0.22222
V ar(
yStr ) = E(
ySrt

4. (S. L. Lohr)Suppose that a city has 80,000 dwelling units, of which 30,000
are houses, 40,000 are apartments, and 10,000 are condominiums.
(a) (5 points) You believe that the mean electricity usage is about twice
as much for houses as for apartments or condominiums, and that the
standard deviation is proportional to the mean so that S1 = 2S2 = 2S3 .
The cost of taking an observation is the same for all dwelling units. We
want to take a stratified sample of n = 800 dwelling units in order to
estimate the mean electricity consumption per unit in the population.
How would you allocate a stratified sample of 800 observations if we
wanted to minimize the variance of the estimate?

Question 4 continues on the next page. . .

Page 6 of 6

Solution: Using nh = PNNh Sh hSh n for optimal allocation with equal


32k
4
costs, n1 = 32k+4k+k
800 = 436, n2 = 11
800 = 291 and n3 =
1
800 = 73.
11
(b) (8 points) Now suppose that you take a stratified random sample with
proportional allocation and want to estimate the overall proportion of
households in which energy conservation is practiced. If 40% of house
dwellers, 20% of apartment dwellers, and 5% of condominium residents practice energy conservation, what is the value of p for this population? What gain would the stratified sample with proportional allocation offer over an SRSWOR? i.e., what is Vprop (
pstr )/V (
pSRSW OR )?

= 20500
= 0.25625.
Solution: p = 300000.40+400000.20+100000.05
80000
80000
PH
ph (1ph )
Nh
2
V (
pStr ) = h=1 Wh (1 fh ) nh , Wh = N , nh = Wh n,
fh = Nnhh = Nn , n = 800 N1 = 30000, N2 = 40000, N3 = 10000,
p1 = 0.40, p2 = 0.20, p3 = 0.05
P
P
ph (1ph )
2
= (1 Nn ) n1 H
V (
pStr ) = H
h=1 Wh ph (1ph ) =
h=1 Wh (1fh )
nh
800
1
3
4
(1 80000 800 ( 8 0.4 (1 0.4) + 8 0.2 (1 0.2) + 18 0.05
(1 0.05)) = 0.0002177226563.


N n p(1 p)
V (
pSRSW OR ) =
N 1
n


80000 800 0.25625(1 0.0.25625)
=
80000 1
800
= 0.0002358530458.
Vprop (
pstr )
V (
pSRSW OR )

0.0002177226563
= 0.0002358530458
0.92. The variability of the estimator
with stratified sampling is is about 92% of that with SRSWOR.

S-ar putea să vă placă și