Sunteți pe pagina 1din 22

PHP 2510

Central limit theorem, confidence intervals

PHP 2510 October 20, 2008

Distribution of the sample mean


Case 1: Population distribution is normal
For an individual in the population, Xi N (, 2 ) for
i = 1, 2, ..., n. Then, for a sample of size n, the sample mean
also has a normal distribution
X N (, 2 /n)
Case 2: Population distribution is not normal, e.g. Poisson,
Binomial, Then, for large samples, the sample mean also has a
normal distribution with mean equal to E(X) and variance
equal to var(X)/n

This is known as the central limit theorem


PHP 2510 October 20, 2008

Central Limit Theorem


Characterizes distribution of X in large samples
Suppose a sample X1 , . . . , Xn comes from a distribution with
mean E(X) and variance var(X). This can be almost any
distribution (binomial, poisson, etc.)

When n is large, the sample mean X is normally distributed. Its


mean is equal to the population mean, and its variance is
var(X)/n. We can write

var(X)
X N E(X),
n

PHP 2510 October 20, 2008

Example 1: Throw a fair coin.


Use sample mean to estimate the probability of having a head.
Let X be the outcome of throw a fair coin once.
X
I Throw a fair coin n times. Let X1 , X2 , ..., Xn be the outcomes.
Pn

II Compute sample mean X = i=1 Xi /n.


is normally distributed for a large n. To illustrate
III CLT says X
versus its
this, lets repeat Steps I and II 1000 times. Plot X
relative frequency.

PHP 2510 October 20, 2008

0.2

0.4

0.6

0.8

0.0666666666666667

0.6

sample mean

n=40

n=100

0.00

0.04

relative freqency

0.08
0.04
0.00

0.8

0.08

sample mean

0.12

relative freqency

0.10
0.00

0.10

0.20

relative freqency

0.30

0.20

n=15

0.00

relative freqency

n=5

0.275 0.45

0.6

sample mean

PHP 2510 October 20, 2008

0.75

0.35

0.5

sample mean

Example 2: Throw a fair die.


Use sample mean to estimate the probability of having a six. Let
X be the outcome of throw a fair die once.
X
I Throw a fair die n times. Let X1 , X2 , ..., Xn be the outcomes.
Pn

II Compute sample mean X = i=1 Xi /n.


is normally distributed for a large n. To illustrate
III CLT says X
versus its
this, lets repeat Steps I and II 1000 times. Plot X
relative frequency.

PHP 2510 October 20, 2008

n=15

0.20
0.00

0.2

0.4

0.6

0.8

0.2

0.4

0.6

sample mean

n=40

n=100

0.00

0.04

relative freqency

0.10
0.05
0.00

0.08

sample mean

0.15

relative freqency

0.10

relative freqency

0.3
0.2
0.1
0.0

relative freqency

0.4

n=5

0.15

0.3
sample mean

PHP 2510 October 20, 2008

0.07 0.23
sample mean

Confidence intervals
Confidence intervals can be used to convey uncertainty about the
estimate of any parameter.
A confidence interval is comprised of two random variables (lower
& upper bound) and covers the true mean with some pre-specified
probability
The confidence interval boundaries themselves are random
variables.

PHP 2510 October 20, 2008

What question does a CI answer?


Example 1: Incidence of pre-eclampsia.
A random sample of 1249 women is selected and followed through
pregnancy. 250 get pre-eclampsia.

Estimate the incidence by sample mean:

250
1249

= 20%

We would like to find an interval that contains, with 95%


probability, the true incidence of pre-eclampsia.
PHP 2510 October 20, 2008

Example 2: Hospitalization rate of HIV-infected women during a


6-month period.
A sample of 787 women are followed for 6 months, resulting in the
following data.

We are interested to construct an interval that contains the true


rate of hospitalization with 90% probability.
PHP 2510 October 20, 2008

10

numhosp |
Freq.
Percent
Cum.
------------+----------------------------------0 |
508
64.55
64.55
1 |
176
22.36
86.91
2 |
61
7.75
94.66
3 |
20
2.54
97.20
4 |
13
1.65
98.86
5 |
5
0.64
99.49
6 |
1
0.13
99.62
7 |
3
0.38
100.00
------------+----------------------------------Total |
787
100.00

Variable |
Obs
Mean
Std. Dev.
Min
Max
---------+----------------------------------------------------numhosp |
787
.5870394
1.036723
0
7

PHP 2510 October 20, 2008

11

Constructing a confidence interval


Central limit theorem: Says the sample mean is normally
distributed in large samples
X N (E(X), var(X)/n)

Writing it this way is a little tedious, so we use and to


generically denote E(X) and var(X); i.e.
X N (, 2 /n)

Implies that the sample mean can be rescaled to a standard


normal
X
N (0, 1)
/ n
PHP 2510 October 20, 2008

12

Applying the CLT to form confidence intervals


To form a 90% confidence interval, we want an interval that
contains the true mean with probability 0.90.
Logic: For a large sample size, the sample mean is a normally
distributed random variable.
Find an interval that contains a standard normal random
variable with some pre-specified probability.
Center it using the sample mean, and scale it using the
standard error.

PHP 2510 October 20, 2008

13

Step 1. Determine which two values contain 90% of the area


under the standard normal curve
Ans: 1.65 and 1.65

Step 2. Then with 90% probability, the standardized mean will


fall between 1.65 and 1.65
X
< 1.65
1.65 <
/ n
In other words,

X
< 1.65 = 0.90
Pr 1.65 <
/ n

PHP 2510 October 20, 2008

14

Step 3.

X
< 1.65 = 0.90
Pr 1.65 <
/ n


Pr X 1.65(/ n) < < X + 1.65(/ n) = 0.90.

In words: start with X, then add and subtract 1.65 standard


errors.

X 1.65 (/ n)
In large samples, can replace with sample SD S

PHP 2510 October 20, 2008

15

Properties of the confidence interval


Covers the true mean with pre-specified probability

Increase this probability by increasing number of standard errors


to add and subtract
For 95% coverage, add and subtract 1.96 std. errors.

X 1.96 (/ n)

Width of an interval determined by


Population variance 2
Sample size n
Nominal coverage probability
PHP 2510 October 20, 2008

16

PHP 2510 October 20, 2008

17

PHP 2510 October 20, 2008

18

1.4

1.8

2.2

95% CI

PHP 2510 October 20, 2008

2.6

1.4

1.8

2.2

2.6

90% CI

19

Example 1. Incidence of pre-eclampsia


Sample 1249 women, 250 get pre-eclampsia. Find an interval that
contains the true incidence with 95% probability.
Step 1. Let X be the pre-eclampsia status.
X Bernoulii(p)
where E(X) = p and 2 = var(X) = p(1 p).
Sample mean: X = 250/1249 = 0.20.
We estimate p by

pb = X,
and 2 by

b2 = pb(1 pb) = (0.2)(0.8) = 0.16.

PHP 2510 October 20, 2008

20

Step 2. Find number of std. errors needed for 95% coverage


1.96

Step 3. Add and subtract 1.96 std. errors from sample mean
Lower limit = 0.20 (1.96)(0.011) = 0.18
Upper limit = 0.20 + (1.96)(0.011) = 0.22

Confidence interval: (0.18, 0.22)


How to make this a 90% interval?

PHP 2510 October 20, 2008

21

Example 2: Hospitalization data


Find a 90% confidence interval for mean number of hospitalizations
Variable |
Obs
Mean
Std. Dev.
Min
Max
---------+----------------------------------------------------numhosp |
787
.5870394
1.036723
0
7
Step 1: Use summary statistics to obtain key values
Sample mean = 0.59
Sample SD = 1.04

Std error of sample mean = 1.04/ 787 = 0.03


Step 2: Coverage probability is 90%. Add and subtract 1.65 SEs
Step 3: Compute interval
0.59 (1.65)(0.03) (0.54, 0.64)
PHP 2510 October 20, 2008

22

S-ar putea să vă placă și