CVEN2002 Week8

Statistics
CVEN2002/2702
Week 8
This lecture
7. Inferences concerning a mean

7.6 Confidence interval on the mean of a distribution, variance
unknown
7.7 Prediction intervals
7.8 Confidence interval on a proportion
Additional reading: Sections 5.6 (pp. 234-235), 7.2, 7.3 (pp. 303-306),
7.4 in the textbook
CVEN2002/2702 (Statistics)
Dr Justin Wishart
Session 2, 2012 - Week 8
2 / 36
7.6 CI on the mean, unknown 2
Confidence interval on the mean of a distribution,

variance unknown
Previously we showed how to build confidence intervals for the mean
of a distribution, assuming that the population variance 2 was known
; this is probably not very realistic!
Suppose now that the population variance 2 is not known
; we can no longer make practical use of the core result

(a)
N (0, 1)
Z = n X
However, from the random sample X1 , X2 , . . . , Xn we have a natural

estimator of the unknown 2 : the sample variance (Slide 38, Week 2)
n
S2 =

1 X
2,
Xi X
n1
i=1
which will provide an estimated sample variance s2 =

upon observation of a sample x1 , x2 , . . . , xn
Dr Justin Wishart
1
n1
i (xi
x )2
3 / 36
Confidence interval on the mean of a normal

distribution, variance unknown
A natural procedure is thus to replace with the sample standard
deviation S, and to work with the random variable

X
T = n
S
In the case of a normal population, Z was just a standardised version
and was therefore N (0, 1)-distributed
of a normal r.v. X
and S)
However, T is now a ratio of two random variables (X
; T is not N (0, 1)-distributed !
Indeed, T cannot have exactly the same distribution as Z , as the
approximation of the constant by a random variable S introduces
some extra variability
; the random variable T varies more in value from sample to sample
than Z (i.e. Var(T ) > Var(Z ))
Dr Justin Wishart
4 / 36
The Students t-distribution

The first person who realised that replacing with an estimation did
affect the distribution of Z was William Gosset (1876-1937), a British
chemist and mathematician who, in the early 20th century, worked at
the Guinness Brewery in Dublin
As the story goes, another researcher at Guinness had previously
published a paper containing trade secrets of the Guinness brewery,
so that Guinness prohibited its employees from publishing any
scientific papers regardless of the contained information
; Gosset used the pseudonym Student for his publications to avoid
their detection by his employer
He showed that, in a normal population, the exact distribution of T is
the so-called t-distribution with n 1 degrees of freedom:
T tn1
This distribution is now referred to as Students t-distribution (which
might otherwise have been Gossets t-distribution)
Dr Justin Wishart
5 / 36

A random variable, say T , is said to follow the Students t-distribution
with degrees of freedom, i.e.
T t
Its probability density function is given by

+1
2
+1
t2
2

f (t) =
1
+
; ST = R
for some integer

Note: the Gamma function is given by
Z +
(y ) =
x y 1 ex dx,
for y > 0
It can be shown that (y ) = (y 1) (y 1), so that, if y is a positive

integer n,
(n) = (n 1)!
There is no simple expression for the Students t-cdf
Dr Justin Wishart
6 / 36
0.4
f(t)
0.0
0.0
0.2
0.1
0.4
0.2
F(t)
0.6
0.3
0.8
1.0
Students t distribution with 1 degree of freedom
pdf f (t) = F 0 (t)
cdf F (t)
Dr Justin Wishart
7 / 36

0.4
Student's distributions and standard normal
f(t)
0.0
0.1
0.2
0.3
t1
t2
t5
t10
t50
N(0,1)
Dr Justin Wishart
8 / 36

It can be shown that the mean and the variance of the t -distribution
are
(for > 2)
E(T ) = 0
and
Var(T ) =
2
The Students t distribution is similar in shape to the standard normal
distribution in that both densities are symmetric, unimodal and
bell-shaped, and the maximum value is reached at 0
However, the Students t distribution has heavier tails than the normal
; there is more probability to find the random variable T far away
from 0 than there is for Z
This is more marked for small values of
As the number of degrees of freedom increases, t -distributions look
more and more like the standard normal distribution
In fact, it can be shown that the Students t distribution with degrees
of freedom approaches the standard normal distribution as
Dr Justin Wishart
9 / 36
The Students t-distribution: quantiles

Similarly to what we did for the Normal distribution, we can define the
quantiles of any Students t-distribution:
tdistribution
0.4
Let t; be the value such that
0.3
P(T > t; ) = 1
f(t)
1
0.0
t;1 = t;
0.1
Like the standard normal

distribution, the symmetry of any
t -distribution implies that
0.2
for T t
t, 4
For any , the main quantiles of interest may be found in the

t-distribution critical values tables
Dr Justin Wishart
10 / 36
Confidence interval on the mean of a normal

So we have, for any n 2,

X
n
tn1
S
Note: the number of degrees of freedom for the t-distribution is the
number of degrees of freedom associated with the estimated variance
S 2 (Slide 38, Week 2)
T =
It is now easy to find a 100 (1 )% confidence interval for by

proceeding essentially as we did when 2 was known
We may write

X
P tn1;1/2 n
tn1;1/2 = 1
S
or

S
S
P X tn1;1/2 X + tn1;1/2
=1
n
n
Dr Justin Wishart
11 / 36
t-confidence interval on the mean of a normal

distribution
; if x and s are the sample mean and sample standard deviation of
an observed random sample of size n from a normal distribution, a
confidence interval of level 100 (1 )% for is given by

s
s
x tn1;1/2 , x + tn1;1/2
n
n
This confidence
interval is sometimes called
h
i t-confidence interval, as
opposed to x z1/2 n , x + z1/2 n (z-confidence interval)

Because tn1 has heavier tails than N (0, 1), tn1;1/2 > z1/2 , n
; this reflects the extra variability introduced by the estimation of
(less accuracy)
Note: One can also define one-sided 100 (1 )% t-confidence
i
h

intervals
, x + tn1;1 sn and x tn1;1 sn , +
Dr Justin Wishart
12 / 36
t-confidence interval: example

Example
An article in Materials Engineering describes the results of tensile adhesion
test on 22 U 700 alloy specimens. The load at specimen failure is as
follows (in megapascals):
7.6, 8.1, 11.7, 14.3, 14.3, 14.1, 8.3, 12.3, 15.9, 16.4,
11.3, 12.0, 12.9, 15.0, 13.2, 14.6, 13.5, 10.4, 13.8,
15.6, 12.2, 11.2
Construct a 99% confidence interval for the true average load at failure for
this type of alloy
Elementary computations give
x = 12.67 MPa
and s = 2.47 MPa
Dr Justin Wishart
13 / 36
t-confidence interval: example

Normal QQ Plot
2
Theoretical Quantiles
The quantile plot below provides good

support for the assumption that the
population is normally distributed
10
12
14
16
Sample Quantiles
Since n = 22, we have n 1 = 21 degrees of freedom for t. In the table, we

find t21;0.995 = 2.831. The resulting CI is

2.47
s
s
x tn1;1/2 , x + tn1;1/2 = 12.67 2.831
n
n
22
= [11.18, 14.16]
; we are 99% confident that the true average load at failure for this type of
alloy lies between 11.18 MPa and 14.16 MPa
Dr Justin Wishart
14 / 36
Confidence interval on the mean of an arbitrary

What if the population is not normal ?
As in the case 2 known, we can rely on the Central Limit Theorem

a
which asserts that, for n large, Z = n X
N (0, 1) to deduce a
result like
a
X
tn1
T = n
S
from which we could find a CI on for n large enough
However, recall that, when is large, t is very much like N (0, 1)
; in large samples, estimating with S has very little effect on the
distribution of T , to which the approximation by the standard
normal distribution is more than enough:
a
T N (0, 1)
Dr Justin Wishart
15 / 36

distribution
Consequently, if x and s are the sample mean and standard deviation
of an observed random sample of large size n from any distribution, an
approximate confidence interval of level 100 (1 )% for is

s
s
x z1/2 , x + z1/2
n
n
This expression holds regardless of the population distribution, as long
as n is large enough ; it is called a large-sample confidence interval
Generally, n should be at least 40 to use this result reliably (the CLT
usually holds for n 30, but a larger sample size is recommended
because replacing by S still results in some additional variability)
As usual, corresponding one-sided confidence intervals could be
defined: (, x + z1 sn ] and [x z1 sn , +)
Dr Justin Wishart
16 / 36

distribution: example
Example
An article in Transactions of the American Fisheries Society reports the
results of a study to investigate the mercury contamination in largemouth
bass. A sample of 53 fishes was selected from some Florida lakes, and
mercury concentration in the muscle tissue was measured (in ppm):
1.23, 0.49, 1.08, . . ., 0.16, 0.27
Find a confidence interval on , the mean mercury concentration in the
muscle tissue of fish
An histogram and a quantile plot for the data are displayed below
; both plots indicate that the distribution of mercury concentration may not
be normally distributed (positively skewed)
But anyway, the sample is large enough (n = 53) to use the Central Limit
Theorem and compute an approximate confidence interval for
Dr Justin Wishart
17 / 36

distribution: example
histogram
Normal QQ Plot
1.0
1
0
Theoretical Quantiles
0.6
0.4
Density
0.8
0.2
0.0
0.0
0.2
0.4
0.6
0.8
1.0
1.2
1.4
0.0
0.2
concentration
0.4
0.6
0.8
1.0
1.2
Sample Quantiles
Elementary computations give x = 0.525

h ppm and s = 0.3486 ppm.iA large
sample confidence interval is given by x z1/2 sn , x + z1/2 sn
With z0.975 = 1.96 and the above values, we have

0.3486
0.3486
0.525 1.96
, 0.525 + 1.96
= [0.4311, 0.6189]
53
53
; we are 95% confident that the true average mercury concentration in the
muscle tissue of the fishes is between 0.4311 and 0.6189 ppm
Dr Justin Wishart
18 / 36
Confidence intervals for the mean: summary

The several situations leading to different confidence intervals for the
mean can be summarised as follows:
The first question is: Is the population normal? (check from a
histogram and a quantile plot, for instance)
if yes, is known ?
I
if yes, use an exact z-confidence interval:

h
i
x z1/2 n , x + z1/2 n
if no, use an exact t-confidenceh interval:
i
x tn1;1/2 s , x + tn1;1/2 s
n
if no, use an approximate large sample

h confidence interval:
i
x z1/2 sn , x + z1/2 sn ,
(provided the sample size is large, say n 40)
What if the sample size is small and the population is not normal ?
; check on a case by case basis (beyond the scope of this course)
Dr Justin Wishart
19 / 36
Prediction interval for a future observation

In some situations, we may be interested in predicting a future
observation of a variable
; different than estimating the mean of the variable !
; instead of confidence intervals, we are after
100 (1 )% prediction interval on a future observation
As an illustration, suppose that X1 , X2 , . . . , Xn is a random sample from
a normal population with mean and standard deviation
; we wish to predict the value Xn+1 , a single future observation
As Xn+1 comes from the same population as X1 , X2 , . . . , Xn ,
information contained in the sample should be used to predict Xn+1
; the predictor of Xn+1 , say X , should be a statistic
Dr Justin Wishart
20 / 36

We desire the predictor to have expected prediction error equal to 0:
E(Xn+1 X ) = 0
E(X ) =
; the predictor X must be an unbiased estimator for !

We said that an efficient unbiased estimator for was the sample
mean, so we take it as predictor:
n
X
=1
Xi
X =X
n
i=1
Now, the variance of the prediction error is

) = Var(Xn+1 ) + Var(X
)
Var(Xn+1 X ) = Var(Xn+1 X

2
1
2
2
= +
= 1+
n
n
)
(because Xn+1 is independent of X1 , X2 , . . . , Xn and so of X
Dr Justin Wishart
21 / 36

are normally distributed (normal
Finally, because both Xn+1 and X
is also normally distributed
population), the prediction error Xn+1 X
Hence,
Z =
Xn+1 X
q
N (0, 1)
1 + n1
Replacing the possibly unknown with the sample standard deviation

S yields
Xn+1 X
T = q
tn1
S 1 + n1
Manipulating Z and T as we did previously for CI leads to the
100 (1 )% z- and t-prediction intervals on the future observation:

q
q
1
1
x z1/2 1 + n , x + z1/2 1 + n

q
q
1
1
x tn1;1/2 s 1 + n , x + tn1;1/2 s 1 + n
Dr Justin Wishart
22 / 36
Prediction interval for a future observation: remarks

Remark 1:
The length of a confidence interval on of level 100 (1 )% is
2 z1/2 n
(see Slide 25 Week 7)
The length of aq
prediction interval on Xn+1 of level 100 (1 )% is
2 z1/2
1+
1
n
Prediction intervals for a single observation will always be longer than

confidence intervals for , because there is more variability associated
with one observation than with an average
Remark 2:
As n gets larger (n ), the length of the CI for decreases to 0 (we
are more and more accurate when estimating ), but this is not the
case for a prediction interval: the inherent variability of Xn+1 never
vanishes, even when we have observed many other observations
before!
Dr Justin Wishart
23 / 36
Prediction interval for a future observation: example

Example
Reconsider the example on Slide 13. Find a 99% confidence interval for the
true average load at failure. We plan to test a 23rd specimen. Find a 99%
prediction interval on the load at failure for this specimen.
From the data (n = 22) we had found x = 12.67 MPa and s = 2.47 MPa, and
a 99% confidence interval for was [11.18, 14.16]
Now, t21;0.995 = 2.831 (from the table), so that a 99% prediction interval for the
next observation is
"
# "
#
r
r
1
1
x tn1;1/2 s 1 +
= 12.67 2.831 2.47 1 +
n
22
= [5.52, 19.82]
; we are 99% confident that the failure load for the next specimen will be
between 5.52 and 19.82 MPa
Dr Justin Wishart
24 / 36
Inferences concerning proportions

Many engineering problems deal with proportions, percentages or
probabilities:
we are concerned with the proportion of defectives in a lot, with the
percentage of certain components which will perform satisfactorily
during a stated period of time, or with the probability that a newly
produced item meets some quality standards
; qualitative information can also be included in statistical studies!
It should be clear that problems concerning proportions, percentages
or probabilities are really equivalent: a percentage is merely a
proportion multiplied by 100, and a probability is a proportion in a
(infinitely) long series of trials (Slide 13, Week 3)
We would like to learn about , the proportion of the population that
has a characteristic of interest, but as usual all we have is just a
sample of size n from that population
; inference about
; confidence interval for

Dr Justin Wishart
25 / 36
Estimation of a proportion
In this situation, the random variable to study is

1 if the individual has the characteristic of interest
X =
0 if not
which is Bernoulli distributed (see Slide 9, Week 5), with parameter
being the value of interest:
X Bern()
The random sample X1 , X2 , . . . , Xn is a set of n independent Bern()
random variables
; the number Y of individuals of the sample with the characteristic is
Y =
n
X
Xi Bin(n, )
i=1
and the sample proportion is
=Y
P
n
Dr Justin Wishart
26 / 36
Estimation of a proportion
is obviously a natural candidate for
This sample proportion P
estimating the population proportion
From the properties of the Binomial distribution, we know that
E(Y ) = n
and
Var(Y ) = n(1 )
= 1 E(Y ) = and Var(P)

=
so that E(P)
n
1
n2
Var(Y ) =
n(1)
n2
(1)
n
is an unbiased and consistent estimator for :

Hence, P
=
= (1 ) ( 0 as n )
E(P)
and
Var(P)
n
q
; the standard error of P is thus sd(P) = (1)

n
UponPobservation of a random sample x1 , x2 , . . . , xn , in which
y = ni=1 xi individuals have the characteristics, an estimate of is
y
=
p
n
Dr Justin Wishart
27 / 36
Sampling distribution
using the Binomial
We could make inference about from p
distribution of Y . However, it is probably easier to use the Central
Limit Theorem (Slides 33-34, Week 7). Indeed:
n
X
=Y =1
Xi ,
P
n
n
i=1
is actually a (particular) sample mean, for which the CLT

so that P
guarantees that

P
a
np
N (0, 1)
(1 )
a
if n is large ( again stands for approximately follows)

We also know that the quality of the approximation depends on the
symmetry of the initial distribution of the Xi s, here Bern()
(1 p
) > 5
; should not be too close to 0 or 1 ; empirical rule: np
Dr Justin Wishart
28 / 36
Confidence interval for a proportion

As the sampling distribution
np

P
(1 )
N (0, 1)

a
is just a particular case of n X
N (0, 1), we can use (almost)
directly the large-sample confidence interval we derived for a mean
Specifically, we have that
P z1/2
np
z1/2
(1 )
'1
or
r
z1/2
P P
(1 )
+ z1/2
P
n
(1 )
n
!
'1
; a confidence interval for takes shape

Dr Justin Wishart
29 / 36
Confidence interval for a proportion

that is the factor
Unfortunately, the standard error of P,
contains the unknown
(1)
,
n
In such a situation, we may replace the unknown value by its estimate,

that is, to use the estimated standard error of the estimator (see Slide
r
12, Week 7)
(1 p
)
p
\
sd(P) =
n
in the expression of the confidence interval
is the sample proportion in an observed random
Consequently, if p
sample of size n, an approximate two-sided confidence interval of level
100 (1 )% for is given by
"
#
r
r
(1 p
)
(1 p
)
p
p
z1/2
+ z1/2
p
,p
n
n
As this is based on the CLT and requires n large, it is a large sample
confidence interval for
Dr Justin Wishart
30 / 36
One-sided confidence intervals for a proportion

We may also find one-sided large-sample confidence intervals for the
proportion by a simple modification of the previous development
We find:
"
(1 p
)
p
n
(1 p
)
p
,1
n
r
+ z1
0, p
and
"
r
z1
p
Dr Justin Wishart
31 / 36
Choice of the sample size

is the estimate of , we can define the error in estimating by p
as
Since p
|. From Slide 29, we are approximately 100 (1 )% confident
e = |p
r
that this error is less than
(1 )
z1/2
n
In situations where the sample size can be selected, we may choose n to be
100 (1 )% confident that the error is less than any specified value e:

z1/2 2
n=
(1 )
(compare Slide 26, Week 7)
e
; this depends on , for which no information is available at this point
Idea: use an upper bound which holds for any value of
Actually, (1 ) 1/4, with equality for = 1/2, thus with

z1/2 2
n=
2e
we are at least 100 (1 )% confident that this error is less than e and this,
regardless of the value of (this is very conservative, though)
Dr Justin Wishart
32 / 36
Confidence interval for a proportion: example

Example
In a random sample of 85 car engine crankshaft bearings, 10 have a surface
finish that is rougher than the specifications allow. a) Find a 95% confidence
interval on the true proportion of produced bearings that exceeds the
roughness specification; b) How large is a sample required if we want to be
95% confident that the error in estimating is less than 0.05?
= yn = 10
a) The estimate of is p
85 = 0.118. Thus, the estimated standard
error is
r
r
p
(1
p
)
0.118 (1 0.118)
\
=
sd(P)
=
= 0.035
n
85
and an approximated two-sided 95% confidence interval for is

\
= [0.118 1.96 0.035] = [0.049, 0.186]
z0.975 sd(
p
P)
; we are 95% confident that the true proportion of produced bearings
outside specifications is between 0.049 and 0.186
Dr Justin Wishart
33 / 36
Confidence interval for a proportion: example

Example
In a random sample of 85 car engine crankshaft bearings, 10 have a surface
finish that is rougher than the specifications allow. a) Find a 95% confidence
interval on the true proportion of produced bearings that exceeds the
roughness specification; b) How large is a sample required if we want to be
95% confident that the error in estimating is less than 0.05?
b) the previous CI has width 0.137, which is quite large for a CI for . If we
want a CI of width at most 2 0.05, we need

2
z1/2 2
1.96
n=
=
= 384.16
2e
2 0.05
; we need at least 385 observations
Note that this number would guarantee the required accuracy, regardless of
the true value of ; this is why it is so high (conservative)
= 0.118 as preliminary estimate of , we would have
(Using p
2
(1 p
) = (1.96/0.05)2 0.118 0.882 = 159.93)
n ' z1/2 /e p
Dr Justin Wishart
34 / 36
Alternative confidence interval for a proportion

The previous confidence does not work well near the boundary, i.e.,
near 0 or 1. Recall the probability statement
!

P
P z1/2 n p
z1/2 ' 1
(1 )
The three-way inequality is a quadratic function of , solving that
equation gives the following confidence interval
s
(1 p
)
1
p
1
+ w z1/2 (1 w)
(1 w)p
+w
2
2
2
n + z1/2
4(n + z1/2
)
where
w=
2
z1/2
2
n + z1/2
* First derived by EB Wilson (1927)

Dr Justin Wishart
35 / 36
Objectives
Now you should be able to:
Construct z- and t-confidence intervals on the mean of a normal
distribution, advisedly using either the normal distribution or the
Students t distribution
Construct large sample confidence intervals on a mean of an
arbitrary distribution with unknown variance
Explain the difference between a confidence interval and a
prediction interval
Construct prediction intervals for a future observation in a normal
population
Construct confidence intervals on a population proportion
Recommended exercises: ; Q7, Q9, p.301, Q13, Q15 p.302, Q20
p.303, Q35 p.319, Q39 p.320, Q43(a-b) p.320, Q55 p.328, (optional)
Q71, Q73 p.340, Q55 p.238
Dr Justin Wishart
36 / 36

CVEN2002 Week8

Încărcat de

Informații document

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

CVEN2002 Week8

Încărcat de

Drepturi de autor:

Formate disponibile

Statistics

7. Inferences concerning a mean

Session 2, 2012 - Week 8

7.6 CI on the mean, unknown 2

7. Inferences concerning a mean

Confidence interval on the mean of a distribution,

However, from the random sample X1 , X2 , . . . , Xn we have a natural

which will provide an estimated sample variance s2 =

Session 2, 2012 - Week 8

7. Inferences concerning a mean

7.6 CI on the mean, unknown 2

Confidence interval on the mean of a normal

Session 2, 2012 - Week 8

7. Inferences concerning a mean

7.6 CI on the mean, unknown 2

The Students t-distribution

Session 2, 2012 - Week 8

7. Inferences concerning a mean

7.6 CI on the mean, unknown 2

The Students t-distribution

for some integer

It can be shown that (y ) = (y 1) (y 1), so that, if y is a positive

Session 2, 2012 - Week 8

7.6 CI on the mean, unknown 2

7. Inferences concerning a mean

The Students t-distribution

Students t distribution with 1 degree of freedom

pdf f (t) = F 0 (t)

Session 2, 2012 - Week 8

7. Inferences concerning a mean

7.6 CI on the mean, unknown 2

The Students t-distribution

Student's distributions and standard normal

Session 2, 2012 - Week 8

7. Inferences concerning a mean

7.6 CI on the mean, unknown 2

The Students t-distribution

Session 2, 2012 - Week 8

7. Inferences concerning a mean

7.6 CI on the mean, unknown 2

The Students t-distribution: quantiles

Let t; be the value such that

Like the standard normal

For any , the main quantiles of interest may be found in the

Session 2, 2012 - Week 8

7. Inferences concerning a mean

7.6 CI on the mean, unknown 2

Confidence interval on the mean of a normal

It is now easy to find a 100 (1 )% confidence interval for by

Session 2, 2012 - Week 8

7. Inferences concerning a mean

7.6 CI on the mean, unknown 2

t-confidence interval on the mean of a normal

opposed to x z1/2 n , x + z1/2 n (z-confidence interval)

Session 2, 2012 - Week 8

7. Inferences concerning a mean

7.6 CI on the mean, unknown 2

t-confidence interval: example

and s = 2.47 MPa

Session 2, 2012 - Week 8

7. Inferences concerning a mean

7.6 CI on the mean, unknown 2

t-confidence interval: example

The quantile plot below provides good