Sunteți pe pagina 1din 20

2.

SAMPLING DISTRIBUTION

2.1 Preliminaries
 Numerical descriptive measures computed from the population measurement
are called parameters. A statistic is a quantity calculated from the
observations in a sample
 Population mean: µ and Sample mean: x
N
∑ ( xi − µ )
2

 Population variance: σ 2 = i =1
N
n
∑ ( xi − x )
2

 Sample variance: s 2 = i =1
n −1
 The standard error of a statistic is the standard deviation of the sampling
distribution of that statistic

2.1.1 Introduction
In M25A, you were introduced to some useful random variables and their
probability distributions. In practical sampling situations, we select a sample of n
observations and use these measurements to calculate statistics such as the sample
mean and variance. These statistics are used to make inferences about the
corresponding parameters in the sampled population. Since, the value of a statistic
depends upon the observed values in the sample, a statistics is itself a random
variable that may be discrete or continuous. The probability distribution of a
statistic is called its sampling distribution, since it describes the behaviour of the
statistic in repeated sampling.

The sampling distribution of a statistic is the probability distribution for the


values of the statistic that results when random samples of size n are repeatedly
drawn from the population.
The sampling distribution may be derived mathematically or approximated
empirically. Empirical approximations are found by drawing a large number of
samples of size n from the specified population, calculate the value of the
statistics for each sample and tabulate the results in a relative frequency histogram.
When the number of samples is large, the relative frequency histogram should
closely approximate the theoretical sampling distribution.

1
2.2 Sampling Distributions
Consider a random sample of size n = 3 drawn with replacement from a
population of N = 5 elements. As simple random sample of size n is selected in
such a way that every sample of size n has the same probability of being selected

equal to ( ) , where
N
1
Cn
N
Cn is the number of sample.

Suppose we have a population of N = 5 elements whose values are 3,6,9,12 and


15. There are five distinct elements, the population probability distribution:

p ( x ) = 15 for x = 3, 6,9,12,15

Sample Sample values x m


1 3, 6, 9 6 6
2 3, 6, 12 7 6
3 3, 6, 15 8 6
4 3, 9, 12 8 9
5 3, 9, 15 9 9
6 3, 12, 15 10 12
7 6, 9, 12 9 9
8 6, 9, 15 10 9
9 6, 12, 15 11 12
10 9, 12, 15 12 12

The table above shows that the values of x and m (median) associated with each
1
sample are each assigned probability equal to 10
. So, we will observe a value of
x = 6 only if sample 1 is selected and this occurs with probability 0.1 A value of
x = 8 will occur if sample 3 or sample 4 is drawn; therefore, probability of
observe x = 8 is 0.2.
Hence, the sampling distributions for x is shown below.

x p(x )
6 0.1
7 0.1
8 0.2
9 0.2

2
10 0.2
11 0.1
12 0.1

2.3 Central Limit Theorem


If random samples of n observations are drawn from a non-normal population
with finite mean µ and standard deviation σ ; then when n is large, the sampling
distribution of the sample mean x is approximately normally distributed, with
mean and standard deviation:
σ
µx = µ and σx =
n

Diagrams

3
2.4 Sampling distribution: Sample Mean
The standard deviation of a statistic used as an estimator of a population parameter is often
called the standard error of the estimator, since it refers to the precision of the estimator.
Thus, the standard deviation of x is referred to as the standard error (s.e.) of the mean.

Example 2.1
Suppose that you select a random sample of n = 25 observations from a
population with mean µ = 8 and σ = 0.6 . Find the probability that the sample
mean x will
a) be less than 7.9 b) exceeds 7.9 c) lie within 0.1 of the µ = 8

Solution
a) since n = 25 , is relatively large, then the sampling distribution of x is
approximately normally distributed due to CLT
σ 0.6
Now, σ x = = = 0.12
n 25

 x − µ 7.9 − 8.0 
P ( x < 7.9 ) = P  < 
 xσ 0.12 

⇒ P ( Z < −0.83) = 1 − 0.7967 = 0.2033

The probability that is


less than 7.9

4
b) P ( x > 7.9 ) = P ( Z > −0.83) = 0.7967

c) P ( 7.9 < x < 8.1) = P ( −0.83 < Z < 0.83) = 0.7967 − 0.2033 = 0.5934

The probability that


lies within 0.1 of

2.5 Sampling distribution: Sample Proportion


Consider a sampling problem involving consumer preference or opinion poll; we
are concerned with estimating the proportion p of the people in the population
who possess some specific characteristic. These are practical examples of
binomial experiments, if the sampling procedure has been conducted in the appropriate
manner.

(i) If a random sample of n observations is selected from a binomial


population with parameter p , then sampling distribution of the
sample proportion is given by:
x pq
pˆ = will have: µ p̂ = p and σ p̂ =
n n

5
(ii) When the sample size is large, the sampling distribution of p̂ can be
approximated by a normal distribution. The approximation will be
adequate if µ pˆ − 2σ pˆ and µ pˆ + 2σ pˆ fall in the interval 0 to 1.

(iii) A rule of thumb for the approximation to be satisfactory is that np > 5


and npq > 5

(iv) We us this normal approximation to evaluate the probability that the


binominal variable Y is less than or greater than a particular value y . This
y is an integer, so we must take account that we are approximating a
discrete random variable Y by a continuous random variable X . So, we
think of the probability mass corresponding to value y as being spread
over the interval, ( y − 12 , y + 12 )

Hence, using a continuity correction:


P (Y ≤ y ) ≈ P ( X ≤ y + 12 ) i.e. adding a half

P (Y ≥ y ) ≈ P ( X ≥ y − 12 ) i.e. subtracting a half

When X is continuous, P ( X = x ) = 0 for any x , we can specify probabilities


in intervals only, not at points. However, using a normal approximation we
can specify P (Y = y ) is equal to P ( y − 12 ≤ X ≤ y + 12 ) . So, P (Y ≥ y ) and
P (Y > y ) are the not the same, they differ by an amount equal to P (Y = y )

Hence, using continuity correction:


P (Y < y ) ≈ P ( X ≤ y − 12 )

P (Y > y ) ≈ P ( X ≥ y + 12 )

6
Example 2.2
A survey of 313 children, ages 14 to 22, selected from the nation’s top corporate
executives; when asked to identify the best aspect of being privileged in this
group, 55% mentioned material and financial gains.
a) describe the sampling distribution of the sample proportion
b) assume that the population proportion is 0.5; what is the probability of
observing a sample proportion as large or larger than p̂ ?

Solution
a) since the sample size is large, then the distribution of p̂ is normally distributed
pq ˆˆ
pq 0.55 × 0.45
with mean pˆ = 0.55 and σ pˆ = ≈ = = 0.028
n n 313

therefore, we know that approximately 95% of the time p̂ will fall within
2σ pˆ ≈ 0.056 of the unknown value of p .

One could check the condition that allows for normal approximation to the distribution of p̂ ; ie.
pˆ ± 2σ pˆ = 0.55 ± 0.056 or 0.494 to 0.606, which falls in the interval 0 to 1

pq 0.5 × 0.5
b) we are given that µ pˆ = p = 0.5 and σ pˆ = = = 0.0283
n 313

The sampling distribution of


based on a sample of
children

 0.55 − 0.5 
P ( pˆ ≥ 0.55 ) = P  Z ≥ 
 0.0283 
⇒ P ( Z ≥ 1.77 ) = 0.0384

7
This tells us that if we were to select a random sample of n = 313 observations from a
population with proportion p = 0.5 , the probability that the sample proportion p̂ would be as
large or larger than 0.55 is only 4%.

Alternatively: using the correction of continuity, the equivalent to ±0.5 would be ± 21n ,
So;
 ( 0.55 − 0.0016 ) − 0.5  = P Z ≥ 1.71 = 0.0436
P  Z1 ≥  ( )
 0.0283 
When n is large, the effect of using the correction is generally negligible.

2.6 Sampling distribution: Sum or Difference between two sample mean


When independent random samples of size n1 and n2 observations have been
selected from population with means µ1 and µ2 , and variances σ 12 and σ 22
respectively; the sampling distribution of the sum or differences will have the
following properties:

(a) The mean and standard deviation of ( x1 ± x2 ) :

σ 12 σ 22
µ( x1 ± x2 ) = µ1 ± µ 2 and σ ( x1 ± x2 ) = +
n1 n2

(b) If the sampled populations are normally distributed, then the sampling
distribution is exactly normally distributed regardless of the sample size

(c) If the sampled populations are not normally distributed, then the sampling
distribution is approximately normally distributed when the sample size are
large due to the CLT

8
Example 2.3
A random sample of 40 teachers were selected from high schools in Kingston
and in St Ann. What is the probability that the sample mean salary from Kingston
will exceed the sample mean salary from St Ann by $1500 or more? Given that
Kingston mean salary is $29,000 and St Ann mean salary is $28,621 and standard
deviations for two population salary are $5000 and $4700 respectively.

Solution
Let x1 be the mean salary for Kgn and x2 be the mean salary for St Ann; also, σ 12
and σ 22 be standard deviation respectively.
Given that: x1 = 29,000 , x2 = 28,621 and σ 12 = 5000 , σ 22 = 4700
then
µ( x1 − x2 ) = µ1 − µ 2 = 29,000 − 28,621 = 379

σ 12 σ 22
50002 47002
σ ( x1 − x2 ) = + = + = 1085.0115
n1 n2 40 40

Since sample size is large, then we can use the normal approximation

 1500 − 379 
P ( x1 − x2 ) > 1500  = P  Z > 
 1085.0115 

⇒ P ( Z > 1.03) = 1 − Φ (1.03) = 1 − 0.8485 = 0.1515

The sampling distribution


of

9
2.7 Sampling distribution: Difference between two sample proportions
Assume that independent random samples of size n1 and n2 observations have
been selected from binomial populations with parameters p1 and p2 respectively.
Then the sampling distributions of difference between sample proportions
 x1 x2 
( pˆ1 − pˆ 2 ) =  −  will have the following properties:
 n1 n2 

(a) The mean and standard deviation of ( pˆ1 − pˆ 2 ) :

p1q1 p2 q2
µ( pˆ1 − pˆ 2 ) = p1 − p2 and σ ( pˆ1 − pˆ 2 ) = +
n1 n2

(b) The sampling distribution of ( pˆ1 − pˆ 2 ) can be approximated by a normal


distribution when both sample sizes are large due to CLT

(c) When we use a normal distribution to approximate binomial probabilities,


the interval ( p1 − p2 ) ± 2σ ( pˆ1 − pˆ 2 ) should varies from −1 to 1.

Example 2.4
A local newspaper reported that 75% of the residents in the developing section
and 60% of the residents in other parts of the city favour passage of a proposed
bond issue to build a new school. Random samples of n1 = 50 residents in
developing section of the city and n2 = 100 residents in other parts of the city are
selected, and the residents in the sample are asked whether they favour the bond
proposal. What is the probability that the difference in magnitude between the
sample proportions favouring the bond proposal does not exceed 10%.

Solution
Let us assume that p1 = 0.75 and p2 = 0.60 , and, the sampling distributions of
the difference between proportions to be approximately normally distributed.

10
So, µ( pˆ1− pˆ 2 ) = ( p1 − p2 ) = 0.75 − 0.60 = 0.15
and
p1q1 p2 q2 0.75 × 0.25 0.6 × .4
σ ( pˆ1 − pˆ 2 ) = + = + = 0.0784
n1 n2 50 100

We wish to find P ( −0.1 < pˆ1 − pˆ 2 < 0.1)


Hence,
 −0.1 − 0.15 0.1 − 0.15 
P ( −0.1 < pˆ1 − pˆ 2 < 0.1) = P  <Z< 
 0.0784 0.0784 

⇒ P ( −3.19 < Z < −0.64 ) =

The sampling distribution of


based on sample
sizes and

11
2.8 Large Sample Estimation
Since populations are characterised by numerical descriptive measures called
parameters, statistical inference is concerned with making inferences about
population parameters. Methods for making inferences about parameters fall into
one of two categories. We may make decisions concerning the value of the
parameter, or we may estimate or predict the value of the parameter. Which
method of inference should be used; that is, should the parameter be estimated or
should we test a hypothesis concerning its value?

Estimation procedures can be divided into two types, point estimation and
interval estimation.

An estimator is a statistic used to estimate a population parameter; it is a function


of the sample observations
An estimate is the value an estimator takes for a particular sample. Also, called a
point estimate.

An interval estimator of a population parameter tells us how to calculate two


numbers based on sample data, forming an interval within which the parameter is
expected to lie. This pair of numbers is called an interval estimate or confidence
interval.

Suppose we let θˆ denote an estimator of the population parameter θ ( µ , σ , or


any parameter). We would like our estimator to be unbiased and the spread of
the sampling distribution of the estimator be as small as possible.

The distance between the estimate and the parameter, called the error of
estimation

The probability that a confidence interval will enclose the estimated parameter is
called the confidence coefficient
A good confidence interval is one that is narrow as possible and has a large
confidence coefficient, near 1. The narrower the interval, the more exactly we
have located the estimated parameter.

12
Suppose we want to estimate the mean number of bacteria per cubic centimetre in
a polluted stream. If we draw 10 samples, each containing n = 30 observations;
Construct, a confidence interval for the population mean µ for each sample, the
intervals might appear as shown in diagram below.

Ten confidence intervals for the


mean number of bacteria per cubic
cm each based on a sample of
observations

The horizontal line segments represent the ten intervals and the vertical line
represents the location of the true mean number of bacteria per cubic cm. The
parameter is fixed and that the interval location and width may vary from sample
to sample. Thus, we speak of “the probability that the interval encloses µ ”, not
“the probability that µ falls in the interval”, because µ is fixed. The interval is
random.

A (1 − α ) 100% confidence interval for θ : θˆ ± zα × σ θˆ


2

where zα is the z value corresponding to an area α in the upper tail of a


2 2

standard normal distribution.

Also, zα × σ θˆ is the bound on the error of estimation


2

θˆ + zα × σ θˆ is called the upper confidence limit and θˆ − zα × σ θˆ is called the lower


2 2

confidence limit

13
2.9 Confidence Interval (CI) for Population Mean
 σ 
A (1 − α ) 100% confidence interval for µ : x ± zα  
2
 n

Note: If σ is unknown, it can be approximated by the sample standard deviation


when the sample size is large.

Remark: If you want a confidence coefficient (1 − α ) equal to 0.95, then the tail-
end area is α = 0.05 and half of α is placed in each tail of the distribution. So of
the commonly used confidence coefficients are shown in the table below.

Confidence coefficient
(1 − α ) α zα LCL UCL
2

0.90 0.1 1.645 σ σ


x − 1.645 x + 1.645
n n

0.95 0.05 1.96 σ σ


x − 1.96 x + 1.96
n n

0.99 0.01 2.58 σ σ


x − 2.58 x + 2.58
n n

Location of

14
Example 2.5
Suppose that we wish to estimate the mean daily yield of a chemical manufactured
in a chemical plat. The daily yield, recorded for 50 days, produced a mean and
standard deviation of x = 871 tons and σ = 21 tons. Find a 90% confidence
interval for the population mean.

Solution
 σ 
A 90% CI for µ : x ± zα   where α = 0.1 ⇒ zα = 1.645
2
 n 2

 21 
hence, 871 ± 1.645   or 871 ± 4.89
 50 

Interpretation:
i. Therefore, we estimate the mean daily yield to be nor more than 875.89
tons and no less than 866.11 tons

ii. In repeated sampling, 90% of the confidence intervals similarly formed will
enclosed the true value of µ

iii. Therefore, we estimate the mean daily yield µ lies in the interval from
866.11 to 875.859 tons

15
Confidence Interval for difference between two means

A (1 − α ) 100% confidence interval for ( µ1 − µ2 ) :


 σ2 σ2 
( 1 2 ) α 2  1 + 2 
x − x ± z
n n2
 1 

Note: If σ 12 and σ 22 are unknown, but both n1 and n2 are greater than or equal to
30, you can use the sample variances s12 and s22 to estimate σ 12 and σ 22 .

Example 2.6
A comparison of wearing quality of two types of automobile tyres were obtained
by road-testing samples of n1 = n2 = 100 tires for each type. The number of miles
until wear-out was recorded. Estimate the difference in mean miles to wear-out,
the standard error and find a 99% CI.
Tyre 1: x1 = 26, 400 miles and s12 = 1, 440,000
Tyre 2: x2 = 25,100 and s22 = 1,960,000

Solution
The point estimate of ( µ1 − µ2 ) is ( x1 − x2 ) = 26,400 − 25,100 = 1300 miles

The standard error (s.e.) of ( x1 − x2 ) is

σ 12 σ 22 s12 s22 1440000 1960000


+ ≈ + = + = 184.4
n1 n2 n1 n2 100 100

 σ2 σ2 
A 99% CI for ( µ1 − µ2 ) : ( x1 − x2 ) ± zα 2  1
+ 2 
n1 n2 
 

16
i.e.
1300 ± ( 2.58)(184.4 ) ⇒ 1300 ± 475.752

Hence, we are 99% confident that mean difference in miles to wear-out is


estimated to lie between 824.2 and 1775.8

Confidence Interval for Proportion


 ˆˆ 
pq
A (1 − α ) 100% confidence interval for p : pˆ ± zα  
2
 n 

Example 2.7
A random sample of 100 voters in a community produced x = 59 voters
favouring candidate J. Find an estimate for population who favoured candidate J;
also a 95% confidence interval for the population proportion.

Solution
x 59
A point estimate for p is pˆ = = = 0.59
n 100
 ˆˆ 
pq
A 95% CI for p : pˆ ± zα  
2
 n 

i.e. 0.59 ± 1.96 ( 0.049 ) or 0.59 ± 0.09604

Therefore, in repeated sampling, 95% of the confidence interval calculated this


way will enclosed the true value of p

17
Confidence Interval for Proportion Differences
A (1 − α ) 100% confidence interval for ( p1 − p2 ) :

 pˆ1qˆ1 pˆ 2 qˆ2 
( pˆ1 − pˆ 2 ) ± zα 2  + 
 n1 n2 

Assumption: n1 and n2 must be sufficiently large so that the sampling


distribution of ( pˆ1 − pˆ 2 ) can be approximated by a normal distribution.

Example 2.8
A manufacturer of fly spray wished to compare two new sprays 1 and 2. Two
rooms of equal size, each containing 1000 flies, were used in the experiment.
Room A was treated with fly spray1 and room B with spray2. A total of 825 and
760 flies succumbed to sprays 1 and 2 respectively. Estimate the difference in the
rate of kill for the two sprays and a 90% confidence interval.

Solution
The point estimate of ( p1 − p2 ) : ( pˆ1 − pˆ 2 ) = 0.825 − 0.76 = 0.065

The standard error:


pˆ1qˆ1 pˆ 2 qˆ 2
+ =
( 0.825 )( 0.175 ) + ( 0.76 )( 0.24 ) = 0.017857
n1 n2 1000 1000

 pˆ1qˆ1 pˆ 2 qˆ2 
A 90% CI for ( p1 − p2 ) : ( pˆ1 − pˆ 2 ) ± zα  + 
2
 n1 n2 
i.e
0.065 ± 1.645 ( 0.017857 ) or 0.065 ± 0.02934

Hence, we are 90% confident that the difference between the rates of kill lies
between 0.036 to 0.094 units.

18
2.10 Sample size
How many observations should be included in the sample? Unfortunately, we
cannot answer this question without knowing how much information the experimenter wishes to
buy.

Suppose we wish to estimate the mean daily yield µ and we would like the error of
estimation to be less than 4 tons with a probability of 0.95.
Now 95% of the sample means will lie within 1.96σ x of µ in repeated sampling;
hence, we are asking that 1.96σ x equal 4 tons.
Thus,
σ
1.96σ x = 4 i.e. 1.96 =4
n

2
 1.96  2
Solving for n , we obtain n=  σ or n = 0.24σ 2
 4 
We assume that n is very large and σ ≈ s
2
Thus, n = 0.24σ 2 = 0.24 ( 21) = 105.9 Hence, a sample size of 106

Procedure
Let θ be the parameter to be estimated and let σ θˆ be the standard deviation of
the point estimator. Then proceed as follows:

1) Choose B , the bound on the error of estimation, and a confidence


coefficient (1 − α )

2) Assume that n is large; solve the following equation for the sample size n :

z 
 α  σ θˆ = B
 2 
where zα is the value of z having α to its right
2
2

19
2.11 The p-value
The smallest value of α for which the test results are statistically significant is
often called the p-value or the observed significance level.
More formally, the p-value (probability value) is the probability of obtaining a
result at least as extreme as the one that was observed assuming that H 0 is true.

Example

20

S-ar putea să vă placă și