Sunteți pe pagina 1din 13

Brief lecture notes

Sampling, Sampling Distribution, and Estimation

We often use samples instead of the entire population because the cost and time
of measuring every item in the population would be too expensive. Also, in
some cases measurement requires destruction of individual items. In general, we
achieve greater accuracy by carefully obtaining a random sample of the
population instead of spending the resources to measure every item. There are
two important reasons for this result. First, it is often very difficult to obtain and
measures every item in a population, and even if possible, the cost would be
very high for a large population.

Sampling
Sampling is the process of selecting a sample.

Simple Random Sampling


Suppose that we want to select a sample of size n objects from a population of
N objects. A simple random sample is selected such that every object has an
equal probability of being selected and the objects are selected independently—
the selection of one object does not change the probability of selecting any other
objects.

Simple random sampling can be implemented in many ways. We can place the
N population items—for example, colored balls—in a large barrel and mix them
thoroughly. Then from this well-mixed barrel we can select individual balls
from different parts of the barrel. In practice, we often use random numbers to
select objects that can be assigned some numerical value. Various statistical
computer software and spreadsheets have routines for obtaining random
numbers, and these are generally used for most sampling studies.

To see how to use random number table, suppose that we have 100 employees
in a company and wish to interview a randomly chosen sample of 10. We could
get such a random sample by assigning every employee a number of 00 to 99,
consulting a Random Number Table, and picking a systematic method of
selecting two-digit numbers. In this case, let’s do the following:

Go from the top to the bottom of the columns beginning with the left-hand
column, and read only the first two digits in each row.

1
Systematic Sampling
In systematic sampling, elements are selected from the population at a uniform
interval that is measured in time, order, or space. If we want to interview every
twentieth student on a college campus, we would choose a random starting
point in the first 20 names in the student directory and then pick every twentieth
name thereafter.

Stratified Sampling
To use stratified sampling, we divide the population into relatively
homogeneous groups, called strata. Then we use one of two approaches. Either
we select at random from each stratum a specified number of elements
corresponding to the proportion of that stratum in the population as whole or we
draw an equal number of elements from each stratum and give weight to the
results according to the stratum’s proportion of total population. With either
approach, stratified sampling guarantees that every element in the population
has a chance of being selected.

Stratified sampling is appropriate when the population is already divided into


groups of different sizes and we wish to acknowledge this fact. Suppose that a
physician’s patients are divided into four groups according to age.

Age group Percentage of total


Birth—19 years 30
20—39 40
40—59 20
60 years and older 10

The physician wants to find out how many hours his patients sleep. To obtain an
estimate of this characteristic of the population, he could take a random sample
from each of the four age groups and give weight to the sample according to the
percentage of patients in that group. This would be an example of a stratified
sample.

Cluster Sampling
In cluster sampling, we divide the population into groups, or clusters, and then
select a random sample of these clusters. We assume that these individual
clusters are representative of the population as a whole. If a market research
team is attempting to determine by sampling the average number of television
sets per household in a large city, they could use a city map to divide the
territory into blocks and then choose a certain number of blocks (clusters) for
interviewing. Every household in each of these blocks would be interviewed. A
well-designed cluster sampling procedure can produce a more precise sample at
considerable less cost than of simple random sampling.

2
Sampling Distributions
Consider a random sample selected from a population that is used to make an
inference about some population characteristic, such as the population mean,  ,
using a sample statistic, such as the sample mean, x . The inference is based on
the realization that every random sample has a different number for x , and,
thus, x is a random variable. The sampling distribution of this statistic is the
probability distribution of the sample means obtained from all possible samples
of the same number of observations drawn from the population.

We illustrate the concept of a sampling distribution by considering the position


of supervisor with six employees, whose years of experience are

2 4 6 6 7 8

Two of these employees are to be chosen randomly for a particular work group.
The mean of the years of experience for this population of six employees is

246678
  5.5
6
Now, let us consider the mean number of years of experience of the two
employees chosen randomly from the population of six. Fifteen (
6! 6  5  4!
6
C2    15 ) possible different random samples could be selected.
2!4! 2!4!
Table 1 shows all of the possible samples and associated sample means.

Table1: Samples and sample means from the worker population sample size n =
2.
Sample Sample mean Sample Sample mean
2, 4 3.0 4, 8 6.0
2.6 4.0 6,6 6.0
2,6 4.0 6,6 6.5
2,7 4.5 6,8 7.0
2,8 5.0 6,7 6.5
4,6 5.0 6,8 7.0
4,6 5.0 7,8 7.5
4,7 5.5

Each of the 15 samples in Table 1 has the same probability, 1/15, of being
selected. Note that there are several occurrences of the same sample mean. For
example, the sample mean 5.0 occurs three times, and, thus, the probability of
obtaining a sample 5.0 is 3/15. Table 2 represents the sampling distribution for
the various sample means from the population, and the probability function is
graphed in Figure 1.

3
Table 2: Sampling distribution of the sample means from the worker population
sample size n = 2.
Sample mean x Probability of x
3.0 1/15
4.0 2/15
4.5 1/15
5.0 3/15
5.5 1/15
6.0 2/15
6.5 2/15
7.0 2/15
7.5 1/15

We see that, while the number of years of experience for the six workers ranges
from 2 to 8, the possible values of the sample mean have a range from only 3.0
to 7.5. In addition, more of the values lie in the central portion of the range.

Table 3 presents similar result for a sample of size n = 5 for sampling


distribution. Notice that the means are concentrated over a narrow range. These
sample means are all close to the population mean   5.5. We will always find
this to be true—the sampling distribution becomes concentrated closer to the
population mean as the sample size increases. This is the important result
provides an important foundation for statistical inference.

Table3: Samples and sample means from the worker population sample size n =
5.
Sample x Probability
2,4,6,6,7 5.0 1/6
2,4,6,6,8 5.2 1/6
2,4,6,7,8 5.4 1/6
2,6,6,7,8 5.8 1/6
4,6,6,7,8 6.2 1/6

Sample Mean
Let the random variables X1, X2,……….,Xn denote a random sample from a
population. The sample mean value of these random variables is defined as
1 n
X   Xi
n i 1
Consider the sampling distribution of the random variable X . At this point we
cannot determine the shape of the sampling distribution, but we can determine
the mean and variance of the sampling distribution. We know that the

4
expectation of a linear combination of random variables is the linear
combination of the expectations:

1  E ( X 1 )  E ( X 2 )  ...........  E ( X n )
E ( X )  E  ( X 1  X 2  ....................  X n  
n  n
    .............   n
=  
n n

Thus, the mean of the sampling distribution of the sample means is the
population mean. If samples of n random and independent observations are
repeatedly and independently drawn from a population, then as the number of
samples becomes very large, the mean of the sample means approaches the true
population mean.

The variance of the sample mean is denoted by  x2 and the corresponding


standard deviation, called the standard error of X , is given by


 X2 
n

Standard Normal Distribution for the Sample Means


Whenever the sampling distribution of the sample means is a normal
distribution, we can compute a standardized normal random variable, Z, that has
mean 0 and variance 1:

X  X 
Z 
X 
n

Example: Suppose that the annual percentage salary increases for the chief
executive officers of all midsize corporations are normally distributed with
mean 12.2% and standard deviation 3.6%. A random sample of nine
observations is obtained from this population and the sample mean computed.
What is the probability that the sample mean will be less than 10%?

Solution
We know that
  12.2   3.6 n9

Let x denote the sample mean, and compute the standard error of the sample
mean
 3.6
x    1.2
n 9

5
Then we compute
 x   10  12.2 
P( x  10)  P    P( Z  1.83)  0.0336
 x 1.2 
From this analysis we conclude that the probability that the sample mean will be
less than 10% is only 0.0336.

Example
A spark plug manufacturer claims that the lives of its plugs are normally
distributed with mean 36,000 miles and standard deviation 4,000 miles. A
random sample of 16 plugs had an average life of 34,500 miles. If the
manufacture’s claim is correct, what is the probability of finding a sample mean
of 34,500 or less?

Solution
To compute the probability, we need to first obtain the standard error of the
sample mean
 4000
x    1,000
n 16
The desired probability is
 x   34,500  36,000 
P ( x  34,500)      P ( Z  1.50)  0.0668
 x 1,000 
We find that the probability that sample mean is less than 34,500 is 0.0668. This
probability suggests that, if the manufacturer’s claims:   36,000 and   4,000
are true, then a sample mean of 34,500 or less has a small probability. As a
result we are doubtful about the manufacturer’s claims.

Sampling Distributions of Sample Proportions


Let X be the number of success in a binomial sample of n observations with
parameter P. The parameter is the proportion of the population members that
have a characteristic of interest. We define the sample proportion as

X
Pˆ 
n
X is the sum of a set of n independent Bernoulli random variables, each with
probability of success P. As a result, P̂ is the mean of a set of independent
random variables. The central limit theorem can be used to argue that the
probability distribution for P̂ can be modeled as a normally distributed random
variable.
The mean and variance of the sampling distribution of the sample proportion P̂
can be obtained from the mean and variance of the number of success, X.

6
E(X) = nP Var(X) = nP(1 – P)

And, thus,
X 1
E( P̂ ) = E   E ( X )  P
 n n
We see that the mean of the distribution of P̂ is the population proportion, P.

The variance of P̂ is the variance of the population distribution of the Bernoulli


random variables divided by n,

X 1 P (1  P )
 P2  Var    2 Var ( X ) 
n n n

The standard deviation of P̂ , which is the square root of the variance, is called
its standard error.

Sampling Distribution of the Sample Proportion


Let P̂ be the sample proportion of successes in a random sample from a
population with proportion of success P. Then

1. The sampling distribution of P̂ has mean p:


E( P̂ ) = p
2. The sampling distribution of p has standard deviation
P (1  P )
 Pˆ 
n
3. If the sample size is large, the random variable
Pˆ  P
Z 
 Pˆ
is approximately distributed as a standard normal. This
approximation is good if
nP(1 – P) > 9

Example: A random sample of 250 homes was taken from a large population
of older homes to estimate the proportion of homes with unsafe wiring. If, in
fact, 30% of the homes have unsafe wiring, what is the probability that the
sample proportion will be between 25% and 30% of homes with unsafe
wiring?

Solution: For this problem we have

P = 0.30 n = 250
We can compute the standard deviation of the sample proportion, P̂ , as

7
P(1  P ) 0.30(1  0.30)
 Pˆ    0.029
n 250
The required probability is
 0.25  P Pˆ  P 0.35  P 
P(0.25< P̂ <0.35) = P   
 ˆ   
 P Pˆ Pˆ 
 0.25  0.30 0.35  0.30 
= P 0.029  Z  0.029 
 
= P(-1.72 <Z<1.72)
= F (1.72)  [1  F (1.72)]
= .9573 – [1 - .9573]
= .9573 – 0.0427
= 0.9146
Thus, we see that the probability that the sample proportion is within the
interval 0.25 to 0.35, given P = 0.30, is 0.9146. This interval is called a
91.46% acceptance interval.

Example: It has been estimated that 43% of business graduates believe that
a course in business ethics is very important for imparting ethical values to
students. Find the probability that more than one-half of a random sample of
80 business graduates have this belief.

Solution: We are given that


P = 0.43 n = 80
We will first compute the standard deviation of the sample proportion:
P (1  P ) 0.43(1  0.43)
 Pˆ    0.055
n 80

Then the required probability can be computed as


 Pˆ  P 0.50  P 
P ( Pˆ  0.50)  P  
 ˆ  Pˆ 
 P

= P Z  1.27 
= 1  P( Z  1.27)
= 1 - .8980
= 0.1020
The probability of having one-half of the sample believing in the value of
business ethics courses is approximately 0.1.

Estimation
The investigation of whole population (totality of all elements) is not feasible,
because it may be time consuming, the cost is very high, need large number of
skilled persons etc.

8
In business, economics and managerial problems we frequently deal with
population parameters such as population mean ( ), population variance ( )
etc. In most of the business and economics problems such information are not
available. In that case sampling is used to estimate these unknown parameters
based on sample information to make inference and policy formulation.
For example, we draw a random sample from a population and we use the
sample mean ( ) to estimate the population mean .

Thus, estimation is a statistical technique which applies the sample


observations to find the answer of a particular question regarding population
parameters.

Parameter
The numerical measures such as population mean ( ), population variance ( )
etc. that describes a population are called parameter. Thus any parameter is the
function of population observations.
For example, the average monthly salary of all the teachers of BRAC in 2016 is
a parameter.

Statistic
On the other hand, the numerical measures such as sample mean , sample
variance ( ) etc. that describes a sample are called statistic. Thus any statistic
is the function of sample observations.
For example, the average monthly salary of all the teachers of Business School
of BRAC in 2016 is a statistic.

Estimator
Any statistic is a function of sample observations, being used to estimate a
population parameter from which the sample is drawn is called an estimator of
the parameter. Thus, an estimator is a random variable because its value varies
from sample to sample.
For example, the sample mean , and the sample variance

are the estimators that are used to estimate population


mean , population variance respectively.

9
Estimate
Any specific value of an estimator computed from a particular sample is called
an estimate of the parameter.
Example:
Let us consider a random sample of obtaining marks of Statistics of 25 students
in a final exam of BUP as 65, 45, 46, 47, 48, 65, 70, 71, 75, 45, 35, 32, 68, 65,
63, 55, 48, 49, 44, 56, 44, 53, 60, 70, and 67. Here all the students are the
population and X indicates the marks of students, and are the mean and
variance of the population and can be estimated by using sample mean ,
sample variance ( ) . These sample mean and variance are the estimators.
The specific value of these estimators is called the estimate.

Point estimate
A point estimate is a single value of an estimator which is computed from a
particular sample, and being used to estimate population parameter from which
the sample is drawn at random.

Interval estimate
An interval estimate is a range of numbers which is computed from a particular
sample, having a specified probability of correctly estimating the true value of
the population parameter from which the sample is drawn. The probability is
called the confidence level.

Confidence interval and confidence level


A confidence interval is a range of values computed from a particular sample so
that the population parameter from which the sample is drawn at random will
occur within that range have a specified probability. The specified probability is
called the confidence level.
Example
Let be a random sample of size ( which is drawn
from a normal population with mean and variance . The 95% confidence
interval of the population mean is given by

Therefore, the interval is called the 95% confidence interval of

the population parameter . If is unknown to us, we have to replace it by

sample estimate. . Therefore, the 95% confidence

10
interval for the population parameter is given by . The
confidence level is 0.95 or 95%.

Properties of a good estimator


Any estimator is said to be good, if it satisfies the following properties.
- Unbiasedness
- Efficiency
- Consistency
- Sufficiency

Estimation methods
Several methods are developed for constructing good estimators for the
unknown population parameters. The most popular and widely used methods for
estimation are given below:
- Least squares method
- Maximum likelihood method
- Method of moments

Problem
The monthly consumption expenditure (Tk.) and the level of monthly income (Tk.) of 25
employees of a garment factory are given below:
Consumption Level of Income Consumption Level of Income
Expenditure (y) (x) Expenditure (y) (x)
7880 8750 11589 12520
8025 8900 11969 12990
8055 9000 11905 13500
8225 9100 12545 13580
8435 9275 12869 13975
8725 9550 13255 14200
9205 10200 13518 14525
9578 10515 13689 15250
9855 10825 13700 15500
10513 11025 14255 15980
10599 11450 14555 16700
10719 11650 15500 16925
10987 12100
Fit the equation between income and consumption expenditure:
using least squares method.

Solution

11
The equation between monthly consumption expenditure and income of the
employee of a garments factory is given by:
Here, the variable Y indicates the monthly consumption expenditure and the
variable X indicates income level of the employees, is the regression
constant, is the regression coefficient which indicates the impact of per unit
income on consumption expenditure, is the random error term.
For sample observations the equation is given by

Where, and are the least squares estimators of and respectively.

If we apply the least squares method we have

…………….. (1)

And

………….. (2)

Table for calculations:


Consumption Expenditure (y) Level of Income (x)
7880 8750 68950000 76562500
8025 8900 71422500 79210000
8055 9000 72495000 81000000
8225 9100 74847500 82810000
8435 9275 78234625 86025625
8725 9550 83323750 91202500
9205 10200 93891000 104040000
9578 10515 100712670 110565225
9855 10825 106680375 117180625
10513 11025 115905825 121550625
10599 11450 121358550 131102500
10719 11650 124876350 135722500
10987 12100 132942700 146410000
11589 12520 145094280 156750400
11969 12990 155477310 168740100
11905 13500 160717500 182250000
12545 13580 170361100 184416400
12869 13975 179844275 195300625
13255 14200 188221000 201640000
13518 14525 196348950 210975625

12
13689 15250 208757250 232562500
13700 15500 212350000 240250000
14255 15980 227794900 255360400
14555 16700 243068500 278890000
15500 16925 262337500 286455625
280150 307985 3.5960e+09 3.9570e+09

Now, ,
Putting the values in the equation we have

Therefore, the fitted regression equation is:

Exercises

1. The Bradford Electric Illuminating Company is studying the


relationship between kilowatt-hours (thousands) and number of rooms in a
private single-family residence. A random sample of 10 homes yielded the
following.
Number of Rooms 12 9 14 6 10 8 10 10 5 7
Kilowatt-hours(thous) 9 7 10 5 8 6 8 10 4 7
(i) Determine the regression equations
(ii) Determine the number of kilowatt-hours, in thousands, for a six-room
house.
2. Mr. James McWhinney, president of Daniel-James Financial Services, believes there is a
relationship between the number of client contacts and the dollar amount of sales. To
document this assertion, Mr. McWhinney gathered the following sample information. The X
column indicates the number of client contacts last month, and the Y column shows the value
of sales ($ thousands) last month for each client sampled.
Number of Contacts 14 12 20 16 46 23 48 50 55 50
Sales ($ thousands) 24 14 28 30 80 30 90 85 120 110
(i) Determine the regression equation
(ii) Determine the estimated sales if 40 contacts made.

13

S-ar putea să vă placă și