Fundamentals of Statistical Inference

Fundamentals of Statistical
Inference
compiled by
Srilakshminarayana,G. M.Sc, Ph.D
Shri Dharmasthala Manjunatheswara Institute
for Management Development
#1 Chamundi Hill Road, Siddhartha Nagar, Mysore-570011
(Private Circulation Only-September 2012)
Table of Contents
Table of Contents i
Important note about the material 1
1 Estimation 2
1.1 Importance of estimation in management . . . . . . . . . . . . . . . . 2
1.2 Key terms in estimation . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.3 Determination of sample size . . . . . . . . . . . . . . . . . . . . . . . 9
1.4 Point estimator for population mean . . . . . . . . . . . . . . . . . . 9
1.4.1 Steps in obtaining an estimate of population mean . . . . . . 11
1.5 Point estimator for population variance . . . . . . . . . . . . . . . . . 11
1.5.1 Steps in calculating an estimate of population variance . . . . 11
1.6 Role of sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
1.7 Sampling distribution of a Statistic . . . . . . . . . . . . . . . . . . . 12
1.8 Sampling error . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
1.9 Point estimator for a population proportion . . . . . . . . . . . . . . 14
1.10 Finding the best estimator . . . . . . . . . . . . . . . . . . . . . . . . 14
1.11 Drawback of point estimate . . . . . . . . . . . . . . . . . . . . . . . 15
1.12 Interval estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
1.13 Probability of the true population parameter falling within the interval
estimate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
1.14 Interval estimates and condence intervals . . . . . . . . . . . . . . . 18
1.15 Relationship between condence level and
condence interval . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
1.16 Using sampling and condence interval
estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
1.17 Interval estimation of population mean
( known) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
i
1.18 Using the Z statistic for estimating population mean . . . . . . . . . 20
1.19 Using nite correction factor for the nite population . . . . . . . . . 25
1.20 Interval estimation for dierence of two means . . . . . . . . . . . . . 27
1.21 Condence interval estimation of the population mean ( unknown) . 28
1.22 Checking the assumptions . . . . . . . . . . . . . . . . . . . . . . . . 29
1.23 Concept of degrees of freedom . . . . . . . . . . . . . . . . . . . . . . 30
1.24 Condence interval estimation for population proportion . . . . . . . 32
1.25 Estimation of the sample size . . . . . . . . . . . . . . . . . . . . . . 33
1.26 Sample size for estimating population mean . . . . . . . . . . . . . . 35
1.27 Sample size for estimation population proportion . . . . . . . . . . . 40
1.28 Sample size for an interval estimate of a population proportion . . . . 41
1.29 Further discussion of sample size determination for a proportion . . . 42
2 Testing of Hypothesis-Fundamentals 44
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
2.2 Formats of Hypothesis . . . . . . . . . . . . . . . . . . . . . . . . . . 47
2.3 The rationale for hypothesis testing . . . . . . . . . . . . . . . . . . 48
2.4 Steps in hypothesis testing . . . . . . . . . . . . . . . . . . . . . . . . 49
2.5 One tail and two tail tests . . . . . . . . . . . . . . . . . . . . . . . . 55
2.5.1 One tailed test . . . . . . . . . . . . . . . . . . . . . . . . . . 56
2.5.2 Two tailed test . . . . . . . . . . . . . . . . . . . . . . . . . . 56
2.6 Critical region and non-critical region . . . . . . . . . . . . . . . . . . 56
2.7 Errors in hypothesis testing . . . . . . . . . . . . . . . . . . . . . . . 57
2.8 Test for single mean . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
2.8.1 Z-test for single mean- known case . . . . . . . . . . . . . . . 59
2.8.2 Testing Using Excel . . . . . . . . . . . . . . . . . . . . . . . . 60
2.8.3 t-test for single mean- unknown case . . . . . . . . . . . . . . 61
2.9 Test for single proportion . . . . . . . . . . . . . . . . . . . . . . . . . 63
2.10 Comparison and conclusion . . . . . . . . . . . . . . . . . . . . . . . 65
3 Testing of hypothesis-Two sample problem 67
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
3.2 Assumptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
3.3 Test for dierence of means: Z-test . . . . . . . . . . . . . . . . . . . 68
3.3.1 Testing Using Excel:
2
1
=
2
2
=
2
(known) . . . . . . . . . . 69
3.3.2 Testing Using Excel: Unequal Variances (Known) . . . . . . . 70
3.4 Test for dierence of means:t-test . . . . . . . . . . . . . . . . . . . . 71
2
1
=
2
2
=
2
(Unknown) . . . . . . . . . 72
ii
3.4.2 Testing Using Excel: Unequal Variances (Unknown) . . . . . . 73
3.5 Test for dierence of two proportions . . . . . . . . . . . . . . . . . . 74
3.5.1 Testing Using Excel: Test for Dierence of Proportions . . . . 75
3.6 Test for dependent samples . . . . . . . . . . . . . . . . . . . . . . . . 76
3.7 Test for dierence of variances-F Test . . . . . . . . . . . . . . . . . . 78
4 Chi-Square tests 80
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
4.1.1 Chi-square test for signicance of a population variance . . . . 81
4.1.2 Chi-square test for goodness of t . . . . . . . . . . . . . . . . 81
4.1.3 Chi-square test for independence of attributes . . . . . . . . . 82
5 Analysis of Variance (ANOVA) 84
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
5.2 One way ANOVA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
5.2.1 Assumptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
5.2.2 Steps for computing the F test value for ANOVA . . . . . . . 86
5.3 Two-Way Analysis of Variance . . . . . . . . . . . . . . . . . . . . . . 88
5.3.1 Assumptions for the Two-Way ANOVA . . . . . . . . . . . . . 90
5.4 The Sche e Test and the Tukey Test . . . . . . . . . . . . . . . . . . 91
5.4.1 Sche e Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
5.5 Tukey Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
6 Correlation and Regression 93
6.1 Testing signicance of Correlation = 0 . . . . . . . . . . . . . . . . 93
6.2 Testing signicance of Correlation =
0
. . . . . . . . . . . . . . . . 94
6.3 Testing signicance of correlation
1
=
2
. . . . . . . . . . . . . . . . 95
6.4 Testing signicance of regression model . . . . . . . . . . . . . . . . . 96
References 97
iii
Important note about the material
This material is for internal circulation only and not a substitute for a text book.
It contains only fundamental steps to be followed when inferential tools are used to
analyze the data. It is restricted to the need of present batch and do not contain
entire information about the topic. The complete information can be found in the
text book prescribed and in other references.
1
Chapter 1
Estimation
1.1 Importance of estimation in management
Everyone makes estimates. When you are ready to cross a street, you estimate the
speed of any car that is approaching, the distance between you and that car, and
your own speed. Having made these quick estimates, you decide whether to wait,
walk, or run. All managers must make quick estimates too. The outcome of these
estimates can aect their organizations as seriously as the outcome of your decision as
to whether to cross the street. University department heads make estimates of next
sessions enrollment in Statistics. Credit managers estimate whether a purchaser will
eventually pay his bills. Prospective home buyers make estimates concerning the
behavior of interest rates in the mortgage market. All these people make estimates
without worry about whether they are scientic but with hope that he estimates
bear a reasonable resemblance to the outcome. Managers use estimates because in all
but the most trivial decisions, they must make rational decisions without complete
information and with a great deal of uncertainty about what the future will bring.
How do managers use sample statistics to estimate population parameters? The
department head attempts to estimate enrollments next fall from current enrollments
2
Estimation 3
in the same courses. The credit manager attempts to estimate the creditworthiness of
prospective customers from a sample of their past payment habits. The home buyer
attempts to estimate the future course of interest rates by observing the current
behavior of those rates. In each case, somebody is trying to infer something about a
population from information taken from a sample. This chapter introduces methods
that enable us to estimate with reasonable accuracy the population proportion (the
proportion of the population that possesses a given characteristic) and population
mean. To calculate exact proportion or the exact mean would be an impossible goal.
Even so, we will be able to make an estimate, make a statement about the error that
will probably accompany this estimate, an implement some controls to avoid as much
of the error as possible. As decision makers, we will be forced at times to rely on
blind hunches. Yet in other situations, in which information is available and we apply
statistical concepts, we can do better than that.
Let us start with a small discussion on why a management student should study
statistical methods to estimate the unknown quantities. Estimations are made at
low level, middle level, and high level management. At any level, we understand
the present carefully. Then look into the past. See what has happened in the past.
List all possible options we have gathered from the past. Choose the best from the
available options. The option that best suits the present is considered as a solution.
For example, a manager of an production unit wishes to estimate the items to be
produced for the current year and depending on his estimate he wishes to place an
order for the raw materials. He takes his records and looks into the items produced,
raw materials used to produce these items. Finally he uses his experience and takes
a decision on the items to be produced for the current year and depending on the
estimate he prepares an order for the raw materials. But what is the guarantee that
the value he estimated is free of error? How can he justify that the actual requirement
is close to the value he estimated? There is a chance that the value he estimated using
Estimation 4
his experience may be an over estimate or under estimate. How can he convince his
boss that the value he chose will yield the organization better prots? If everything
goes ne, then no one will blame him. It would have been better if life is free of
uncertainty. But it is not so. The manager should take care of the uncertainty
associated with the estimate obtained using his experience. At this stage one can
argue saying that he is experienced and he can specify a range instead of a single
value. The statement could be May be this time the requirement lie in between
10000 and 15000. Even now there is an amount of uncertainty associated with
this. Because the words May be indicate that there is an amount of uncertainty
associated. One can continue the argument. Finally what is that we want. We want
a statement that the requirement for the current year may lie between 10000 and
15000 and the chance that it lie outside this range is 0.05. How could we get this
chance of 0.05? It is due to the systematic procedures available in statistics. Why he
needs it because the manager has to report to his boss saying that the requirement
for the current year lies between limits A and B. The boss is much concerned about
satisfying the needs of the customers and if anything goes wrong it is he who will be
targeted rst. To avoid this the manager can use the statistical techniques available
and provide the range along with a chance. This is again done taking the past data
into consideration. Systematic construction needs understanding the past carefully
and choosing an appropriate tool for the given situation. One has to choose the
technique based on the study variable under consideration. This is because the tools
used for a quantitative variable cannot be used for a qualitative variable without
proper adjustment.
The example discussed above is from a production unit. Similarly, let us consider
marketing. The sales executive has to report to his boss the number of packets of
oil he will sell this month. What will he do when his boss asks him about this? He
immediately says that he will sell 150 packets of oil this month. How did he say this?
Estimation 5
He didnt use statistics to do this. He just used his experience to do this. There is
no point in taking an excel sheet and use a statistical procedure to give this number.
He used his common sense, past experience and market conditions. He is sure that
the market need atleast 150 packets this month and he already sold 145 packets last
month. He also has complete knowledge about his competitors sales in the market.
Taking into consideration all these factors he could easily estimate the current months
sales. Let us consider another example. Suppose that this time the sales executive
has been promoted as sales manager. Now he has to estimate the sales of the entire
region. Now the problem is he is the manager but not a sales executive. He has to
take the data from the sales executives of the entire region and then he has to estimate
the sales for the current year. Depending on this estimate he has to build a strategy
to increase the sales. In the previous case he can get on because of his experience and
common sense. But now he is a manager and he cant take any risk. Now also he can
use his experience. But this time only to develop a proper strategy. He should take
help of statistical methods in order to give a proper estimate and to construct a better
strategy. What will he do? He will take the data from the sales executives and takes
the average of all the values and adjusts that according to the market conditions and
nally gives an estimate. Is it a good estimate? what adjustment he should make to
the average and give an estimate which convinces his boss? The answer is very simple.
According to the statistical theory, the sample average best estimates the population
average. Here the population average is the sales for the current year and the sample
average is the value he calculated after obtaining the data from his executives. What
about the adjustment? The adjustment is to construct an interval associated with a
probability value, take a value within the interval and consider it as an estimate.
Let us consider the case in Human resource management (HRM). Suppose that
the HR manager wishes to know about the performance of new appraisal system
Estimation 6
developed to appraise the employees. Since the organization has thousands of em-
ployees it is apparent that she cant take the opinion of all the employees. She has
to take a sample of employees and consider their opinion. Here the estimate is the
proportion of employees who are against the system, which is an estimator of the
population proportion. The variable under consideration is a qualitative variable and
the appropriate estimator is the sample proportion.
Management is a discipline which uses statistical tools to support the decisions
relating to various business situations. Most of the times the decision maker will be
left with some amount of data relating to the given situation, on which he is supposed
to take a decision. It is always desirable to use the data obtained and take appropriate
decision. One important aspect in decision making is estimation. This is a part of
statistical inference. Estimation is a systematic way of understanding the behavior of
unknown population characteristics based on a sample. These characteristics include
all the descriptive statistics related to a properly dened population. But most of the
time we are interested in population mean and variance. These are the characteristics
which play an important role in making decisions. It is very important to note that
mean should always be followed by the variance. Mean measures the central tendency
and variance measures the dispersion. To estimate these characteristics, we use the
sample data gathered from the dened population. The sample is selected as the
true representative of the population selected for the study. Note that care has to be
taken while selecting the sample. Coming back to the estimation, we use the sample
characteristics to estimate the population characteristics. Sample mean and variance
are used to estimate the population mean and variance. Two types of estimation
has been studied formally by the researchers. They are point estimates and interval
estimates. A point estimate is the value of the statistic for a given sample. We
use sample statistics as estimators to estimate the population parameters. These
estimators are functions of the sample i.e., they produce dierent values for dierent
Estimation 7
samples. Each value is considered as the estimate of the parameter. Point estimates
obtained for dierent samples put together constitute sampling distribution of the
statistic. Usual understanding in estimation is that for suciently large samples
these sample means, when plotted, produce a normal curve. This basic assumption
is very important to construct an interval estimate. Another important aspect in
point estimation is the associated sampling error of the statistic. When we obtain
the point estimate from a sample, it is equivalently important to obtain the sampling
error or standard deviation of the statistic. This sampling error gives the amount of
uctuation that can be allowed below and above the estimate.
The purpose of any random sample is to estimate population properties of a popula-
tion from the data observed in the sample. The mathematical procedures appropriate
for performing this estimation depend on which properties are of interest and which
type of random sampling scheme is used. Note that the sampling scheme has to be
selected appropriately for a given situation. The decision maker has to take care of
the assumptions made at the time of selecting the sampling scheme. This is very im-
portant because the assumptions of the mathematical model that will be used in the
later stages should coincide with the assumptions made at the time of selecting the
sample. If this is not taken care, then the results obtained may not be reliable. Along
with this, another aspect that play an important role is sampling error. Sampling
error is the inevitable result of basing an inference os a random sample rather than
on the entire population.
1.2 Key terms in estimation
1. Population: Group of objects or individuals that posses the assumed charac-
teristics under study. This group can be nite or innite.
2. Sample: Group of objects or individuals that posses the same characteristics
as that of population, taken for enumeration and further analysis. This group
is considered as the true representative of the entire population under study.
Estimation 8
3. Parameter: Unknown characteristics of the population under study such as
population mean, median, mode, standard deviation etc.
4. Statistic: Characteristics of the sample such as sample mean, median, mode,
standard deviation etc.
5. Estimator: Any statistic, which is a function of sample values, used to estimate
a population parameter.
6. Estimate: An estimate is a specic value of the estimator for a given sample.
7. Point estimate: A point estimate is a numerical value, a best guess of a
population parameter, based on the data in a sample.
8. Estimation error: The estimation error is the dierence between the point
estimate and the true value of the population parameter being estimated.
9. Interval estimate: An interval estimate is an interval around the point esti-
mate, calculated from the sample data, where we strongly believe the true value
of the population parameter lie.
10. Unbiased estimate: An unbiased estimate is a point estimate such that the
mean of its sampling distribution is equal to the true value of the population
parameter being estimated.
11. Eciency: Another desirable property of a good estimator is that it be ef-
cient. Eciency refers to the size of the standard error of the statistic. If
we compare two statistics from a sample of the sample size and try to decide
which one is the more ecient estimator, we would pick the statistic that has
the smaller standard error, or standard deviation of the sampling distribution.
12. Suciency: An estimator is sucient if it makes so much use of the infor-
mation in the sample that no other estimator could extract from the sample
information about the population parameter being estimated.
13. Consistency: A point estimator is said to be consistent if its value

tends to
become closer to the population parameter as the sample size increases.
Estimation 9
1.3 Determination of sample size
There are several ways to estimate an unknown characteristic of the population. In
this compiled work we only discuss parametric estimation.Interested can look into
standard book for other methods like non-parametric estimation, robust estimation
etc.In parametric estimation we mainly talk about the population characteristics like
mean, variance/standard deviation and proportion. We rst discuss determination
of sample size in detail and then proceed to estimation procedures.At an intermedi-
ate stage, i.e. after collecting the sample from the population under study, we look
forward to understand the behavior of the population through the estimated char-
acteristics from the sample. Hence one has to note at this point that the sample
taken play an important role in studying the population.Now the question is what
should be the sample size. This is an interesting question, which do not have a ready
made answer. It is an important step before the survey. Note that sampling error,
decreases with increase in the sample size. So we use the fact the smaller the variance,
the larger the sample size needed to achieve a degree of accuracy.
Determining the best sample size is not just a statistical decision. Statisticians
can tell you how the standard error behaves as you increase or decrease the sample
size, and the market researchers can tell you what the cost of taking more or larger
samples will be. But its the decision maker who must use your judgement to combine
these two inputs to make a sound managerial decision.
1.4 Point estimator for population mean
Denition 1. Point estimator:
A sample statistic that is calculated using sample data to estimate most likely value of
the corresponding unknown population parameter is termed as point estimator, and the
numerical value of the estimator is termed as point estimate. A point estimate consists
of a single sample statistic that is used to estimate the true value of a population
parameter.
Estimation 10
For example, the sample mean X is a point estimate of the population mean
and the sample variance S
2
is a point estimate of the population variance
2
. On
many occasions estimating the population mean is useful in business research. For
example,
1. The manager of human resources in a company might want to estimate the
average number of days of work an employee misses per year because of illness.
If the rm has thousands of employees, direct calculation of a population mean
such as this may be practically impossible. Instead, a random sample of em-
ployees can be taken, and the sample mean number of sick days can be used to
estimate the population mean.
2. Suppose that another company developed a new process for prolonging the
shelf life of a loaf of bread. The company wants to be able to date each loaf for
freshness, but company ocials do not know exactly how long the bread will
stay fresh. By taking a random sample and determining the sample mean shelf
life, they can estimate the average shelf life for population of bread.
3. As the cellular telephone industry matures, a cellular telephone company is
rethinking its pricing structure. Users appear to be spending more time on
the phone and are shopping around for the best deals. To do better planning,
the cellular company wants to ascertain the average number of minutes of time
used per month by each of its residential users but does not have the resources
available to examine all monthly bills and extract the information. The company
decides to take a random sample of customer bills and estimate the population
mean from sample data. A researcher for the company takes a random sample
of 85 bills for a recent month and from these bills computes a sample mean
of 510 min. This sample mean, which is a statistic, is used to estimate the
population mean, which is a parameter. If the company uses the sample mean
of 510 min as an estimate for the population mean, then the sample mean is
used as a point estimate.
Estimation 11
4. A tire manufacturer developed a new tire designed to provide an increase in
mileage over the rms current line of tires. To estimate the mean number of
miles provided by the new tires, the manufacturer selected a sample of 120 new
tires and observed a sample mean of 36,500 miles.
In all the above examples, note the statistic (sample mean) is a function of the
sample drawn from the population under study and the numerical value assumed by
this statistic is an estimate of the population mean. (Observe the dierence between
an estimator and an estimate).
1.4.1 Steps in obtaining an estimate of population mean
1. Draw a sample from the population under study.
2. Find the total of all the observations in the sample.
3. Divide the total with the number of observations.
4. The resultant value is the sample mean, which is taken as the estimate of the
population mean.
1.5 Point estimator for population variance
The estimation of population variance is an important step in analyzing the sample
drawn from the population under study. We use sample variance to estimate the
population variance. But sample variance is not an unbiased estimator of population
variance. So we modify the formula used to calculate the sample variance. The
formula to calculate the sample variance is given by
2
= s
2
=
1
n
n
i=1
(X
i
X)
2
In order get an unbiased estimator, ne has to change
1
n
to
1
n 1
. The resultant is
called Means square error, which gives an unbiased estimator of population variance.
1.5.1 Steps in calculating an estimate of population variance
1. Calculate the mean of the sample drawn.
Estimation 12
2. Compute the deviation of all the observations from the mean.
3. Square the deviations and obtain the total.
4. Divide the total obtained in step 3 with n 1.
Note 1. Note that, the above formulae to calculate mean and variance are used when
individual observations are taken. If one is using a frequency distribution then they
have to include the frequencies to calculate mean and variance.
1.6 Role of sampling
In order to understand the population characteristics like mean, variance etc. it is
very important to draw a sample which is a true representative of the population.
Proper sampling design has to be adopted before drawing a sample. Sampling frame
should be framed and then should be checked with population. Care should be taken
to decrease the non-response rate. It should be noted that a random sample will
better estimate the population parameters than a non-random sample. In order to
get better estimate, it is also important to ensure that the sample is free of any sort of
bias. The questionnaire framed to collect the responses should be tested using a pilot
survey before the actual survey. One has to note that pilot survey has to be framed
in such a way that it resembles the actual survey and should give better insights
about the resources needed to conduct the actual survey. An interesting point is
that samples with smaller samples, which is a true representative, gives satisfactory
results than a sample with larger sample size, which is not a true representative
of the population. Another interesting aspect in sampling is the belief that larger
populations need larger samples is not always a valid statement. Depending on the
situation, objectives, a sample should be taken.
1.7 Sampling distribution of a Statistic
Sampling distribution is the underlying probability distribution of the statistic used
for the study. This is constructed by taking several samples from the population.
Estimation 13
For example, a sampling distribution of sample mean is constructed by taking as
many samples as possible from the population and by calculating sample mean for
all the samples. The set of all values constitute a sampling distribution of sample
mean. Theoretically, it has been shown that the sampling distribution of mean is
either normal (central limit theorem-nite known variance-larger sample size) or a
t-distribution (when assumption of normality is satised-small sample sizes). When
the assumption of normality is not satised, then the sampling distribution of sample
mean can be approximated to normal law using central limit theorem for suciently
large sample sizes. The sampling distribution of sample variance or mean square error
is chi-square distribution (it is discussed in detail in chapter 4).
1.8 Sampling error
After drawing a sample, it is important to study the sampling error. For this, the
decision maker has to nd the standard error of the estimator used to estimate the
population parameter. The question is what is the relation between the standard error
and sampling error? Note that, the sample is drawn to understand the behaviour of
the population characteristics (like mean, median etc.) and are studied using their
estimators from a sample. Obviously if the sampling error is more, then it will be
reected in the standard error of the estimator. Also note that, reciprocal of standard
error gives the precision of the estimator. This is because it is expected that the
absolute dierence between the true population characteristic and sample estimator
is less than , where depends on standard error. Refer to determination of sample
size section to understand this better.
Sampling variation is the price we pay for working with a sample rather than the
population.
Estimation 14
1.9 Point estimator for a population proportion
When the underlying variable is a qualitative variable, one is interested in studying
the proportion of individuals who satisfy a particular attribute. For example, the
sales manager may be interested in studying the proportion of individuals who give
more importance to quality than cost. Here, he may conne to the customers who
are regular in purchasing from his store. For this properly dened population, the
parameter is proportion (denoted by P) and the sample statistic (denoted by p of
P) is the unbiased estimator. To calculate the sample proportion, one has to dene
the random variable under study properly. Then, identify the individuals who satisfy
the attribute (denoted by X) and take the ratio of X and n, the sample size, to
get the estimate. Note that the sampling distribution on sample proportion can be
approximated to normal distribution. But the exact probability distribution used
to model the number of individuals who fall under a particular category is binomial
distribution.
1.10 Finding the best estimator
A given sample statistic is not always the best estimator of its analogous population
parameter. Consider a symmetrically distributed population in which the values of
the median and the mean coincide. In this instance, the sample mean would be
an unbiased estimator of population median. Also, the sample mean would be a
consistent estimator of the population median because, as the sample size increases,
the value of the sample mean would tend to come very close to the population median.
And the sample mean would be a more ecient estimator of the population median
than the sample median itself because in large samples, the sample mean has s smaller
standard error than the sample median. At the same time, the sample median in a
symmetrically distributed population would be an unbiased and consistent estimator
of the population mean but not the most ecient estimator because in large samples,
its standard error is larger than that of the sample mean.
Estimation 15
1.11 Drawback of point estimate
The drawback of a point estimate is that no information is available regarding its re-
liability, i.e., how close it is to its true population parameter. In fact, the probability
that a single sample statistic actually equals the population parameter is extremely
small. For this reason, point estimates are rarely used alone to estimate population
parameters. It is better to oer a range of values within which the population pa-
rameters are expected to fall so that reliability (probability) of the estimate can be
measured. This is the purpose of the interval estimation.
1.12 Interval estimation
In most of the cases, a point estimate does not provide information about how close
is the estimate to the population parameter unless accompanied by a statement of
possible sampling error involved based on the sampling distribution of the statistic.
It is therefore important to know the precision of an estimate before depending on
it to make a decision. Thus, decision-makers prefers to use an interval estimate (i.e.
the range of values dened around a sample statistic) that is likely to contain the
population parameter value.
Interval estimation is a rule for calculating two numerical values, say and that
create an interval that contains the population parameter of interest. This interval is
therefore commonly referred to as a condence coecient and denoted by . However,
it is also important to state how condent one should be that the interval estimate
contains the parameter value. Hence an interval estimate of a population parameter
is a condence interval with a statement of condence (probability) that the inter-
val contains the parameter value. In other words, condence interval estimation is
an interval of values computed from sample data that is likely to contain the true
population parameter value.
Suppose the marketing research director needs an estimate of the average life in
Estimation 16
months of car batteries his company manufactures. We select a random sample of
200 batteries, record the car owners names and addresses as listed in store records,
and interview these owners about the battery life they have experienced. Our sample
of 200 users has a mean battery life of 36 months. If we use the point estimate of the
sample mean as the best estimator of the population mean , we would report that
the mean life of the companys batteries is 36 months. But director also asks for a
statement about the uncertainty that will be likely to accompany this estimate, that
is, a statement about the range within which the unknown population mean is likely
to lie. To provide such a statement, we need to nd the standard error of the mean.
The general form of an interval estimate is as follows:
Point estimate Margin of error
The purpose of an interval estimate is to provide information about how close the
point estimate is to the value of the population parameter. The general form of an
interval estimate of a population mean is
X Margin of error
The general form of an interval estimate of a population proportion is
P Margin of error
The sampling distribution of X and

P play key roles in company these interval esti-
mates.
1.13 Probability of the true population parameter
falling within the interval estimate
To begin to solve this problem, we should review the relevant concepts that we worked
with the normal probability distribution and learned that specic portions of the area
under the normal curve are located between plus and minus any given number of
standard deviations from the mean. Fortunately, we can apply these properties to
the standard error of the mean and make the statement about range of values used to
Estimation 17
make an interval estimate. Note that if we select and plot a large number of sample
means from a population, the distribution of these means will approximate normal
curve. Furthermore, the mean of the sample means will be the same as the population
mean. Our sample size of 200 (in battery example) is large enough that we can apply
the central limit theorem. To measure the spread, or dispersion, in our distribution
of sample means, we can use the following formula and calculate the standard error
of the mean:
Standard error of the mean for an innite population
X
=

n
Standard deviation of the population
Suppose we have already estimated the standard deviation of the population of the
batteries and reported that it is 10 months. Using this standard deviation, we can
calculate the standard error of the mean:
X
=

n
=
10
200
= 0.707 month
We could now report to the director that our estimate of the life of the companys
batteries is 36 months, and the standard error that accompanies this estimate is
0.707. In other words, the actual mean life for all the batteries may lie somewhere
in the interval estimate of 35.293 to 36.707 months. This is helpful but insucient
information for the director. Next we need to calculate the chance that the actual life
will lie in this interval or in other intervals of dierent widths that we might choose,
2(2 0.707), 3(3 0.707), and so on.
The probability is 0.955 that the mean of a sample size of 200 will be within
2 standard errors of the population mean. Stated dierently 95.5 percent of all
the sample means are within 2 standard errors from , and hence is within 2
standard errors of 95.5 percent of all the sample means. Theoretically, if we select
1,000 samples at random from a given population and then construct an interval of
Estimation 18
2 standard errors around the mean of each of these samples, about 955 of these
intervals will include the population mean. Similarly, the probability is 0.683 that
the mean of the sample will be within 1 standard error of the population mean,
and so forth. This theoretical concept is basic to study interval construction and
statistical inference. Applying this to the battery example, we can now report to the
director that our best estimate of the life of the companys batteries is 36 months, and
we are 68.3 percent condent that the life lies in the interval from 35.293 to 36.707
months ( 36 1
x
). Similarly, we are 95.5 percent condent that the life falls within
the interval of 34.586 to 37.414 months (362
x
), and we are 99.7 percent condent
that battery life falls within the interval of 33.879 to 38.121 months (36 3
x
).
1.14 Interval estimates and condence intervals
In using interval estimates, we are not conned to 1, 2, and 3 standard errors. For
example, 1.64 standard error includes about 90 percent of the area under the curve;
it includes 0.4495 of the area on either side of the mean in a normal distribution.
Similarly 2.58 standard errors include 99 percent of the area, or 49.51 percent on
each side of the mean.
In statistics, the probability that we associated with an interval estimate is called
the condence level. This probability indicates how condent we are that the interval
estimate will include the population parameter. A higher probability means more
condence. In estimation, the most commonly used condence levels are 90 percent,
95 percent, and 99 percent, but we are free to apply any condence level.
The condence interval is the range of the estimate we are making. If we report
that we are 90 percent condent that the mean of the population of incomes of people
in a certain community will lie between 8000and24000, then the range 800024000
is our condence interval. Often, however, we will express the condence interval in
Estimation 19
standard errors rather than in numerical values. Thus, we will often express con-
dence intervals like this: X 1.64
X
, where
X + 1.64
X
= Upper limit of the condence interval
X + 1.64
X
= Lower limit of the condence interval
Thus, condent limits are the upper and lower limits of the condence interval. In
this case, X + 1.64
X
is called the upper condence limit (UCL) and X 1.64
X
is
the lower condence limit (LCL).
1.15 Relationship between condence level and
condence interval
You may think that we should use a high condence level, such as 99 %, in all
estimation problems. After all, a high condence level seems to signify a high degree
of accuracy in the estimate. In practice, however, high condence levels will produce
large condence intervals and such large intervals are not precise; they give very fuzzy
estimates.
1.16 Using sampling and condence interval
estimation
We described that samples being drawn repeatedly from a given population in order
to estimate a population parameter. We also mentioned selecting a large number of
sample means from a population. In practice, however, it is often dicult or expensive
to take more than one sample from a population. Based on just one sample, we
estimate the population parameter. We must be careful, then, about interpreting the
results of such a process.
Estimation 20
Suppose we calculate from one sample in our battery example the following con-
dence interval and condence level: We are 95 percent condent that the mean
battery life of the population lies within 30 and 42 months. This statement does not
mean that the chance is 0.95 that the mean life of all our batteries falls within the
interval established from this one sample. Instead, it means that if we select many
random samples of the same size and calculate a condence interval for each of these
samples, then in about 95 percent of these cases, the population mean will lie within
that interval.
1.17 Interval estimation of population mean
( known)
In order to develop an interval estimate of a population mean, either the population
standard deviation or the sample standard deviation must be used to compute the
margin of error. Although rarely known exactly, historical data or other information
available in some applications permit us to obtain a good estimate of the population
standard deviation prior to sampling. In such cases, population standard deviation
can, for all practical purposes, be considered known. We refer to such cases as the
known case.
1.18 Using the Z statistic for estimating popula-
tion mean
Note that a complete census is neither a feasible, nor a practical option. In order to
draw an inference about the population, a researcher has to take a sample and has
to apply statistical techniques to estimate population parameter on the basis of the
sample statistics. For example, a researcher can use two methods to nd out the rate
Estimation 21
of absenteeism in a manufacturing company with 500,000 employees. The rst method
is to go in for a census and calculate the rate of absenteeism based on information from
all the 500,000 employees. This would be extremely dicult in terms of execution
and would be time-consuming and costly. Instead of this, a researcher can take a
sample of any size (keeping in mind the denition of small-and large-sized samples)
and can make an estimate based on the information obtained from the sample. The
possibility of committing non-sampling errors will also be minimized if this method
is used. We need to develop a statistical tool that provides a good estimate of the
population parameter on the basis of the sample statistic. The Z statistic can be
used for estimating the population parameter on the basis of the sample statistic.
According to the central limit theorem, the sample means for a suciently large
samples (n 30 ), are approximately normally distributed, regardless of the shape
of the population distribution. For a normally distributed population, sample means
are normally distributed for any size of the sample.
Suppose the population mean is unknown and the true population standard
deviation is known. Then for a large sample size (n 30 ), the sample mean X is
the best point estimator for the population mean . Since the sampling distribution
is approximately normal, it can be used to compute condence interval of population
mean as follows:
X Z
n
or
X Z
n
X +Z
n
,
where Z
2
is the Z-value representing an area

2
in the right tail of the standard
normal probability distribution, and (1 ) is the level of condence.
Alternative approach:
A (1 ) 100% large sample condence interval for a population mean can also be
Estimation 22
found by using the statistic
Z =
X
n
which has a standard normal distribution (i.e, Z N(0, 1)). This formula can be
rearranged algebraically for population mean
= X Z

n
Sample mean can be greater than or less than the population mean; hence, the
formula takes the following form is the area under the normal curve which is outside
the condence interval and is located in the tails of the normal curve. Condence
interval is the range within which we can say with some condence that the population
mean is located. We can say with some condence, however, we are not absolutely
sure that the population mean is within the condence interval. In order to be 100%
sure that the population mean is within the condence interval, the condence level
should be 100%, that is, indenitely wide, which would be meaningless. We use
the concept of probability in order to dene some certainty. We can assign some
probability that the population mean is located within the condence interval.
If Z
2
is the Z-value with an area

2
in the right tail of normal curve, then we can
write
P
_
Z
2
<
X
n
< Z
2
_
= 1
or
Z
n
< X < Z
n
X Z
n
< < X +Z
n
So that
P
_
X Z
n
< < X +Z
n
_
= 1 .
Estimation 23
Both lower limit XZ
n
and upper limit X+Z
n
depend on the sample mean
. Thus, in repeated sampling the interval X Z
n
will contain the population
mean with probability 1 .
The value of Z that has tail area

2
to its right and left is called its critical value
and is represented by Z
2
and -Z
2
respectively. The area between -Z
2
and Z
2
is the
condence coecient (1 ). n is the sample size, the population standard
deviation, the area under the normal curve which is outside the condence interval,
and

2
the one-tail area under the normal curve which is outside the condence
interval.
Here we need to understand the meaning of (1 ) . We have seen that

2
is the
area of one-tail under the normal curve which is outside the condence interval. In
both the tails, this area will be

2
+

2
= . The total area under the normal curve is 1
and is the area which indicates that the population mean lies outside the condence
interval and is in the area. The total area 1 under the normal curve includes 0.5 of
the area between the middle of the curve and the right-tail of the curve and 0.5 of the
area between the middle of the curve and the left tail of the curve. So, the condence
interval between the middle of the distribution and the right-tail will be
_
0.5

2
_
and condence interval between the middle of the distribution and the left-tail will
be
_
0.5

2
_
.
The probability associated with condence interval indicates how condent we
are that the condence interval will include the population parameter. A higher
probability indicates higher condence. In estimation, any condence level can be
applied; however, the most widely used levels are 90%, 95%, 99%.
For understanding the concept of condence level, we take the example of 99%
condence level.
In case of 99% condence level, the probability statement indicates that the prob-
ability is 0.99 (99%) that the population parameter will be within the condence
Estimation 24
interval. It means that if 100 such intervals are constructed by taking a random
sample the population, it is very likely that 99 condence intervals will include the
population parameter and only one will not include the population parameter. Sim-
ilarly, if we take 95% condence interval, probability that the population parameter
is within the condence interval is 0.95.
For 99% condence interval, = 0.01 and

2
= 0.005. The area between the
middle of the normal curve and +Z
2
= +Z
0.005
can be obtained by subtracting 0.005
from the total area on the right side of the normal curve, that is, 0.5. So, the area
where the population mean is likely to lie on the right side of the normal curve is
_
0.5

2
_
= 0.495. Similarly, the area where population mean is likely to lie on
the left side of the normal curve is 0.495. This area is associated with a Z value of
2.575 (from table). While dealing with estimation, a simple question might strike a
researcher. Why cant we select the highest condence and always use that level? In
order to answer this question, we need to understand the tradeo between sample size,
interval width, and the level of condence. For example, as the level of condence
increases, the condence interval increases in width, provided the sample size and the
standard deviation remains constant.
If n = 100 and = 25, then
X
=

n
= 2.5. Using a table of areas for the
standard normal probability distribution, 95% of the values of a normally distributed
population are within 1.96
X
. Hence 95% of the sample means will be within of
the population mean . In other words, there is a 0.95 probability that the sample
mean would provide a sampling error equal to |X | = 4.90 or less. The value 4.90
also provides an upper limit on the sampling error, also called margin of error. The
value 0.95 is called condence coecient and the interval estimate X 4.90 is called
a 95% condence interval.
In general, a 95% condence interval estimate implies that if all possible samples
of the same size were drawn, then it would contain the true population mean in the
Estimation 25
interval X Z
n
and 5% area under the curve would not contain value of . The
values Z
2
for the most commonly used as well as the other condence levels can be
seen from standard normal probability table shown in the following table:
Condence level, (1 )% Acceptable error level,

2
Z
2
90% 0.10 0.050 1.64
95% 0.05 0.025 1.96
98% 0.02 0.010 2.33
99% 0.01 0.005 2.58
1.19 Using nite correction factor for the nite
population
For nite populations, we need to apply a nite correction factor for increasing the
accuracy of the solution. When sample size is less than 5% of the population, the
nite correction factor does not signicantly increase the accuracy of the solution. In
case of a nite population, the condence interval formula takes the following shape:
Condence interval for estimating population mean (case of nite popula-
tion):
X Z
n
_
N n
N 1
X +Z
n
_
N n
N 1
Examples Example 1:
The average monthly electricity consumption for a sample of 100 families is 1250 units.
Assuming the standard deviation of electric consumption of all families is 150 units;
construct a 95% condence interval estimate of the actual mean electric consumption.
Solution
The information given is: X = 1250, = 150, n = 100 and condence level (1 )
Estimation 26
= 95%. Using the Standard normal curve we nd that the half of 0.95 yields a
condence coecient Z
2
= 1.96. Thus condence limits with for 95% condence are
given by
X Z
n
= 1250 29.40 units.
Thus for 95% level of condence, the population mean is likely to fall between 1220.60
units and 1274.40 units.
Example 2:
A manufacturer of computer paper has a production process that operates continu-
ously throughout an entire production shift. The paper is expected to have a mean
length of 11 inches and the standard deviation of the length is known to be 0.02
inch. At periodic intervals, samples are selected to determine whether the mean pa-
per length is still equal is 11 inches or whether something has gone wrong in the
production process to change the length of the paper produced. If such a situation
has occurred, corrective action is needed. Suppose a random sample of 100 sheets
is selected, and the mean paper length is found to be 10.998 inches. Set up 95%
condence interval estimate of the population mean paper length.
Solution
With Z=1.96, the 95% condence interval is given by,
X Z

n
= 10.998 0.00392
Thus, with 95% condence, the population mean is estimated to be between 10.99408
and 11.00192 inches. Because 11, the value indicating that the production process is
working properly, is included within the interval, there is no reason to believe that
anything is wrong with the production process.
Estimating the mean paper length with 99% condence:
With Z=2.58, the 99% condence interval is given by
X Z

n
= 10.998 (2.58)
0.02
100
Estimation 27
Once again, because 11 is included within this wider interval, there is no reason to
believe that any this wrong with the production process.
1.20 Interval estimation for dierence of two means
If all possible samples of large size n
1
and n
2
are drawn from two dierent popula-
tions, then sampling distribution of the dierence between two means X
1
and X
2
is
approximately normal with mean (
1
2
) and standard deviation
X
1
X
2
=
2
1
n
1
+

2
2
n
2
For a desired condence level, the condence interval limits for the population mean
are given by
(X
1
X
1
) Z
X
1
X
2
Example
The strength of the wire produced by company A has a mean of 4500 kg and a
standard deviation of 200 kg. Company B has a mean of 4000 kg and a standard
deviation of 300 kg. A sample of 50 wires of company A and 100 wires of company
B are selected at random for testing the strength. Find 99% condence limits on
the dierence in the average strength of the population of wires produced by the two
companies.
Solution
The following information is given:
Company A: X
1
= 4500,
1
= 200, n
1
= 50,
Company B: X
2
= 4500,
2
= 200, n
2
= 50
Therefore X
1
X
2
= 500 and Z
2
= 2.576 and
X
1
X
2
=
2
1
n
1
+

2
2
n
2
= 41.23
Estimation 28
The required 99% condence interval limits are given by
500 106.20.
Hence, 99% condence limits on the dierence in the average strength of wires pro-
duced by the two companies are likely to fall in the interval
393.80 606.20.
1.21 Condence interval estimation of the popula-
tion mean ( unknown)
If the standard deviation of a population is not known, then it can be approximated
by the sample standard deviation, n( 30) when the sample size, is large. So, the
interval estimator of a population mean for a large sample n( 30) with condence
coecient 1 is given by
X Z
2
s
x
= X Z
2
s
n
When the population standard deviation is not known and the sample size is small,
the procedure of interval estimation of population mean is based on a probability
distribution known as the t-distribution.
Students t distribution
We have seen that when the population standard deviation is unknown, sample stan-
dard deviation can be used for estimating the condence interval for large samples
(n 30) . In a real-life situation, a sample size less than 30 is not very uncommon.
In the case of small sample size (n < 30), the Z formula discussed earlier is not ap-
plicable. The problem can be solved by using the t-statistic, developed by a British
Statistician, William S.Gosset. At the beginning of the 20th century, William S. Gos-
set, an employee of Guinness Breweries in Ireland, was interested in making inferences
Estimation 29
about mean when was unknown. Because Guinness employees were not permitted to
publish research work under their own names, Gosset adopted the pseudonym Stu-
dent. The distribution that he developed has come to be known as Students t
distribution.
If the random variable X is normally distributed, then the following statistic has
a t distribution with (n-1) degrees of freedom:
t =
X
S
n
Notice that this expression has the same form as the Z statistic except that S is used
to estimate which is unknown in this case. In appearance, the t distribution is very
similar to the standard normal distribution. Both distributions are bell shaped and
symmetrical. However, the t distribution has more area in the tails and less in the
center than does the standard normal distribution. Because the value of is unknown
and S is used to estimate it, the values of t that are observed will be more variable
than Z.
As the number of degrees of freedom increases, the t distribution gradually ap-
proaches the standardized normal distribution until the two are virtually identical,
since S becomes a better estimate of as the sample size gets larger. With a sample
size of about 120 or more, S estimates precisely enough that there is little dierence
between the t and Z distributions. For this reason, most Statisticians use Z instead
of t when the sample size is greater than 120.
1.22 Checking the assumptions
Recall that the t distribution assumes that the random variable X being studied is
normally distributed. In practice, however, as long as the sample size is large enough
and the population is not very skewed, the t distribution can be used to estimate
the population mean when is unknown. You should be concerned about the validity
Estimation 30
of the condence interval primarily when dealing with a small sample size and a
skewed population distribution. The assumption of normality in the population can
be assessed by evaluating the shape of the sample data by using a histogram, stem-
and-leaf display, box-and-whisker plot, or normal probability plot. A t-distribution
is lower at the mean and higher at the tails than a normal distribution.
(The critical values of t for the appropriate degrees of freedom are obtained from
the table of the t distribution. The top of each column of the table indicates the area
in the upper tail of the t distribution (because positive entries for t are supplied, the
values for t are for the upper tail). Each row represents the particular t value for each
specic degree of freedom. )
(Characteristics of the t distribution: Without deriving the t distribution math-
ematically, we can gain an intuitive understanding of the relationship between the t
distribution and the normal distribution. Both are symmetrical. In general, the t dis-
tribution is atter than the normal distribution, and then is a dierent t distribution
for every possible sample size. Even so, as the sample size gets larger, the shape of
the t distribution loses its atness and becomes approximately equal to the normal
distribution. In fact, for sample sizes of more than 30, the t distribution is so close
to the normal distribution that we will use the normal to approximate the t.)
1.23 Concept of degrees of freedom
Recall that the numerator of the sample variance S
2
requires the computation of
n
i=1
(X
1
X)
2
In order to compute S
2
, X needs to be known. Therefore, only (n 1) of the sample
values are free to vary. This means that there are (n 1) degrees of freedom. For
example, suppose a sample of ve values has a mean of 20. How many distinct values
need to be known before the remainder can be obtained? The fact that n = 5 and X
Estimation 31
also tells you that
n
i=1
X
i
= 100
Because,
n
i=1
X
i
n
= X
Thus, when four of the values are known, the fth one will not be free to vary because
the sum must add to 100. For example, if four of the values are 18, 24, 19 and 16,
the fth value must be 23 so that the sum equals 100.
For dierent degrees of freedom, the t distribution is unique. For example, for
a t distribution with 10 degrees of freedom and

2
= 0.05, the value from the t
distribution table is t
0.05
= 1.812. For a t distribution with 20 degrees of freedom
and

2
= 0.05, the tabular value is t
0.05
= 1.725. From the t distribution table,
it is also clear that as the degrees of freedom continue to increase, t
0.05
approaches
Z
0.05
= 1.645. Here, we need to understand the concept of degrees of freedom.
The shape of the distribution varies with degrees of freedom (df.) instead of sample
size. As the sample size increases, the degrees of freedom also increase (as degrees of
freedom is (n1), where n is the size of a sample). The number of degrees of freedom
indicates the number of values that are free to vary in a random sample. The degrees
of freedom can be understood as the number of independent observations for a source
of variation minus the number of independent parameters estimated in computing the
variation. If one independent parameter, population mean , is being estimated by
sample mean X. So, the degrees of freedom is all independent observations n minus
one independent parameter being estimated, that is . So, in this case, degrees of
freedom will be n 1.
Condence interval to estimate population parameter , when population stan-
dard deviation is unknown and the population is normally distributed.
X t
2
,n1
s
n
Estimation 32
X t
2
,n1
s
n
X +t
2
,n1
s
n
s =
_
n
i=1
(X
1
X)
2
n 1
where X is the sample mean, n is the sample size, s the sample standard deviation,
the area under the normal curve that is outside the condence interval,

2
the one-
tail area under the normal curve which is outside the condence interval, and degrees
of freedom = n 1.
The procedure of the condent interval estimation of population mean when
population standard deviation is unknown and sample size is large or small, is sum-
marized in the following table:
Large
known X Z
n
estimated by s X Z
2
s
n
Small
known X Z
n
estimated by s X t
2
s
n
1.24 Condence interval estimation for population
proportion
For estimating the population proportion, the central limit theorem for sample pro-
portion can be used. The Z formula for sample proportion for np 5 is given as
Z =
p P
_
p(1p)
n
N(0, 1)
Estimation 33
where

P is sample proportion, n the sample size, P the population proportion.
When the sample size is large, the sample proportion

P (x is the number of
successes) is the best point estimator for the population proportion P. Since the
sampling distribution of sample proportion

P is approximately normal with mean
p
and the standard error
_
pq
n
, the condence interval for a population proportion at
(1 ) condence coecient is given by
p Z
p
= p Z
2
_
p(1 p)
n
or
p Z
p
P p +Z
p
where Z
2
is the Z-value corresponding to an area of

2
in the right tail of the standard
normal probability distribution and the quantity Z
p
is the margin of error (or error
of the estimation). Since P is unknown, it is estimated using point estimator

P. Thus
for a sample proportion, the standard error denoted by S.E.(

P) or
P
is given by
p
=
_
p(1 p)
n
.
Hence the condence interval is given by
p Z
2
_
p(1 p)
n
P p +Z
2
_
p(1 p)
n
.
1.25 Estimation of the sample size
For sample Statistics to infer about population, it is important to estimate the size of
the sample. Since standard error
X
=

n
and
p
=
_
pq
n
of sampling distribution of
sample statistic X and

P are both inversely proportional to the sample size n, which
is also related to the width of the condence intervals X Z
X
and p Z
p
, the
width or range of the condence interval can be decreased by increasing the sample
size n.
Estimation 34
In each example concerning condence interval estimation, the sample size was
selected without regard to the width of the resulting condence interval. In the busi-
ness world, determining the proper sample size is a complicated procedure, subject
to the constraints of budget, time, and ease of selection.
While conducting any kind of research, the most frequently asked question is
what should be the sample size. Most researchers seem to be perplexed about the
sample size. What will be an appropriate sample size? Should it be a predetermined
percentage of the population? Is there any formula which can produce optimum
sample size? All these are very common questions in the mind of a researcher when
undertaking research work based on sample selection.
From the previous discussion, it is clear that standard error

n
and
_
p(1 p)
n
of sampling distribution of sample statistic X and

P are inversely proportional to
the sample size n. From the formula of condence interval, it is clear that both

n
and
_
p(1 p)
n
are related to the width of the condence interval X Z
X
and
p Z
p
, respectively. This relationship indicates that the width of the condence
interval decreases with the increase in sample size. Apart from this, selection of sam-
ple size depends on factors such as time, cost, convenience of sample selection, etc.
Thus, this information must also be kept in mind while determining the appropriate
sample size.
Precision of condence interval:
The precision with which a condence interval estimates the true population param-
eter is determined by the width of the condence interval. Narrow the condence
interval, more precise the estimate and vice-versa. The width of condence interval
is inuenced by
1. Specied level of condence.
2. Sample size.
3. Population standard deviation.
Estimation 35
To gain more precision, or condence, or both, the sample size needs to be increased
provided there is very less variability in the population itself. For certain reasons,
if the sample size cannot be increased, the investigator cannot aord the increased
cost of sampling. So with the same sample size, the only way to maintain the desired
level of precision is to lower the condence level to an extent so as to predict that the
estimate may become close to the target .
The decision regarding the appropriate size of sample depends on
I. Precision level needed in estimating the characteristics of a population i.e. what
is the margin of error to make?
II. Condence level needed, i.e. how much chance to make error in estimating a
population parameter?
III. Extent of variability in the population investigated?
IV. Cost-benet analysis of increasing the sample size?
For example, an insurance company wants to estimate the proportion of claims settled
within 2 months of the receipt of claim. For this purpose, the company must decide
how much error it is willing to allow in estimating the population proportion of claims
settled in a particular nancial year. This means, whether accuracy is required to
be within 80 claims, 100 claims, and so on. Also, the company needs to deter-
mine in advance the level of condence for estimating the true population parameter.
Hence for determining the sample size for estimating population mean or proportion,
such requirements must be kept in mind along with information regarding standard
deviation.
1.26 Sample size for estimating population mean
While estimating population mean , sample size n can be determined from the Z
formula for sample mean. When the distribution of sample mean X is normal, the
Estimation 36
standard normal variable Z is
Z =
X
n
or X =

n
The value of Z can be seen from standard normal table for a specied condence
coecient 1 .
The value of Z in the above equation will be positive or negative, depending on
whether the sample mean X is larger or smaller than the population mean . This
dierence between X and is called the sampling error or margin of error E. Thus
for estimating the population mean with a condition that the error in its estimation
should not exceed a xed value, say, E, we require that the sample mean X should
fall within the range, E with a specied probability. Thus the margin of error
acceptable (i.e. maximum tolerable dierence between unknown population mean
and the sample estimate at a particular level of condence) can be written as:
X = Z
n
or E = Z
n
Solving for n yields a formula that can be used to determine sample size
n =
Z
2
E
or n =
_
Z
2
E
_
2
This formula for sample size n will provide the tolerable margin of error E, at the
chosen condence level 1 (which determines the critical value of Z from the normal
table) with known or estimated population standard deviation . This sample size
provides the desired margin of error at the chosen condence level.
In the equation to estimate the sample size, E is the margin of error that the user
is willing to accept, and the value of Z
2
follows directly from the condence level
to be used in developing the interval estimate. Although user preference must be
considered, 95% condence is the most frequently chosen value (Z
0.5
= 1.96).
Finally, use of the equation requires a value for the population standard deviation
. However, even if is unknown, we can use the equation provided we have a
Estimation 37
preliminary or planning value for . In practice, one of the following procedures can
be:
1. Use the estimate of the population standard deviation computed from data of
previous studies as the planning value of .
2. Use a pilot study to select a preliminary sample. The sample standard deviation
from the preliminary sample can be used as the planning value for .
3. Use judgment or a best guess for the value of . For example, we might
begin by estimating the largest and smallest data values in the population. The
dierence between the largest and smallest values provides an estimate of the
range for the data. Finally, the range divided by 4 is often suggested as a rough
approximation of the standard deviation and thus an acceptable planning value
for .
(To determine the sample size, you must know three factors:
1. The desired condence level, which determines the value of Z, the critical value
from the standardized normal distribution.
2. The acceptable sampling error E.
3. The standard deviation .)
In some business-to-business relationships requiring estimation of important pa-
rameters, acceptable levels of sampling error and the condence level required are
specied in legal contracts. For companies in the food or drug sectors, sampling
errors and condence levels are often specied by government regulations. In gen-
eral, however, it is usually not easy to specify the two factors needed to determine
the sample size. How can you determine the level of condence and sampling error?
Typically, these questions are answered only by the subject matter expert (i.e., the
individual most familiar with the variables to be analyzed). Although 95% is the
most common condence level used (in which case Z = 1.96), if more condence,
then 99% might be more appropriate. If less condence is deemed acceptable, then
Estimation 38
90% might be used. For the sampling error, you should be thinking not of how much
sampling error you would like to have (you really do not want any error) but of how
much can be tolerated while still permitting you to draw adequate conclusion from
the data.
In addition to specifying the condence level and the sampling error, an estimate
of the standard deviation must be available. Unfortunately, the population standard
deviation is rarely known. In some instances, the standard deviation can be estimated
from past data. In other situations, an educated guess can be made by taking into
account the range is approximately equal to 6 (i.e.3 around the mean) so that
is estimated as the range/6. If cannot be estimated in this manner, a pilot study
can be conducted and the standard deviation can be estimated from the resulting
data.
Let us demonstrate the use of equation to determine the sample size by consid-
ering the following example. A previous study that investigated the cost of renting
automobiles in the United states found a mean cost of approximately $55 per day
for renting a mid-size automobile. Suppose that the organization that conducted this
study would like to conduct a new study in order to estimate the population mean
daily rental cost for a mid-size automobile in the United states. In designing the new
study, the project director species that the population mean daily rental cost be
estimated with a margin of error of $2 and a 95% level of condence.
The project director specied a desired margin of error of E = 2, and the 95%
level of condence indicates Z
0.025
= 1.96. Thus we only need a planning value for
the population standard deviation in order to compute the required sample size. At
this point, an analyst reviewed the sample data from the previous study and found
that the sample standard deviation for the daily rental cost was $9.65. Using 9.65 as
the planning value for , we obtain
n =
(Z
2
)
2
2
E
2
=
(1.96)
2
(9.65)
2
4
= 89.43
Estimation 39
Thus, the sample size for the new study needs to be at least 89 mid-size automobile
rentals in order to satisfy the project directors $2 margin-of-error requirement. In
cases where the computed n is not an integer, we round up to the next integer value;
hence, the recommended sample size is 90 mid-size automobile rentals.
Example:1
Suppose the sample standard deviation of P/E ratios for stocks listed on the Mumbai
Stock Exchange (BSE) is s = 7.8. Assume that we are interested in estimating the
population mean of P/E ratio for all stocks listed on BSE with 95% condence. How
many stocks should be included in the sample if we desire a margin of error of 2?
Solution:
The information given is E = 2, s = 7.8, Z
2
at 95% level of condence Using the
formula for n and substituting the given values, we have
n =
(Z
2
)
2
2
E
2
=
(1.96)
2
(7.8)
2
4
59
Thus a sample size n = 59 should be chosen to estimate the population mean of P/E
ratio for all stocks on the BSE.
Example 2:
A person wants to buy a machine component in large quantity from a company.
The companys sales manager is requested to provide data for the mean life of the
component. The manager considers it worth Rs.800 to obtain an estimate that has
19 chances in 20 of being within 0.05 of the correct value. The cost of setting up
equipment to the test the component is Rs.500 and the cost of testing a component
is Rs.2.50. It is known from the past records that the standard deviation of the life
of the component is 0.80. Will the manager be able to obtain the required estimate?
What is the minimum cost of obtaining the necessary estimate?
Solution:
We know that the money required to obtain an estimate is Rs.800; money required
Estimation 40
for setting up equipment to test is Rs.500, and the cost of testing a component is
Rs.2.50. Thus
n = (800 500)/2.5 = 120 (sample size)
Also given that, sample standard deviation, s = 0.8 and Z
2
= 1.96 at 95% condence
level. Thus, the condence interval for population mean is
X Z
X
= X Z
2
s
n
= X 1.96(0.073) = X 0.143
Thus the condence interval to estimate the population mean value is
X 0.143 X + 0.143
Since the dierence in estimate is required to be 0.05, new sample size required is
E = Z
2
s
n
or n = 984
Thus the new cost for testing the computer in the sample of 984 units will be 2.5984 =
Rs.7560.
Remark 1.26.1. The general rule used in determining sample size is always to round
o to the nearest integer value.
1.27 Sample size for estimation population pro-
portion
The method for determining a sample size for estimating the population proportion
is similar to that used in the previous section.
Let us consider the question of how large the sample size should be to obtain an
estimate of a population proportion at a specic level of precision. The rationale for
the sample size determination in developing interval estimates of p is similar to the
rationale used in previous section.
Estimation 41
Note that the margin of error associated with an interval estimate of a population
proportion is Z
2
_
p(1 p)
n
. The margin of error is based on the values of Z
2
, the
sample proportion

P, and the sample size n. Larger sample sizes provide a smaller
margin of error and better precision.
Let E denote the desired margin of error
E = Z
2
_
p(1 p)
n
.
Solving this equation for n provides a formula for the sample size that will provide a
margin of error of size E.
n =
(Z
2
)
2
p(1 p)
E
2
Note, however, that we cannot use this formula to compute the sample size that will
provide the desired margin of error because

P will not be known until after we select
the sample. What we need, then, is a planning value for

P that can be used to make
the computation. Using

P to denote the planning value for

P, the following formula
can be used to compute the sample size that will provide a margin a error of size E.
1.28 Sample size for an interval estimate of a pop-
ulation proportion
In practice, the planning value can be chosen by one of the following procedures:
1. Use the sample proportion from a previous sample of the same or similar units.
2. Use a pilot study to select a preliminary sample. The sample proportion from
this sample can be used as the planning value

P.
3. Use judgment or a best guess for the value of

P.
4. If none of the preceding alternatives apply, use a planning value of

P = 0.5.
Estimation 42
1.29 Further discussion of sample size determina-
tion for a proportion
To determine the sample size for estimating a proportion, three unknowns must be
dened:
1. The desired level of condence that determines the value of Z.
2. The acceptable sampling error E.
3. The true proportion of successesp.
In practice, the selection of these quantities requires some planning. Once the desired
level of condence is determined, you can obtain the appropriate Z value from the
standardized normal distribution. The sampling error E indicates the amount of error
that you are willing to accept or tolerate in estimating the population proportion. The
third quantity-the true proportion of success p-is actually the population parameter
that you want to nd. Thus, how do you state a value for the very thing that you
are taking a sample in order to determine?
Here there are two alternatives. First, in many situations, past information or
relevant experiences may be available that provide an educated estimate of p. Second,
if past information or relevant experiences are not available, you can try to provide
a value for p that would never underestimate the sample size needed. Referring
to the equation used to estimate n, observe that the quantity p(1 p) appears in
the numerator. Thus, you need to determine the value of p = 0.5, the product
p(1p) achieves its maximum result. Several values of p along with the accompanying
products of p(1 p) are
When p = 0.9, then p(1 p) = 0.09
When p = 0.7, then p(1 p) = 0.21
When p = 0.5, then p(1 p) = 0.25
Estimation 43
When p = 0.3, then p(1 p) = 0.21
When p = 0.1, then p(1 p) = 0.09
Therefore, when there is no prior knowledge or estimate of the true proportion p,
you should use p = 0.5 as the most conservative way of determining the sample size.
This produces the largest possible sample size, but the results in the highest possible
cost of sampling. Thus, the use of p = 0.5 may overestimate the sample size needed
because the actual sample proportion is used in developing the condence interval. If
the actual sample proportion is very dierent from 0.5, the width of the condence
interval may be substantially narrower than originally intended. The increased pre-
cision comes at the cost of spending more time and money for an increased sample
size.
Chapter 2
Testing of
Hypothesis-Fundamentals
2.1 Introduction
Inferential statistics is concerned with estimating the true value of population pa-
rameters using sample statistics. Chapter 1 contains three techniques of inferential
statistics, namely a
1. Point estimate.
2. Condence Interval that is likely to contain the true parameter value.
3. Degree of condence association with a parameter value which lies within an
interval values.
This information helps decision-makers in determining an interval estimate of a pop-
ulation parameter value with the help of sample statistic along with certain level of
condence of the interval containing the parameter value. Such an estimate is helpful
44
Testing of Hypothesis 45
for drawing statistical inference about the characteristic (or feature) of the population
of interest.
Another way of estimating the true value of population parameters is to test the
validity of the claim (assertion or statement) made about this true value using sample
statistics.
In diverse elds such as marketing, personnel management, nancial management,
etc. decision makers need might answer to certain questions in order to take optimum
decisions. For example, a marketing manager might be interested in assessing the
customer loyalty for a particular product; a personnel manager might be interested
in knowing the job satisfaction level of employees; a nancial manager might be
interested in understanding the nancial aspect of the companys retirement scheme,
etc. In every case the concerned manager has to make decisions on the basis of the
available information and in most cases, information is obtained through sampling.
The sample statistic is computed through sampling and it is used to make an inference
about the population parameter.
In order to nd out the answers to these questions, a decision maker needs to
collect sample data, compute the sample statistic and use this information to ascer-
tain the correctness of the hypothesized population parameter. For example, suppose
the Vice President (HR) of a company wants to know the eectiveness of a training
programme which the company has organized for all its 70,000 employees based at
130 dierent locations in the country. Contacting all these employees with an eec-
tiveness measurement questionnaire is not feasible. So the Vice President (HR) takes
a sample of size 629 from all the dierent locations in the country. The result that
is obtained would not be the result from the entire population but only from the
sample. The Vice President (HR) will then set an assumption that training has not
enhanced eciency and will accept or reject this assumption through a well-dened
statistical procedure known as hypothesis testing. A statistical hypothesis is an as-
sumption about an unknown population parameter. Hypothesis testing starts with
an assumption termed as hypothesis on the basis of intuition or on the basis of
general information. Hypothesis testing is a well-dened procedure which helps us to
decide objectively whether to accept or reject the hypothesis based on the information
available from the sample.
Now we need to understand the rationale of hypothesis testing. Drawing a random
sample from the population is based on the assumption that the sample will resemble
the population. Based on this philosophy, the known sample statistic is used for
estimating the unknown population parameter. When a researcher sets a hypothesis
or assumption, he assumes that the sample statistic will be close to the hypothesized
population parameter. This is possible in cases where the hypothesized population
parameter is correct and the sample statistic is a good estimate of the population
parameter. In real life, we cannot expect the sample statistic to always be a good
estimate of the population parameter. Dierences are likely to occur due to sampling
and non-sampling errors or due to chance. A large dierence between the sample
statistic and the hypothesized population parameter raises questions on the accuracy
of the sampling technique. In statistical analysis, we use the concept of probability to
specify a probability level at which a researcher concludes that the observed dierence
between the sample statistic and the population parameter is not due to chance.
The process that enables a decision maker to test the validity (or signicance) of
the claim by analyzing the dierence between the value of sample statistic and the
corresponding hypothetical population parameter value, is called hypothesis testing.
Hypothesis testing begins with an assumption, called hypothesis, which we make
about a population parameter. Then we collect sample data, produce sample statis-
tics, and use this information to decide how likely it is that our hypothesized popula-
tion parameter is correct. Say that we assume a certain value for a population mean.
To test the validity of our assumption, we gather sample data and determine the
dierence between the hypothesized value and the actual value of the sample mean.
Then we judge whether the dierence is signicant. The smaller the dierence, the
greater the likelihood that our hypothesized value for the mean is correct. The larger
the dierence, the smaller the likelihood.
Unfortunately, the dierence between the hypothesized population parameter and
the actual statistic is more often neither so large that we automatically reject our
hypothesis nor so small that we do not reject. So in hypothesis testing, as in most
signicant real-life decisions, clear-cut solutions are the exception, not the rule.
2.2 Formats of Hypothesis
As stated earlier, a hypothesis is a statement to be tested about the true value of
population parameter using sample statistics. A hypothesis whether there exists any
signicant dierence between two or more populations with respect to any of their
common parameter can also be tested. To examine whether any dierence exists
or not, a hypothesis can be stated in the form of if-else statement. Consider, for
instance, the nature of following statements:
(A) If ination rate has decreased, then wholesale price index will also decrease.
(B) If employees are healthy, then they will take sick leave less frequently.
If terms such as positive, negative, more than, less than, etc. are used to make a
statement, then such a hypothesis is called directional hypothesis because it indicates
the direction of the relationship between two or more populations under study with
respect to a parameter value as illustrated below:
(A) Side eects were experienced by less than 20 percent of people who take a
particular medicine.
(B) Greater the stress experienced in the job, lower the job satisfaction to employees.
The non-directional hypothesis indicates a relationship (or dierence), but oer no
indication of the direction of relationship (or dierences). In other words, though it
may be obvious that there would be a signicant relationship between two populations
with respect to a parameter, we may not be able to say whether the relationship would
be positive or negative. Similarly, even if we consider that two populations dier with
respect to a parameter, it will not be easy to say which population will be more or
less. Following examples illustrates non-directional hypotheses.
(A) There is a relationship between age and job satisfaction.
(B) There is a dierence average pulse rates of men and women.
2.3 The rationale for hypothesis testing
The inferential statistics is concerned with estimating the unknown population pa-
rameter by using sample statistics. If a claim or assumption is made about the
specic value of population parameter, then it is expected that the corresponding
sample statistic is close to the hypothesized parameter value. It is possible only if
hypothesized parameter value is correct and the sample statistic turns out to be a
good estimator of the parameter. This approach to test a hypothesis is called a test
statistic.
Since sample statistics are random variables, therefore their sampling distributions
show the tendency of variation. Consequently we do not expect the sample statistic
value to be equal to the hypothesized parameter value. The dierence, if any, is
due to chance and/or sampling error. But if the value of the sample statistic diers
signicantly from the hypothesized parameter value, then the question arises whether
the hypothesized parameter value is correct or not. The greater the dierence between
the value of the sample statistic and hypothesized parameter, the more doubt is there
about the correctness of the hypothesis.
In statistical analysis, dierence between the value of the sample statistic and
hypothesized parameter is specied in terms of the given level of probability whether
the particular level of dierence is signicant or not when the hypothesized parameter
value is correct. The probability that a particular level of deviation occurs by chance
can be calculated from the known sampling distribution of the test statistic.
The probability level at which the decision-maker concludes that observed dier-
ence between the value of the test statistic and hypothesized parameter value cannot
be due to chance called the level of signicance of the test.
2.4 Steps in hypothesis testing
1. Null hypothesis: The null hypothesis, generally referred to as H
0
is the hy-
pothesis which is tested for possible rejection under the assumption that it is
true. Theoretically, a null hypothesis is set as no dierence and considered true,
until and unless it is proved wrong by the collected sample data. The null hy-
pothesis is always expressed in the form of an equation, which makes a claim
regarding the specic value of the population. Symbolically, a null hypothesis
is represented as
H
0
: =
0
where is the population mean and
0
is the hypothesized value of the popu-
lation mean. For example, to test whether a population mean is equal to 150, a
null hypothesis can be set as population mean is equal to 150. Symbolically,
H
0
: = 150
The null hypothesis H
0
represents the claim or statement made about the value
or range of values of the population parameter. The capital letter H stands
for hypothesis and the subscript zero implies no dierence between sample
statistic and the parameter value. Thus hypothesis testing requires that the
null hypothesis be considered true until it is proved false on the basis of results
observed from the sample data. The null-hypothesis is always expressed in the
form of mathematical statement which includes the sign (, =, ) making the
claim regarding the specic value of the population parameter. Only one sign
out of , = and will appear at a time when stating the null hypothesis.
(a) It is a hypothesis of no dierence which is being tested for possible rejec-
tion.
(b) It is a claim regarding the parameters of the population.
(c) If the variable under study is quantitative, the sales manager of the or-
ganization may claim that the average sales for the current year will be
10, 000 items.
(d) If the variable under study is a qualitative variable, the HR manager may
claim that the proportion of individuals who prefer the appraisal system
will be 0.75.
(e) Null hypothesis is denoted by H
0
.
(f) In most of the cases it will be of equal to type. Other exceptional
cases are studies where the null hypothesis will be of or type.
Depending on the situation, one has to decide upon the null hypothesis.
(g) Suppose that one decides to test whether the level of medicine in the
bottle meets exactly the specications by the doctor. In such cases, the
null hypothesis will be of equal to type.
(h) Suppose that one decides to test a researchers claim that the average price
of a stock will be greater than Rs.25. In such cases, the null hypothesis
will of type.
(i) Suppose that one decides to test the claim of the manufacturer that the
average mileage will be 60 km/ltr. In such cases, the null hypothesis will
be of type
2. Alternative hypothesis:
An alternative hypothesis, H
1
: is the counter claim (statement) made against
the value of the particular population parameter. That is, an alternative hy-
pothesis must be true when the null hypothesis is found to be false. In other
words, the alternative hypothesis states that specic population parameter value
is not equal to the value stated in the null hypothesis and is written as:
H
1
: =
0
or H
1
: <
0
or H
1
: >
0
.
It is the logical opposite of the null hypothesis. In other words, when null
hypothesis is found to be true, the alternative hypothesis must be false or when
null hypothesis is found to be false, the alternative hypothesis must be true. The
alternative hypothesis represents the conclusion reached by rejecting the null
hypothesis if there is sucient evidence from the sample information to decide
that the null hypothesis is unlikely to be true. Hypothesis-testing methodology
is designed so that the rejection of the null hypothesis is based on evidence from
the sample that the alternative hypothesis is far more likely to be true. However,
failure to reject the null hypothesis is not proof that it is true. One can never
prove that the null hypothesis is correct because the decision is based only on the
sample information, not on the entire population. Therefore, if you fail to reject
the null hypothesis, you can only conclude that there is insucient evidence to
warrant its rejection. A summary of the null and alternative hypothesis is
presented below:
The Null and alternative hypothesis:
(a) The null hypothesis H
0
represents the status quo or the current belief in
a situation.
(b) The alternative hypothesis H
1
is the opposite of the null hypothesis and
represents a research claim or specic inference you would like to prove.
(c) If you reject the null hypothesis, you have statistical proof that the alter-
native hypothesis is correct.
(d) If you do not reject the null hypothesis, then you have failed to prove the
alternative hypothesis. The failure to prove the alternative, however, does
not mean that you have proven null hypothesis.
(e) The null hypothesis H
0
always refers to a specied value of the population
parameter (such as ), not a sample statistic (such as X).
(f) The statement of the null hypothesis always contains an equal sign re-
garding the specied value of the population parameter (e.g. H
0
: =
368 grams).
(g) The statement of the alternative hypothesis never contains an equal sign
regarding the specied value of the population parameter (e.g. H
1
: =
368 grams).
Each of the following statements is an example of a null hypothesis and alter-
native hypothesis:
H
0
: =
0
H
1
: =
0
H
0
:
0
H
1
: >
0
H
0
:
0
H
1
: <
0
(I) Directional hypothesis
(a) H
0
: There is no dierence between the average pulse rates of men and
women.
H
1
: Men have lower average pulse rates than women do.
(b) H
0
: There is no relationship between exercise intensity and the re-
sulting aerobic benet.
H
1
: Increasing exercise intensity increases the resulting aerobic bene-
t.
(c) H
0
: The defendant is innocent.
H
1
: The defendant is guilty.
(II) Non-directional hypothesis
(a) H
0
: Men and women have same verbal abilities.
H
1
: Men and women have dierent verbal abilities.
(b) H
0
: The average monthly salary for management graduates with a
4-year experience.
H
1
: The average monthly salary is not Rs.75, 000.
(c) H
0
: Older workers are more loyal to a company.
H
1
: Older workers may not be loyal to a company.
3. Determine the appropriate statistical test:
After setting the hypothesis, the researcher has to decide on an appropriate sta-
tistical test that will be used for statistical analysis. The tests of signicance or
test statistic are classied into two categories: parametric and non-parametric
tests. Parametric tests are more powerful because their data are derived from
interval and ratio measurements. Nonparametric tests are used to test hypothe-
ses with nominal and ordinal data. Parametric techniques are the tests of choice
provided certain assumptions are met. Assumptions for parametric tests are as
follows:
i. The selection of any element (or member) from the population should not
aect the chance for any other to be included in the sample to be drawn
from the population.
ii. The samples should be drawn from normally distributed population.
iii. Populations under study should have equal variances.
Non-parametric tests have few assumptions and do not specify normally dis-
tributed populations or homogeneity of variance.
Selection of a test:
For choosing a particular test of signicance following three factors are consid-
ered:
a. Whether the test involves one sample, two samples or k samples?
b. Whether samples used are independent or related?
c. Is the measurement scale nominal, ordinal, interval, or ratio?
Further, it is also important to know: (i) sample size, (ii) The number of sam-
ples, and their size, (iii) Whether data have been weighted. Such questions help
in selecting an appropriate test statistic. One sample tests are used for single
sample and to test the hypothesis that it comes from a specied population.
The following questions need to be answered before using one sample tests
a. Is there a dierence between observed frequencies and the expected fre-
quencies based on a statistical theory?
b. Is there dierence between observed and expected proportions?
c. Is it reasonable to conclude that a sample is drawn from a population with
some specic distribution (normal, Poisson, and so on).
d. Is there signicant dierence between some measures of central tendency
and its population parameter?
The value of test statistic is calculated from the distribution of sample statistic
by using the following formula
Test Statistic =
Value of sample statistic Value of hypothesized population parameter
standardized error of the sample statistic
The choice of a probability distribution of a sample statistic is guided by the
sample size n and the value of population standard deviation n as shown below
Sample size Population standard deviation
. . . . . . Known Unknown
n > 30 Normal distribution Normal distribution
n 30 Normal distribution t-distribution
4. Level of signicance: This is admissible level of error at which we test the null
hypothesis. The level of signicance generally denoted by is the probability,
which is attached to a null hypothesis, which may be rejected even when it is
true. The level of signicance is also known as the size of the rejection region
or the size of the critical region. It is very important to note that the level
of signicance must be determined before drawing the samples, so that the
obtained result is free from the choice bias of a decision marker. The levels of
signicance which are generally applied by researchers are 0.01, 0.05, 0.10. It
is specied in terms of the probability of null hypothesis H
0
being wrong. In
other words, the level of signicance denes the likelihood of rejecting a null
hypothesis when it is true, i.e. it is the risk a decision maker takes of rejecting
the null hypothesis when it is really true. The guide provided by the statistical
theory is that this probability must be small.
5. Test statistic: This is constructed using the statistic used to estimate the popu-
lation parameter on which the hypothesis is being tested. The value of the teste
statistic decided whether to reject the hypothesis or not reject the hypothesis.
6. Critical value: After constructing the test statistic, we need to obtain the critical
value. This critical value divides the entire region into critical and non-critical
region.
7. Conclusion:At this stage, the calculated value of the test statistic is compared
with the critical value and concluded accordingly. In recent times, p-value
approach is prominent and these two methods will be discusses in detail in the
next section.
8. Power of the test: This decides the strength of the test in correctly rejecting the
null hypothesis. Its calculation will be discussed for each test separately using
an example.
2.5 One tail and two tail tests
The form of the alternative hypothesis can be either one-tailed or two-tailed, depend-
ing on what the analyst is trying to prove.
2.5.1 One tailed test
One tailed tests are further classied as right tailed and left tailed tests. Alternative
hypothesis decides whether a test is right tailed or a left tailed. If the alternative
hypothesis is of type > then, the test is classied as right tailed test and if the
alternative hypothesis is of type < then, the test is classied as left tailed test. Note
that the = sign should be always in null hypothesis (let us accept this). This is
because, the test statistic is calculated under the assumption that the null hypothesis
is true.
2.5.2 Two tailed test
When the alternative hypothesis is of type = then the test is classied as two tailed
test.
2.6 Critical region and non-critical region
The sampling distribution of the test statistic is divided into two regions,a region of
rejection (sometimes called the critical region) and a region of non-rejection. If the
test statistic falls into the region of non-rejection,you do not reject the null hypothesis.
If the test statistic falls into the rejection region,you reject the null hypothesis.
The region of rejection consists of the values of the test statistic that are unlikely
to occur if the null hypothesis is true.These values are much more likely to occur
if the null hypothesis is false.Therefore,if a value of the test statistic falls into this
rejection region, you reject the null hypothesis because that value is unlikely if the
null hypothesis is true. To make a decision concerning the null hypothesis, you rst
determine the critical value of the test statistic. The critical value divides the non-
rejection region from the rejection region. Determining the critical value depends on
the size of the rejection region.The size of the rejection region is directly related to
the risks involved in using only sample evidence to make decisions about a population
parameter.
2.7 Errors in hypothesis testing
A Type I error occurs if you reject the null hypothesis, H
0
, when it is true and should
not be rejected. A Type I error is afalse alarm. The probability of a Type I error
occurring is .
A Type II error occurs if you do not reject the null hypothesis, H
0
, when it is
false and should be rejected. A Type II error represents a missed opportunity to take
some corrective action. The probability of a Type II error occurring is .
Whenever we reject a null hypothesis, there is a chance that we have made a
mistake i.e., that we have rejected a true statement. Rejecting a true null hypothesis
is referred to as a Type I error, and our probability of making such an error is
represented by the Greek letter alpha (). This probability, which is referred to as
the signicance level of the test, is of primary concern in hypothesis testing.
On the other hand, we can also make the mistake of failing to reject a false null
hypothesis this is a Type II error. Our probability of making it is represented by the
Greek letter beta (). Naturally, if we either fail to reject a true null hypothesis or
reject a false null hypothesis, we have acted correctly. The probability of rejecting
afalse null hypothesis is called the power of the test. The four possibilities are shown
in Table.
Actual Situation
Statistical decision H
0
true H
0
false
Do not reject H
0
Correct decision, Condence= (1 ) Type-II error, P(Type IIerror) =
Reject H
0
Type-I error, P(Type Ierror) = Correct decision, Power= (1 )
In hypothesis testing, there is a necessary trade-o between Type I and Type II
errors: For a given sample size, reducing the probability of a Type I error increases the
probability of a Type II error, and vice versa. The only sure way to avoid accepting
false claims is to never accept any claims. Likewise, the only sure way to avoid
rejecting true claims is to never reject any claims. Of course, each of these extreme
approaches is impractical, and we must usually compromise by accepting a reasonable
risk of committing either type of error.
Complements of Type-I and Type-II Errors
The condence coecient 1, is the probability that you will not reject the null
hypothesis, when it is true and should not be rejected.
The power of a statistical test, 1 , is the probability that you will reject the
null hypothesis when it is false and should be rejected.
2.8 Test for single mean
In this section, we discuss two tests that are most common in testing a hypothesis
built on population mean . The rst one is the Z test and second the t test. We
discuss these two tests in detail using appropriate examples. The selection of the test
depends on the sample size of the study or on the value of the standard deviation
(known or unknown case).
Assumptions
1. The variable under study is ratio or interval.
2. The population follows normal distribution.
3. Population variance
2
: known (Z-test), Unknown (t-test).
4. Responses are independent within the samples.
2.8.1 Z-test for single mean- known case
The procedure to use a Z-test is as follows:
1. Null hypothesis: H
0
: = (, )
0
.
2. Alternative hypothesis: H
0
: = (<, >)
0
.
3. Level of signicance: = 0.05(0.01, 0.02, 0.10).
4. Test Statistic: Under H
0
,
Two tailed test:
Z =
|X
0
|
n
N(0, 1)
One tailed test:
Z =
X
0
n
N(0, 1)
5. Comparison and Conclusion.
2.8.2 Testing Using Excel
A1 Null Hypothesis =
0
A2 Level of Signicance () 0.05
A3 Population Standard Deviation
A4 Sample Size n
A5 Sample Mean X
A6 Intermediate Calculations
A7 Standard Error of the Mean

n
= A3/sqrt(A4)
A8 Z test Statistic Z =
X
0
n
N(0, 1) = (A5 A1)/A7
A9 Two Tailed Test Alternative Hypothesis H
1
: =
0
A10 Lower Critical Value =NORM.S.INV(A2/2)
A11 Upper Critical Value =NORM.S.INV(1-A2/2)
A12 p-Value 2* (1-NORM.S.DIST(ABS(A8), TRUE))
A13 Left Tailed Test Alternative Hypothesis H
1
: <
0
A14 Lower Critical Value =NORM.S.INV(A2)
A15 p-Value NORM.S.DIST(A8, TRUE)
A16 Right Tailed Test Alternative Hypothesis H
1
: >
0
A17 Upper Critical Value =NORM.S.INV(1-A2)
A18 p-Value 1-(NORM.S.DIST(A8, TRUE))
A19 Conclusion
A20 Reject or Do not reject H
0
IF(A12 < A2,Reject H
0
, Do not reject H
0
)
2.8.3 t-test for single mean- unknown case
The procedure to use a Z-test is as follows:
0
: = (, )
0
.
1
: = (<, >)
0
.
0
,
Two tailed test:
t =
|X
0
|
S
n
t
n1
d.f.
One tailed test:
t =
X
0
S
n
t
n1
d.f.
where
S =
n
i=1
(X
i
X)
2
n 1
A1 Null Hypothesis =
0
A3 Sample Standard Deviation = S
A4 Sample Size n
A5 Degrees of Freedom (d.f.) n 1
A6 Sample Mean X
S
n
= A3/sqrt(A4)
A9 t test Statistic t =
X
0
S
n
t
(n1)
d.f. = (A6 A1)/A7
1
: =
0
A11 Lower Critical Value =T.INV(A2/2, A5)
A12 Upper Critical Value =T.INV(1-A2/2, A5)
A13 p-Value 2* (1-T.DIST(ABS(A9), A5, TRUE))
1
: <
0
A15 Lower Critical Value =T.INV(A2, A5)
A16 p-Value T.DIST(A9, A5, TRUE)
1
: >
0
A18 Upper Critical Value =T.INV(1-A2, A5)
A19 p-Value 1-(T.DIST(A9, A5, TRUE))
A20 Conclusion
0
0
, Do not reject H
0
)
2.9 Test for single proportion
In this section, we discuss the procedure used to test the signicance of single pro-
portion.
Assumptions
2. The condition np 5, n(1 p) 5 is satised. This condition is necessary to
approximate the sampling distribution of the statistic to normal law.
Steps in using the test
0
: P = (, )P
0
.
1
: P = (<, >)P
0
.
0
,
Two tailed test:
Z =
|
P P
0
|
_
P
0
(1P
0
)
n
N(0, 1)
One tailed test:
Z =
P P
0
|
_
P
0
(1P
0
)
n
N(0, 1)
A1 Null Hypothesis P = P
0
A3 Number of items of Interest X
A4 Sample Size n
A6 Sample Proportion
X
n
= A3/A4
A7 Standard Error
_
P
0
(1 P
0
)
n
= sqrt((A1*(1-A1))/A4)
P P
0
_
P
0
(1 P
0
)
n
N(0, 1) = (A6 A1)/A7
1
: =
0
1
: <
0
1
: >
0
A19 Conclusion
0
0
, Do not reject H
0
)
2.10 Comparison and conclusion
Two tailed test:
1. Critical value approach:
Find the Table value at chosen level of signicance . Compare this value with
the calculated value.
(a) If cal tab then, do not reject the null hypothesis.
(b) If cal > tab then, reject the null hypothesis.
2. p-value approach:
Compute the p-value at chosen level of signicance.
(a) If p then, do not reject the null hypothesis.
(b) If p > then, reject the null hypothesis.
One tailed test:
1. Right tailed test:
(a) Critical value approach:
Find the Table value at chosen level of signicance . Compare this value
with the calculated value.
i. If cal tab then, do not reject the null hypothesis.
ii. If cal > tab then, reject the null hypothesis.
(b) p-value approach:
i. If p then, do not reject the null hypothesis.
ii. If p > then, reject the null hypothesis.
2. Left tailed test
i. If cal > tab then, do not reject the null hypothesis.
ii. If cal tab then, reject the null hypothesis.
(b) p-value approach: Compute the p-value at chosen level of signicance.
Chapter 3
Testing of hypothesis-Two sample
problem
3.1 Introduction
In this chapter, we discuss the testing procedures used to test the signicant dierence
between parameters belonging to two independent populations. We note that, there
are several cases which will be discussed in the following sections.
3.2 Assumptions
In this section, we give some important points regarding the testing procedures used
in two sample problem.
1. The variable under study is ratio or interval.
3. Population variances are equal i.e.
2
1
=
2
2
.
4. Samples are independent.
5. Responses are independent within the samples.
67
Testing of hypothesis-Two sample problem 68
3.3 Test for dierence of means: Z-test
0
:
1
= (, )
2
.
1
:
1
= (>, <)
2
.
0
,
(a) When the assumption of equality of variances is satised (
2
1
=
2
2
=
2
)
and
2
is known.
Two tailed test:
Z =
|X
1
X
2
|
_
1
n
1
+
1
n
2
N(0, 1)
One tailed test
Z =
X
1
X
2
_
1
n
1
+
1
n
2
N(0, 1)
(b) When the assumption of equality of variances is not satised (
2
1
=
2
2
).
Two tailed test:
Z =
|X
1
X
2
|
_
2
1
n
1
+

2
2
n
2
N(0, 1)
One tailed test
Z =
X
1
X
2
_
2
1
n
1
+

2
2
n
2
N(0, 1)
2
1
=
2
2
=
2
(known)
A1 Null Hypothesis
1
=
2
A3 Sample Mean1 X
1
A4 Sample Mean2 X
2
A5 Sample Size1 n
1
A6 Sample Size2 n
2
A7 Population Standard deviation
A9 S.E. of dierence of Means
__
1
n
1
+
1
n
2
_
= A7/sqrt((1/A5)+(1/A6))
X
1
X
2
__
1
n
1
+
1
n
2
_ N(0, 1) = (A3 A4)/A9
1
:
1
=
2
1
:
1
<
2
1
:
1
>
2
A22 Conclusion Reject or Do not reject H
0
3.3.2 Testing Using Excel: Unequal Variances (Known)
A1 Null Hypothesis
1
=
2
A3 Sample Mean1 X
1
A4 Sample Mean2 X
2
A5 Sample Size1 n
1
A6 Sample Size2 n
2
A7 Population Standard deviation1
1
A8 Population Standard deviation2
2
_
2
1
n
1
+

2
2
n
2
= sqrt((A7
2
/A5)+(A8
2
/A6))
X
1
X
2
_
2
1
n
1
+

2
2
n
2
N(0, 1) = (A3 A4)/A10
1
:
1
=
2
1
:
1
<
2
1
:
1
>
2
0
3.4 Test for dierence of means:t-test
0
:
1
= (, )
2
.
0
:
1
= (<, >)
2
.
0
,
(a) When the assumption of equality of variances is satised (
2
1
=
2
2
=
2
)
and
2
is unknown.
Two tailed test:
t =
|X
1
X
2
|

_
1
n
1
+
1
n
2
t
n
1
+n
2
2
d.f.
One tailed test:
t =
X
1
X
2

_
1
n
1
+
1
n
2
t
n
1
+n
2
2
d.f.
(b) When the assumption of equality of variances is not satised (
2
1
=
2
2
).
Two tailed test:
t =
|X
1
X
2
|
_
S
2
1
n
1
+
S
2
2
n
2
t
n
1
+n
2
2
d.f.
One tailed test:
t =
X
1
X
2
_
S
2
1
n
1
+
S
2
2
n
2
t
n
1
+n
2
2
d.f.
where
S
2
1
=
n
i=1
(X
i
X
1
)
2
n
1
1
and S
2
2
=
n
i=1
(Y
i
X
2
)
2
n
2
1
2
1
=
2
2
=
2
(Unknown)
A1 Null Hypothesis
1
=
2
A3 Sample Mean1 X
1
A4 Sample Mean2 X
2
A5 Sample Size1 n
1
A6 Sample Size2 n
2
A7 Sample Standard deviation1 S
1
2
A9 Pooled Estimate S = sqrt ((A5 A7
2
+A6 A8
2
)/(A5 +A6 2))
A11 S.E. of dierence of Means S
__
1
n
1
+
1
n
2
_
= A9/sqrt((1/A5)+(1/A6)
X
1
X
2
S
__
1
n
1
+
1
n
2
_ t
n
1
+n
2
2
= (A3-A4)/A11
1
:
1
=
2
A14 Lower Critical Value =T.INV(A2/2, (A5+A6-2))
A15 Upper Critical Value =T.INV(1-A2/2, (A5+A6-2))
A16 p-Value 2* (1-T.DIST(ABS(A12), (A5+A6-2), TRUE))
1
:
1
<
2
A18 Lower Critical Value =T.INV(A2, (A5+A6-2))
A19 p-Value T.DIST(A12, (A5+A6-2), TRUE)
1
:
1
>
2
A21 Upper Critical Value =T.INV(1-A2, (A5+A6-2))
A22 p-Value 1-(T.DIST(A12, (A5+A6-2), TRUE))
0
3.4.2 Testing Using Excel: Unequal Variances (Unknown)
A1 Null Hypothesis
1
=
2
A3 Sample Mean1 X
1
A4 Sample Mean2 X
2
A5 Sample Size1 n
1
A6 Sample Size2 n
2
1
2
A10 S.E. of dierence of Means
_
_
S
2
1
n
1
+
S
2
2
n
2
_
= sqrt((A7
2
/A5)+(A8
2
/A6))
X
1
X
2
_
S
2
1
n
1
+
S
2
2
n
2
t
n
1
+n
2
2
= (A3-A4)/A10
1
:
1
=
2
A13 Lower Critical Value =T.INV(A2/2, (A5+A6-2))
A14 Upper Critical Value =T.INV(1-A2/2, (A5+A6-2))
A15 p-Value 2* (1-T.DIST(ABS(A11), (A5+A6-2), TRUE))
1
:
1
<
2
A17 Lower Critical Value =T.INV(A2, (A5+A6-2))
A18 p-Value T.DIST(A11, (A5+A6-2), TRUE)
1
:
1
>
2
A20 Upper Critical Value =T.INV(1-A2, (A5+A6-2))
A21 p-Value 1-(T.DIST(A11, (A5+A6-2), TRUE))
0
3.5 Test for dierence of two proportions
0
: P
1
= (, )P
2
.
0
: P
1
= (<, >)P
2
.
0
,
Two tailed test:
Z =
|
P
1

P
2
|
_
P
1
(1P
1
)
n
1
+
P
2
(1P
2
)
n
2
N(0, 1)
One tailed test:
Z =
P
1

P
2
_
P
1
(1P
1
)
n
1
+
P
2
(1P
2
)
n
2
N(0, 1)
Test statistic using a pooled estimate:
Z =
|
P
1

P
2
|
_
P(1

P)
_
1
n
1
+
1
n
2
_
N(0, 1)
One tailed test:
Z =
P
1

P
2
_
P(1

P)
_
1
n
1
+
1
n
2
_
N(0, 1)
where,
P =
n
1

P
1
+n
2

P
1
n
1
+n
2
3.5.1 Testing Using Excel: Test for Dierence of Proportions
A1 Null Hypothesis P
1
= P
2
A3 Sample Size1 n
1
A4 Sample Size2 n
2
A5 Number of items of interest X
1
A6 Number of items of interest X
1
A7 Sample Proportion1

P
1
= A5/A3
A8 Sample Proportion2

P
2
= A6/A4
A9 Pooled Estimate

P = ((A3 A7 + A4 A8)/(A3 +A4))
A11 S.E. sqrt(A9 (1-A9)(((1/A5)+(1/A6)))
A12 Z test Statistic Z=(A7-A8)/A11
1
: P
1
= P
2
1
:P
1
< P
2
1
: P
1
> P
2
0
3.6 Test for dependent samples
In previous section, we have discussed the testing procedure used to test the hypoth-
esis constructed on dierence between two population means, when the samples are
independent. In this section, we discuss a testing procedure, when the samples are
dependent. This is the case where the responses are taken from the same set of in-
dividuals before an experiment and after the experiment. This is also used when the
samples are matched samples.
Suppose that, in a marketing research, the researcher wants to know the opinion
of the customers on his companys product. He selects a sample of customers, say of
size n, from a population and collects the opinion from these n customers and then he
introduces the same product with some additions to it. He requests the customers to
use the product and collects the response from them after a month. Here the variable
measured is the weight of the customers. In this case the researcher is interested to
test the hypothesis Is there any signicant dierence between the average weight of
the customers before and after the additions to the product?
0
:
1
= (, )
2
or D =
1
2
= (, )0.
1
:
1
= (>, <)
2
or D =
1
2
= (>, <)0.
0
,
Two tailed test:
Z =
|d|
_

d
2
n1
t
n1
d.f.
One tailed test:
Z =
d
_

d
2
n1
t
n1
d.f.
where
d =
n
i=1
d
i
n
, d
i
= X
i
Y
i
and
d
2
=
n
i=1
(d
i
d)
2
n
A1 Null Hypothesis D :
1
2
= 0
A3 Sample Mean d
A4 Sample Size (before) n
1
A5 Sample Size (after) n
2
A6 Sample Standard deviation
d
A8 S.E.

d
n
= A6/sqrt(A5)
d
n
t
n1
d.f. = (A3)/A8
1
: D :
1
2
= 0
1
:D :
1
2
< 0
1
: D :
1
2
> 0
A19 p-Value 1-(NORM.S.DIST(A9), TRUE))
0
3.7 Test for dierence of variances-F Test
This is an important test which is used to test the hypothesis H
0
:
2
1
=
2
2
.
0
:
2
1
= (,
2
2
.
1
:
2
1
= (>, <)
2
2
.
4. Test statistic: Under H
0
,
F =
S
2
1
S
2
2
F
n
1
1,n
2
1
d.f.
Two tailed test:
One tailed test:
2. Left tailed test
Chapter 4
Chi-Square tests
4.1 Introduction
The statistical-inference techniques presented so far have dealt exclusively with hy-
pothesis tests and condence intervals for population parameters, such as population
means and population proportions. In this chapter, we consider three widely used in-
ferential procedures that are not concerned with population parameters. These three
procedures are often called chi-square procedures because they rely on a distribution
called the chi-square distribution.
The distribution is also important in discrete hedging of options in nance, as
well as option pricing. This distribution is used to construct the condence interval
for population variance
2
. Also note that this distribution is derived from normal
distribution. Square of a standard normal variate gives a chi-square random variable
with 1 degrees of freedom. Similarly if we square n standard normal random variables
and add them, we get a chi-square distribution with n degrees of freedom.
The tests discussed in this chapter have wide applicability when the variable under
study is a qualitative variable. These tests are used to test (i) the signicance of
population variance (ii) test the good of t of a model for the situation under study.
(iii) test for independence of attributes. These tests will be discussed one after the
80
Chi-Square tests 81
other along with the prerequisites to use the test.
4.1.1 Chi-square test for signicance of a population variance
0
:
2
= (, )
2
0
.
1
:
2
= (>, >)
2
0
.
0
,
2
=
nS
2
2
0

2
n1
d.f. where S
2
=
n
i=1
(X
i
X)
2
n 1
4.1.2 Chi-square test for goodness of t
Assumptions
1. The sample information is obtained using a random sample drawn from a pop-
ulation in which each individual is classied according to the categorical vari-
able(s) involved in the test.
2. The expected cell frequency for each category must be 5 or more.
Procedure
1. Null hypothesis: The proposed model is a good t to the situation under study.
2. Alternative hypothesis: The proposed model is not a good t to the situation
under study.
0
,
2
=
n
i=1
(O
i
E
i
)
2
E
i

2
1(c1)
d.f.
where c: Number of columns, O
i
: Observed Frequencies and E
i
: Expected
Frequencies (Refer to the class slides).
Chi-Square tests 82
4.1.3 Chi-square test for independence of attributes
Assumptions
1. The sample information is obtained using a random sample drawn from a pop-
ulation in which each individual is classied according to the categorical vari-
able(s) involved in the test.
2. The expected cell frequency for each category must be 5 or more.
Procedure
1. Null hypothesis: The two attributes under study are independent.
2. Alternative hypothesis: The two attributes under study are not independent.
0
,
2
=
r
i=1
c
j=1
(O
ij
E
ij
)
2
E
ij

2
(r1)(c1)
d.f.
where r: Number of rows, c: Number of columns, O
ij
: Observed Frequencies
and E
ij
: Expected Frequencies (Refer to the class slides).
Two tailed test:
Chi-Square tests 83
One tailed test:
2. Left tailed test
Chapter 5
Analysis of Variance (ANOVA)
5.1 Introduction
The F test, used to compare two variances, can also be used to compare three or
more means. This technique is called analysis of variance, or ANOVA.It is used to
test claims involving three or more means. (Note: The F test can also be used to
test the equality of two means. But since it is equivalent to the t test in this case,
the t test is usually used instead of the F test when there are only two means.) For
example, suppose a researcher wishes to see whether the means of the time it takes
three groups of students to solve a computer problem using Fortran, Basic, and Pascal
are dierent. The researcher will use the ANOVA technique for this test.
For three groups, the F test can only show whether a dierence exists among the
three means. It cannot reveal where the dierence liesthat is, between X
1
and X
2
, or
X
1
and X
3
, or X
2
and X
3
. If the F test indicates that there is a dierence among the
means, other statistical tests are used to nd where the dierence exists. The most
commonly used tests are the Sche test and the Tukey test, which are also explained
in this chapter.
The analysis of variance that is used to compare three or more means is called a
84
Analysis of Variance (ANOVA) 85
one way analysis of variance since it contains only one variable
5.2 One way ANOVA
When an F test is used to test a hypothesis concerning the means of three or more
populations, the technique is called analysis of variance (commonly abbreviated as
ANOVA). At rst glance, you might think that to compare the means of three or
more samples, you can use the t test, comparing two means at a time. But there are
several reasons why the t test should not be done.
First, when you are comparing two means at a time, the rest of the means un-
der study are ignored. With the F test, all the means are compared simultaneously.
Second, when you are comparing two means at a time and making all pairwise com-
parisons, the probability of rejecting the null hypothesis when it is true is increased,
since the more t tests that are conducted, the greater is the likelihood of getting
signicant dierences by chance alone. Third, the more means there are to compare,
the more t tests are needed. For example, for the comparison of 3 means two at a
time, 3 t tests are required. For the comparison of 5 means two at a time, 10 tests are
required. And for the comparison of 10 means two at a time, 45 tests are required.
5.2.1 Assumptions
1. The populations from which the samples were obtained must be normally or
approximately normally distributed.
2. The samples must be independent of one another.
3. The variances of the populations must be equal.
Even though you are comparing three or more means in this use of the F test, variances
are used in the test instead of means.
With the F test, two dierent estimates of the population variance are made. The
rst estimate is called the between-group variance, and it involves nding the
variance of the means. The second estimate, the within-group variance, is made
by computing the variance using all the data and is not aected by dierences in the
means. If there is no dierence in the means, the between-group variance estimate will
be approximately equal to the within-group variance estimate, and the F test value
will be approximately equal to 1. The null hypothesis will not be rejected. However,
when the means dier signicantly, the between-group variance will be much larger
than the within-group variance; the F test value will be signicantly greater than 1;
and the null hypothesis will be rejected. Since variances are compared, this procedure
is called analysis of variance (ANOVA).
For a test of the dierence among three or more means, the following hypotheses
should be used:
H
0
:
1
=
2
= . . . =
k
.
H
1
: At least one mean is dierent from the others.
As stated previously, a signicant test value means that there is a high probability
that this dierence in means is not due to chance, but it does not indicate where the
dierence lies.
The degrees of freedom for this F test are d.f. N = k 1, where k is the number
of groups, and d.f. D = N k, where N is the sum of the sample sizes of the groups
N = n
1
+n
2
+. . . +n
k
. The sample sizes need not be equal. The F test to compare
means is always right-tailed.
5.2.2 Steps for computing the F test value for ANOVA
1. State the hypothesis and identify the claim H
0
:
1
=
2
= . . . =
k
.
H
1
: At least one mean is dierent from the others.
2. Find the critical value.
3. Find the mean and variance of each sample.
(X
1
, s
2
1
), (X
2
, s
2
2
), . . . , (X
k
, s
2
k
).
4. Find the grand mean
X
GM
=
n
i=1
X
i
N
5. Find the between-group variance.
s
2
B
=
n
i=1
n
i
(X
i
X
GM
)
2
k 1
Note: This formula nds the variance among the means by using the sample
sizes as weights and considers the dierences in the means.
6. Find the within-group variance.
s
2
w
=
n
i=1
(n
i
1)s
2
i
n
i=1
(n
i
1)
Note: This formula nds an overall variance by calculating a weighted average
of the individual variances. It does not involve using dierences of the means.
7. Find the F test value.
F =
s
2
b
s
2
w
The degrees of freedom are
d.f.N = k 1
where k is the number of groups, and
d.f.D = N k
where N is the sum of the sample sizes of the groups
N = n
1
+n
2
+. . . +n
k
8. Make the decision by comparing the calculated F value with F table value.
9. Summarize the results. The ANOVA summary table is shown in Table
10. Important note
The numerator of the fraction obtained in step 5, of the computational proce-
dure is called the sum of squares between groups, denoted by SS
B
. The
numerator of the fraction obtained in step 6, part d, of the computational pro-
cedure is called the sum of squares within groups, denoted by SS
W
. This
statistic is also called the sum of squares for the error. SS
B
is divided by
d.f. N to obtain the between-group variance. SS
W
is divided by N k to
obtain the within-group or error variance. These two variances are sometimes
called mean squares, denoted by MS
B
and MS
W
. These terms are used to
summarize the analysis of variance and are placed in a summary table, as shown
in Table
Analysis of Variance Summary Table
Source Sum of squares d.f. Mean square F
Between SS
B
k 1 MS
B
=
SS
B
k 1
Within (error) SS
W
N k MS
W
=
SS
W
N k
MS
B
MS
W
Total SS
B
+SS
W
N k
5.3 Two-Way Analysis of Variance
The analysis of variance technique shown previously is called a one-way ANOVA
since there is only one independent variable. The two-way ANOVA is an exten-
sion of the one way analysis of variance; it involves two independent variables. The
independent variables are also called factors.
The two-way analysis of variance is quite complicated, and many aspects of the
subject should be considered when you are using a research design involving a two-
way ANOVA. For the purposes of this textbook, only a brief introduction to the
subject will be given.
In doing a study that involves a two-way analysis of variance, the researcher is able
to test the eects of two independent variables or factors on one dependent variable.
In addition, the interaction eect of the two variables can be tested.
The two-way ANOVA design has several null hypotheses. There is one for each
independent variable and one for the interaction. In the plant foodsoil type problem,
the hypotheses are as follows:
1. H0 : There is no interaction eect between factor A and factor B.
H
1
: There is an interaction eect between factor A and factor B.
2. H
0
: There is no dierence in means of factor A.
H
1
: There is a dierence in means of factor A.
3. H
0
: There is no dierence in means of factor B.
H
1
: There is a dierence in means of factor B.
The rst set of hypotheses concerns the interaction eect; the second and third sets
test the eects of the independent variables, which are sometimes called the main
eects.
As with the one-way ANOVA, a between-group variance estimate is calculated,
and a within-group variance estimate is calculated. An F test is then performed for
each of the independent variables and the interaction. The results of the two-way
ANOVA are summarized in a two-way table, as shown in Table
Analysis of Variance Summary Table
Source Sum of squares d.f. Mean square F
A
Factor A SS
A
a 1 MS
A
F
B
Factor B SS
B
b 1 MS
B
F
AB
Factor A*B SS
AB
(a 1)(b 1) MS
AB
F
AB
Within (error) SS
W
ab(n 1) MS
W
Total
where
1. SS
A
= Sum of squares for factor A.
2. SS
B
= Sum of squares for factor B.
3. SS
AB
= Sum of squares for interaction.
4. SS
W
= Sum of squares for error term (within-group).
5. a = number of levels of factor A.
6. b = number of levels of factor B.
7. n = number of subjects in each group.
8. MS
A
=
SS
A
a 1
.
9. MS
B
=
SS
B
b 1
.
10. MS
AB
=
SS
AB
(a 1)(b 1)
.
11. MS
W
=
SS
W
ab(n 1)
.
12. F
A
=
MS
A
MS
W
with d.f. N = a 1, d.f. D = ab(n 1).
13. F
B
=
MS
B
MS
W
with d.f. N = b 1, d.f. D = ab(n 1).
14. F
AB
=
MS
AB
MS
W
with d.f. N = (a 1)(b 1), d.f. D = ab(n 1).
The assumptions for the two-way analysis of variance are basically the same as those
for the one-way ANOVA, except for sample size.
5.3.1 Assumptions for the Two-Way ANOVA
1. The populations from which the samples were obtained must be normally or
approximately normally distributed.
2. The samples must be independent.
3. The variances of the populations from which the samples were selected must be
equal.
4. The groups must be equal in sample size.
5.4 The Sche e Test and the Tukey Test
When the null hypothesis is rejected using the F test, the researcher may want to know
where the dierence among the means is. Several procedures have been developed
to determine where the signicant dierences in the means lie after the ANOVA
procedure has been performed. Among the most commonly used tests are the Sche e
test and the Tukey test.
5.4.1 Sche e Test
To conduct the Sche e test, you must compare the means two at a time, using all
possible combinations of means. For example, if there are three means, the following
comparisons must be done:
X
1
versus X
2
X
1
versus X
3
X
2
versus X
3
Formula for the Sche e Test
F
S
=
(X
i
X
j
)
2
s
2
W
_
1
n
i
+
1
n
j
_
where X
i
and X
j
are the means of the means of the samples being compared, n
i
and
n
j
are the respective sample sizes, and s
2
W
is the within-group variance.
To nd the critical value F for the Sche e test, multiply the critical value for the
F test by k 1:
F = (k 1)(FC.V.)
There is a signicant dierence between the two means being compared when F
S
is
greater than F.
5.5 Tukey Test
The Tukey test can also be used after the analysis of variance has been completed to
make pairwise comparisons between means when the groups have the same sample
size. The symbol for the test value in the Tukey test is q.
Formula for the Tukey test:
q =
X
i
X
j
_
s
2
w
n
where X
i
and X
j
are the means of the samples being compared, n is the size of the
samples, and s
2
W
is the within-group variance.
When the absolute value of q is greater than the critical value for the Tukey test,
there is a signicant dierence between the two means being compared.
You might wonder why there are two dierent tests that can be used after the
ANOVA. Actually, there are several other tests that can be used in addition to the
Sche e and Tukey tests. It is up to the researcher to select the most appropriate
test. The Sche e test is the most general, and it can be used when the samples are of
dierent sizes. Furthermore, the Sche e test can be used to make comparisons such
as the average of 1 and 2 compared with 3. However, the Tukey test is more powerful
than the Sche e test for making pairwise comparisons for the means. A rule of thumb
for pairwise comparisons is to use the Tukey test when the samples are equal in size
and the Sche e test when the samples dier in size.
Chapter 6
Correlation and Regression
6.1 Testing signicance of Correlation = 0
1. Null hypothesis: = 0, the correlation coecient is not signicantly dierent
from 0.
2. Alternative hypothesis: = 0, the correlation coecient is signicantly dierent
from 0.
4. Test Statistic:
t =
r
n 2
1 r
2
t
n2
d.f.
5. Comparison and conclusion.
93
6.2 Testing signicance of Correlation =
0
1. Null hypothesis: =
0
, the correlation coecient is not signicantly dierent
from
0
.
2. Alternative hypothesis: =
0
, the correlation coecient is signicantly dierent
from
0
.
4. Test Statistic: Fisher Z transformation
Z =
1
2
log
e
_
1 +r
1 r
_
1
2
log
e
_
1 +
0
1
0
_
1
n3
N(0, 1)
6.3 Testing signicance of correlation
1
=
2
1. Null hypothesis:
1
=
2
, there is no signicant dierence between two popula-
tion correlation coecients.
2. Alternative hypothesis:
1
=
2
, there is signicant dierence between two pop-
ulation correlation coecients.
4. Test Statistic: Fisher Z transformation
Z =
_
1
2
log
e
_
1 + r
1
1 r
1
_
1
2
log
e
_
1 +r
2
1 r
2
__
_
1
2
log
e
_
1 +
1
1
1
_
1
2
log
e
_
1 +
2
1
2
__
_
1
n
1
3
+
1
n
2
3
N(0, 1)
6.4 Testing signicance of regression model
1. Null hypothesis: The regression model is not signicant i.e., H
0
:
1
= 0.
2. Alternative hypothesis: The regression model is signicant i.e., H
1
:
1
= 0.
4. Test Statistic :
F =
MSR
MSE
F
1,n2
d.f.
For details refer to class slides
References
1. Albright, C., Winston, W.L. and Zappe, C.J. (2011): Data analysis and decision
making. 4
th
edition. South-Western Cengage Learning.
2. Berenson, M.L., Levine, D. M and Krehbiel, T.C. (2012): Basic business statis-
tics: Concepts and applications. 12
th
edition. Prentice Hall.
3. Bluman, A. (2012): Elementary statistics: A step by step approach. 8
th
edition.
McGraw-Hill.
4. Ken Black (2010): Business statistics for contemporary decision making. 6
th
edition. John Wiley and sons, Inc.
5. Mann, P.S. (2010): Introductory Statistics.7
th
edition. John Wiley and sons,
Inc.
6. Ross, S.M. (2012): Introductory Statistics. 3rd edition. Elsevier.
97

Fundamentals of Statistical Inference

Încărcat de

Informații document

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

Fundamentals of Statistical Inference

Încărcat de

Drepturi de autor:

Formate disponibile

Fundamentals of Statistical

S-ar putea să vă placă și