Sunteți pe pagina 1din 105

QUANTITATIVE METHODS 1

LECTURE NOTES
BY G.K. ABLEDU

DATA COLLECTION

INTRODUCTION: Collecting, understanding, and analyzing data are central to the


success of many businesses. Obtaining appropriate information is essential to conducting
business.

We may think of data as the information needed to help us make a more informed
decision in a particular situation. A set of measurement obtained on some variable is
called a data set. For example, heart rate measurement for 10 patients may constitute a
data set.

There are six main reasons for data collection:


i. Data are needed to provide the necessary input to a survey.
ii. Data are needed to provide the necessary input to a study.
iii. Data are needed to measure performance of an ongoing service or production
process.
iv. Data are needed to evaluate conformance to standards.
v. Data are needed to assist in formulating alternative courses of action in a
decision-making process.
vi. Data are needed to satisfy our curiosity

SOURCES OF DATA: There are four key data collection sources. Data collectors are
labeled primary sources, while data compilers are called secondary sources.
i. The first method (source) of obtaining data is via governmental, industrial, or
individual(published) sources. Of these three, the government is the major
collector and compiler of data for both public and private purposes. Many
governmental agencies, for example the statistical services, facilitate this
work.
ii. The second data collection source is through experimentation. In an
experiment, strict control is exercised over the treatments given to the

1
participants. For example, in a study testing the effectiveness of toothpaste,
the researcher would determine which participants in the study would use the
new brand and which would not, instead of leaving the choice to the subject.
iii. The third data collection source is by conducting a survey. Here no control is
exercised over the behaviour of the people being surveyed. They are merely
asked questions about their attitudes, beliefs, behaviour, and other
characteristics. Responses are then edited, coded, and tabulated for analysis.
iv. The fourth method for obtaining data is through observational study. Here,
the researcher observes the behaviour of the subjects being studied, usually in
their natural settings.

ILLUSTRATION 1

DATA SOURCE

PRIMARY SOURCE SECONDARY SOURCE

EXPERIMENT SURVEY OBSERVATION PUBLISHED

Sources of primary data include observation, group discussion, and the use of
questionnaires. The distinguishing feature of a primary data is its collection for a specific
project. As a result primary data can take a long time to collect and be expensive.
Secondary data, in contrast, has been collected for some other purpose. It is usually
available at low cost but may be inadequate for the purposes of the enquiry.

2
TYPES OF DATA:
Statisticians develop a survey to deal with a variety of phenomena or characteristics.
These phenomena or characteristics are called random variables. The data, which are the
observed outcomes of these random variables, will undoubtedly differ from response to
response. There are two basic types of data:
i. Qualitative variable: When the characteristics or variable being studied is
non-numeric, it is called qualitative variable or an attribute or categorical data.
Examples of qualitative variables are gender, religious affiliation, type of
automobile owned, colour, nationality, etc…When the data are qualitative, we
are usually interested in how many or what proportion fall in each category.
For example, “What percentage of the population is male/female?” How many
protestants are in Ghana?’ etc. Categorical variables yield responses such as
yes or no answers. Qualitative data are often summarized in charts and bar
graph.

A set of data is said to be categorical if the values or observations belonging to it can be


sorted according to category. Each value is chosen from a set of non-overlapping
categories. For example, shoes in a cupboard can be sorted according to colour: the
characteristic 'colour' can have non-overlapping categories 'black', 'brown', 'red' and
'other'. People have the characteristic of 'gender' with categories 'male' and 'female'.

Categories should be chosen carefully since a bad choice can prejudice the outcome of an
investigation. Every value should belong to one and only one category, and there should
be no doubt as to which one.

ii. Quantitative variable: When the variables being studied can be reported
numerically(measurable or countable), the variable is called quantitative
variable. Examples of quantitative variables are “the balance in your savings
account”, “the number of children in a family”, “the height of students in a
class”, etc. A quantitative variable can be described by a number for which
arithmetic operations such as average make sense. Quantitative variables are
either Discrete or Continuous.

3
Discrete variables can assume only certain values and there are usually “gap” between
the values. For example there is(are) gap(s) between the numbers 1 and 2. These gaps are
1.2 , 1.3 etc. A set of data is said to be discrete if the values / observations belonging to it
are distinct and separate, i.e. they can be counted (1,2,3,....). Examples might include the
number of kittens in a litter; the number of patients in a doctors surgery; the number of
flaws in one metre of cloth; gender (male, female); blood group (O, A, B, AB).

Typically, discrete variables produce numerical responses that arise from a counting
process. For example the “number of magazines subscribed to”, “number of bedrooms in
a house”, etc,…, can be determined by counting.

Continuous random variables can assume any value within a specific range. They
produce numerical responses that arise from a measurement process. A set of data is said
to be continuous if the values / observations belonging to it may take on any value within
a finite or infinite interval. You can count, order and measure continuous data. Your
height is an example of continuous numerical variable because the response takes on any
value within a continuum or interval depending on the measuring instrument. For
example, your height may be 1.67metres, 1.68metres, 1.69metres, 1.70metres…etc,
depending on the precision of the available instruments.

Other examples include height, weight, temperature, the amount of sugar in an orange,
the time required to run a mile.

4
ILLUSTRTION:

Data

Qualitative or attribute or Quantitative or numeric


categorical

Discrete Continuous

DATA TYPE QUESTION TYPE RESPONSES


Categorical Do you currently own life Yes No
insurance policy?
Discrete To how many magazines do
you currently subscribe? …………. Number
Numerical

Continuous How tall are you? ………… Metres

LEVELS OF MEASUREMENT
Different statistics require the use of different mathematical operations and, therefore,
level of measurements considerations are the logical first guidelines to use in selecting a
statistic. For example, computation of mean( or arithmetic average) requires that all
observations be added together and then divided by the number of observations. Thus,
computation of the mean is truly justified only when a variable is measured at the
interval-ratio scale. Ideally, the researcher would utilize only those statistics that are fully
justified by the level of measurement criterion. The most powerful and useful statistics
(such as the mean) require interval-ratio variables, while most of the variables of interest
to the social sciences are only nominal (sex, race, marital status etc.) or at best ordinal
(attitude scale).

5
For example, many statistical techniques require that the scores be added. These
techniques could be legitimately used only when the variable is measured in a way that
permits addition as a mathematical operation. Thus, the researcher’s choice of statistical
techniques is heavily dependent on the level at which the variables have been measured.
The four levels of measurement are, in order of increasing sophistication, nominal,
ordinal, interval and ratio.

1. Nominal level: The most basic and the only universal measurement procedure is to
classify cases into the pre-established categories of a variable. In nominal measurements,
classification into category is the only measurement procedure permitted. The categories
themselves are not numerical and can be compared to others only in terms of the number
of cases classified in them. In no sense can the categories be thought of as “higher” or
“lower’ than each other along some numerical scale.

A set of data is said to be nominal if the values / observations belonging to it can be


assigned a code in the form of a number where the numbers are simply labels. You can
count but not order or measure nominal data.

Some examples are:

i. Sex: male, female


ii. Colour of shoe: black, blue etc.
iii. Ethnic group: Hausa, Yoruba, etc.
iv. Religious affiliation: Christianity, Islam, Traditional, Buddhism, Judaism, etc.
Measurement on this scale involves labeling only. There is no particular order to the
labels. There is no natural order. Each member of a group is labeled according to the
relevant attribute possessed by him/her. The attributes do not have any special order of
magnitude. Numerical labels are sometimes used to identify the categories of a variable
measured at the nominal level. This procedure is especially common when data are being
prepared for computer analysis. For example the variable “sex” might be labeled or coded
as follows:
“Male = 1”
“Female = 2”

6
You should understand that these numbers are merely labels or names and have no
numerical quantity to them. They cannot be added, subtracted, multiplied or divided. The
only mathematical operation permissible with nominal variable is counting the number of
occurrences that have been classified into the various categories of the variable.
To summarize, the nominal level data has the following properties:
i. Data categories are mutually exclusive, meaning, each individual, object,
or measurement is included in only one category.
ii. Data categories are mutually exhaustive, meaning, each individual, object,
or measurement must appear in a category.
iii. Data categories have no logical order, meaning, we could report any label
first

EXAMPLES OF NOMINAL SCALING (MEASUREMENT)


CATEGORICAL VARIABLE CATEGORIES
Automobile ownership Yes No
Type of life insurance Term Endowment Straight life Other None
Political party affiliation Democrat Republican Independent None

2. Ordinal level: In addition to classifying cases into categories, variables measured


at the ordinal level allow the categories to be ranked with respect to how much of the
trait being measured they posses. The categories form a kind of numerical scale that
can be ordered from “high” to “low”. Thus, variables measured at the ordinal level
are more sophisticated than nominal-level variables because, in addition to counting
the number of cases in a category, we can rank the cases with respect to each other.
Not only can we say that one case is different from another, we can also say that one
case is higher or lower than, more or less than, another. Some examples are:
i. Grade in an examination: A, B, C, etc or 1, 2, 3, etc.
ii. Size of shoes: 7, 8, 9, 10, 11, 12, etc.
iii. The position of a student in a class test:

A set of data is said to be ordinal if the values / observations belonging to it can be ranked
(put in order) or have a rating scale attached. You can count and order, but not measure,
ordinal data.

7
The categories for an ordinal set of data have a natural order, for example, suppose a
group of people were asked to taste varieties of biscuit and classify each biscuit on a
rating scale of 1 to 5, representing strongly dislike, dislike, neutral, like, strongly like. A
rating of 5 indicates more enjoyment than a rating of 4, for example, so such data are
ordinal.

Consider the grade received by two students in a mathematics examination. The grade of
1 is higher than grade 2, but we cannot say that the one who obtained grade 1 did as much
as twice as the one who obtained 2. Similarly, one who obtained grade 2 is not
necessarily twice as good as one who obtained grade 4. In other words, equal intervals do
not represent equal quantities.

However, the distinction between neighbouring points on the scale is not necessarily
always the same. For instance, the difference in enjoyment expressed by giving a rating
of 2 rather than 1 might be much less than the difference in enjoyment expressed by
giving a rating of 4 rather than 3.

Consider for example, the variable “socioeconomic status (SES)” which is usually
measured at the ordinal scale in the social sciences. The categories of the variable are
often ordered according to the following scheme:
4 = Upper class
3 = middle class
2 = Working class
1 = Lower class

Individual cases can be compared in terms of the categories into which they are
classified. Thus, an individual classified as 4(upper class) would be ranked higher than an
individual classified as a 2(working class).
To summarize, the ordinal level data has the following properties:
a. Data categories are mutually exclusive, meaning, each individual, object,
or measurement is included in only one category.
b. Data categories are mutually exhaustive, meaning, each individual, object,
or measurement must appear in a category.

8
c. Data categories are ranked or ordered according to the particular trait they
posses.
Note: Data obtain from a categorical variable are said to have been measured at nominal
scale or on an ordinal scale.

EXAMPLES OF ORDINAL SCALING (MEASUREMENT)


CATEGORICAL VARIABLE ORDERED CATEGORIES (Lowest – Highest)
Student class designation First year, Second year, Third year,
Product satisfaction Very unsatisfied, Unsatisfied, Fairly satisfied, Very satisfied
Movie classification Children, Adult, General
Faculty rank Assistant Lecturer, Lecturer, Senior Lecturer, Professor
Student grades F, D, D+, C, C+, B, B+, A, A+

3. Interval –ratio level: The categories of nominal-level variables have no numerical


quality to them. Ordinal –level variables have categories that can be arrayed along a scale
from high to low, but the exact distances between categories are unknown. Variables
measured at the interval-ratio level not only permit classification and ranking but also
allow the distance from category to category (or score to score) to be exactly defined.
Interval ratio variables are measured in units that have equal intervals and a true zero
point.

For example, recording the ages of your respondents is a measurement procedure that
would produce interval-ratio data because the unit of measure (years) has equal intervals
(the distance from year to year is 365 days) and a true zero point. Other examples of
interval-ratio variables would be income, number of children, weight, years of marriage,
etc. All mathematical operations are permitted for data measured at these levels.

The interval level of measurement is the next highest level. It includes all the
characteristics of the ordinal level, but in addition, the difference between values is a
constant size. An interval scale is a scale of measurement where the distance between any
two adjacent units of measurement (or 'intervals') is the same but the zero point is
arbitrary. Scores on an interval scale can be added and subtracted but can not be
meaningfully multiplied or divided. For example, the time interval between the starts of
years 1981 and 1982 is the same as that between 1983 and 1984, namely 365 days. The

9
zero point, year 1 AD, is arbitrary; time did not begin then. Other examples of interval
scales include the heights of tides, and the measurement of longitude.

Another example of the interval level of measurement is temperature. Suppose the high
temperatures on three consecutive days in Accra are 28, 31, and 20 degrees Fahrenheit.
These temperatures can be easily ranked, but we can also determine the difference
between the temperatures. This is possible because 1 degree Fahrenheit represents a
constant unit of measurement. Equal differences between two temperatures are the same,
regardless of their position on the scale. That is, the difference between 10 degrees
Fahrenheit and 15 degrees Fahrenheit is 5; the difference between 50 degrees Fahrenheit
and 55 degrees Fahrenheit is also 5 degrees. It is important to note that zero (0) is just a
point on the scale. It does not represent the absence of the condition. Zero (0) degree
Fahrenheit does not represent the absence of heat, just that it is cold. In fact, zero (0)
degree Fahrenheit is about –18 degrees on the Celsius scale. It has no true zero (0).
To summarize, the interval level data has the following properties:
a. Data categories are mutually exclusive, meaning, each individual, object,
or measurement is included in only one category.
b. Data categories are mutually exhaustive, meaning, each individual, object,
or measurement must appear in a category.
c. Equal differences in the characteristics are represented by equal
differences in the numbers assigned to the categories.
d. The zero point is arbitrary.

The ratio level is the highest level of measurement. The ratio level of measurement has
all the characteristics of the interval level, but in addition, the zero (0) point is meaningful
and ratio between two numbers is meaningful. If you have zero(0) dollars, then you have
no money.
To summarize, the ratio level data has the following properties:
a. Data categories are mutually exclusive, meaning, each individual, object,
or measurement is included in only one category.
b. Data categories are mutually exhaustive, meaning, each individual, object,
or measurement must appear in a category.

10
c. Equal differences in the characteristics are represented by equal
differences in the numbers assigned to the categories.
d. The point zero (0) reflects the absence of the characteristics.

EXAMPLES OF INTERVAL – RATIO SCALE (MEASUREMENT)


NUMERICAL VARIABLE LEVEL OF MEASUREMENT
Temperature Interval
Calendar time (Gregorian, Hebrew, or Islamic) Interval
Height (in inches or centimeters) Ratio
Weight (in pounds or kilogram) Ratio
Years (in years or days) Ratio
Salary (in cedis, pounds or dollars) Ratio

QUESTIONS
QUESTION 1:
For each of the following random variables determine:
i. Whether the variable is categorical or numerical. If the variable is numerical,
determine whether the phenomenon of interest is discrete or continuous.
ii. The level of measurement
a. Number of telephones per household
b. Type of telephone primarily used.
c. Number of long distance calls made per month
d. Length (in minutes) of longest distance call made per month
e. Colour of telephone primarily used
f. Monthly charge (in dollars and cents) for long distance calls made
g. Number of local calls made per month
h. Length (in minutes) of longest local call per month
i. Ownership of a cellular phone
j. Whether there is a telephone line connected to a computer modem in the
household
k. Whether there is a FAX machine in the household.
QUESTION 2:
Suppose that the following information is obtained about students from the campus
bookstore during the first week of classes:
a. Amount of money spent on books
b. Number of books purchased
11
c. Amount of time spent shopping in the bookshop
d. Academic major
e. Gender
f. Ownership
g. Ownership of a videocassette recorder
h. Number of credits registered for in the current semester
i. Whether or not any clothing items were purchased at the bookstore
j. Method of payment
a. Classify each of these variables as categorical or numerical. If the variable is
numerical, determine whether the variable is discrete or continuous.
b. Provide the level of measurement\

QUESTION 3:
For each of the following random variables determine:
a. Whether the variable is categorical or numerical. If the variable is numerical,
determine whether the phenomenon of interest is discrete or continuous.
b. The level of measurement
i. Brand of personal computer primarily used
ii. Cost of personal computer system
iii. Amount of time the personal computer is used per week
iv. Primary use of the personal computer
v. Number of persons in the household who use the personal computer
vi. Number of computer magazine subscriptions
vii. Word processing package primarily used
viii. Whether the personal computer is connected to the internet

QUESTION 4:
For each of the following random variables determine:
i. Whether the variable is categorical or numerical. If the variable is numerical,
determine whether the phenomenon of interest is discrete or continuous.
ii. The level of measurement

12
a. Amount of money spent on clothing in the last month
b. Number of winter coats owned
c. Favourite department store
d. Amount of time spent shopping for clothing in the last month
e. Most likely time period during which shopping for clothing takes place
f. Number of pairs of gloves owned
g. Primary type of transportation used when shopping for clothing

QUESTION 5:
i. . Determine whether each of the following variables is categorical or numerical. If the
variable is numerical, determine whether the phenomenon of interest is discrete or
continuous.
ii. What is the level of measurement for each of the following variables?
a. Student IQ scores.
b. Distance student travel to class
c. Student scores on the first quantitative studies test
d. A classification of students by state of birth.
e. A ranking of students by freshman/women and continuing students..
f. Number of hours students study per week.

QUESTION 67:
Suppose the following information is obtained from Robert Keeler on his application for
a home mortgage loan at the Metro Country Savings and Loan associations:
a. Place of residence: Stony Brook, New York
b. Type of residence: Single-family home
c. Date of birth: April 9,1962
d. Monthly payments: $1,427
e. Occupation: Newspaper reporter/author
f. Employer: Daily newspaper
g. Number of years at job: 14
h. Number of jobs in past 10 years: 1
i. Annual family salary income: $66,000
j. Other income: $16,000

13
k. Marital status: Married
l. Number of children: 3
m. Mortgage requested: $120,000
n. Term of mortgage: 30 years
o. Other loans: Car
p. Amount of other loans: $8,000
Classify each of the responses by type of data and level of measurement

QUESTION 7:

A survey by an electric company contains questions on the following:

i. Age of household head


ii. Sex of household head

iii. Number of people in household

iv. Use of electric heating (yes/no)

v. Number of large appliances used daily

vi. Thermostat setting in winter

vii. Average number of hours heating is on

viii. Average number of heating days

ix. Household income

x. Average monthly electric bill

xi. Ranking of this electric company as compared with two previous electricity
suppliers

a. Describe the variables implicit in these 11 items as quantitative or


qualitative

b. Describe the scale of measurement

14
c. If a variable is quantitative, state whether it is discrete or continuous.

POPULATION SAMPLE AND SAMPLING TECHNIQUES


Population:

A population is any entire collection of people, animals, plants or things from which we
may collect data. It is the entire group we are interested in, which we wish to describe or
draw conclusions about.

In order to make any generalisations about a population, a sample, that is meant to be


representative of the population, is often studied. For each population there are many
possible samples. A sample statistic gives information about a corresponding population
parameter. For example, the sample mean for a set of data would give information about
the overall population mean.

Target Population

The target population is the entire group a researcher is interested in; the group about
which the researcher wishes to draw conclusions.

Example
Suppose we take a group of men aged 35-40 who have suffered an initial heart attack.
The purpose of this study could be to compare the effectiveness of two drug regimes for
delaying or preventing further attacks. The target population here would be all men
meeting the same general conditions as those actually included in the study.

Sample

A sample is a group of units selected from a larger group (the population). By studying
the sample it is hoped to draw valid conclusions about the larger group.

A sample is generally selected for study because the population is too large to study in its
entirety. The sample should be representative of the general population. This is often best

15
achieved by random sampling. Also, before collecting the sample, it is important that the
researcher carefully and completely defines the population, including a description of the
members to be included.

Example
The population for a study of infant health might be all children born in Ghana in the
1980's. The sample might be all babies born on 7th May in any of the years.

Parameter

A descriptive measure of a population is called a parameter. A parameter is a value,


usually unknown (and which therefore has to be estimated), used to represent a certain
population characteristic. For example, the population mean is a parameter that is often
used to indicate the average value of a quantity. Within a population, a parameter is a
fixed value which does not vary. Parameters are often assigned Greek letters (e.g. µ and
),

Statistic

A statistic is a quantity that is calculated from a sample of data. It is a descriptive measure


of a sample. It is used to give information about unknown values in the corresponding
population. For example, the average of the data in a sample is used to give information
about the overall average in the population from which that sample was drawn.

It is possible to draw more than one sample from the same population and the value of a
statistic will in general vary from sample to sample. For example, the average value in a
sample is a statistic. The average values in more than one sample, drawn from thesample
population will not necessarily be equal. Statistics are often assigned the letters x and s.

16
Data Collection
Primary data is collected either through census or by sample selection:
i. Census: This does not require a selection procedure. It involves a complete
enumeration of the identified population
ii. Where the identified population is too large for a cost effective census to be
conducted, a sample of that population must be selected
In many cases, sampling is the only way to determine something about the population.
Some of the major reasons why sampling is necessary are:
i. The cost of studying all the items in a population is often prohibitive.
ii. To contact the whole population is often time consuming.
iii. The destructive nature of certain tests.
iv. The physical impossibility of checking all items in the population.

Sampling Techniques
There are two types of sampling: Probability sampling and non-probability sampling.

Non-probability sampling is one in which the items or individuals included in a sample


are chosen without regard to their probability of occurrence. It includes the following;
Judgment sampling and Quota sampling. Probability sampling is one in which the
subjects are chosen on the basis of known probabilities. It includes Simple Random
Sampling. (stratified, systematic and cluster).

A simple random sample is one in which every individual or item from the population
has the same chance of selection as every other individual or item. In addition, every
sample of a fixed size has the same chance of selection as every other sample of that size.
With simple random sampling, we use n to represent the sample size and N to represent
the population. Samples can be selected with replacement( once a person is selected, it is
returned to the population where it has the same probability of being selected again) or
without replacement a person or item once selected is not returned to the population and
therefore cannot be selected again. By using random sampling, the likelihood of bias is
reduced

17
Simple random sampling is the most elementary random sampling techniques and as such
forms the basis for the other sampling techniques.

Systematic Sample
In a systematic sample, the N individuals or items in the population frame are
partitioned into k groups by dividing the size of the population frame n by the desired
sample size. The items or individuals of the population are arranged in some manner. A
random starting point is selected, and then every kth member of the population is selected
for the sample. For example, if a sample 10 is to be selected from 100 files, the first file
should be chosen using simple random process. If the first file is the 10 th file, then every
10th file will be selected foe the sample. That is, the sample will consists of the 10 th, 20th,
30th, …, 100th files.

Stratified Sample
In a, the N individuals or items in the population are first subdivided into separate
subpopulations, or strata, according to some common characteristics. A simple random
sample is conducted with each of the strata and the results from the separate simple
random sampling are then combined. Either a proportional or a nonproportional sample
can be selected.

There may often be factors which divide up the population into sub-populations (groups /
strata) and we may expect the measurement of interest to vary among the different sub-
populations. This has to be accounted for when we select a sample from the population in
order that we obtain a sample that is representative of the population. This is achieved by
stratified sampling.

Stratified sampling techniques are generally used when the population is heterogeneous,
or dissimilar, where certain homogeneous, or similar, sub-populations can be isolated
(strata). Simple random sampling is most appropriate when the entire population from
which the sample is taken is homogeneous. Some reasons for using stratified sampling
over simple random sampling are:

a. the cost per observation in the survey may be reduced;


b. estimates of the population parameters may be wanted for each sub-population;

18
c. increased accuracy at given cost.

A stratified sample is obtained by taking samples from each stratum or sub-group of a


population. When we sample a population with several strata, we generally require that
the proportion of each stratum in the sample should be the same as in the population.
Suppose a farmer wishes to work out the average milk yield of each cow type in his herd
which consists of Ayrshire, Friesian, Galloway and Jersey cows. He could divide up his
herd into the four sub-groups and take samples from these.

A proportional sampling procedure requires that the number of items in each stratum be
in the same proportion as in the population. For example, the problem might be to study
the advertising expenditures of the 352 largest companies in Ghana. Suppose the
objective of the study is to determine whether firms with high returns on equity(a
measure of profitability) spent more of each sales dollar on advertising than firms with
low return or a deficit. Assume that the 352 firms were divided into five strata. If, say, 50
firms were to be selected for intensive study, the selection will be done as shown:
Stratum Profitability Number of firms Number sampled Procedure
1 30% and over 8 1 (8/352)*50
2 20% to 30% 35 5 (35/352)*50
3 10% to 20% 189 27 (189/352)*50
4 0% to 10% 115 16 (115/352)*50
5 Deficit 5 1 (5/352)*50
Total 352 50

In a nonproportional stratified sampling, the number of items chosen in each stratum is


disproportionate to the respective numbers in the population. Regardless of whether a
proportional or a nonproportional sampling procedure is used, every item or person in the
population has a chance of being selected for the sample. Such sampling methods are
more efficient that either simple random sampling or systematic sampling because they
ensure representation of individuals or items across the entire population, which ensures a
greater precision in the estimates of underlying parameters. It is the homogeneity of
individuals or items within each stratum that when combined across the stratum provides
the precision.
Cluster Sample
In a cluster sample, the N individuals or items in the population are divided into several
clusters so that each cluster is representative of the entire population. A random sampling
19
of clusters is then taken and all individuals or items in each selected cluster are then
studied.

Cluster sampling is typically used when the researcher cannot get a complete list of the
members of a population they wish to study but can get a complete list of groups or
'clusters' of the population. It is also used when a random sample would produce a list of
subjects so widely scattered that surveying them would prove to be far too expensive, for
example, people who live in different postal districts in the Ghana.

Cluster sampling methods can be more cost effective than simple random sampling
methods, particularly if the underlying population is spread over a wide geographic
region. However, cluster sampling methods tend to be less efficient than either simple
random sampling methods or stratified sampling methods and would require a large
overall size to obtain results as precise as those that would be obtained from the more
efficient procedures.

Example
1. Suppose that the Department of Agriculture wishes to investigate the use of pesticides
by farmers in England. A cluster sample could be taken by identifying the different
counties in England as clusters. A sample of these counties (clusters) would then be
chosen at random, so all farmers in those counties selected would be included in the
sample. It can be seen here then that it is easier to visit several farmers in the same county
than it is to travel to each farm in a random sample to observe the use of pesticides.

2. Suppose you want to determine the views of industrialists in Ghana about the state and
environmental protection policies. Selecting a random sample of industrialists in Ghana
and personally contacting each one would be time consuming and very expensive.
Instead, you could subdivide the country into regions or small units. These are called
primary units. You then select 4 or 5 units. From these units you could take a random
sample of industrialists in each of these units and interview them.
Quota Sampling

Quota sampling is a method of sampling widely used in opinion polling and market
research. Interviewers are each given a quota of subjects of specified type to attempt to

20
recruit for example, an interviewer might be told to go out and select 20 adult men and 20
adult women, 10 teenage girls and 10 teenage boys so that they could interview them
about their television viewing.

It suffers from a number of methodological flaws, the most basic of which is that the
sample is not a random sample and therefore the sampling distributions of any statistics
are unknown.

TUTORIAL QUESTIONS
1. Define and give example of each of the statistical terms:
i. Population
ii. Sample
iii. Parameter
iv. Statistic
2. A politician who is running for the office of the mayor of a city with 25,000 registered
voters commissions a survey. In the survey, 48% of the 200 voters interviewed say they
planned to vote for her.
i. What is the population of interest?
ii. What is the sample?
iii. Is the value 48% a parameter or statistic?
3. A manufacturer of computer chips claims that less than 10% of his products are
defective. When 1000 chips were drawn from a large production run, 7.5% were found to
be defective
a. What is the population of interest?
b. What is the sample?
c. What is the parameter?
d. What is the statistic?
4. The owner of a large fleet of taxis is trying to estimate his cost for next year’s
operations. One major cost is fuel. To estimate fuel purchases, the owner needs to know
the total distance his taxis will travel next year, the cost a gallon of fuel, and the fuel
mileage of his taxis. The owner has been provided with the first two figures (distance
estimate and cost). However, because of the high cost of gasoline, the owner has recently
converted his taxis to operate on propane mileage(in miles per gallon) for 50 taxis.
a. What is the population of interest?
21
b. What is the sample the?
c. What is the parameter owner needs?
d. What is the statistic?

SUMMARISING DATA

Frequency Table

22
A frequency table is a way of summarising a set of data. It is a record of how often each
value (or set of values) of the variable in question occurs. It may be enhanced by the
addition of percentages that fall into each category.

A frequency table is used to summarise categorical, nominal, and ordinal data. It may
also be used to summarise continuous data once the data set has been divided up into
sensible groups.

When we have more than one categorical variable in our data set, a frequency table is
sometimes called a contingency table because the figures found in the rows are
contingent upon (dependent upon) those found in the columns.

Example
Suppose that in thirty shots at a target, a marksman makes the following scores:
52234 43203 03215
13155 24004 54455
The frequencies of the different scores can be summarised as:
Frequency
Score Tally Frequency
(%)
0 4 13%
1 3 10%
2 5 17%
3 5 17%
4 6 20%
5 7 23%

Pie Chart

A pie chart is a way of summarising a set of categorical data. It is a circle which is


divided into segments. Each segment represents a particular category. The area of each
segment is proportional to the number of cases in that category.

Example
Suppose that, in the last year a sports wear manufacturer has spent 6 million pounds on
advertising their products; 3 million has been spent on television adverts, 2 million on

23
sponsorship, 1 million on newspaper adverts, and a half million on posters. This spending
can be summarised using a pie chart:

Bar Chart

A bar chart is a way of summarising a set of categorical data. It is often used in


exploratory data analysis to illustrate the major features of the distribution of the data in a
convenient form. It displays the data using a number of rectangles, of the same width,
each of which represents a particular category. The length (and hence area) of each
rectangle is proportional to the number of cases in the category it represents, for example,
age group, religious affiliation. Bar charts are used to summarise nominal or ordinal data.

Bar charts can be displayed horizontally or vertically and they are usually drawn with a
gap between the bars (rectangles), whereas the bars of a histogram are drawn
immediately next to each other.

Histogram

A histogram is a way of summarising data that are measured on an interval scale (either
discrete or continuous). It is often used in exploratory data analysis to illustrate the major
features of the distribution of the data in a convenient form. It divides up the range of

24
possible values in a data set into classes or groups. For each group, a rectangle is
constructed with a base length equal to the range of values in that specific group, and an
area proportional to the number of observations falling into that group. This means that
the rectangles might be drawn of non-uniform height.

The histogram is only appropriate for variables whose values are numerical and measured
on an interval scale. It is generally used when dealing with large data sets (>100
observations)

MEASURES OF CENTRAL TENDENDY

25
The central tendency of a set of measurement is the tendency of the data to cluster about
certain numerical value eg. Mean mode etc.

Mean

The sample mean is an estimator available for estimating the population mean . It is a

measure of location, commonly called the average, often symbolised .

Its value depends equally on all of the data which may include outliers. It may not appear
representative of the central region for skewed data sets.

It is especially useful as being representative of the whole sample for use in subsequent
calculations.

The sample mean is calculated by taking the sum of all the data values and dividing by
the total number of data values:

Example 1
Lets say our data set is: 5 3 54 93 83 22 17 19.

For n observations x1, x2, x3, ... , xn the sample mean is

For frequency table, the mean is given as:

 fx
f
Example 2

26
x f fx
1 2 2
2 3 6
3 5 15
4 3 12
5 2 10
∑f = 15 ∑fx = 45

  fx 45
Mean =
x =
f
=
15
=3

Example 2

Mark f Midpoint(x) fx
1-5 2 3 6
6-10 2 8 16
11-15 3 13 39
16-20 2 18 36
21-25 1 23 23
∑f = 10 ∑fx = 120

  fx 120
Mean =
x =
f
=
10
= 12

Median

The median is the value halfway through the ordered data set, below and above which
there lies an equal number of data values.

It is generally a good descriptive measure of the location which works well for skewed
data, or data with outliers. The median is the 0.5 quantile.

Example 1
With an odd number of data values, for example 21, we have:

27
Data 96 48 27 72 39 70 7 68 99 36 95 4 6 13 34 74 65 42 28 54 69
Ordered Data 4 6 7 13 27 28 34 36 39 42 48 54 65 68 69 70 72 74 95 96 99
Median 48, leaving ten values below and ten values above

With an even number of data values, for example 20, we have:


Data 57 55 85 24 33 49 94 2 8 51 71 30 91 6 47 50 65 43 41 7
Ordered
2 6 7 8 24 30 33 41 43 47 49 50 51 55 57 65 71 85 91 94
Data
Median Halfway between the two 'middle' data points - in this case halfway
between 47 and 49, and so the median is 48

Example 2

x f cf
1 2 2
2 3 5
3 5 10
4 3 13
5 2 15

n  1 15  1
Median position = = =8th position
2 2

From table the 8th position is contained in 10. Therefore, the median value is 3.

Example 3

Mark f cf
1-5 2 2
6-10 2 4
11-15 3 7
16-20 2 9
21-25 1 10

n 10
Median position = = =5th and 6th positions
2 2
From table the 5th and 6th positions is contained in 7. Therefore, the median class is 11 -
15. The median value can only be found using formula or graph.

28
Mode

The mode is the most frequently occurring value in a set of discrete data. There can be
more than one mode if two or more values are equally common.

Example 1
Suppose the results of an end of term Statistics exam were distributed as follows:
Student: Score:</I.< td>
1 94
2 81
3 56
4 90
5 70
6 65
7 90
8 90
9 30
Then the mode (most common score) is 90, and the median (middle score) is 81.

Example 2
x f
1 2
2 3
3 5
4 3
5 2

The highest frequency is 5, and the class that has this is 3. Therefore, the mode is 3.

Example 3

Mark f
1-5 2
6-10 2
11-15 3
16-20 2
21-25 1

The highest frequency is 3, and the class that has this is 11-15. Therefore, the modal class
is 11-15. The modal value can only be found using formula or graph

Dispersion

29
The data values in a sample are not all the same. This variation between values is called
dispersion.

When the dispersion is large, the values are widely scattered; when it is small they are
tightly clustered. There are several measures of dispersion, the most common being the
standard deviation. These measures indicate to what degree the individual observations of
a data set are dispersed or 'spread out' around their mean.

In manufacturing or measurement, high precision is associated with low dispersion.

Range

The range of a sample (or a data set) is a measure of the spread or the dispersion of the
observations. It is the difference between the largest and the smallest observed value of
some quantitative characteristic and is very easy to calculate.

A great deal of information is ignored when computing the range since only the largest
and the smallest data values are considered; the remaining data are ignored.

The range value of a data set is greatly influenced by the presence of just one unusually
large or small value in the sample (outlier).

Examples

1. The range of 65,73,89,56,73,52,47 is 89-47 = 42.


2. If the highest score in a 1st year statistics exam was 98 and the lowest 48, then the
range would be 98-48 = 50.

Inter-Quartile Range (IQR)

The inter-quartile range is a measure of the spread of or dispersion within a data set.

It is calculated by taking the difference between the upper and the lower quartiles. For
example:

Data 23456667789
Upper quartile 7
Lower quartile 4

30
IQR 7-4=3

The IQR is the width of an interval which contains the middle 50% of the sample, so it is
smaller than the range and its value is less affected by outliers.

Quantile

Quantiles are a set of 'cut points' that divide a sample of data into groups containing (as
far as possible) equal numbers of observations.

Examples of quantiles include quartile, quintile, percentile.

Percentile

Percentiles are values that divide a sample of data into one hundred groups containing (as
far as possible) equal numbers of observations. For example, 30% of the data values lie
below the 30th percentile.

.
Quartile

Quartiles are values that divide a sample of data into four groups containing (as far as
possible) equal numbers of observations.

A data set has three quartiles. References to quartiles often relate to just the outer two, the
upper and the lower quartiles; the second quartile being equal to the median. The lower
quartile is the data value a quarter way up through the ordered data set; the upper quartile
is the data value a quarter way down through the ordered data set.

Example
Data 6 47 49 15 43 41 7 39 43 41 36
Ordered Data 6 7 15 36 39 41 41 43 43 47 49
Median(Q2) 41
Upper quartile 43
Lower quartile 15

31
.

Quintile

Quintiles are values that divide a sample of data into five groups containing (as far as
possible) equal numbers of observations.

.
Sample Variance

Sample variance is a measure of the spread of or dispersion within a set of sample data.

The sample variance is the sum of the squared deviations from their average divided by
one less than the number of observations in the data set. For example, for n observations
x1, x2, x3, ... , xn with sample mean

the sample variance is given by


1
s = 2

n 1
 f ( xi  x) 2

Standard Deviation

Standard deviation is a measure of the spread or dispersion of a set of data.

It is calculated by taking the square root of the variance and is symbolised by s.d, or s. In
other words

The more widely the values are spread out, the larger the standard deviation. For
example, say we have two separate lists of exam results from a class of 30 students; one
ranges from 31% to 98%, the other from 82% to 93%, then the standard deviation would
be larger for the results of the first exam.

32
Example

x f x-x = d d2 f d2
1 2 -2 2 4
2 3 -1 4 12
3 5 0 0 0
4 3 1 1 3
5 2 2 4 8
∑f = 15 ∑f d2 = 27


1

1
2
s = f ( xi  x) 2 =
14
* 27 = 1.93
n 1
S tan dardDeviation  1.93  1.39

EXCECISE
QUESTION 1
The data below show prices of vehicles sold in 1999 at DH Ltd (prices are in dollars).
20,197 20,372 17,454 20,591 23,651 24,453 14,266 15,021 25,683 27,872
16,587 20,169 32,851 16,251 17,047 21,285 21,324 21,609 25,670 12,546*
12,935 16,873 22,251 22,277 25,034 21,533 24,443 16,889 17,004 14,357
17,155 16,688 20,657 23,613 17,895 17,203 20,765 22,783 23,661 29,277
17,642 18,981 21,052 22,799 12,794 15,263 32,925# 14,399 14,968 17,356
18,442 18,722 16,331 19,817 16,766 17,633 17,962 19,845 23,285 24,896
26,076 29,492 15,890 18,740 19,374 21,571 22,449 25,337 17,642 20,613
21,220 27,655 19,442 14,891 17,818 23,237 17,445 18,556 18,639 21,296
Tally the data into a frequency distribution. Draw an appropriate chart for the distribution
Find the mean, modal and median prices
QUESTION 2:
A recent survey showed that the typical American car owner spends $2,950 per year on
operating expenses. Below is a breakdown of the various items.
EXPENDITURE ITEM AMOUNT($)
Fuel 603
Interest on car loan 279
Repairs 930
Insurance and incense 646

33
Depreciation 492
Total 2,950
Draw an appropriate chart to portray the data and summaries your finding in a brief
report.

QUESTION 3:
The Midland Natural Bank selected a sample of 40 students’ accounts. Below are their
end-of-month balances in thousands of dollars.
404 74 234 149 279 215 123 55 43 321
87 234 68 489 57 185 141 758 72 863
703 125 350 440 37 252 27 521 302 127
968 712 503 489 327 608 358 425 303 203

a. Tally the data into a frequency distribution using $100 as class interval and
$0 as the starting point
b. Draw a less than cumulative polygon
c. The bank considers any student with ending balance of $400 or more a
“preferred customers”. Estimate the percentage of preferred customers.
d. The bank is also considering a service charge to the lowest 10 percent of
the ending balances. What would you recommend as the cutoff point
between those who have to pay a service charge and those who do not?

QUESTION 4:
Annual revenues by type of tax for the state are as follows:
TYPE OF TAX AMOUNT ($)
Sales 2,812,473
Income(individual) 2,732,045
License 185,198
Corporate 525,015
Property 22,647
Gift 37,326
Total 6,3147,04
Develop an appropriate chart or graph and write a brief report to summarize the
information.

34
QUESTION 5:
Annual imports from selected Canadian trading partners are listed below.
PARTNER ANNUAL IMPORT ($MILLIONS)
Japan 9,550
U. Kingdom 4,556
S. Korea 2,441
China 1,182
Australia 618
Develop an appropriate chart or graph and write a brief report to summarize the
information.
QUESTION 6:
A breakfast cereal is supposed to include 200 raisins in each box. A sample of 60 boxes
produced showed the following number of raisins in each box.
200 193 198 203 196 202 200 196 203 201 205 201
193 203 198 201 200 202 204 195 198 202 201 204
204 200 206 198 202 206 197 199 200 205 191 206
202 199 207 205 202 199 200 200 205 196 200 197
200 199 206 206 204 195 197 200 195 197 200 198
i. Organize the data into a frequency distribution
ii. Draw an appropriate chart for the distribution
QUESTION 7:
The Marketing Research Department is investigating the performance of several
corporations in the coal, mining and gas industries. The fourth-quarter sales in 1997(in
millions of dollars) for these corporations are:
CORPORATION FOURTH-QUARTER SALES ($MILLIONS)
American Hess 1,645.2
Atlantic Richfield 4,757.0
Chevron 8,913.0
Diamond Shamrock 627.1
Exxon 24,612.0
Quaker State 191.9
The Department wants to include a chart in their report comparing the fourth-quarter
sales of six corporations. Draw an appropriate chart to compare the fourth-quarter sales of
these corporations. And write a brief report summarizing the chart.
QUESTION 8:
The distance (in thousand kilometers) covered by the sales agents of Marketing Company
during the year 1998 and the years 2000 were as follows.
Distance Covered 1998 2000
35
(000 Km) No. of Sales Agents No. of Sales Agents
Under 15 9 3
15-25 15 7
25-40 18 12
40-60 20 20
60-80 12 14
80-100 8 10
100 and over 0 4
(a) Compute the mean distance covered by a sales agent in the year (i) 1998 (ii) 2000
(b) Compute the standard deviation, to the nearest whole number, of distance covered by
sales agent in the year (i) 1998 (ii) 2000
QUESTION 9:
The data below gives the wages of 50 operatives in a construction firm. The wages are
measured in thousands of Cedis.
167 168 177 171 170 166 168 173 166 170
173 167 169 170 172 172 169 170 171 175
171 172 171 164 167 172 168 176 167 170
174 169 174 165 170 166 169 170 167 172
161 171 166 170 171 169 168 172 170 173
(a) Using the classes 160-162, 163-165, etc, construct a cumulative table for the data
(b) Draw a cumulative frequency curve for the data
(c) Use your cumulative frequency curve in (b) to estimate
(i) the median wage (ii) the semi interquartile range
(d) Determine the proportion of the operatives in the sample, whose wags are
(i) between ¢170,000 and ¢175,000 (ii) less than or equal to the median wage
QUESTION 10:
A market research analysis conducted a household survey in a suburb in Accra. One of
the questions asked was “how many rooms do you have in your house?” The data below
gives the results.
8 4 3 5 5 2 7 4 7 7 7 7
6 6 8 6 3 8 6 5 7 9 5 8
8 5 4 7 7 5 4 7 6 6 7 7
8 6 6 3 6 9 7 7 4 8 8 6
6 7 2 6 7 6 5 4 5 4 7 5
(a) Construct a frequency table for the data

36
(b) Find (i) The mean (ii) the variance (iii) the median of the distribution
(d) If a household is selected at random from the sample determine the probability that it
has at least 6 rooms.
(QUESTION 11:
The following data gives the age distribution of a random sample of 50 workers in the
catering industry in a Large City town.
32 32 35 36 38 31 37 37 31 34
37 35 34 32 38 32 33 42 36 35
36 37 36 29 32 39 34 39 30 35
33 34 33 34 33 26 36 31 35 36
35 41 35 37 31 33 35 37 40 38
(a) Using the class limits 25-27, 28-30, etc, construct a frequency table for the data. Draw
a histogram for the grouped data.
(b) Determine the mean, mode and median from the grouped data
(c) Calculate the standard deviation from the grouped data
QUESTION 12:
The table below shows the distribution of monthly salary in thousand cedis, of 200
employees of KADO Industries:
Monthly Salary Number of Employees
150-200 8
200-250 30
250-300 41
300-350 52
350-400 33
400-450 24
450-500 12
(a) Draw a cumulative frequency curve of the distribution
(b) If the starting monthly salary of a manager in the firm is ¢425,000 estimate from
graph the number of workers who are managers.
(c) Use your graph to estimate the median and the quartiles of the distribution.
(d) Comment on the shape of the distribution in the light of your values.
QUESTION 13:
Loan granted by a branch of a corporate bank in 1994 are distributed as follows
Value of Loan (¢million) Number of Loans
21-45 10
46-70 25

37
71-95 55
96-120 60
121-145 35
146-170 10
176-195 5
(a) Draw a cumulative frequency curve for the distribution
(b) If loans less than ¢75m are referred to the head quarters of the bank for approval,
estimate from your graph the number of loans that were referred to the headquarters for
approval
(c) Estimate from your graph the median value of the loans. Explain what this value
means.
(d) Compute the mean value of the loans.
(e) Comment on the shape of the distribution in the light of your values.

QUESTION 14:
The table below shows the distribution of number of days elapsing between date of
purchases and date of return of items returned to a department store during the current
fiscal year.
Number of days Number of items
Less than 5 8
5-9 22
10-14 38
15-19 16
20-24 15
25-29 8
30 or more 3
Total 110
(a) How is the information in the frequency distribution represented in a histogram?
(b) (i) Estimate the mean, mode, median and the semi-interquartile range for the data
(ii) Comment on the shape of the distribution.
QUESTION 15:
The administrator of a teaching Hospital has been collecting data regarding the number of
patients treated in the emergency ward on week-ends. The frequency distribution for the
numbers treated over a 20-week period is shown below.
No. of patients Treated No. of Weeks
25-34 4
35-44 5
45-54 7

38
55-64 2
65-74 1
75-84 1

(a) Compute the mean, mode median (ii) Semi-interquatile for the distribution

(b) Draw an appropriate diagram using the data


QUESTION 16
Students in the Department of Accountancy are asked to fill out a course evaluation
questionnaire upon completion of their courses. It consists of a variety of questions that
have a five-category response scale. One of the questions follows:
“ Compared to other courses that you have taken, what is the overall quality of the course
you are now completing?”
Poor Fair Good Very Good Excellent
A sample of 60 students completing the course provided the following responses. To aid
in computer processing of the questionnaire results, a numeric scale was used with
1=Poor, 2=Fair, 3=Good, 4=Very Good, and 5=Excellent
3 4 4 5 1 5 3 4 5 2 4 5 3 4 4
4 5 5 4 1 4 5 4 2 5 4 2 4 4 4
5 5 3 4 5 5 2 4 3 4 5 4 3 5 4
4 3 5 4 5 4 3 5 3 4 4 3 5 3 3

a. State whether or not these are qualitative data. Give reason.


b. Construct a frequency distribution
c. Draw an appropriate graph for the distribution table.
d. On the basis of your summaries, comment on the students’ overall
evaluation of the course.
QUESTION 17
The following data represent daily water consumption per household in the Koforidua
Municipality.
Bathing and showering Gallons per day
Dish washing 100
Drinking and cooking 20
Laundering 15
Lawn watering 40
Toilet 200
Miscellaneous 15

(b) Draw an appropriate diagram using the data

39
PROBABILITY
Background to Probability

Historically, probability originated from the study of games of chance and early
applications of the theory of probability were in such games. In the middle of the 17 th
century, a French coutier, the Chavelier de Mere wanted to know how to adjust the stakes
in gambling so that in the long run, the advantage would be his. He presented the problem
to Blaise Pascal. It was in the correspondence between Pascal and Pierre Fermat, another
French mathematician, that the theory of probability had its beginning. Many of the
probability calculations were based on objects of gambling: the coin, the die, and cards.

In order to understand probability, it is useful to have some familiarity with sets and
operations involving sets. This is because in probability theory, we make use of the idea
of set and operations involving sets.

Set: A set is a collection of elements. The elements of a set may be people, horses, desks,
files in a cabinet, or even numbers.
Universal Set: It is a set containing everything in a given context. We denote the
universal set by S.
Intersection of Sets: If sets A and B have elements in common, we say they intersect.
The intersection of A and B is denoted A  B. Example: Given that A = 2, 3, 4, 5, 6
and B = 2, 4, 5., 7, 8, then A  B =2, 4, 5. 
Union of Sets: The union of A and B is the set containing all elements that are members
of A or B or both. This is denoted A  B. Example: Given that A = 2, 3, 4, 5, 6 and B
= 2, 4,5.,7, 8, then A  B =2, 3, 4, 5, 6, 7, 8 
Disjoint Sets: The sets A and B are said to be disjoint if they have no elements in
common. Thus, A  B= .
Complement of a Set: Given a set A, we define its complement, denoted A’, is the set
containing all the elements in the universal set S, that are not members of set A. The set
A’ is often called “not A”

40
Basic definitions of terms relevant to the computation of probability
Experiment
Is a process that leads to one of several possible outcomes. Other examples of experiment
include simple process of checking whether a switch is turned on or off, counting the
imperfections in a piece of cloth etc…

Outcome

An outcome is the result of an experiment or other situation involving uncertainty. An


outcome of an experiment is some observation or measurement. Drawing a card out of a
deck of 52 playing cards is an experiment. One of the outcomes of the experiment may be
that a queen of diamond is drawn..

Sample Space

The set of all possible outcomes of a probability experiment is called a sample space It is
an exhaustive list of all the possible outcomes of an experiment. Each possible result of
such a study is represented by one and only one point in the sample space, which is
usually denoted by S.

Examples
Experiment Rolling a die once:
Sample space S = {1,2,3,4,5,6}
Experiment Tossing a coin:
Sample space S = {Heads,Tails}
Experiment Measuring the height (cms) of a girl on her first day at school:
Sample space S = the set of all possible real numbers

Deterministic Experiment
An experiment is deterministic if its observed results are not subject to chance. In a
deterministic experiment, if the experiment is repeated a number of times under exactly

41
the same conditions, we expect the same results. For example, if we measure the distance
between the points P and Q(in Km.) many times under the same conditions, we expect to
have the same results.

Random Experiment
An experiment is random if its outcomes are uncertain. If a random experiment is
repeated under identical conditions, the outcomes may be different as there may be some
random phenomena or chance mechanism at work affecting the outcomes. For example,
tossing a coin or rolling a die is a random experiment since in each case the process can
lead to more than one possible outcome. A synonym for random experiment is stochastic
experiment.

Trial
Each repetition of an experiment is a trial. For example, if a coin is tossed four times,
each single toss is a trial.

Sample Space
Sample space is the universal set S pertinent to a given experiment. It is the set of all
possible outcomes of an experiment.

Sample Point
Each outcome in a sample space is an element or sample point. For example, when an
experiment is performed, it gives rise to certain outcomes. These outcomes are called
sample points( or elementary events). A collection of all possible outcomes or sample
points is called sample space. A coin tossed once may either result in a head(H) or a
tail(T). so there are only two outcomes of this experiment. Here, the sample space ,S,
consists of only two sample points. Thus, S = H, T . The number of sample points is 2.
This is denoted n(S) = 2. A fair die thrown once may show up on its face either of the six
numbers 1, 2, 3, 4, 5, and 6. There are six possible outcomes of this experiment and so
the sample space is S =1, 2, 3, 4, 5, 6 .The number of sample points is n(S) = 6

42
Equally Likely Outcomes:
When any outcome of an experiment has the same chance of occurrence as any other
outcome, then the outcomes are said to be equally likely. When a die is tossed once, the
outcomes 1, 2, 3, 4, 5, and 6 are all equally likely as long as the die is fair.

Event

An event is any collection of outcomes of an experiment. Formally, any subset of the


sample space is an event. It is the set of basic outcomes. For example, the event “an ace is
drawn out of a deck of card” is the set of four aces within the sample consisting of 52
cards. This event occurs whenever one of the four aces (the basic outcomes) is drawn. We
denote an event by a capital letter If we roll a die once, the event of rolling a “4” can be
satisfied by only one outcome. That is the 4 itself. If we let A to represent an event, then
A =  4 . The event of rolling an odd number can be satisfied by any one of three
outcomes. These are 1, 3, and 5. If we let B represent an event, the B = 1, 3, 5 .

Any event which consists of a single outcome in the sample space is called an elementary
or simple event. Events which consist of more than one outcome are called compound
events.

Set theory is used to represent relationships among events. In general, if A and B are two
events in the sample space S, then
(A union B) = 'either A or B occurs or both occur'
(A intersection B) = 'both A and B occur'
(A is a subset of B) = 'if A occurs, so does B'
A' or = 'event A does not occur'
(the empty set) = an impossible event
S (the sample space) = an event that is certain to occur

43
Example
Experiment: rolling a dice once -
Sample space S = {1,2,3,4,5,6}
Events A = 'score < 4' = {1,2,3}
B = 'score is even' = {2,4,6}
C = 'score is 7' =
= 'the score is < 4 or even or both' = {1,2,3,4,6}
= 'the score is < 4 and even' = {2}
A' or = 'event A does not occur' = {4,5,6}

Mutually Exclusive Events


Two events, A and B, are said to be mutually exclusive if they cannot occur together. That
is, two events, A and B, are mutually exclusive if the occurrence of A implies the non-
occurrence of B and vice versa. Synonyms for mutually exclusive events are disjoint
events, incompatible events, or non-overlapping events. For example if a die is tossed
once, the number 4 and 5 cannot occur together and hence the event A =  4  and B = 
5 . are mutually exclusive. Note that for mutually exclusive events, A  B= .

Mutually Inclusive Events


Two events, A and B, are said to be mutually inclusive if they can occur together.
Synonyms for mutually inclusive events are compatible events, or overlapping events.
For example if a die is tossed once, the event A =  odd numbers  and B=
prime numbers  are mutually inclusive. Note that for mutually inclusive events, A  B≠
.

Collectively Exhaustive Events


Two or more events defined on the sample space are said to be collectively exhaustive it
the union is equal to the sample space S. The events A =  odd numbers  and B=
even numbers  include every possible outcome in the sample space in the experiment of
tossing a die once. Note that A =  1, 3, 5  and B =  2, 4, 6 . Thus A  B =1, 2, 3, 4,
5, 6 . In an experiment if the events are collectively exhaustive, it means that at least one
of the events must occur. If the set of events is collectively exhaustive and the events are

44
mutually exclusive, the sum of the probabilities is equal to one. Note that in the
experiment of tossing a coin once, S =  H,T  and the events A =  H.and B =  T .
The events are mutually exclusive because both A and b cannot occur at the same time.
Also the events A and B are collectively exhaustive because A  B =H,T. Thus,
the sum of the probabilities of the events A and B is equal to 1.

Independent Events
Two events A and B are said to be independent if the occurrence (or non-occurrence of
one of them is not affected by the occurrence (or non-occurrence) of the other. For
example, if two coins are tossed the event “Head” on the first coin and “Tail” on the
second coin are independent.

Dependent Events
Two events A and B are said to be dependent if the occurrence (or non-occurrence) of
one of them is affected by the occurrence (or non-occurrence) of the other. For example,
suppose a box contains 2 red pens and 3 blue pen and two are picked at random
successively. The event “blue pen” in the second picking and the event “red pen” in the
first picking are dependent

Concept
Probability is a concept that most people understand naturally, since words such as
“chance’, “likelihood’, “possibility’, and “proportion” are used as part of everyday
speech. For example, most of the following, which might be heard in business situations
are in fact statements of probability.
i. There is a 305 chance that the job will not be completed on time.
ii. There is no probability of delivering the goods before Monday. Etc…

Probability is the likelihood or chance that a particular event will occur. It could refer to
the chance of picking a black card from a deck of cards, or the chance that new consumer
product on the market will be successful, etc. Probability is a numerical measure of the
likelihood that an event will occur. These values are always assigned on a scale from 0 to
1. A probability near 0 indicates an event is very unlikely to occur(no chance of
occurring); a probability near 1 indicates an event is almost certain. Other probabilities

45
between 0 and 1 represent degrees of likelihood that an event will occur. For example, if
we consider the event “rain tomorrow” we understand that when the weather report
indicates ‘a near-zero probability of rain’ there is almost no chance of rain. However, if
0.90 probability of rain is reported, we know that rain is likely to occur. If 0.50
probability indicates that rain is just as likely to occur as not.

Chance and the assessment of risk play a part in everyone’s life. It might be something as
simple as tossing a coin at the start of a game, playing cards, owning premium bonds or
playing the National Lottery etc…. Probability has found a wide range of business
application. In addition to the calculation of risk in the banking and insurance industries,
probability provides the basis of many of the sampling procedures used in market
research and quality control. Investment appraisal requires an assessment of risk and a
measure of expected outcomes. The planning of major projects needs to take account of
uncertainties. Given that the outcomes of most activities are not known with certainty
( not deterministic), it is useful to understand them in probalistic terms

Approaches to Probability

Priori Classical Probability: The probability is based on the assumption that the
outcomes are equally likely. The probability of success is based on prior knowledge
of the process involved. In the simplest case where each event is equally likely, the
chance of occurrence of the event is defined as follows
X
Probability of occurrence =
T
Where:
X = Number of outcomes in which the event we are looking for occurs
T = Total number of possible outcomes

The above formula can also be stated as follows:


n(A)
Probability of occurrence P(A) =
N(S)
Where:
P(A) = Probability of event A
n(A) = number of elements in the set of event A

46
n(S) = number of elements in the set of the sample space, S

In some experiments, all outcomes are equally likely. For example if you were to choose
one winner in a raffle from a hat, all raffle ticket holders are equally likely to win, that is,
they have the same probability of their ticket being chosen. This is the equally-likely
outcomes model and is defined to be:

number of outcomes corresponding to event E


P(E) =
total number of outcomes

Examples

1. The probability of drawing a spade from a pack of 52 well-shuffled playing cards


is 13/52 = 1/4 = 0.25 since
event E = 'a spade is drawn';
the number of outcomes corresponding to E = 13 (spades);
the total number of outcomes = 52 (cards).

Example: Consider the experiment of rolling a six-sided die once. What is the
probability of the event ‘an even number appears face up”?
Solution:
Possible outcomes (S) =  1,2, 3, 4, 5, 6 , Thus n(S) = 6
Favourable outcomes ( event A) = = 2, 4, 6  Thus n(A) = 3
n(A) 3 1
 Probability of occurrence P(A) = = =
N(S) 6 2

Empirical (Relative Frequency) Probability


Although the probability is defined as the ratio of the number of favourable outcomes,
these outcomes are based on observed historical (past) data.. This type of probability
could refer to the proportion of individuals who actually purchase a television set, who
prefer a certain political party, or who have a part time job while attending school. The
probability of an event happening in the long run is determined by observing what
fraction of that similar events happened in the past. In terms of formula, we have:
Number of times event occurs in the past
Probability of event happening =

47
Number of observations

Example: A study of 751 Business Administration students a university revealed that


383 of the 751 were not employed in their major area of study in the university. What
is the probability that a particular student will be employed in an area other than
his/her major?
Solution:
383
Probability of event happening = = 0.51
751

Subjective Probability
Whereas the probability of favourable event in the two approaches was computed
objectively, either from prior knowledge or from actual data, subjective probability refers
to the chance occurrence assigned to an event by a particular individual based on his
degree of belief or strength of conviction that the event will occur. This chance may be
different from the subjective probability assigned by another individual. For example, the
inventor of a new toy may assign quite a different probability to the chance of success for
the toy than the managing director of the company. The assignment of subjective
probabilities to various events is usually based on a combination of an individual’s past
experience, personal opinion, and analysis of a particular situation. Subjective probability
is especially useful in making decisions in situations in which the probability of various
events cannot be determined empirically.

A subjective probability describes an individual's personal judgement about how likely a


particular event is to occur. It is not based on any precise computation but is often a
reasonable assessment by a knowledgeable person.

Like all probabilities, a subjective probability is conventionally expressed on a scale from


0 to 1; a rare event has a subjective probability close to 0, a very common event has a
subjective probability close to 1.

Example
A Rangers supporter might say, "I believe that Rangers have probability of 0.9 of winning
the Premier Division this year since they have been playing really well."
48
Rules of Probability

Addition Rule: This applies when we are considering two or more events and wish to
determine the probability that at least one of the events will take place. In other words,
the addition rule is a result used to determine the probability that event A or event B
occurs or both occur
The result is often written as follows, using set notation:

where:
P(A) = probability that event A occurs
P(B) = probability that event B occurs
= probability that event A or event B occurs
= probability that event A and event B both occur

1. Mutually exclusive events: For mutually exclusive events, that is events which cannot
occur together:
There are two variations of this rule:

1. Mutually exclusive events: If A and B are two mutually exclusive events, then the
probability of obtaining either A or B is equal to the probability of obtaining A plus the
probability of obtaining B. The sets corresponding to the events are disjoint. In general,
we have: P(A or B) = P(A) +P(B).

Two events are mutually exclusive (or disjoint) if it is impossible for them to occur
together.

Formally, two events A and B are mutually exclusive if and only if

If two events are mutually exclusive, they cannot be independent and vice versa.

49
Examples

1. Experiment: Rolling a die once


Sample space S = {1,2,3,4,5,6}
Events A = 'observe an odd number' = {1,3,5}
B = 'observe an even number' = {2,4,6}
= the empty set, so A and B are mutually exclusive.
2. A subject in a study cannot be both male and female, nor can they be aged 20 and
30. A subject could however be both male and 20, or both female and 30.

Worked Example: An automatic Shaw machine fills plastic bags with a mixture of
beans, broccoli, and other vegetables. Most of the bags contain the correct weight, but
because of the slight variations in the size of the beans and other vegetables, a package
might be slightly underweight or overweight. A check of 4,000 packages filled in the past
month revealed the following:

Weight Event No. of packages


Underweight A 100
Satisfactory B 3,600
Overweight C 300
Total 4,000

What is the probability that a particular package will be either underweight or


overweight?

Solution:
Let event A = underweight
Let event B = overweight
The corresponding probabilities are P(A) = .025, P(C) = .075
 P(A or B) = P(A) + P(B) = .025+0.075 = 0.10

2. Non-Mutually exclusive events: This rule is also known as the Joint Addition Rule of
probability. It measures the likelihood of two or more events happening concurrently. For
two events A and B, P(A or B) = P(A) +P(B) – P(A and B). Note that P(A and B) is the
same as P(A  B). Also P(A or B) is the same as P(A  B).

50
Example 1: What is the probability the randomly choosing a card from a standard deck
of playing cards will either be a king or a heart?
Solution:
EVENT PROBABILITY EXPLANATION
A(King) P(A) = 4/52 4 kings in a deck of 52 cards
B(Heart) P(B) = 13/52 13 Hearts in a deck of 52 cards
C(King of Heart) P( C ) = 1/52 1 King of Hearts in a deck of 52 cards
P(A or B) = P(A) + P(B) - P(A  B = 4/52 + 13/ 52 – 1/52 = 16/52 = 0.3077

Example 1: Adie is tossed once. What is the probability that the number that shows up is
even or prime or both?

Conditional Probability

In many situations, once more information becomes available, we are able to revise our
estimates for the probability of further outcomes or events happening. For example,
suppose you go out for lunch at the same place and time every Friday and you are served
lunch within 15 minutes with probability 0.9. However, given that you notice that the
restaurant is exceptionally busy, the probability of being served lunch within 15 minutes
may reduce to 0.7. This is the conditional probability of being served lunch within 15
minutes given that the restaurant is exceptionally busy.

The usual notation for "event A occurs given that event B has occurred" is "A | B" (A
given B). The symbol | is a vertical line and does not imply division. P(A | B) denotes the
probability that event A will occur given that event B has occurred already.

A rule that can be used to determine a conditional probability from unconditional


probabilities is:

where:
P(A | B) = the (conditional) probability that event A will occur given that event B
has occurred already

51
= the (unconditional) probability that event A and event B both occur
P(B) = the (unconditional) probability that event B occurs

MULTIPLICATION RULE: The multiplication rule is a result used to determine the


probability that two events, A and B, both occur.

There are two versions of this rule:


Independent events: Two events are independent when the occurrence (or non-
occurrence) of one event has no effect on the probability of the other event. Given that
events A and B are independent, P(A and B) = P(A)*P(B). Note that for independent
events, P(A/B) = P(A)

The multiplication rule is a result used to determine the probability that two events, A and
B, both occur.

The multiplication rule follows from the definition of conditional probability.

The result is often written as follows, using set notation:

where:
P(A) = probability that event A occurs
P(B) = probability that event B occurs
= probability that event A and event B occur
P(A | B) = the conditional probability that event A occurs given that event B has
occurred already
P(B | A) = the conditional probability that event B occurs given that event A has
occurred already
For independent events, that is events which have no influence on one another, the rule
simplifies to:

That is, the probability of the joint events A and B is equal to the product of the
individual probabilities for the two events.

52
In other words, two events are independent if the occurrence of one of the events gives us
no information about whether or not the other event will occur; that is, the events have no
influence on each other.

Example 1
Suppose that a man and a woman each have a pack of 52 playing cards. Each draws a
card from his/her pack. Find the probability that they each draw the ace of clubs.
We define the events:
A = probability that man draws ace of clubs = 1/52
B = probability that woman draws ace of clubs = 1/52
Clearly events A and B are independent so:

= 1/52 . 1/52 = 0.00037


That is, there is a very small chance that the man and the woman will both draw the ace
of clubs.

Example 2: An oil exploration company assesses that the probability of a successful


strike in the North Sea is 0.6 while in the Irish Sea the probability is 0.4. If the oil
company two drills bore holes, one in each Sea, simultaneously, what is the probability
that it will strike oil in both cases?
Solution:
P(A) = .06, P(B) = 0.4
 P(A and B) = P(A)*P(B) = 0.6 * 0.4 = 0.24

Dependent events: Two events are dependent when the occurrence (or non-occurrence)
of one event has effect on the probability of the other event. If events A and B are
dependent, then P(A and B) = P(A)*P(B/A) or P( A B). = P(B)*P(A/B)

Example: A box contains 10 roll of films, 3 of which are defective. Two rolls are
selected at random one after the other without replacement. What is the probability of
selecting a defective roll followed by another defective roll?
Solution:
Let A = event that the first roll defective
53
Let B = event that the second roll is defective
P(A) = 3/10 and P(B) = 2/9
 P(A and B) = P(A)*P(B/A) = 3/10 * 2/9 = 6/90 = 2/30 = .07

Law of Total Probability


The result is often written as follows, using set notation:

where:
P(A) = probability that event A occurs
= probability that event A and event B both occur
= probability that event A and event B' both occur, i.e. A occurs and B
does not.
Using the multiplication rule, this can be expressed as
P(A) = P(A | B).P(B) + P(A | B').P(B')

Bayes' Theorem

Bayes' Theorem is a result that allows new information to be used to update the
conditional probability of an event.

Using the multiplication rule, gives Bayes' Theorem in its simplest form:

Using the Law of Total Probability:


P(B | A).P(A)
P(A | B) =
P(B | A).P(A) + P(B | A').P(A')
where:
P(A) = probability that event A occurs
P(B) = probability that event B occurs
P(A') = probability that event A does not occur
P(A | B) = probability that event A occurs given that event B has occurred already
P(B | A) = probability that event B occurs given that event A has occurred already

54
P(B | A') = probability that event B occurs given that event A has not occurred
already

TUTORIAL QUESTIONS

1. The chairman of the board of Rudd Industries is delivering a speech to the


company stockholders explaining his position that the company should merge
with Zimmerman Plastics. He has received six pieces of mail on the issue and is
interested in the number of writers who agree with him.
a. What is the experiment?
b. What are some of the possible events?
c. List two possible outcomes.

2. A survey of a class of 34 students in a Business class showed the following


selection of courses
Courses No. of students
Accountancy 10
Finance 5
Info. Systems 3
Management 6
Marketing 15
Suppose you select a student at random. What is the probability that she/he takes
management?

3. The Department of Highways of New Juabeng Municipality is considering


widening the streets in the town. Before a final decision is made, 500 citizens
are asked if they support the idea.
a. What is the experiment?
b. What are some of the possible events?
c. List two possible outcomes
5. A large number of automobile drivers were selected at random and the number of
traffic violations they had, if any, were recorded

55
No. of violations No. of drivers
0 1,910
1 46
2 18
3 12
4 9
5 or more 5
a. What is the experiment?
b. List one possible event
c. What is the probability that a particular driver had exactly two violations?

6. Before a nationwide survey was conducted, 40 people were selected to test the
questionnaire. One question about whether abortions should be legal required a yes or
no answer.
a. what is the experiment/
b. List one possible event
c. Ten of the 40 people favoured the legalization of abortions. Based on these
sample responses, what is the probability that a particular person will be in
favour of the legalization of abortions?
d. Are each of the possible outcomes equally likely and mutually exclusive?

7. An automatic Shaw machine fills plastic bags with a mixture of beans, broccoli and
other vegetables. Most of the bags contain the correct weight, but because of the slight
variation in the size if the beans and other vegetables, a package might be slightly
underweight or overweight. A check of 10,000 packages filled in the past month revealed
the following:

Weight Event No. of packages


Underweight A 150
Satisfactory B 9,500
Overweight C 350
Total 10,000
i. What is the experiment
ii. What is the probability that a particular package will be either underweight or
overweight?
iii. What are the events?
56
iv. Are the events mutually exclusive or non-mutually exclusive?

8. What is the probability that a randomly chosen card from a standard deck of
playing cards is either a
a. king or a heart?
b. club or a heart?
c. queen or a jack?

9. The probabilities of two events A and B are 0.20 and 0.30 respectively. The probability
that both A and B occur is 0.15. What is the probability that either A or B will occur?

10. The probabilities of two events A and B are 0.35 and 0.65 respectively. The
probability that both A and B occur is 0.20. What is the probability that either A or B will
occur? Are the events mutually exclusive/inclusive? Give reason

11. A survey of executives dealt with their loyalty to the company. That is, whether or not
they would leave given an offer by another company higher than their present position.
The table below shows the results of the survey.
LENGTH OF SERVICE
LOYALTY Less than 1 1 to 5 years 6 to 10 More than Total
year years 10 years
Would remain 10 30 5 75 120
Would not remain 25 15 10 30 80
Total 35 45 15 105 200

What is the probability of randomly selecting an executive who is loyal to the company
(would remain) and who has more than 10 years of service?

12. Suppose that the following contingency table was set up

B B1
A 10 20
A1 20 40
What is the probability of:
d. Event A?
e. Event B?
f. Event A1?

57
g. Event A and B?
h. Event A and B1?
i. Event A1 and B1?
j. Event A or B?
k. Event A or B1?
l. Event A1 or B1?

13. In the past several years, credit card companies have made an aggressive effort to
solicit new accounts from college students. Suppose that a sample of 200 students at a
college indicated the following information as to whether the student possessed a bank
and/ travel and entertainment credit card
Travel and entertainment credit card
Bank credit card Yes No
Yes 60 60
No 15 65
If a student is selected at random, what is the probability that the student:
i. Had a bank credit card?
ii. Had a travel and entertainment credit card?
iii. Had a bank credit card and travel and entertainment credit card?
iv. Had a bank credit card or travel and entertainment credit card?
v. Does not have a bank credit card and travel and entertainment credit card?
vi. Neither a bank credit card nor travel and entertainment credit card?
14. A sample of 500 respondents was selected in a large metropolitan area to determine
previous information concerning consumer behaviour. Among the questions asked was,
“Do you enjoy shopping for clothing?” Of 240 males, 136 answered yes. Of 260 females,
244 answered yes.
m. Set up a 2x2 contingency table to evaluate the probabilities.
n. What is the probability that a respondent chosen at random:
i. Is a male?
ii. Enjoys shopping for clothing?
iii. Is a female and enjoys shopping for clothing?
iv. Is a male and does not enjoy shopping for clothing?
v. Is a female or enjoys shopping for clothing?
vi. Is a male or does not enjoy shopping for clothing?
vii. Is a male or female?

15. Suppose that the following contingency table was set up


B B1
A 10 30
A1 25 35

i. Find the following:


a. (A/B)
b. P(A/B1)
c. P(A1/B1)

58
d. P(A1/B)

ii. Are A and B statistically independent?

16. If P(A and B) = 0.40 and P(B) = 0.80, find P(A/B)

17. If P(A) = 0.70, P(B) =0.60 and P(A and B) are statistically independent, find P(A and
B)

18. If P(A) = 0.3, P(B) =0.4 and P(A and B)= 0.20. Are A and B statistically independent?

19. A bag contains 6 red balls and 4 black balls. Two balls are drawn at random one at a
time. Find the probability that:
i. Both balls are red.
j. Both balls are black.
k. The second ball is red given that the first is black.
l. The second ball black is given that the first is red.
COMBINATORIAL CONCEPTS
This deals with combination and permutation. These techniques are used to determine
the total number of possible outcomes.

FACTORIAL: For any positive integer n, we define n factorial denoted n!,


as n(n-1)(n-2),,,1. For example, 4! = 4(41)(4-2)(4-3) = 4*3*2*1 = 24
By definition, 1! = 1 and 0! = 1

PERMUTATION: This refers to the number of ways in which a set off objects can
be arranged in order. Order is very important. The number of permutations of n
objects taking r at a time denoted nPr is given by

n!
P =
n r
(n-r)!

Example: Suppose 4 people are to be randomly chosen out of 10 people who agreed
to be interviewed in a market survey. The four people are to be assigned to four
interviewers. How many possibilities are there?
Solution: The possibilities are given by the relation nPr.

59
n=10 and r = 4

10!
 10 4P =
(10-4)!

10*9*8*7*6*5*4*3*2*1
= = 5040
6*5*4*3*2*1

COMBINATION: These are the possible selections of r items from a group of n


items regardless of the order of selection. The number of combinations denoted by
nCr. is given as:
n!
 nCr =
r!(n-r)!

Example: Suppose that 3 out of 10 members of the board of directors of a large


cooperation are to be randomly selected to serve on a particular committee. How
many possible selections are there?
Solution: n = 10 and r = 3

10!
 10C3 =
(10-4)!

10*9*8*7*6*5*4*3*2*1
= = 120
3*2*1(7*6*5*4*3*2*1)

EXERCISE
1. Three electronic parts are to be assembled into a plug-in unit for a television set. The
parts can be assembled in any order. How many different ways can the 3 parts be
assembled?
2. Suppose GHAMOT machine shop has 8 screw machine available but only 3 spaces
available in the production area in the machine. In how many different ways can 8
machines be arranged in the available spaces
3. The marketing department has been given the assignment of designing colour codes for
42 different lines of compact disk sold by Goody Records. Three colours are to be used
on each CD, but a combination of 3 colours used on one CD cannot be rearranged and
used to identify a different CD. Would 7 colours taking 3 at a time be adequate to colour
code the 42 lines?
4. A transport manager has to plan routes for his drivers. There are 3 deliveries to be
made to customer; X,Y and Z. How many routes can be made?

60
5. A company has 4 training officers A,B,C, D and two training sections. In how many
different ways may the 4 officers be assigned to the two sections X and Y?
6. A committee of 5 is to be chosen from 4 men and 5 women to work on a project.
a) In how many ways can the team be chosen?
b) In how many ways can the committee be chosen to include just 3 women?
c) What is the probability that the committee includes
i. at least 3 women?
ii. more than 3 women?

7. a) (a) State the addition and multiplication laws of probability.

b) In a stock room, 6 adjacent shelves are available for storing 6 different items. The
stock of each item can be stored satisfactorily on any shelf.

(i) In how many ways can the 6 items be stored on the 6 shelves?

(ii) If there are 7 different items to be stored, but only 5 shelves are available, how many
arrangements are possible.

Binomial Distribution
In cases where the variable of interest is dichotomous (has two parts or outcomes),
the variable is binomial. Many business situations give rise to the compilation of
simple “yes” or “no” type of answers to particular questions. Examples of these
situations include:
i. In sampling the output of a production line, we could record for each item
coming off the line whether or not it is defective (that is defective and
nondefective)
ii. A salesgirl may or may not succeed in obtaining an order
iii. A consumer survey may indicate whether or not people are likely to buy a
product.
In each of the above cases, only two outcomes are possible. Statistical analysis of these
types of situation may be referred to as binomial experiment. We classify the two possible
outcomes as “success” and “failure”. In binomial experiment, we are interested in the
number of successes(or failures) occurring in n independent trials( such as x items
inspected on a production line). Typically, a binomial random variable is the number of
successes in a series of trials, for example, the number of 'heads' occurring when a coin is
tossed 50 times.

61
If we let x represent the random variable of the number of successes occurring in n such
trials, then x can take on any of the discrete values 0, 1, 2, .., n. The probabilities
associated with each of the possible outcomes have a special frequency distribution called
the binomial probability distribution.
A discrete random variable X is said to follow a Binomial distribution with parameters n
and p, written X ~ Bi(n,p) or X ~ B(n,p), if it has probability distribution

P( X  x )  C  x p x (1  p ) n  x
n

where
x = 0, 1, 2, ......., n
n = 1, 2, 3, .......
p = success probability; 0 < p < 1

The Binomial distribution has expected value E(X) = np and variance V(X) = np(1-p).

Characteristics of binomial distribution:


i. Any outcome on each trial of an experiment is classified into two mutually
exclusive and exhaustive categories., success or failure. For example, the
answer to a true or false question is either true or false. The answer cannot be
both true and false at the same time.
ii. The probability of a success(p) remain the same(is constant) from one trial to
another. So also is the probability of a failure(1-p = q). For example, the
probability that you will guess the first question of a true/false test correctly(a
success) is ½. The probability that you will guess right on the second
question(second trial) is also ½.
iii. The trials are independent, meaning that the outcome of one trial does not
affect the outcome of another trial..
Example: A quality inspector selects 5 components at random from a production line
in order to test whether the machinery is functioning normally. It is known that when

62
it is functioning normally, the probability of a component selected at random being
defective is .05. What is the probability that there will be 2 defectives?
Solution: P(success) 0.05, P(failure) = 1-.05 = .95, n= 5, and x = 2

iii. P(x) = nCx(P)x(q)n-x = 5C2(.95)3(.05)2 = 0.00214

TUTORIAL QUESTIONS
QUESTION 1
Let X be a binomial random variable. Compute the following probabilities:
a. P(x=3) if n=5 and p=.2
b. P(x=2) if n=6 and p=.3
c. P(x=5) if n=5 and p=.75

QUESTION 2
A shoe store’s records that 30% of customers making a purchase use a credit card to
make payment. On a particular day, 20 customers purchased shoes from the store ;
o. Find the probability that at least customers use credit card
p. What is the expected number of customer who used a credit card?

QUESTION 3
An auditor is preparing for a physical count of inventory as a means of verifying its
values. Items counted are reconciled with a list prepared by the storeroom supervisor.
Normally, 20% of the items counted cannot be reconciled without reviewing invoices.
The auditor selects 10 items.

63
Find the probability of each of the following:
i. Up to 4 items cannot be reconciled.
ii. At least 6 items cannot be reconciled.
iii. Between 4 and 6 items (inclusive) cannot be reconciled.

QUESTION 4
A student majoring in Accounting is trying to decide upon the number of firms to which
she should apply. Given her work experience, grade and extracurricular activities, she has
been told by a placement counselor that she cannot expect to receive a job offer from
80% of the firms to which she applies. Wanting to save time, the student applies to only
five firms. Assuming the counselor’s estimate is correct, find the probability that the
student receives the following:
i. No offers
ii. At most 2 offers
iii. Between 2 and 4 offers (inclusive)
iv. 5 offers
QUESTION 5
A multiple-choice quiz has 15 questions. Each question has five possible answers, of
which only one is correct.
a. What is the probability that sheer guesswork will yield at least seven
correct answers?
b. What is the expected number of correct answers by sheer guesswork?
QUESTION 6
A sign on the gas pumps of a certain chain of gasoline station encourages customers
to have their oil checked, claiming that one out of every four cars should have its oil
topped up.
a. What is the probability that exactly 3 of the next 10 cars entering a station
should have their oil topped up?
b. What is the probability that least half of the next 10 cars entering a station
should have their oil topped up?
c. What is the probability that at least half of the next 20 cars entering a
station should have their oil topped up?
QUESTION 7

64
A company Minibus has 7 passenger seats and on a routine run it is estimated that any
passenger seat will be filled with probability 0.42.
a. What is the mean and variance of the binomial distribution of the number of
passengers on a routine run?
b. Calculate the probability (to 3 decimal places) that on a routine run:
i. There will be no passengers
ii. There will be just one passenger
iii. There will be exactly two passengers
iv. There will be at least three passengers
QUESTION 8
Suppose a poll of 20 voters is taken in a large city. The purpose is to determine the
number who favour a candidate for mayor. Suppose that 60% of all the city voters
favour the candidate.
a. find the mean and standard deviation of x.
b. find the probability that:
i. x  3
ii. x > 17
Poisson Distribution
While binomial random variable counts the number of successes that occur in a fixed
number of trials, a Poisson random variable counts the number of rare events (successes)
that occur in a specified time interval or a specified region. Activities to which the
Poisson distribution can be successfully applied include counting the number of
telephone calls received by a switchboard in a specified time period, counting the number
of arrivals at a service location (such as tollbooth or counter etc) in a given time period.

Typically, a Poisson random variable is a count of the number of events that occur in a
certain time interval or spatial area. For example, the number of cars passing a fixed point
in a 5 minute interval, or the number of calls received by a switchboard during a given
period of time.

A discrete random variable X is said to follow a Poisson distribution with parameter m,


written X ~ Po(m), if it has probability distribution

where
65
x = 0, 1, 2, ..., n
m > 0.

A Poisson distribution possesses the following properties:


i. The number of successes that occur in any given interval is independent of the
number of successes that occur in any other interval
ii. The probability that a success will occur in an interval is the same for all
intervals of equal size and is proportional to the size of the interval
iii. The probability that two or more successes will occur in an interval
approaches zero as the interval becomes smaller.
iv. The length of the observation period is fixed in advance;

In Poisson experiment, success refers to occurrence of the event of interest, and interval
refers to either an interval of time or an interval of space.
.

The Poisson distribution has expected value E(X) = m and variance V(X) = m; i.e. E(X) =
V(X) = m.

The Poisson distribution can sometimes be used to approximate the Binomial distribution
with parameters n and p. When the number of observations n is large, and the success
probability p is small, the Bi(n,p) distribution approaches the Poisson distribution with
the parameter given by m = np. This is useful since the computations involved in
calculating binomial probabilities are greatly reduced.

QUESTION 1
The manager of a company has noted that she usually receives 10 complaint calls from
customers during a week(consisting of 5 working days) and that the calls occur at
random. Find the probability of her receiving exactly five such calls in a single day.

QUESTION 2
The number of calls received by a switchboard operator between 9am. And 10 am. has a
Poisson distribution with mean 12. Find the probability that the operator received
a. 1 call

66
b. at least five calls during the periods

QUESTION 3
The number of accidents that occur on an assembly line have a Poisson distribution, with
an average of three accidents a week.
a. Find the probability that a particular week will accident free.
b. Find the probability that at least 3 accidents will occur a week
c. Find the probability that exactly 5 accidents will occur a week

CORRELATION
When the value of one variable is related to the value of another, they are said to be
correlated. Thus, correlation measures the relationship (association) between quantitative
variables. The coefficient of correlation is the measure of the association

A correlation coefficient is a number between -1 and 1 which measures the degree to


which two variables are linearly related.

If there is perfect linear relationship with positive slope between the two variables, we
have a correlation coefficient of 1; if there is positive correlation, whenever one variable
has a high (low) value, so does the other. If there is a perfect linear relationship with
negative slope between the two variables, we have a correlation coefficient of -1.
Movements in one variable may cause movement in the same direction in the other
variable. For example, there is likely to be some correlation between a person’s height
and weight; price and quantity of an item supplied

If there is negative correlation, whenever one variable has a high (low) value, the other
has a low (high) value. Movements in one variable may cause movement in the opposite

67
direction in the other variable. For example, price and quantity demanded of an item.
Correlation of 0 indicates no relation ship. It means that the variables are independent

A correlation coefficient of 0 means that there is no linear relationship between the


variables.

There are a number of different correlation coefficients that might be appropriate


depending on the kinds of variables being studied. For the purpose of this course, two of
these will be studied. These are:

i. Product moment coefficient of correlation( r ).


ii. Rank correlation coefficient( R ).

Product moment coefficient of correlation( r ).

Pearson's product moment correlation coefficient, usually denoted by r, is one example of


a correlation coefficient. It is a measure of the linear association between two variables
that have been measured on interval or ratio scales, such as the relationship between
height in inches and weight in pounds. However, it can be misleadingly small when there
is a relationship between the variables but it is a non-linear one.

The Pearson correlation coefficient formula makes several assumption about the nature of
data to which it is applied. First of all, the two variables are assumed to have been
measured using interval or ratio scales. If this is not the case, there are other types of
coefficients that can be computed that match the data on hand. A second implicit
assumption is that the nature of the relationship that we are trying to measure is linear.
The use of Pearson correlation coefficient also assumes that the variables you want to
analyse come from a bivariate normally distributed population. That is, the population is
such that all the observations with a given value of one variable have values of the second
variable that are normally distributed.

68
When this assumption is not justified, a non-parametric measure such as the Spearman
Rank Correlation Coefficient might be more appropriate.

If the variables are denoted by X and Y and the data is in the form (X1,Y1), (X2,Y2),
(X3,Y3),, (X4,Y4),…, (Xn,Yn), then r is given by the formula

n xy - xy
r =
 ( nx2 –(nx)2 y2 –(y)2

Rank correlation coefficient( R ).

The Spearman rank correlation coefficient is one example of a correlation coefficient. It


is usually calculated on occasions when it is not convenient, economic, or even possible
to give actual values to variables, but only to assign a rank order to instances of each
variable. It may also be a better indicator that a relationship exists between two variables
when the relationship is non-linear.

Commonly used procedures, based on the Pearson's Product Moment Correlation


Coefficient, for making inferences about the population correlation coefficient make the
implicit assumption that the two variables are jointly normally distributed. When this
assumption is not justified, a non-parametric measure such as the Spearman Rank
Correlation Coefficient might be more appropriate.

In computing this coefficient the actual values of the variables are not taken into
consideration but only their relative magnitudes. For each of the series of values x and y
we note the magnitudes of the items and assign them ranks in the descending order of
their magnitudes. In case of a tie, when two or more items have the same value, we adopt
the following methods:
i. All such items are given the same rank, that is, the rank in which the tie
occurred.

69
ii. They are given the same rank, namely the average of the ranks of places
occupied by these items. In this case, the ranks of some items may be
fractional.
iii. The next item gets as usual the rank which it would have but for the tie.
iv. The difference d in the two ranks of the corresponding items of the two series
is then ascertained.
v. The coefficient of correlation R is then given by:

6d2
R = 1-
n(n2-1)

Where n is the number of items in each series. The value of R so obtained, also ranges
from -1 to +1. Since this coefficient is based on the ranks of the items in the two series, it
is known as Spearman’s rank correlation coefficient

In case m items in a series have the same rank, we add M=1/12(m3 –m) to the value d2
as a correction factor. The coefficient of correlation is given as:
6(d2 +M)
R = 1-
n(n2-1)

There may be as many corrections as the number of groups of items having the same
rank. Thus, if there are more than one such groups of items with with common ranks, the
above correction (m may be different from each group) is added as many times as the
number of such groups. This method has the advantage that only the ranks of the items
are required to be known and not their actual values. Hence it can be profitably used in
those cases where the variables cannot be given numerical values, but only positions, for
example, in cases when the variables are attributes like honesty, intelligence, character,
etc.. It is also easier to compute than the product moment correlation coefficient.

The Spearman rank correlation coefficient is the recommended statistics to use when the
two variables have been measured using ordinal scale. If either one of the variables is
represented by rank order data, the best approach is to use the Spearman rank order
correlation rather than the Pearson product moment correlation coefficient.

70
Example 1: Calculate the coefficient of correlation for the following by the method of
rank difference.
x 75 88 95 70 60 80 81 50
y 120 134 150 115 110 140 142 100
Solution:
x y Rank of x Rank of y d d2
75 120 5 5 0 0
88 134 2 4 -2 4
95 150 1 1 0 0
70 115 6 6 0 0
60 110 7 7 0 0
80 140 4 3 1 1
81 142 3 2 1 1
50 100 8 8 0 0
d2 =6

The rank correlation coefficient is:


6d2
R = 1-
n(n2-1)

6*6
= 1-
8(82-1)

= 0.93

Example 2: The following table shows the marks obtained by ten students in
accountancy and statistics. Find Spearman’s coefficient of rank correlation.

No 1 2 3 4 5 6 7 8 9 10
x 45 70 65 30 90 40 50 75 85 60
y 35 90 70 40 95 40 60 80 80 50

Solution:
Student Rank Rank
number x y x y difference(d) d2
1 45 35 8 10 -2 4.00
2 70 90 4 2 2 4.00
3 65 70 5 5 0 0
4 30 40 10 8.5 1.5 2.25
5 90 95 1 1 0 0
6 40 40 9 8.5 0.5 0.25
7 50 60 7 6 1 1.00

71
8 75 80 3 3.5 -0.5 0.25
9 85 80 2 3.5 -1.5 2.25
10 60 50 6 7 -1 1
Total d =15
2

In statistics, the highest number of marks are 95. Hence this student gets rank 1. The rank
2 goes to the student with 90 marks. Now, there are two students who got 80 marks each.
They should get the 3rd and 4th ranks. Since their marks are equal, their ranks should also
be equal. Therefore, each of them is given the rank (3+4)/2 = 3.5. In a similar manner, the
two students who got 40 marks each are given the rank (8+9)/2 = 8.5 each.
The rank correlation coefficient is:
6(d2 +M)
R = 1-
n(n2-1)

Where M = 1/1/(m3 – m)
For the given data, there are two groups of two figures having
the same item. So, m1(number of student having the same
marks, i.e. 80) = 2 and m1(number of student having the same
marks, i.e. 40) = 2
Therefore, m1 = 2 and m2 = 2,

Hence, M = 1/12(23-2)+1/12(23-2) =1

6*(15+1)
R = 1 -
10(102-1)

= 0.9
SIMPLE LINEAR REGRESSION

Simple linear regression aims to find a linear relationship between a response variable
and a possible predictor variable by the method of least squares.

Least Squares
The method of least squares is a criterion for fitting a specified model to observed data.
For example, it is the most commonly used method of defining a straight line through a
set of points on a scatterplot

Regression Equation

72
A regression equation allows us to express the relationship between two (or more)
variables algebraically. It indicates the nature of the relationship between two (or more)
variables. In particular, it indicates the extent to which you can predict some variables by
knowing others, or the extent to which some are associated with others.

A linear regression equation is usually written


Y/ = a + bX + e
where
Y is the dependent variable
a is the intercept
b is the slope or regression coefficient
X is the independent variable (or covariate)
e is the error term

The values a and b are determined by the following equations:


b = nxy - xy
nx2 - ( x ) 2

a = y - bx
n n

Regression Line

A regression line is a line drawn through the points on a scatter plot to summarise the
relationship between the variables being studied. When it slopes down (from top left to
bottom right), this indicates a negative or inverse relationship between the variables;
when it slopes up (from bottom right to top left), a positive or direct relationship is
indicated.

The regression line often represents the regression equation on a scatter plot.

73
74
75
76
TIME SERIES
A time series is a collection of data recorded over a period of time: usually weekly, monthly,
quarterly or yearly. Examples include quarterly earning reports of a firm. Monthly
shipment of cements from a habour, annual consumer price indices, etc.

Unless the time series data are subject to a constant rate of change in an upward or
downward direction, which in fact is not very likely to obtain, all time series are subject
to variations whose pattern and amplitude vary from one period to another.

Given this, the objective of dealing with a time series is to study and analyse these
variations with a view to knowing how the time series has behaved in the past. This
knowledge about the past behaviour is useful for drawing inferences about the future.
77
.

TIME SERIES ANALYSIS The object of time series is to discover the magnitude and
direction of the trend that may exist in the time series, the nature and amplitude of
cycles, the effect of seasonal change and the size of random movements. The analysis is
done to estimate and separate the four types of variations and to bring out the selective
impact of each on the overall behaviour of the time series.

Analysis of the time series essentially, involves decomposition of the time series into its
components.

COMPONENTS OF TIME SERIES


Trend: Is the gradual upward or downward movement of the data over time. Changes in
income, population, age distribution or cultural views may account for movement in trend
Seasonal: Is a data pattern that repeats itself after a period of days, weeks, months, or
quarters. There are six common seasonality patterns:

Period of pattern Season length Number of seasons in pattern


Week Day 7
Month Week 4–4½
Month Day 28 – 31
Year Quarter 4
Year Month 12
Year Week 52
Restaurants and barber shops, for example, experience weekly seasons, with Saturday
being the peak of business. Beer distribution forecast yearly patterns, with monthly
seasons.
Cyclical: Are patterns in data that occur every several years. The are usually tied into the
business cycle and are of major importance in business analysis and planning. Predicting
business cycles is difficult because they may be affected by political events, by
international turmoil.

78
Irregular(random) variations: These are abrupt but not frequent, variations that go
extremely deep downwards or very high upward. These are caused by chance and
unusual situations. The follow no discernible pattern and so they cannot be predicted

MODELS OF TIME SERIES ANALYSIS: There are two models of decomposition of


time series. These are:
i. The additive model
ii. The multiplicative model.

Which of these two models to be used in decomposition depends on the assumption that
we might make about the nature of relationship among the four components.

The additive model: The additive model is used where it is assumed that the four
components are independent of one another. Under this assumption, the four components
are additive in the sense that the magnitude of the time series are the sum of separate
influence of the four components.

Thus, if Yt is taken to represent the magnitude of the time series data at that period t, then
Yt can be expressed as:
Y t = T t + Ct + S t + Rt

Where Tt = Trend variation


Ct = Cyclical variation
St = Seasonal variation
Rt = Random variation

When the time series data are recorded against years, the seasonal component would
vanish, and in that case the additive model will take the form:

Y t = T t + Ct + R t

The multiplicative model: It is used when it is assumed that the forces giving rise to the
four types of variations are interdependent, so that the overall pattern of variations in the

79
time series is the combined result of the interaction of the forces operating on the time
series.

According to this assumption, the original magnitudes of the time series are the product
of its four components. That is:

Y t = T t x Ct x S t x Rt

and Y t = T t x Ct x Rt

As regards the choice between the two models, it is generally the multiplicative model,
which is used more frequently. The reason being that most business and economic time
series data are the result of interaction of a large number of forces which, individually,
cannot be treated responsible for generating any particular type of variations. Since the
forces responsible for one type of variations are also responsible for the other types of
variations, it is the multiplicative model, which is ideally suited for the purpose of
decomposition of a time series.

DECOMPOSITION OF TIME SERIES: Decomposition of a time series requires


estimation of its four components and then separating them from each other, so as to be
able to understand the pattern of variation in each component independently.

For this purpose. the Pearson’s approach based on the multiplicative model is used The
first step in the Pearson’s approach is to estimate the trend variations by fitting an
appropriate trend on the time series. After estimating the trend variations, these are then
separated from the time series data by dividing the original magnitudes by the
corresponding trend values. Separating trend variables from a time series is known as
detrending. This is given as:
Y t = Ct x S t x Rt
Tt

Estimation and separation of seasonal variations, which involve computation of seasonal


index, follow the estimation of trend values. If the original time series data are recorded
against months, the multiplicative model would now be of the form:
80
Y t = C t x Rt
T t x St

Alternatively, the seasonal variation , can be separated by dividing the monthly data by
the seasonal index. This is known as deseasonalising a time series, and the resultant series
is called the deseasonalised time series.

The trend variation and seasonal variation having been removed, only the cyclical and
irregular variations would be left which can easily be examined with reference to the
pattern of their occurrences and amplitude.

Generally, the random variations are neither very important, nor can they be easily
eliminated. However, the extent to which their elimination is possible , they tend to
become marginal in the process of deseasonalisation.

ESTIMATION OF TREND: The essence of trend estimation lies in fitting a trend line on
the time series data in such a way that it passes through ( as nearly as possible ) the
middle of the high and low turning points of the time series graph.

The trend can take many possible shapes. A straight line trend is frequently encountered
because most business and economic time series either consistently tend to increase or
decline over a long period of time.

METHODS OF ESTIMATING STRAIGHT LINE TREND: A trend may be


estimated/determined by the following methods:
i. Moving average method
ii. Least square method

Moving average method: This method of obtaining a time series trend involves
calculating a set of averages, each one corresponding to a trend(t) value for a time point
of the series. These are known as moving averages, since each average is calculated by
moving from one overlapping set of values to the next. The number of values in each set
is always the same and is known as the period of the moving average.

81
To demonstrate the technique, a set of moving averages of period 5 has been calculated
for a set of values:
Original values: 12 10 11 11 9 11 10 10 11 10
Moving totals: 53 52 52 51 51 52
Moving averages: 10.6 10.4 10.4 10.2 10.2 10.4
The first total, 53, is formed from adding the first 5 items; i.e
12+10+11+11+9=53. Similarly, the second total is given by 10+11+11+9+11. The rest are
found in like manner. The averages are then obtained by dividing each total by 5.Notice
that the totals and the averages are written down in line with the middle value of the set
being worked on. These averages are the trend (t) values required. It should also be
noticed that there are no trend values corresponding to the first and last two original
values. This is always the case with moving averages and is a disadvantage of this
particular method of obtaining a trend. Another disadvantage is that, the averages do not
yield an equation, which could be used for forecasting the values of a time series variable
for the future.
The choice of the length of the period for a moving average is important because
this would determine the extent to which the variations would be smoothed in the process
of averaging.Cycles of uniform length and height can be more easily eliminated by
choosing a moving average period equal to or a multiple of the cycle. In other words, if
the period of the moving average is equal to the period of the cycle or is a multiple of it,
the smoothing is perfect and we have a straight line trend. In general, a shorter period is
more useful in averaging out the cycles.

Least square method: Estimation of trend values by the method of least square makes
use of the regression equation. That is:
Tt = a+bt
Where Tt = the trend value of the time series variable T in time, t.
a = trend value at the point of origin.
b = amount by which the trend value changes per unit of time.
t =the value of the independent variable, that is time.
The values of the two constants, a and b, in the regression equation are determined by
solving the two normal equations
y = na + bx

82
xy =ax + bx2
Or using the identities:
b = nxy - xy
nx2 - ( x ) 2

a = y - bx
n n
Example 1 : Sales of food since 1987 are shown below:
YEAR 1987 1988 1989 1990 1991
SALES($M) 7 10 9 11 13
Determine the least square trend line equation.
Solution:
YEAR ( x ) SALES (y ) X xy x2
1987 7 0 0 0
1988 10 1 10 1
1989 9 2 18 4
1990 11 3 33 9
1991 13 4 52 16
TOTAL 10 113 30
The values for the years have been coded i.e. 1987=0, 1988=1, e.t.c.

b = nxy - xy = 5(113) –50(10) = 1.3


nx2 - ( x ) 2 5(30) – (10) 2

a = y - bx = 50 -.3(10) = 7.40


n n 5 5
Hence, Y’ =7.4 + 1.3t

The problem can be simplified by making the following change of scale(coding): let x be
the variable which measures time and taking the origin(the zero) of the new scale at the
middle of the series, that is, at the middle of the x’s, so that in the new scale x= 0. If the
series has an odd number of years, we assign x=0 to the middle year and number the
years …,-5, -4, -3, -2, -1, 0, 1, 2, 3, 4, 5, ….If the series has an even numbers, there is no
middle year, and we assign successive years the numbers …,-7, -5, -3, -1, 1, 3, 5, 5, 7,
….with –1 and 1 assigned to the two middle years Substituting x= 0 into the identities
above we have:
b = xy
x2

83
a = y
n
Example 2: Fit a least square trend line to the data below.
Year Total annual
revenues,
1966 789
1967 789
1968 542
1969 1,093
1970 1,175
1971 1,067
1972 1,166
1973 1,426
1974 1,692

Solution: Since we have figures for nine (an odd number of ) years, we label them –4, -3,
-2, -1, 0, 1, 2, 3, 4, and the sums needed for substitution into the formulas for a and b are
obtained in the following table.

Year X Y xy x2
1966 -4 789 -3,156 16
1967 -3 789 -1,626 9
1968 -2 542 -1,538 4
1969 -1 1,093 -1,093 1
1970 0 1,175 0 0
1971 1 1,067 1,067 1
1972 2 1,166 2,332 4
1973 3 1,426 4,278 9
1974 4 1,692 6,768 16
Total 0 9,719 7,032 60

Using the , n=9, 9, y = 9, 719, xy = 7,032 , x2 =60

b = xy = 9,719 = 1,079.9


x2 9

a = y = 7,032 = 117.2
n 60

Therefore y’ = 1,079.9 + 117.2x

84
This makes it clear that 1,079.9 is the trend value for 1970 and that the annual trend
increment(the year-to-year growth) in the annual revenues is estimated as $117.2 million
for the given period of time.

Using the equation, we can now determine the trend value for any year by substituting the
corresponding value of x. For instance, for 1966 we substitute x = -4 and get a trend value
of y’= 1,079.9 + 117.2(-4) = 611.1, and for 1974 we substitute x=4 and get a trend value
of y’=1,079.9+117.2(4) =1,548.7 Plotting these two trend values and joining them with a
straight line, we obtain the least-square trend line.(see graph)

DETRENDING A TIME SERIES: From Yt = Tt x Ct x Rt ( using yearly data, seasonal


variations do not appear in the data because all seasons are represented in the annual
data).
Detrending requires dividing both sides by Tt,.

Y t = C t x Rt
Tt

Consider the data below.


Year Yt Trended values Detrended values Cyclical
Tt Yt = Ct x Rt relatives
Tt
1 2 3 4 5
1957 3512 3632.32 0.9669 96.69
1958 3472 3478.28 0.9982 99.82
1959 3464 3324.24 1.0420 104.20
1960 3174 3170.20 1.0012 100.12
1961 2969 3016.16 0.9844 98.44
1962 2960 2862.12 1.0342 103.42
1963 2715 2708.08 1.0026 100.26
1964 2460 2554.04 0.9632 96.32
1965 2300 2400.00 0.9558 95.58
1966 2334 2245.96 1.0347 103.47
1967 2250 2091.92 1.0756 107.56
1968 1960 1937.88 1.0114 101.14
1969 1635 1783.84 0.9166 91.66

85
The values in columns 1 and 2 have been used to calculate the trend equation. Column 1
values have been coded. The trend equation is y’ 2708.08 – 154.04x (please verify).The
trended values are: for 1957,
y’ =2708.08 – 154.04(-6) = 3632.32
The rest are and put in column 3. (Please check the values)

Dividing column 2 by column 3 (i.e Yt = Ct x Rt ) we have the detrended values i.e


Tt

For 1957, we have 3512 = 0.9669 (check the rest in column 4)


3632.32

When these values are multiplied by 100, we have what is known as cyclical relatives.
These are put in column 5. It is assumed that random variables do not have much
influence on annual data. Therefore what remains after detrending the annual time series
data are the cyclical variations only.

ESTIMATION OF SEASONAL INDEX (VARIATION: The estimation of seasonal


variations requires determining the seasonal component St. It is necessary to estimate the
seasonal components as it indicates how a given time series varies from month to month
in a given year.

A set of data showing the relative values of a variable during all the months of the years
is known as “seasonal index”. For example, if the production of a particular item during
the months of January, February, March, …, are 60, 90, 120, …, percent of the average
monthly production for the whole year, then 60, 90, 120, ..,constitute the seasonal index
for that year. These are sometimes referred to as “seasonal index number”. The average
seasonal index for the whole year should be 100%, or the sum of the index number
should be 1200%.

A typical index of 96.0 for January indicates that sales(or whatever the variable is ) are
4% below the average for the year. An index of 107.2 for October means that the variable
is typically 7.2% above the annual average.

86
The method most commonly used to compute the typical seasonal pattern is called the
ratio-to- moving average method. It eliminates the trend, cyclical and irregular
components from the original sales.

Example 3 : Toy’s international takes an inventory of its dolls, mechanical toys, and
other products on hand every quarter. The value of the inventory, in millions of dollars at
the beginning of each quarter since 1989 is indicated below:

YEAR QUARTER
WINTER SPRING SUMMER FALL
1987 6.7 4.9 10.0 12.7
1988 6.5 4.8 9.8 13.6
1989 6.7 4.3 10.4 13.1
1990 7.0 5.5 10.8 15.0
1991 7.1 4.4 11.1 14.5
1992 8.0 4.2 11.4 14.9

What are the typical quarterly indexes using the ratio to moving average method?
Answer?
Step 1. Determine a four quarterly moving total.
Starting with the winter quarter of 1987, we add 6.7, 4,9, 10.0, 12.7. The total is 34.3.
The four-quarterly total in column 2 is “moved along” by adding the Spring, Summer and
Fall inventories of the 1987 and the 1988 Winter inventory. The total is $34.1 found by
4.9+10.0+12.7+6.5.

Note: Instead of adding the four inventory values, we can subtract the Winter 1987
inventory(6.7) from the initial total of $34.3 and add the Winter 1988 inventory(6.5)

That is: 34.3 - 6.7 + 6.5 = 34.1


This procedure is continued until all the quarterly inventories are accounted for. These
four – quarter moving totals are given in column 2.Note that the first moving total is
positioned between the Spring and Summer of 1987, and the rest also follow in like
manner.

87
Step 2. Determine the four-quarterly moving average. Each quarterly moving total in
column 2 is divided by 4 to give the four-quarterly moving average. All the moving
averages are positioned between quarters

Step 3. Centre the four-quarterly moving average. The first centered moving average is
determined as follows:

( 8.515+8.525 ) = 8.550
4

The rest are computed in like manner.

Quarter Inventory Four Q.M. total Moving av.Seasonal Indexes


First Quarter 6.7
Second Quarter 4.9 34.3 8.6
Third Quarter 10.0 34.1 8.5 17.1 8.6 117.0
Fourth Quarter 12.7 34 8.5 17.0 8.5 149.2
First Quarter 6.5 33.8 8.5 17.0 8.5 76.7
Second Quarter 4.8 34.7 8.7 17.1 8.6 56.1
Third Quarter 9.8 34.9 8.7 17.4 8.7 112.6
Fourth Quarter 13.6 34.4 8.6 17.3 8.7 157.0
First Quarter 6.7 35 8.8 17.4 8.7 77.2
SecondQuarter 4.3 34.5 8.6 17.4 8.7 49.5
Third Quarter 10.4 34.8 8.7 17.3 8.7 120.1
Fourth Quarter 13.1 36 9.0 17.7 8.9 148.0
First Quarter 7.0 36.4 9.1 18.1 9.1 77.3
Second Quarter 5.5 38.3 9.6 18.7 9.3 58.9
Third Quarter 10.8 38.4 9.6 19.2 9.6 112.6
Fourth Quarter 15.0 37.3 9.3 18.9 9.5 158.5
First Quarter 7.1 37.6 9.4 18.7 9.4 75.8
Second Quarter 4.4 37.1 9.3 18.7 9.3 47.1
Third Quarter 11.1 38 9.5 18.8 9.4 118.2
Fourth Quarter 14.5 37.8 9.5 19.0 9.5 153.0
First Quarter 8.0 38.1 9.5 19.0 9.5 84.3
Second Quarter 4.2 38.5 9.6 19.2 9.6 43.9
Third Quarter 11.4
Fourth Quarter 14.9

Step 4. determine the specific seasonal indexes. A specific seasonal index for each
quarter is then computed by dividing the value of the inventory in column 1 by the

88
centered moving average in column 4. Each quotient is multiplied by 100 to convert it to
an index.
The first specific seasonal index is :
( 10 )x 100 =117.0
8.550
The others are computed in like manner
Step 5. Organize the specific seasonal indexes. The specific seasonal indexes are
organized in a table
YEAR QUARTER
Winter Spring Summer Fall
1987 117.0 149.2
1988 76.7 56.1 112.3 156.1
1989 79.1 49.2 119.7 148.6
1990 77.3 58.9 112.6 158.5
1991 75.8 47.1 118.2 153.0
1992 84.3 43.9
TOTAL 393.2 255.2 579.80 764.8
MEAN 78.64 51.04 115.96 152.96
TYPICAL 78.92 51.22 116.27 153.50
INDEX
(Adjusted index)

STEP 6. Adjust the seasonal indexes The four quarter means ( 78.64, 51.04, 115.96,and
152.96 ) should theoretically total 400 because the average is set at 100. The total may
not equal 400 due to rounding. In this problem, the total is 398.0. A correction factor is
therefore applied to each of the four means to force them to total 400.

The formula to use is:

Correcting factor = 400


Total of the four means

= 400
398.0

= 1.00351

89
To adjust, we have: (1.00351 ) ( 152.96 ) = 153.50 for the Fall. The rest are computed in
like manner.(See table below)

(ii)

Typical Indexes Total Average IndexComputations


First Quarter 391.4 97.86 78.55(400/498.3)*97.86
SecondQuarter 255.4 63.86 51.26(400/498.3)*63.86
Third Quarter 580.5 145.14 116.51(400/498.3)*145.14
Fourth Quarter 765.8 191.44 153.68(400/498.3)*191.44
498.30

DESEASONALISING DATA: A set of typical indexes is useful in adjusting a sales


series, for example, for seasonal fluctuations. The resulting series is called deseasonalised
sales or seasonally adjusted sales. The reason for deseasonalising the sales series is to
remove the seasonal fluctuations so that the trend and cycle can be studied.

To remove the seasonal variation, the inventory for each quarter (which contains trend,
cyclical, irregular and seasonal variations ) is divided by the seasonal index for that
quarter. That is:

T t x Ct x Rt x S t = T t x Ct x Rt
St

For example, the actual inventory for the first quarter of 1987 was $6.7million. The
seasonal index for the first quarter is 78.92. The index of 78.92 indicates that inventory in
the first quarter is typically 21.08% below the average for a typical quarter. By dividing
the actual inventory of $6.7million by78.92 and multiplying by100, the deseasonalsed
value of inventory of $8,489,610 is obtained for the first quarter of 1987. (This is put in
column 5 in the table below)

The rest are computed in like manner


Year Quarter Inventory Seaasonal Index Deseanalised Index
1987 First Quarter 6.7 78.55 8.5

90
Second Quarter 4.9 51.26 9.6
Third Quarter 10.0 116.51 8.6
Fourth Quarter 12.7 153.68 8.3
1988 First Quarter 6.5 78.55 8.3
Second Quarter 4.8 51.26 9.4
Third Quarter 9.8 116.51 8.4
Fourth Quarter 13.6 153.68 8.8
1989 First Quarter 6.7 78.55 8.5
Second Quarter 4.3 51.26 8.4
Third Quarter 10.4 116.51 8.9
Fourth Quarter 13.1 153.68 8.5
1990 First Quarter 7.0 78.55 8.9
Second Quarter 5.5 51.26 10.7
Third Quarter 10.8 116.51 9.3
Fourth Quarter 15.0 153.68 9.8
1991 First Quarter 7.1 78.55 9.0
Second Quarter 4.4 51.26 8.6
Third Quarter 11.1 116.51 9.5
Fourth Quarter 14.5 153.68 9.4
1992 First Quarter 8.0 78.55 10.2
Second Quarter 4.2 51.26 8.2
Third Quarter 11.4 116.51 9.8
Fourth Quarter 14.9 153.68 9.7

Since the seasonalised component has been removed ( divided out ) from the quarterly
inventory, the deseasonalised inventory contains only trend (T ), cyclical (C ), and
irregular/random ( R ) components

USING DESEASONALIZED DATA TO FORECAST: The procedure for identifying


trend and the seasonal adjustment can be combined to yield seasonally adjusted forecast.
To identify the trend, we determine the least square trend equation. Then we project this
trend into future periods, and finally, we adjust these trend values to account for the
seasonal factors.

Example: Toy International would like to forecast their inventory for each quarter of
1993. Use Table Q* to determine the forecast.

Solution: The deseasonalized trend equation is determined as follows: The winter quarter
of 1987 is the period t = 1, and t = 24 corresponds to the fourth quarter of 1992.

91
Using these coding, the constants a and b in the equation y’ =a + bt are determined. The
trend equation is y’ = 8.5169 + 0.0425t.

The slope of the trend line is 0.0425. This indicates that over the 24 quarter, the
deseasonalized inventory growth rate is 0.0425($millions) or $42,500 per quarter. We can
estimate the inventory for 1993 as follows:

Y’ = 8.5169 + 0.0425 ( 25 ) i.e. t=25


= 9.5794.

Using the trend equation, we can forecast inventories for the other quarters of 1993 using
t = 26, 27, 28.
Y’ = 8.5169 + 0.0425 ( 26) =85180.1
Y’ = 8.5169 + 0.0425 ( 27) =85180.5
Y’ = 8.5169 + 0.0425 ( 28) =85180.9

QUESTIONS
Q.1. Wyoming park and Yellowstone park contain shops, restaurant and motels. They
have two peak seasons: winter for skiing and summer for tourists visiting the parks. The
specific seasonals with respect to the total sales volume for recent years are:
QUARTER
Winter Summer Spring Fall

YEAR
1986 117.0 80.7 129.6 76.1
1987 118.6 82.5 121.4 77.0
1988 114.0 84.3 119.9 75.0
1989 120.7 79.6 130.7 69.6
1990 125.2 80.2 127.6 72.0

i. Calculate the quarterly indexes using the ratio to moving average method.
ii. Estimate the typical quarterly indexes.
iii. Use the multiplicative model to deseasonalize the typical quarterly indexes.
iv. Estimate the trend equation.
v. Use the trend equation to forecast the inventory for 1991

92
Q.2
QUARTER
1 2 3 4

YEAR
1 2.2 5.0 7.9 3.2
2 2.9 5.2 8.2 3.8
3 3.2 5.8 9.1 4.1

i. Calculate the trend values using four quarterly moving averages.


ii. Plot the original data with the trend superimposed.

Q3

WEEK
1 2 3 4

DAY
Monday 22 22 24 26
Tuesday 36 34 38 38
Wednesday 40 42 43 45
Thursday 48 49 49 50
Friday 61 58 62 64
Saturday 58 59 58 58
c. Calculate the quarterly indexes using the ratio to moving average method.
d. Estimate the typical quarterly indexes.
e. Use the multiplicative model to deseasonalize the typical quarterly
indexes.
f. Estimate the trend equation.
g. Use the trend equation to forecast the inventory for 1991
CONFIDENCE INTERVAL

A confidence interval gives an estimated range of values which is likely to include an


unknown population parameter, the estimated range being calculated from a given set of
sample data.

If independent samples are taken repeatedly from the same population, and a confidence
interval calculated for each sample, then a certain percentage (confidence level) of the
intervals will include the unknown population parameter. Confidence intervals are

93
usually calculated so that this percentage is 95%, but we can produce 90%, 99%, 99.9%
(or whatever) confidence intervals for the unknown parameter.

The width of the confidence interval gives us some idea about how uncertain we are
about the unknown parameter (see precision). A very wide interval may indicate that
more data should be collected before anything very definite can be said about the
parameter.

Confidence intervals are more informative than the simple results of hypothesis tests
(where we decide "reject H0" or "don't reject H0") since they provide a range of plausible
values for the unknown parameter.

Confidence Limits

Confidence limits are the lower and upper boundaries / values of a confidence interval,
that is, the values which define the range of a confidence interval.

The upper and lower bounds of a 95% confidence interval are the 95% confidence limits.
These limits may be taken for other confidence levels, for example, 90%, 99%, 99.9%.

Confidence Level

The confidence level is the probability value associated with a confidence


interval.

It is often expressed as a percentage. For example, say , then the


confidence level is equal to (1-0.05) = 0.95, i.e. a 95% confidence level.

Example
Suppose an opinion poll predicted that, if the election were held today, the Conservative

94
party would win 60% of the vote. The pollster might attach a 95% confidence level to the
interval 60% plus or minus 3%. That is, he thinks it very likely that the Conservative
party would get between 57% and 63% of the total vote.

Confidence Interval for a Mean

A confidence interval for a mean specifies a range of values within which the unknown
population parameter, in this case the mean, may lie. These intervals may be calculated
by, for example, a producer who wishes to estimate his mean daily output; a medical
researcher who wishes to estimate the mean response by patients to a new drug; etc.

The (two sided) confidence interval for a mean contains all the values of 0 (the true
population mean) which would not be rejected in the two-sided hypothesis test of:
H0: µ = µ0
against
H1: µ not equal to µ0

The width of the confidence interval gives us some idea about how uncertain we are
about the unknown population parameter, in this cas the mean. A very wide interval may
indicate that more data should be collected before anything very definite can be said
about the parameter.

We calculate these intervals for different confidence levels, depending on how precise we
want to be. We interpret an interval calculated at a 95% level as, we are 95% confident
that the interval contains the true population mean. We could also say that 95% of all
confidence intervals formed in this manner (from different samples of the population)
will include the true population mean.

Confidence Interval for the Difference Between Two Means

A confidence interval for the difference between two means specifies a range of values
within which the difference between the means of the two populations may lie. These

95
intervals may be calculated by, for example, a producer who wishes to estimate the
difference in mean daily output from two machines; a medical researcher who wishes to
estimate the difference in mean response by patients who are receiving two different
drugs; etc.

The confidence interval for the difference between two means contains all the values of
µ1 - µ2 (the difference between the two population means) which would not be rejected in
the two-sided hypothesis test of:
H0: µ1 = µ2
against
H1: µ1 not equal to µ2
i.e.
H0: µ1 - µ2 = 0
against
H1: µ1 - µ2 not equal to 0

If the confidence interval includes 0 we can say that there is no significant difference
between the means of the two populations, at a given level of confidence.

The width of the confidence interval gives us some idea about how uncertain we are
about the difference in the means. A very wide interval may indicate that more data
should be collected before anything definite can be said.

We calculate these intervals for different confidence levels, depending on how precise we
want to be. We interpret an interval calculated at a 95% level as, we are 95% confident
that the interval contains the true difference between the two population means. We could
also say that 95% of all confidence intervals formed in this manner (from different
samples of the population) will include the true difference.

HYPOTHESIS TEST

Setting up and testing hypotheses is an essential part of statistical inference. In order to


formulate such a test, usually some theory has been put forward, either because it is
believed to be true or because it is to be used as a basis for argument, but has not been

96
proved, for example, claiming that a new drug is better than the current drug for
treatment of the same symptoms.

In each problem considered, the question of interest is simplified into two competing
claims / hypotheses between which we have a choice; the null hypothesis, denoted H0,
against the alternative hypothesis, denoted H1. These two competing claims / hypotheses
are not however treated on an equal basis. Special consideration is given to the null
hypothesis.

We have two common situations:

1. The experiment has been carried out in an attempt to disprove or reject a


particular hypothesis, the null hypothesis, thus we give that one priority so it
cannot be rejected unless the evidence against it is sufficiently strong. For
example,
H0: there is no difference in taste between coke and diet coke
against
H1: there is a difference.
2. If one of the two hypotheses is 'simpler' we give it priority so that a more
'complicated' theory is not adopted unless there is sufficient evidence against the
simpler one. For example, it is 'simpler' to claim that there is no difference in
flavour between coke and diet coke than it is to say that there is a difference.

The hypotheses are often statements about population parameters like expected value and
variance; for example H0 might be that the expected value of the height of ten year old
boys in the Ghanaian population is not different from that of ten year old girls. A
hypothesis might also be a statement about the distributional form of a characteristic of
interest, for example that the height of ten year old boys is normally distributed within the
Ghanaian population.

The outcome of a hypothesis test is "Reject H0 in favour of H1" or "Do not reject H0".

Null Hypothesis

97
The null hypothesis, H0, represents a theory that has been put forward, either because it is
believed to be true or because it is to be used as a basis for argument, but has not been
proved. For example, in a clinical trial of a new drug, the null hypothesis might be that
the new drug is no better, on average, than the current drug. We would write
H0: there is no difference between the two drugs on average.

We give special consideration to the null hypothesis. This is due to the fact that the null
hypothesis relates to the statement being tested, whereas the alternative hypothesis relates
to the statement to be accepted if / when the null is rejected.

The final conclusion once the test has been carried out is always given in terms of the
null hypothesis. We either "Reject H0 in favour of H1" or "Do not reject H0"; we never
conclude "Reject H1", or even "Accept H1".

If we conclude "Do not reject H0", this does not necessarily mean that the null hypothesis
is true, it only suggests that there is not sufficient evidence against H0 in favour of H1.
Rejecting the null hypothesis then, suggests that the alternative hypothesis may be true.

Alternative Hypothesis
The alternative hypothesis, H1, is a statement of what a statistical hypothesis test is set up
to establish. For example, in a clinical trial of a new drug, the alternative hypothesis
might be that the new drug has a different effect, on average, compared to that of the
current drug. We would write
H1: the two drugs have different effects, on average.
The alternative hypothesis might also be that the new drug is better, on average, than the
current drug. In this case we would write
H1: the new drug is better than the current drug, on average.

The final conclusion once the test has been carried out is always given in terms of the
null hypothesis. We either "Reject H0 in favour of H1" or "Do not reject H0". We never
conclude "Reject H1", or even "Accept H1".

98
If we conclude "Do not reject H0", this does not necessarily mean that the null hypothesis
is true, it only suggests that there is not sufficient evidence against H0 in favour of H1.
Rejecting the null hypothesis then, suggests that the alternative hypothesis may be true.

Type I Error

In a hypothesis test, a type I error occurs when the null hypothesis is rejected when it is in
fact true; that is, H0 is wrongly rejected.

For example, in a clinical trial of a new drug, the null hypothesis might be that the new
drug is no better, on average, than the current drug; i.e.
H0: there is no difference between the two drugs on average.
A type I error would occur if we concluded that the two drugs produced different effects
when in fact there was no difference between them.
The following table gives a summary of possible results of any hypothesis test:
Decision
Reject H0 Don't reject H0
H0 Type I Error Right decision
Truth
H1 Right decision Type II Error
A type I error is often considered to be more serious, and therefore more important to
avoid, than a type II error. The hypothesis test procedure is therefore adjusted so that
there is a guaranteed 'low' probability of rejecting the null hypothesis wrongly; this
probability is never 0. This probability of a type I error can be precisely computed as
P(type I error) = significance level =

The exact probability of a type II error is generally unknown.

If we do not reject the null hypothesis, it may still be false (a type II error) as the sample
may not be big enough to identify the falseness of the null hypothesis (especially if the
truth is very close to hypothesis).

For any given set of data, type I and type II errors are inversely related; the smaller the
risk of one, the higher the risk of the other.

A type I error can also be referred to as an error of the first kind.

99
Type II Error
In a hypothesis test, a type II error occurs when the null hypothesis H0, is not rejected
when it is in fact false. For example, in a clinical trial of a new drug, the null hypothesis
might be that the new drug is no better, on average, than the current drug; i.e.
H0: there is no difference between the two drugs on average.
A type II error would occur if it was concluded that the two drugs produced the same
effect, i.e. there is no difference between the two drugs on average, when in fact they
produced different ones.

A type II error is frequently due to sample sizes being too small.

The probability of a type II error is generally unknown, but is symbolised by  and


written
P(type II error) = 

A type II error can also be referred to as an error of the second kind.

Test Statistic

A test statistic is a quantity calculated from our sample of data. Its value is used to decide
whether or not the null hypothesis should be rejected in our hypothesis test.

The choice of a test statistic will depend on the assumed probability model and the
hypotheses under question.

Critical Value(s)

100
The critical value(s) for a hypothesis test is a threshold to which the value of the test
statistic in a sample is compared to determine whether or not the null hypothesis is
rejected.

The critical value for any hypothesis test depends on the significance level at which the
test is carried out, and whether the test is one-sided or two-sided.

Critical Region

The critical region CR, or rejection region RR, is a set of values of the test statistic for
which the null hypothesis is rejected in a hypothesis test. That is, the sample space for the
test statistic is partitioned into two regions; one region (the critical region) will lead us to
reject the null hypothesis H0, the other will not. So, if the observed value of the test
statistic is a member of the critical region, we conclude "Reject H0"; if it is not a member
of the critical region then we conclude "Do not reject H0".

.Significance Level

The significance level of a statistical hypothesis test is a fixed probability of wrongly


rejecting the null hypothesis H0, if it is in fact true.

It is the probability of a type I error and is set by the investigator in relation to the
consequences of such an error. That is, we want to make the significance level as small as
possible in order to protect the null hypothesis and to prevent, as far as possible, the
investigator from inadvertently making false claims.

The significance level is usually denoted by


Significance Level = P(type I error) =

Usually, the significance level is chosen to be 0.05 (or equivalently, 5%).

One-sided Test

101
A one-sided test is a statistical hypothesis test in which the values for which we can reject
the null hypothesis, H0 are located entirely in one tail of the probability distribution.

In other words, the critical region for a one-sided test is the set of values less than the
critical value of the test, or the set of values greater than the critical value of the test.

A one-sided test is also referred to as a one-tailed test of significance.

The choice between a one-sided and a two-sided test is determined by the purpose of the
investigation or prior reasons for using a one-sided test.

Example

Suppose we wanted to test a manufacturers claim that there are, on average, 50 matches
in a box. We could set up the following hypotheses
H0: µ = 50,
against
H1: µ < 50 or H1: µ > 50
Either of these two alternative hypotheses would lead to a one-sided test. Presumably, we
would want to test the null hypothesis against the first alternative hypothesis since it
would be useful to know if there is likely to be less than 50 matches, on average, in a box
(no one would complain if they get the correct number of matches in a box or more).
Yet another alternative hypothesis could be tested against the same null, leading this time
to a two-sided test:
H0: µ = 50,
against
H1: µ not equal to 50
Here, nothing specific can be said about the average number of matches in a box; only
that, if we could reject the null hypothesis in our test, we would know that the average
number of matches in a box is likely to be less than or greater than 50.

Two-Sided Test

102
A two-sided test is a statistical hypothesis test in which the values for which we can reject
the null hypothesis, H0 are located in both tails of the probability distribution.

In other words, the critical region for a two-sided test is the set of values less than a first
critical value of the test and the set of values greater than a second critical value of the
test.

A two-sided test is also referred to as a two-tailed test of significance.

The choice between a one-sided test and a two-sided test is determined by the purpose of
the investigation or prior reasons for using a one-sided test.

Example

Suppose we wanted to test a manufacturers claim that there are, on average, 50 matches
in a box. We could set up the following hypotheses
H0: µ = 50,
against
H1: µ < 50 or H1: µ > 50
Either of these two alternative hypotheses would lead to a one-sided test. Presumably, we
would want to test the null hypothesis against the first alternative hypothesis since it
would be useful to know if there is likely to be less than 50 matches, on average, in a box
(no one would complain if they get the correct number of matches in a box or more).
Yet another alternative hypothesis could be tested against the same null, leading this time
to a two-sided test:
H0: µ = 50,
against
H1: µ not equal to 50
Here, nothing specific can be said about the average number of matches in a box; only
that, if we could reject the null hypothesis in our test, we would know that the average
number of matches in a box is likely to be less than or greater than 50.

One Sample t-test

103
A one sample t-test is a hypothesis test for answering questions about the mean where the
data are a random sample of independent observations from an underlying normal

distribution N(µ, ), where is unknown.

The null hypothesis for the one sample t-test is:


H0: µ = µ0, where µ0 is known.

That is, the sample has been drawn from a population of given mean and unknown
variance (which therefore has to be estimated from the sample).

This null hypothesis, H0 is tested against one of the following alternative hypotheses,
depending on the question posed:
H1: µ is not equal to µ
H1: µ > µ
H1: µ < µ

Two Sample t-test


A two sample t-test is a hypothesis test for answering questions about the mean where the
data are collected from two random samples of independent observations, each from an
underlying normal distribution:

When carrying out a two sample t-test, it is usual to assume that the variances for the two
populations are equal, i.e.

The null hypothesis for the two sample t-test is:


H0: µ1 = µ2

That is, the two samples have both been drawn from the same population. This null
hypothesis is tested against one of the following alternative hypotheses, depending on the
question posed.

104
H1: µ1 is not equal to µ2
H1: µ1 > µ2
H1: µ1 < µ2

Paired Sample t-test

A paired sample t-test is used to determine whether there is a significant difference


between the average values of the same measurement made under two different
conditions. Both measurements are made on each unit in a sample, and the test is based
on the paired differences between these two values. The usual null hypothesis is that the
difference in the mean values is zero. For example, the yield of two strains of barley is
measured in successive years in twenty different plots of agricultural land (the units) to
investigate whether one crop gives a significantly greater yield than the other, on average.

The null hypothesis for the paired sample t-test is


H0: d = µ1 - µ2 = 0
where d is the mean value of the difference.
This null hypothesis is tested against one of the following alternative hypotheses,
depending on the question posed:
H1: d = 0
H1: d > 0
H1: d < 0

The paired sample t-test is a more powerful alternative to a two sample procedure, such
as the two sample t-test, but can only be used when we have matched samples.

105

S-ar putea să vă placă și