Documente Academic
Documente Profesional
Documente Cultură
LECTURE NOTES
BY G.K. ABLEDU
DATA COLLECTION
We may think of data as the information needed to help us make a more informed
decision in a particular situation. A set of measurement obtained on some variable is
called a data set. For example, heart rate measurement for 10 patients may constitute a
data set.
SOURCES OF DATA: There are four key data collection sources. Data collectors are
labeled primary sources, while data compilers are called secondary sources.
i. The first method (source) of obtaining data is via governmental, industrial, or
individual(published) sources. Of these three, the government is the major
collector and compiler of data for both public and private purposes. Many
governmental agencies, for example the statistical services, facilitate this
work.
ii. The second data collection source is through experimentation. In an
experiment, strict control is exercised over the treatments given to the
1
participants. For example, in a study testing the effectiveness of toothpaste,
the researcher would determine which participants in the study would use the
new brand and which would not, instead of leaving the choice to the subject.
iii. The third data collection source is by conducting a survey. Here no control is
exercised over the behaviour of the people being surveyed. They are merely
asked questions about their attitudes, beliefs, behaviour, and other
characteristics. Responses are then edited, coded, and tabulated for analysis.
iv. The fourth method for obtaining data is through observational study. Here,
the researcher observes the behaviour of the subjects being studied, usually in
their natural settings.
ILLUSTRATION 1
DATA SOURCE
Sources of primary data include observation, group discussion, and the use of
questionnaires. The distinguishing feature of a primary data is its collection for a specific
project. As a result primary data can take a long time to collect and be expensive.
Secondary data, in contrast, has been collected for some other purpose. It is usually
available at low cost but may be inadequate for the purposes of the enquiry.
2
TYPES OF DATA:
Statisticians develop a survey to deal with a variety of phenomena or characteristics.
These phenomena or characteristics are called random variables. The data, which are the
observed outcomes of these random variables, will undoubtedly differ from response to
response. There are two basic types of data:
i. Qualitative variable: When the characteristics or variable being studied is
non-numeric, it is called qualitative variable or an attribute or categorical data.
Examples of qualitative variables are gender, religious affiliation, type of
automobile owned, colour, nationality, etc…When the data are qualitative, we
are usually interested in how many or what proportion fall in each category.
For example, “What percentage of the population is male/female?” How many
protestants are in Ghana?’ etc. Categorical variables yield responses such as
yes or no answers. Qualitative data are often summarized in charts and bar
graph.
Categories should be chosen carefully since a bad choice can prejudice the outcome of an
investigation. Every value should belong to one and only one category, and there should
be no doubt as to which one.
ii. Quantitative variable: When the variables being studied can be reported
numerically(measurable or countable), the variable is called quantitative
variable. Examples of quantitative variables are “the balance in your savings
account”, “the number of children in a family”, “the height of students in a
class”, etc. A quantitative variable can be described by a number for which
arithmetic operations such as average make sense. Quantitative variables are
either Discrete or Continuous.
3
Discrete variables can assume only certain values and there are usually “gap” between
the values. For example there is(are) gap(s) between the numbers 1 and 2. These gaps are
1.2 , 1.3 etc. A set of data is said to be discrete if the values / observations belonging to it
are distinct and separate, i.e. they can be counted (1,2,3,....). Examples might include the
number of kittens in a litter; the number of patients in a doctors surgery; the number of
flaws in one metre of cloth; gender (male, female); blood group (O, A, B, AB).
Typically, discrete variables produce numerical responses that arise from a counting
process. For example the “number of magazines subscribed to”, “number of bedrooms in
a house”, etc,…, can be determined by counting.
Continuous random variables can assume any value within a specific range. They
produce numerical responses that arise from a measurement process. A set of data is said
to be continuous if the values / observations belonging to it may take on any value within
a finite or infinite interval. You can count, order and measure continuous data. Your
height is an example of continuous numerical variable because the response takes on any
value within a continuum or interval depending on the measuring instrument. For
example, your height may be 1.67metres, 1.68metres, 1.69metres, 1.70metres…etc,
depending on the precision of the available instruments.
Other examples include height, weight, temperature, the amount of sugar in an orange,
the time required to run a mile.
4
ILLUSTRTION:
Data
Discrete Continuous
LEVELS OF MEASUREMENT
Different statistics require the use of different mathematical operations and, therefore,
level of measurements considerations are the logical first guidelines to use in selecting a
statistic. For example, computation of mean( or arithmetic average) requires that all
observations be added together and then divided by the number of observations. Thus,
computation of the mean is truly justified only when a variable is measured at the
interval-ratio scale. Ideally, the researcher would utilize only those statistics that are fully
justified by the level of measurement criterion. The most powerful and useful statistics
(such as the mean) require interval-ratio variables, while most of the variables of interest
to the social sciences are only nominal (sex, race, marital status etc.) or at best ordinal
(attitude scale).
5
For example, many statistical techniques require that the scores be added. These
techniques could be legitimately used only when the variable is measured in a way that
permits addition as a mathematical operation. Thus, the researcher’s choice of statistical
techniques is heavily dependent on the level at which the variables have been measured.
The four levels of measurement are, in order of increasing sophistication, nominal,
ordinal, interval and ratio.
1. Nominal level: The most basic and the only universal measurement procedure is to
classify cases into the pre-established categories of a variable. In nominal measurements,
classification into category is the only measurement procedure permitted. The categories
themselves are not numerical and can be compared to others only in terms of the number
of cases classified in them. In no sense can the categories be thought of as “higher” or
“lower’ than each other along some numerical scale.
6
You should understand that these numbers are merely labels or names and have no
numerical quantity to them. They cannot be added, subtracted, multiplied or divided. The
only mathematical operation permissible with nominal variable is counting the number of
occurrences that have been classified into the various categories of the variable.
To summarize, the nominal level data has the following properties:
i. Data categories are mutually exclusive, meaning, each individual, object,
or measurement is included in only one category.
ii. Data categories are mutually exhaustive, meaning, each individual, object,
or measurement must appear in a category.
iii. Data categories have no logical order, meaning, we could report any label
first
A set of data is said to be ordinal if the values / observations belonging to it can be ranked
(put in order) or have a rating scale attached. You can count and order, but not measure,
ordinal data.
7
The categories for an ordinal set of data have a natural order, for example, suppose a
group of people were asked to taste varieties of biscuit and classify each biscuit on a
rating scale of 1 to 5, representing strongly dislike, dislike, neutral, like, strongly like. A
rating of 5 indicates more enjoyment than a rating of 4, for example, so such data are
ordinal.
Consider the grade received by two students in a mathematics examination. The grade of
1 is higher than grade 2, but we cannot say that the one who obtained grade 1 did as much
as twice as the one who obtained 2. Similarly, one who obtained grade 2 is not
necessarily twice as good as one who obtained grade 4. In other words, equal intervals do
not represent equal quantities.
However, the distinction between neighbouring points on the scale is not necessarily
always the same. For instance, the difference in enjoyment expressed by giving a rating
of 2 rather than 1 might be much less than the difference in enjoyment expressed by
giving a rating of 4 rather than 3.
Consider for example, the variable “socioeconomic status (SES)” which is usually
measured at the ordinal scale in the social sciences. The categories of the variable are
often ordered according to the following scheme:
4 = Upper class
3 = middle class
2 = Working class
1 = Lower class
Individual cases can be compared in terms of the categories into which they are
classified. Thus, an individual classified as 4(upper class) would be ranked higher than an
individual classified as a 2(working class).
To summarize, the ordinal level data has the following properties:
a. Data categories are mutually exclusive, meaning, each individual, object,
or measurement is included in only one category.
b. Data categories are mutually exhaustive, meaning, each individual, object,
or measurement must appear in a category.
8
c. Data categories are ranked or ordered according to the particular trait they
posses.
Note: Data obtain from a categorical variable are said to have been measured at nominal
scale or on an ordinal scale.
For example, recording the ages of your respondents is a measurement procedure that
would produce interval-ratio data because the unit of measure (years) has equal intervals
(the distance from year to year is 365 days) and a true zero point. Other examples of
interval-ratio variables would be income, number of children, weight, years of marriage,
etc. All mathematical operations are permitted for data measured at these levels.
The interval level of measurement is the next highest level. It includes all the
characteristics of the ordinal level, but in addition, the difference between values is a
constant size. An interval scale is a scale of measurement where the distance between any
two adjacent units of measurement (or 'intervals') is the same but the zero point is
arbitrary. Scores on an interval scale can be added and subtracted but can not be
meaningfully multiplied or divided. For example, the time interval between the starts of
years 1981 and 1982 is the same as that between 1983 and 1984, namely 365 days. The
9
zero point, year 1 AD, is arbitrary; time did not begin then. Other examples of interval
scales include the heights of tides, and the measurement of longitude.
Another example of the interval level of measurement is temperature. Suppose the high
temperatures on three consecutive days in Accra are 28, 31, and 20 degrees Fahrenheit.
These temperatures can be easily ranked, but we can also determine the difference
between the temperatures. This is possible because 1 degree Fahrenheit represents a
constant unit of measurement. Equal differences between two temperatures are the same,
regardless of their position on the scale. That is, the difference between 10 degrees
Fahrenheit and 15 degrees Fahrenheit is 5; the difference between 50 degrees Fahrenheit
and 55 degrees Fahrenheit is also 5 degrees. It is important to note that zero (0) is just a
point on the scale. It does not represent the absence of the condition. Zero (0) degree
Fahrenheit does not represent the absence of heat, just that it is cold. In fact, zero (0)
degree Fahrenheit is about –18 degrees on the Celsius scale. It has no true zero (0).
To summarize, the interval level data has the following properties:
a. Data categories are mutually exclusive, meaning, each individual, object,
or measurement is included in only one category.
b. Data categories are mutually exhaustive, meaning, each individual, object,
or measurement must appear in a category.
c. Equal differences in the characteristics are represented by equal
differences in the numbers assigned to the categories.
d. The zero point is arbitrary.
The ratio level is the highest level of measurement. The ratio level of measurement has
all the characteristics of the interval level, but in addition, the zero (0) point is meaningful
and ratio between two numbers is meaningful. If you have zero(0) dollars, then you have
no money.
To summarize, the ratio level data has the following properties:
a. Data categories are mutually exclusive, meaning, each individual, object,
or measurement is included in only one category.
b. Data categories are mutually exhaustive, meaning, each individual, object,
or measurement must appear in a category.
10
c. Equal differences in the characteristics are represented by equal
differences in the numbers assigned to the categories.
d. The point zero (0) reflects the absence of the characteristics.
QUESTIONS
QUESTION 1:
For each of the following random variables determine:
i. Whether the variable is categorical or numerical. If the variable is numerical,
determine whether the phenomenon of interest is discrete or continuous.
ii. The level of measurement
a. Number of telephones per household
b. Type of telephone primarily used.
c. Number of long distance calls made per month
d. Length (in minutes) of longest distance call made per month
e. Colour of telephone primarily used
f. Monthly charge (in dollars and cents) for long distance calls made
g. Number of local calls made per month
h. Length (in minutes) of longest local call per month
i. Ownership of a cellular phone
j. Whether there is a telephone line connected to a computer modem in the
household
k. Whether there is a FAX machine in the household.
QUESTION 2:
Suppose that the following information is obtained about students from the campus
bookstore during the first week of classes:
a. Amount of money spent on books
b. Number of books purchased
11
c. Amount of time spent shopping in the bookshop
d. Academic major
e. Gender
f. Ownership
g. Ownership of a videocassette recorder
h. Number of credits registered for in the current semester
i. Whether or not any clothing items were purchased at the bookstore
j. Method of payment
a. Classify each of these variables as categorical or numerical. If the variable is
numerical, determine whether the variable is discrete or continuous.
b. Provide the level of measurement\
QUESTION 3:
For each of the following random variables determine:
a. Whether the variable is categorical or numerical. If the variable is numerical,
determine whether the phenomenon of interest is discrete or continuous.
b. The level of measurement
i. Brand of personal computer primarily used
ii. Cost of personal computer system
iii. Amount of time the personal computer is used per week
iv. Primary use of the personal computer
v. Number of persons in the household who use the personal computer
vi. Number of computer magazine subscriptions
vii. Word processing package primarily used
viii. Whether the personal computer is connected to the internet
QUESTION 4:
For each of the following random variables determine:
i. Whether the variable is categorical or numerical. If the variable is numerical,
determine whether the phenomenon of interest is discrete or continuous.
ii. The level of measurement
12
a. Amount of money spent on clothing in the last month
b. Number of winter coats owned
c. Favourite department store
d. Amount of time spent shopping for clothing in the last month
e. Most likely time period during which shopping for clothing takes place
f. Number of pairs of gloves owned
g. Primary type of transportation used when shopping for clothing
QUESTION 5:
i. . Determine whether each of the following variables is categorical or numerical. If the
variable is numerical, determine whether the phenomenon of interest is discrete or
continuous.
ii. What is the level of measurement for each of the following variables?
a. Student IQ scores.
b. Distance student travel to class
c. Student scores on the first quantitative studies test
d. A classification of students by state of birth.
e. A ranking of students by freshman/women and continuing students..
f. Number of hours students study per week.
QUESTION 67:
Suppose the following information is obtained from Robert Keeler on his application for
a home mortgage loan at the Metro Country Savings and Loan associations:
a. Place of residence: Stony Brook, New York
b. Type of residence: Single-family home
c. Date of birth: April 9,1962
d. Monthly payments: $1,427
e. Occupation: Newspaper reporter/author
f. Employer: Daily newspaper
g. Number of years at job: 14
h. Number of jobs in past 10 years: 1
i. Annual family salary income: $66,000
j. Other income: $16,000
13
k. Marital status: Married
l. Number of children: 3
m. Mortgage requested: $120,000
n. Term of mortgage: 30 years
o. Other loans: Car
p. Amount of other loans: $8,000
Classify each of the responses by type of data and level of measurement
QUESTION 7:
xi. Ranking of this electric company as compared with two previous electricity
suppliers
14
c. If a variable is quantitative, state whether it is discrete or continuous.
A population is any entire collection of people, animals, plants or things from which we
may collect data. It is the entire group we are interested in, which we wish to describe or
draw conclusions about.
Target Population
The target population is the entire group a researcher is interested in; the group about
which the researcher wishes to draw conclusions.
Example
Suppose we take a group of men aged 35-40 who have suffered an initial heart attack.
The purpose of this study could be to compare the effectiveness of two drug regimes for
delaying or preventing further attacks. The target population here would be all men
meeting the same general conditions as those actually included in the study.
Sample
A sample is a group of units selected from a larger group (the population). By studying
the sample it is hoped to draw valid conclusions about the larger group.
A sample is generally selected for study because the population is too large to study in its
entirety. The sample should be representative of the general population. This is often best
15
achieved by random sampling. Also, before collecting the sample, it is important that the
researcher carefully and completely defines the population, including a description of the
members to be included.
Example
The population for a study of infant health might be all children born in Ghana in the
1980's. The sample might be all babies born on 7th May in any of the years.
Parameter
Statistic
It is possible to draw more than one sample from the same population and the value of a
statistic will in general vary from sample to sample. For example, the average value in a
sample is a statistic. The average values in more than one sample, drawn from thesample
population will not necessarily be equal. Statistics are often assigned the letters x and s.
16
Data Collection
Primary data is collected either through census or by sample selection:
i. Census: This does not require a selection procedure. It involves a complete
enumeration of the identified population
ii. Where the identified population is too large for a cost effective census to be
conducted, a sample of that population must be selected
In many cases, sampling is the only way to determine something about the population.
Some of the major reasons why sampling is necessary are:
i. The cost of studying all the items in a population is often prohibitive.
ii. To contact the whole population is often time consuming.
iii. The destructive nature of certain tests.
iv. The physical impossibility of checking all items in the population.
Sampling Techniques
There are two types of sampling: Probability sampling and non-probability sampling.
A simple random sample is one in which every individual or item from the population
has the same chance of selection as every other individual or item. In addition, every
sample of a fixed size has the same chance of selection as every other sample of that size.
With simple random sampling, we use n to represent the sample size and N to represent
the population. Samples can be selected with replacement( once a person is selected, it is
returned to the population where it has the same probability of being selected again) or
without replacement a person or item once selected is not returned to the population and
therefore cannot be selected again. By using random sampling, the likelihood of bias is
reduced
17
Simple random sampling is the most elementary random sampling techniques and as such
forms the basis for the other sampling techniques.
Systematic Sample
In a systematic sample, the N individuals or items in the population frame are
partitioned into k groups by dividing the size of the population frame n by the desired
sample size. The items or individuals of the population are arranged in some manner. A
random starting point is selected, and then every kth member of the population is selected
for the sample. For example, if a sample 10 is to be selected from 100 files, the first file
should be chosen using simple random process. If the first file is the 10 th file, then every
10th file will be selected foe the sample. That is, the sample will consists of the 10 th, 20th,
30th, …, 100th files.
Stratified Sample
In a, the N individuals or items in the population are first subdivided into separate
subpopulations, or strata, according to some common characteristics. A simple random
sample is conducted with each of the strata and the results from the separate simple
random sampling are then combined. Either a proportional or a nonproportional sample
can be selected.
There may often be factors which divide up the population into sub-populations (groups /
strata) and we may expect the measurement of interest to vary among the different sub-
populations. This has to be accounted for when we select a sample from the population in
order that we obtain a sample that is representative of the population. This is achieved by
stratified sampling.
Stratified sampling techniques are generally used when the population is heterogeneous,
or dissimilar, where certain homogeneous, or similar, sub-populations can be isolated
(strata). Simple random sampling is most appropriate when the entire population from
which the sample is taken is homogeneous. Some reasons for using stratified sampling
over simple random sampling are:
18
c. increased accuracy at given cost.
A proportional sampling procedure requires that the number of items in each stratum be
in the same proportion as in the population. For example, the problem might be to study
the advertising expenditures of the 352 largest companies in Ghana. Suppose the
objective of the study is to determine whether firms with high returns on equity(a
measure of profitability) spent more of each sales dollar on advertising than firms with
low return or a deficit. Assume that the 352 firms were divided into five strata. If, say, 50
firms were to be selected for intensive study, the selection will be done as shown:
Stratum Profitability Number of firms Number sampled Procedure
1 30% and over 8 1 (8/352)*50
2 20% to 30% 35 5 (35/352)*50
3 10% to 20% 189 27 (189/352)*50
4 0% to 10% 115 16 (115/352)*50
5 Deficit 5 1 (5/352)*50
Total 352 50
Cluster sampling is typically used when the researcher cannot get a complete list of the
members of a population they wish to study but can get a complete list of groups or
'clusters' of the population. It is also used when a random sample would produce a list of
subjects so widely scattered that surveying them would prove to be far too expensive, for
example, people who live in different postal districts in the Ghana.
Cluster sampling methods can be more cost effective than simple random sampling
methods, particularly if the underlying population is spread over a wide geographic
region. However, cluster sampling methods tend to be less efficient than either simple
random sampling methods or stratified sampling methods and would require a large
overall size to obtain results as precise as those that would be obtained from the more
efficient procedures.
Example
1. Suppose that the Department of Agriculture wishes to investigate the use of pesticides
by farmers in England. A cluster sample could be taken by identifying the different
counties in England as clusters. A sample of these counties (clusters) would then be
chosen at random, so all farmers in those counties selected would be included in the
sample. It can be seen here then that it is easier to visit several farmers in the same county
than it is to travel to each farm in a random sample to observe the use of pesticides.
2. Suppose you want to determine the views of industrialists in Ghana about the state and
environmental protection policies. Selecting a random sample of industrialists in Ghana
and personally contacting each one would be time consuming and very expensive.
Instead, you could subdivide the country into regions or small units. These are called
primary units. You then select 4 or 5 units. From these units you could take a random
sample of industrialists in each of these units and interview them.
Quota Sampling
Quota sampling is a method of sampling widely used in opinion polling and market
research. Interviewers are each given a quota of subjects of specified type to attempt to
20
recruit for example, an interviewer might be told to go out and select 20 adult men and 20
adult women, 10 teenage girls and 10 teenage boys so that they could interview them
about their television viewing.
It suffers from a number of methodological flaws, the most basic of which is that the
sample is not a random sample and therefore the sampling distributions of any statistics
are unknown.
TUTORIAL QUESTIONS
1. Define and give example of each of the statistical terms:
i. Population
ii. Sample
iii. Parameter
iv. Statistic
2. A politician who is running for the office of the mayor of a city with 25,000 registered
voters commissions a survey. In the survey, 48% of the 200 voters interviewed say they
planned to vote for her.
i. What is the population of interest?
ii. What is the sample?
iii. Is the value 48% a parameter or statistic?
3. A manufacturer of computer chips claims that less than 10% of his products are
defective. When 1000 chips were drawn from a large production run, 7.5% were found to
be defective
a. What is the population of interest?
b. What is the sample?
c. What is the parameter?
d. What is the statistic?
4. The owner of a large fleet of taxis is trying to estimate his cost for next year’s
operations. One major cost is fuel. To estimate fuel purchases, the owner needs to know
the total distance his taxis will travel next year, the cost a gallon of fuel, and the fuel
mileage of his taxis. The owner has been provided with the first two figures (distance
estimate and cost). However, because of the high cost of gasoline, the owner has recently
converted his taxis to operate on propane mileage(in miles per gallon) for 50 taxis.
a. What is the population of interest?
21
b. What is the sample the?
c. What is the parameter owner needs?
d. What is the statistic?
SUMMARISING DATA
Frequency Table
22
A frequency table is a way of summarising a set of data. It is a record of how often each
value (or set of values) of the variable in question occurs. It may be enhanced by the
addition of percentages that fall into each category.
A frequency table is used to summarise categorical, nominal, and ordinal data. It may
also be used to summarise continuous data once the data set has been divided up into
sensible groups.
When we have more than one categorical variable in our data set, a frequency table is
sometimes called a contingency table because the figures found in the rows are
contingent upon (dependent upon) those found in the columns.
Example
Suppose that in thirty shots at a target, a marksman makes the following scores:
52234 43203 03215
13155 24004 54455
The frequencies of the different scores can be summarised as:
Frequency
Score Tally Frequency
(%)
0 4 13%
1 3 10%
2 5 17%
3 5 17%
4 6 20%
5 7 23%
Pie Chart
Example
Suppose that, in the last year a sports wear manufacturer has spent 6 million pounds on
advertising their products; 3 million has been spent on television adverts, 2 million on
23
sponsorship, 1 million on newspaper adverts, and a half million on posters. This spending
can be summarised using a pie chart:
Bar Chart
Bar charts can be displayed horizontally or vertically and they are usually drawn with a
gap between the bars (rectangles), whereas the bars of a histogram are drawn
immediately next to each other.
Histogram
A histogram is a way of summarising data that are measured on an interval scale (either
discrete or continuous). It is often used in exploratory data analysis to illustrate the major
features of the distribution of the data in a convenient form. It divides up the range of
24
possible values in a data set into classes or groups. For each group, a rectangle is
constructed with a base length equal to the range of values in that specific group, and an
area proportional to the number of observations falling into that group. This means that
the rectangles might be drawn of non-uniform height.
The histogram is only appropriate for variables whose values are numerical and measured
on an interval scale. It is generally used when dealing with large data sets (>100
observations)
25
The central tendency of a set of measurement is the tendency of the data to cluster about
certain numerical value eg. Mean mode etc.
Mean
The sample mean is an estimator available for estimating the population mean . It is a
Its value depends equally on all of the data which may include outliers. It may not appear
representative of the central region for skewed data sets.
It is especially useful as being representative of the whole sample for use in subsequent
calculations.
The sample mean is calculated by taking the sum of all the data values and dividing by
the total number of data values:
Example 1
Lets say our data set is: 5 3 54 93 83 22 17 19.
fx
f
Example 2
26
x f fx
1 2 2
2 3 6
3 5 15
4 3 12
5 2 10
∑f = 15 ∑fx = 45
fx 45
Mean =
x =
f
=
15
=3
Example 2
Mark f Midpoint(x) fx
1-5 2 3 6
6-10 2 8 16
11-15 3 13 39
16-20 2 18 36
21-25 1 23 23
∑f = 10 ∑fx = 120
fx 120
Mean =
x =
f
=
10
= 12
Median
The median is the value halfway through the ordered data set, below and above which
there lies an equal number of data values.
It is generally a good descriptive measure of the location which works well for skewed
data, or data with outliers. The median is the 0.5 quantile.
Example 1
With an odd number of data values, for example 21, we have:
27
Data 96 48 27 72 39 70 7 68 99 36 95 4 6 13 34 74 65 42 28 54 69
Ordered Data 4 6 7 13 27 28 34 36 39 42 48 54 65 68 69 70 72 74 95 96 99
Median 48, leaving ten values below and ten values above
Example 2
x f cf
1 2 2
2 3 5
3 5 10
4 3 13
5 2 15
n 1 15 1
Median position = = =8th position
2 2
From table the 8th position is contained in 10. Therefore, the median value is 3.
Example 3
Mark f cf
1-5 2 2
6-10 2 4
11-15 3 7
16-20 2 9
21-25 1 10
n 10
Median position = = =5th and 6th positions
2 2
From table the 5th and 6th positions is contained in 7. Therefore, the median class is 11 -
15. The median value can only be found using formula or graph.
28
Mode
The mode is the most frequently occurring value in a set of discrete data. There can be
more than one mode if two or more values are equally common.
Example 1
Suppose the results of an end of term Statistics exam were distributed as follows:
Student: Score:</I.< td>
1 94
2 81
3 56
4 90
5 70
6 65
7 90
8 90
9 30
Then the mode (most common score) is 90, and the median (middle score) is 81.
Example 2
x f
1 2
2 3
3 5
4 3
5 2
The highest frequency is 5, and the class that has this is 3. Therefore, the mode is 3.
Example 3
Mark f
1-5 2
6-10 2
11-15 3
16-20 2
21-25 1
The highest frequency is 3, and the class that has this is 11-15. Therefore, the modal class
is 11-15. The modal value can only be found using formula or graph
Dispersion
29
The data values in a sample are not all the same. This variation between values is called
dispersion.
When the dispersion is large, the values are widely scattered; when it is small they are
tightly clustered. There are several measures of dispersion, the most common being the
standard deviation. These measures indicate to what degree the individual observations of
a data set are dispersed or 'spread out' around their mean.
Range
The range of a sample (or a data set) is a measure of the spread or the dispersion of the
observations. It is the difference between the largest and the smallest observed value of
some quantitative characteristic and is very easy to calculate.
A great deal of information is ignored when computing the range since only the largest
and the smallest data values are considered; the remaining data are ignored.
The range value of a data set is greatly influenced by the presence of just one unusually
large or small value in the sample (outlier).
Examples
The inter-quartile range is a measure of the spread of or dispersion within a data set.
It is calculated by taking the difference between the upper and the lower quartiles. For
example:
Data 23456667789
Upper quartile 7
Lower quartile 4
30
IQR 7-4=3
The IQR is the width of an interval which contains the middle 50% of the sample, so it is
smaller than the range and its value is less affected by outliers.
Quantile
Quantiles are a set of 'cut points' that divide a sample of data into groups containing (as
far as possible) equal numbers of observations.
Percentile
Percentiles are values that divide a sample of data into one hundred groups containing (as
far as possible) equal numbers of observations. For example, 30% of the data values lie
below the 30th percentile.
.
Quartile
Quartiles are values that divide a sample of data into four groups containing (as far as
possible) equal numbers of observations.
A data set has three quartiles. References to quartiles often relate to just the outer two, the
upper and the lower quartiles; the second quartile being equal to the median. The lower
quartile is the data value a quarter way up through the ordered data set; the upper quartile
is the data value a quarter way down through the ordered data set.
Example
Data 6 47 49 15 43 41 7 39 43 41 36
Ordered Data 6 7 15 36 39 41 41 43 43 47 49
Median(Q2) 41
Upper quartile 43
Lower quartile 15
31
.
Quintile
Quintiles are values that divide a sample of data into five groups containing (as far as
possible) equal numbers of observations.
.
Sample Variance
Sample variance is a measure of the spread of or dispersion within a set of sample data.
The sample variance is the sum of the squared deviations from their average divided by
one less than the number of observations in the data set. For example, for n observations
x1, x2, x3, ... , xn with sample mean
1
s = 2
n 1
f ( xi x) 2
Standard Deviation
It is calculated by taking the square root of the variance and is symbolised by s.d, or s. In
other words
The more widely the values are spread out, the larger the standard deviation. For
example, say we have two separate lists of exam results from a class of 30 students; one
ranges from 31% to 98%, the other from 82% to 93%, then the standard deviation would
be larger for the results of the first exam.
32
Example
x f x-x = d d2 f d2
1 2 -2 2 4
2 3 -1 4 12
3 5 0 0 0
4 3 1 1 3
5 2 2 4 8
∑f = 15 ∑f d2 = 27
1
1
2
s = f ( xi x) 2 =
14
* 27 = 1.93
n 1
S tan dardDeviation 1.93 1.39
EXCECISE
QUESTION 1
The data below show prices of vehicles sold in 1999 at DH Ltd (prices are in dollars).
20,197 20,372 17,454 20,591 23,651 24,453 14,266 15,021 25,683 27,872
16,587 20,169 32,851 16,251 17,047 21,285 21,324 21,609 25,670 12,546*
12,935 16,873 22,251 22,277 25,034 21,533 24,443 16,889 17,004 14,357
17,155 16,688 20,657 23,613 17,895 17,203 20,765 22,783 23,661 29,277
17,642 18,981 21,052 22,799 12,794 15,263 32,925# 14,399 14,968 17,356
18,442 18,722 16,331 19,817 16,766 17,633 17,962 19,845 23,285 24,896
26,076 29,492 15,890 18,740 19,374 21,571 22,449 25,337 17,642 20,613
21,220 27,655 19,442 14,891 17,818 23,237 17,445 18,556 18,639 21,296
Tally the data into a frequency distribution. Draw an appropriate chart for the distribution
Find the mean, modal and median prices
QUESTION 2:
A recent survey showed that the typical American car owner spends $2,950 per year on
operating expenses. Below is a breakdown of the various items.
EXPENDITURE ITEM AMOUNT($)
Fuel 603
Interest on car loan 279
Repairs 930
Insurance and incense 646
33
Depreciation 492
Total 2,950
Draw an appropriate chart to portray the data and summaries your finding in a brief
report.
QUESTION 3:
The Midland Natural Bank selected a sample of 40 students’ accounts. Below are their
end-of-month balances in thousands of dollars.
404 74 234 149 279 215 123 55 43 321
87 234 68 489 57 185 141 758 72 863
703 125 350 440 37 252 27 521 302 127
968 712 503 489 327 608 358 425 303 203
a. Tally the data into a frequency distribution using $100 as class interval and
$0 as the starting point
b. Draw a less than cumulative polygon
c. The bank considers any student with ending balance of $400 or more a
“preferred customers”. Estimate the percentage of preferred customers.
d. The bank is also considering a service charge to the lowest 10 percent of
the ending balances. What would you recommend as the cutoff point
between those who have to pay a service charge and those who do not?
QUESTION 4:
Annual revenues by type of tax for the state are as follows:
TYPE OF TAX AMOUNT ($)
Sales 2,812,473
Income(individual) 2,732,045
License 185,198
Corporate 525,015
Property 22,647
Gift 37,326
Total 6,3147,04
Develop an appropriate chart or graph and write a brief report to summarize the
information.
34
QUESTION 5:
Annual imports from selected Canadian trading partners are listed below.
PARTNER ANNUAL IMPORT ($MILLIONS)
Japan 9,550
U. Kingdom 4,556
S. Korea 2,441
China 1,182
Australia 618
Develop an appropriate chart or graph and write a brief report to summarize the
information.
QUESTION 6:
A breakfast cereal is supposed to include 200 raisins in each box. A sample of 60 boxes
produced showed the following number of raisins in each box.
200 193 198 203 196 202 200 196 203 201 205 201
193 203 198 201 200 202 204 195 198 202 201 204
204 200 206 198 202 206 197 199 200 205 191 206
202 199 207 205 202 199 200 200 205 196 200 197
200 199 206 206 204 195 197 200 195 197 200 198
i. Organize the data into a frequency distribution
ii. Draw an appropriate chart for the distribution
QUESTION 7:
The Marketing Research Department is investigating the performance of several
corporations in the coal, mining and gas industries. The fourth-quarter sales in 1997(in
millions of dollars) for these corporations are:
CORPORATION FOURTH-QUARTER SALES ($MILLIONS)
American Hess 1,645.2
Atlantic Richfield 4,757.0
Chevron 8,913.0
Diamond Shamrock 627.1
Exxon 24,612.0
Quaker State 191.9
The Department wants to include a chart in their report comparing the fourth-quarter
sales of six corporations. Draw an appropriate chart to compare the fourth-quarter sales of
these corporations. And write a brief report summarizing the chart.
QUESTION 8:
The distance (in thousand kilometers) covered by the sales agents of Marketing Company
during the year 1998 and the years 2000 were as follows.
Distance Covered 1998 2000
35
(000 Km) No. of Sales Agents No. of Sales Agents
Under 15 9 3
15-25 15 7
25-40 18 12
40-60 20 20
60-80 12 14
80-100 8 10
100 and over 0 4
(a) Compute the mean distance covered by a sales agent in the year (i) 1998 (ii) 2000
(b) Compute the standard deviation, to the nearest whole number, of distance covered by
sales agent in the year (i) 1998 (ii) 2000
QUESTION 9:
The data below gives the wages of 50 operatives in a construction firm. The wages are
measured in thousands of Cedis.
167 168 177 171 170 166 168 173 166 170
173 167 169 170 172 172 169 170 171 175
171 172 171 164 167 172 168 176 167 170
174 169 174 165 170 166 169 170 167 172
161 171 166 170 171 169 168 172 170 173
(a) Using the classes 160-162, 163-165, etc, construct a cumulative table for the data
(b) Draw a cumulative frequency curve for the data
(c) Use your cumulative frequency curve in (b) to estimate
(i) the median wage (ii) the semi interquartile range
(d) Determine the proportion of the operatives in the sample, whose wags are
(i) between ¢170,000 and ¢175,000 (ii) less than or equal to the median wage
QUESTION 10:
A market research analysis conducted a household survey in a suburb in Accra. One of
the questions asked was “how many rooms do you have in your house?” The data below
gives the results.
8 4 3 5 5 2 7 4 7 7 7 7
6 6 8 6 3 8 6 5 7 9 5 8
8 5 4 7 7 5 4 7 6 6 7 7
8 6 6 3 6 9 7 7 4 8 8 6
6 7 2 6 7 6 5 4 5 4 7 5
(a) Construct a frequency table for the data
36
(b) Find (i) The mean (ii) the variance (iii) the median of the distribution
(d) If a household is selected at random from the sample determine the probability that it
has at least 6 rooms.
(QUESTION 11:
The following data gives the age distribution of a random sample of 50 workers in the
catering industry in a Large City town.
32 32 35 36 38 31 37 37 31 34
37 35 34 32 38 32 33 42 36 35
36 37 36 29 32 39 34 39 30 35
33 34 33 34 33 26 36 31 35 36
35 41 35 37 31 33 35 37 40 38
(a) Using the class limits 25-27, 28-30, etc, construct a frequency table for the data. Draw
a histogram for the grouped data.
(b) Determine the mean, mode and median from the grouped data
(c) Calculate the standard deviation from the grouped data
QUESTION 12:
The table below shows the distribution of monthly salary in thousand cedis, of 200
employees of KADO Industries:
Monthly Salary Number of Employees
150-200 8
200-250 30
250-300 41
300-350 52
350-400 33
400-450 24
450-500 12
(a) Draw a cumulative frequency curve of the distribution
(b) If the starting monthly salary of a manager in the firm is ¢425,000 estimate from
graph the number of workers who are managers.
(c) Use your graph to estimate the median and the quartiles of the distribution.
(d) Comment on the shape of the distribution in the light of your values.
QUESTION 13:
Loan granted by a branch of a corporate bank in 1994 are distributed as follows
Value of Loan (¢million) Number of Loans
21-45 10
46-70 25
37
71-95 55
96-120 60
121-145 35
146-170 10
176-195 5
(a) Draw a cumulative frequency curve for the distribution
(b) If loans less than ¢75m are referred to the head quarters of the bank for approval,
estimate from your graph the number of loans that were referred to the headquarters for
approval
(c) Estimate from your graph the median value of the loans. Explain what this value
means.
(d) Compute the mean value of the loans.
(e) Comment on the shape of the distribution in the light of your values.
QUESTION 14:
The table below shows the distribution of number of days elapsing between date of
purchases and date of return of items returned to a department store during the current
fiscal year.
Number of days Number of items
Less than 5 8
5-9 22
10-14 38
15-19 16
20-24 15
25-29 8
30 or more 3
Total 110
(a) How is the information in the frequency distribution represented in a histogram?
(b) (i) Estimate the mean, mode, median and the semi-interquartile range for the data
(ii) Comment on the shape of the distribution.
QUESTION 15:
The administrator of a teaching Hospital has been collecting data regarding the number of
patients treated in the emergency ward on week-ends. The frequency distribution for the
numbers treated over a 20-week period is shown below.
No. of patients Treated No. of Weeks
25-34 4
35-44 5
45-54 7
38
55-64 2
65-74 1
75-84 1
(a) Compute the mean, mode median (ii) Semi-interquatile for the distribution
39
PROBABILITY
Background to Probability
Historically, probability originated from the study of games of chance and early
applications of the theory of probability were in such games. In the middle of the 17 th
century, a French coutier, the Chavelier de Mere wanted to know how to adjust the stakes
in gambling so that in the long run, the advantage would be his. He presented the problem
to Blaise Pascal. It was in the correspondence between Pascal and Pierre Fermat, another
French mathematician, that the theory of probability had its beginning. Many of the
probability calculations were based on objects of gambling: the coin, the die, and cards.
In order to understand probability, it is useful to have some familiarity with sets and
operations involving sets. This is because in probability theory, we make use of the idea
of set and operations involving sets.
Set: A set is a collection of elements. The elements of a set may be people, horses, desks,
files in a cabinet, or even numbers.
Universal Set: It is a set containing everything in a given context. We denote the
universal set by S.
Intersection of Sets: If sets A and B have elements in common, we say they intersect.
The intersection of A and B is denoted A B. Example: Given that A = 2, 3, 4, 5, 6
and B = 2, 4, 5., 7, 8, then A B =2, 4, 5.
Union of Sets: The union of A and B is the set containing all elements that are members
of A or B or both. This is denoted A B. Example: Given that A = 2, 3, 4, 5, 6 and B
= 2, 4,5.,7, 8, then A B =2, 3, 4, 5, 6, 7, 8
Disjoint Sets: The sets A and B are said to be disjoint if they have no elements in
common. Thus, A B= .
Complement of a Set: Given a set A, we define its complement, denoted A’, is the set
containing all the elements in the universal set S, that are not members of set A. The set
A’ is often called “not A”
40
Basic definitions of terms relevant to the computation of probability
Experiment
Is a process that leads to one of several possible outcomes. Other examples of experiment
include simple process of checking whether a switch is turned on or off, counting the
imperfections in a piece of cloth etc…
Outcome
Sample Space
The set of all possible outcomes of a probability experiment is called a sample space It is
an exhaustive list of all the possible outcomes of an experiment. Each possible result of
such a study is represented by one and only one point in the sample space, which is
usually denoted by S.
Examples
Experiment Rolling a die once:
Sample space S = {1,2,3,4,5,6}
Experiment Tossing a coin:
Sample space S = {Heads,Tails}
Experiment Measuring the height (cms) of a girl on her first day at school:
Sample space S = the set of all possible real numbers
Deterministic Experiment
An experiment is deterministic if its observed results are not subject to chance. In a
deterministic experiment, if the experiment is repeated a number of times under exactly
41
the same conditions, we expect the same results. For example, if we measure the distance
between the points P and Q(in Km.) many times under the same conditions, we expect to
have the same results.
Random Experiment
An experiment is random if its outcomes are uncertain. If a random experiment is
repeated under identical conditions, the outcomes may be different as there may be some
random phenomena or chance mechanism at work affecting the outcomes. For example,
tossing a coin or rolling a die is a random experiment since in each case the process can
lead to more than one possible outcome. A synonym for random experiment is stochastic
experiment.
Trial
Each repetition of an experiment is a trial. For example, if a coin is tossed four times,
each single toss is a trial.
Sample Space
Sample space is the universal set S pertinent to a given experiment. It is the set of all
possible outcomes of an experiment.
Sample Point
Each outcome in a sample space is an element or sample point. For example, when an
experiment is performed, it gives rise to certain outcomes. These outcomes are called
sample points( or elementary events). A collection of all possible outcomes or sample
points is called sample space. A coin tossed once may either result in a head(H) or a
tail(T). so there are only two outcomes of this experiment. Here, the sample space ,S,
consists of only two sample points. Thus, S = H, T . The number of sample points is 2.
This is denoted n(S) = 2. A fair die thrown once may show up on its face either of the six
numbers 1, 2, 3, 4, 5, and 6. There are six possible outcomes of this experiment and so
the sample space is S =1, 2, 3, 4, 5, 6 .The number of sample points is n(S) = 6
42
Equally Likely Outcomes:
When any outcome of an experiment has the same chance of occurrence as any other
outcome, then the outcomes are said to be equally likely. When a die is tossed once, the
outcomes 1, 2, 3, 4, 5, and 6 are all equally likely as long as the die is fair.
Event
Any event which consists of a single outcome in the sample space is called an elementary
or simple event. Events which consist of more than one outcome are called compound
events.
Set theory is used to represent relationships among events. In general, if A and B are two
events in the sample space S, then
(A union B) = 'either A or B occurs or both occur'
(A intersection B) = 'both A and B occur'
(A is a subset of B) = 'if A occurs, so does B'
A' or = 'event A does not occur'
(the empty set) = an impossible event
S (the sample space) = an event that is certain to occur
43
Example
Experiment: rolling a dice once -
Sample space S = {1,2,3,4,5,6}
Events A = 'score < 4' = {1,2,3}
B = 'score is even' = {2,4,6}
C = 'score is 7' =
= 'the score is < 4 or even or both' = {1,2,3,4,6}
= 'the score is < 4 and even' = {2}
A' or = 'event A does not occur' = {4,5,6}
44
mutually exclusive, the sum of the probabilities is equal to one. Note that in the
experiment of tossing a coin once, S = H,T and the events A = H.and B = T .
The events are mutually exclusive because both A and b cannot occur at the same time.
Also the events A and B are collectively exhaustive because A B =H,T. Thus,
the sum of the probabilities of the events A and B is equal to 1.
Independent Events
Two events A and B are said to be independent if the occurrence (or non-occurrence of
one of them is not affected by the occurrence (or non-occurrence) of the other. For
example, if two coins are tossed the event “Head” on the first coin and “Tail” on the
second coin are independent.
Dependent Events
Two events A and B are said to be dependent if the occurrence (or non-occurrence) of
one of them is affected by the occurrence (or non-occurrence) of the other. For example,
suppose a box contains 2 red pens and 3 blue pen and two are picked at random
successively. The event “blue pen” in the second picking and the event “red pen” in the
first picking are dependent
Concept
Probability is a concept that most people understand naturally, since words such as
“chance’, “likelihood’, “possibility’, and “proportion” are used as part of everyday
speech. For example, most of the following, which might be heard in business situations
are in fact statements of probability.
i. There is a 305 chance that the job will not be completed on time.
ii. There is no probability of delivering the goods before Monday. Etc…
Probability is the likelihood or chance that a particular event will occur. It could refer to
the chance of picking a black card from a deck of cards, or the chance that new consumer
product on the market will be successful, etc. Probability is a numerical measure of the
likelihood that an event will occur. These values are always assigned on a scale from 0 to
1. A probability near 0 indicates an event is very unlikely to occur(no chance of
occurring); a probability near 1 indicates an event is almost certain. Other probabilities
45
between 0 and 1 represent degrees of likelihood that an event will occur. For example, if
we consider the event “rain tomorrow” we understand that when the weather report
indicates ‘a near-zero probability of rain’ there is almost no chance of rain. However, if
0.90 probability of rain is reported, we know that rain is likely to occur. If 0.50
probability indicates that rain is just as likely to occur as not.
Chance and the assessment of risk play a part in everyone’s life. It might be something as
simple as tossing a coin at the start of a game, playing cards, owning premium bonds or
playing the National Lottery etc…. Probability has found a wide range of business
application. In addition to the calculation of risk in the banking and insurance industries,
probability provides the basis of many of the sampling procedures used in market
research and quality control. Investment appraisal requires an assessment of risk and a
measure of expected outcomes. The planning of major projects needs to take account of
uncertainties. Given that the outcomes of most activities are not known with certainty
( not deterministic), it is useful to understand them in probalistic terms
Approaches to Probability
Priori Classical Probability: The probability is based on the assumption that the
outcomes are equally likely. The probability of success is based on prior knowledge
of the process involved. In the simplest case where each event is equally likely, the
chance of occurrence of the event is defined as follows
X
Probability of occurrence =
T
Where:
X = Number of outcomes in which the event we are looking for occurs
T = Total number of possible outcomes
46
n(S) = number of elements in the set of the sample space, S
In some experiments, all outcomes are equally likely. For example if you were to choose
one winner in a raffle from a hat, all raffle ticket holders are equally likely to win, that is,
they have the same probability of their ticket being chosen. This is the equally-likely
outcomes model and is defined to be:
Examples
Example: Consider the experiment of rolling a six-sided die once. What is the
probability of the event ‘an even number appears face up”?
Solution:
Possible outcomes (S) = 1,2, 3, 4, 5, 6 , Thus n(S) = 6
Favourable outcomes ( event A) = = 2, 4, 6 Thus n(A) = 3
n(A) 3 1
Probability of occurrence P(A) = = =
N(S) 6 2
47
Number of observations
Subjective Probability
Whereas the probability of favourable event in the two approaches was computed
objectively, either from prior knowledge or from actual data, subjective probability refers
to the chance occurrence assigned to an event by a particular individual based on his
degree of belief or strength of conviction that the event will occur. This chance may be
different from the subjective probability assigned by another individual. For example, the
inventor of a new toy may assign quite a different probability to the chance of success for
the toy than the managing director of the company. The assignment of subjective
probabilities to various events is usually based on a combination of an individual’s past
experience, personal opinion, and analysis of a particular situation. Subjective probability
is especially useful in making decisions in situations in which the probability of various
events cannot be determined empirically.
Example
A Rangers supporter might say, "I believe that Rangers have probability of 0.9 of winning
the Premier Division this year since they have been playing really well."
48
Rules of Probability
Addition Rule: This applies when we are considering two or more events and wish to
determine the probability that at least one of the events will take place. In other words,
the addition rule is a result used to determine the probability that event A or event B
occurs or both occur
The result is often written as follows, using set notation:
where:
P(A) = probability that event A occurs
P(B) = probability that event B occurs
= probability that event A or event B occurs
= probability that event A and event B both occur
1. Mutually exclusive events: For mutually exclusive events, that is events which cannot
occur together:
There are two variations of this rule:
1. Mutually exclusive events: If A and B are two mutually exclusive events, then the
probability of obtaining either A or B is equal to the probability of obtaining A plus the
probability of obtaining B. The sets corresponding to the events are disjoint. In general,
we have: P(A or B) = P(A) +P(B).
Two events are mutually exclusive (or disjoint) if it is impossible for them to occur
together.
If two events are mutually exclusive, they cannot be independent and vice versa.
49
Examples
Worked Example: An automatic Shaw machine fills plastic bags with a mixture of
beans, broccoli, and other vegetables. Most of the bags contain the correct weight, but
because of the slight variations in the size of the beans and other vegetables, a package
might be slightly underweight or overweight. A check of 4,000 packages filled in the past
month revealed the following:
Solution:
Let event A = underweight
Let event B = overweight
The corresponding probabilities are P(A) = .025, P(C) = .075
P(A or B) = P(A) + P(B) = .025+0.075 = 0.10
2. Non-Mutually exclusive events: This rule is also known as the Joint Addition Rule of
probability. It measures the likelihood of two or more events happening concurrently. For
two events A and B, P(A or B) = P(A) +P(B) – P(A and B). Note that P(A and B) is the
same as P(A B). Also P(A or B) is the same as P(A B).
50
Example 1: What is the probability the randomly choosing a card from a standard deck
of playing cards will either be a king or a heart?
Solution:
EVENT PROBABILITY EXPLANATION
A(King) P(A) = 4/52 4 kings in a deck of 52 cards
B(Heart) P(B) = 13/52 13 Hearts in a deck of 52 cards
C(King of Heart) P( C ) = 1/52 1 King of Hearts in a deck of 52 cards
P(A or B) = P(A) + P(B) - P(A B = 4/52 + 13/ 52 – 1/52 = 16/52 = 0.3077
Example 1: Adie is tossed once. What is the probability that the number that shows up is
even or prime or both?
Conditional Probability
In many situations, once more information becomes available, we are able to revise our
estimates for the probability of further outcomes or events happening. For example,
suppose you go out for lunch at the same place and time every Friday and you are served
lunch within 15 minutes with probability 0.9. However, given that you notice that the
restaurant is exceptionally busy, the probability of being served lunch within 15 minutes
may reduce to 0.7. This is the conditional probability of being served lunch within 15
minutes given that the restaurant is exceptionally busy.
The usual notation for "event A occurs given that event B has occurred" is "A | B" (A
given B). The symbol | is a vertical line and does not imply division. P(A | B) denotes the
probability that event A will occur given that event B has occurred already.
where:
P(A | B) = the (conditional) probability that event A will occur given that event B
has occurred already
51
= the (unconditional) probability that event A and event B both occur
P(B) = the (unconditional) probability that event B occurs
The multiplication rule is a result used to determine the probability that two events, A and
B, both occur.
where:
P(A) = probability that event A occurs
P(B) = probability that event B occurs
= probability that event A and event B occur
P(A | B) = the conditional probability that event A occurs given that event B has
occurred already
P(B | A) = the conditional probability that event B occurs given that event A has
occurred already
For independent events, that is events which have no influence on one another, the rule
simplifies to:
That is, the probability of the joint events A and B is equal to the product of the
individual probabilities for the two events.
52
In other words, two events are independent if the occurrence of one of the events gives us
no information about whether or not the other event will occur; that is, the events have no
influence on each other.
Example 1
Suppose that a man and a woman each have a pack of 52 playing cards. Each draws a
card from his/her pack. Find the probability that they each draw the ace of clubs.
We define the events:
A = probability that man draws ace of clubs = 1/52
B = probability that woman draws ace of clubs = 1/52
Clearly events A and B are independent so:
Dependent events: Two events are dependent when the occurrence (or non-occurrence)
of one event has effect on the probability of the other event. If events A and B are
dependent, then P(A and B) = P(A)*P(B/A) or P( A B). = P(B)*P(A/B)
Example: A box contains 10 roll of films, 3 of which are defective. Two rolls are
selected at random one after the other without replacement. What is the probability of
selecting a defective roll followed by another defective roll?
Solution:
Let A = event that the first roll defective
53
Let B = event that the second roll is defective
P(A) = 3/10 and P(B) = 2/9
P(A and B) = P(A)*P(B/A) = 3/10 * 2/9 = 6/90 = 2/30 = .07
where:
P(A) = probability that event A occurs
= probability that event A and event B both occur
= probability that event A and event B' both occur, i.e. A occurs and B
does not.
Using the multiplication rule, this can be expressed as
P(A) = P(A | B).P(B) + P(A | B').P(B')
Bayes' Theorem
Bayes' Theorem is a result that allows new information to be used to update the
conditional probability of an event.
Using the multiplication rule, gives Bayes' Theorem in its simplest form:
54
P(B | A') = probability that event B occurs given that event A has not occurred
already
TUTORIAL QUESTIONS
55
No. of violations No. of drivers
0 1,910
1 46
2 18
3 12
4 9
5 or more 5
a. What is the experiment?
b. List one possible event
c. What is the probability that a particular driver had exactly two violations?
6. Before a nationwide survey was conducted, 40 people were selected to test the
questionnaire. One question about whether abortions should be legal required a yes or
no answer.
a. what is the experiment/
b. List one possible event
c. Ten of the 40 people favoured the legalization of abortions. Based on these
sample responses, what is the probability that a particular person will be in
favour of the legalization of abortions?
d. Are each of the possible outcomes equally likely and mutually exclusive?
7. An automatic Shaw machine fills plastic bags with a mixture of beans, broccoli and
other vegetables. Most of the bags contain the correct weight, but because of the slight
variation in the size if the beans and other vegetables, a package might be slightly
underweight or overweight. A check of 10,000 packages filled in the past month revealed
the following:
8. What is the probability that a randomly chosen card from a standard deck of
playing cards is either a
a. king or a heart?
b. club or a heart?
c. queen or a jack?
9. The probabilities of two events A and B are 0.20 and 0.30 respectively. The probability
that both A and B occur is 0.15. What is the probability that either A or B will occur?
10. The probabilities of two events A and B are 0.35 and 0.65 respectively. The
probability that both A and B occur is 0.20. What is the probability that either A or B will
occur? Are the events mutually exclusive/inclusive? Give reason
11. A survey of executives dealt with their loyalty to the company. That is, whether or not
they would leave given an offer by another company higher than their present position.
The table below shows the results of the survey.
LENGTH OF SERVICE
LOYALTY Less than 1 1 to 5 years 6 to 10 More than Total
year years 10 years
Would remain 10 30 5 75 120
Would not remain 25 15 10 30 80
Total 35 45 15 105 200
What is the probability of randomly selecting an executive who is loyal to the company
(would remain) and who has more than 10 years of service?
B B1
A 10 20
A1 20 40
What is the probability of:
d. Event A?
e. Event B?
f. Event A1?
57
g. Event A and B?
h. Event A and B1?
i. Event A1 and B1?
j. Event A or B?
k. Event A or B1?
l. Event A1 or B1?
13. In the past several years, credit card companies have made an aggressive effort to
solicit new accounts from college students. Suppose that a sample of 200 students at a
college indicated the following information as to whether the student possessed a bank
and/ travel and entertainment credit card
Travel and entertainment credit card
Bank credit card Yes No
Yes 60 60
No 15 65
If a student is selected at random, what is the probability that the student:
i. Had a bank credit card?
ii. Had a travel and entertainment credit card?
iii. Had a bank credit card and travel and entertainment credit card?
iv. Had a bank credit card or travel and entertainment credit card?
v. Does not have a bank credit card and travel and entertainment credit card?
vi. Neither a bank credit card nor travel and entertainment credit card?
14. A sample of 500 respondents was selected in a large metropolitan area to determine
previous information concerning consumer behaviour. Among the questions asked was,
“Do you enjoy shopping for clothing?” Of 240 males, 136 answered yes. Of 260 females,
244 answered yes.
m. Set up a 2x2 contingency table to evaluate the probabilities.
n. What is the probability that a respondent chosen at random:
i. Is a male?
ii. Enjoys shopping for clothing?
iii. Is a female and enjoys shopping for clothing?
iv. Is a male and does not enjoy shopping for clothing?
v. Is a female or enjoys shopping for clothing?
vi. Is a male or does not enjoy shopping for clothing?
vii. Is a male or female?
58
d. P(A1/B)
17. If P(A) = 0.70, P(B) =0.60 and P(A and B) are statistically independent, find P(A and
B)
18. If P(A) = 0.3, P(B) =0.4 and P(A and B)= 0.20. Are A and B statistically independent?
19. A bag contains 6 red balls and 4 black balls. Two balls are drawn at random one at a
time. Find the probability that:
i. Both balls are red.
j. Both balls are black.
k. The second ball is red given that the first is black.
l. The second ball black is given that the first is red.
COMBINATORIAL CONCEPTS
This deals with combination and permutation. These techniques are used to determine
the total number of possible outcomes.
PERMUTATION: This refers to the number of ways in which a set off objects can
be arranged in order. Order is very important. The number of permutations of n
objects taking r at a time denoted nPr is given by
n!
P =
n r
(n-r)!
Example: Suppose 4 people are to be randomly chosen out of 10 people who agreed
to be interviewed in a market survey. The four people are to be assigned to four
interviewers. How many possibilities are there?
Solution: The possibilities are given by the relation nPr.
59
n=10 and r = 4
10!
10 4P =
(10-4)!
10*9*8*7*6*5*4*3*2*1
= = 5040
6*5*4*3*2*1
10!
10C3 =
(10-4)!
10*9*8*7*6*5*4*3*2*1
= = 120
3*2*1(7*6*5*4*3*2*1)
EXERCISE
1. Three electronic parts are to be assembled into a plug-in unit for a television set. The
parts can be assembled in any order. How many different ways can the 3 parts be
assembled?
2. Suppose GHAMOT machine shop has 8 screw machine available but only 3 spaces
available in the production area in the machine. In how many different ways can 8
machines be arranged in the available spaces
3. The marketing department has been given the assignment of designing colour codes for
42 different lines of compact disk sold by Goody Records. Three colours are to be used
on each CD, but a combination of 3 colours used on one CD cannot be rearranged and
used to identify a different CD. Would 7 colours taking 3 at a time be adequate to colour
code the 42 lines?
4. A transport manager has to plan routes for his drivers. There are 3 deliveries to be
made to customer; X,Y and Z. How many routes can be made?
60
5. A company has 4 training officers A,B,C, D and two training sections. In how many
different ways may the 4 officers be assigned to the two sections X and Y?
6. A committee of 5 is to be chosen from 4 men and 5 women to work on a project.
a) In how many ways can the team be chosen?
b) In how many ways can the committee be chosen to include just 3 women?
c) What is the probability that the committee includes
i. at least 3 women?
ii. more than 3 women?
b) In a stock room, 6 adjacent shelves are available for storing 6 different items. The
stock of each item can be stored satisfactorily on any shelf.
(i) In how many ways can the 6 items be stored on the 6 shelves?
(ii) If there are 7 different items to be stored, but only 5 shelves are available, how many
arrangements are possible.
Binomial Distribution
In cases where the variable of interest is dichotomous (has two parts or outcomes),
the variable is binomial. Many business situations give rise to the compilation of
simple “yes” or “no” type of answers to particular questions. Examples of these
situations include:
i. In sampling the output of a production line, we could record for each item
coming off the line whether or not it is defective (that is defective and
nondefective)
ii. A salesgirl may or may not succeed in obtaining an order
iii. A consumer survey may indicate whether or not people are likely to buy a
product.
In each of the above cases, only two outcomes are possible. Statistical analysis of these
types of situation may be referred to as binomial experiment. We classify the two possible
outcomes as “success” and “failure”. In binomial experiment, we are interested in the
number of successes(or failures) occurring in n independent trials( such as x items
inspected on a production line). Typically, a binomial random variable is the number of
successes in a series of trials, for example, the number of 'heads' occurring when a coin is
tossed 50 times.
61
If we let x represent the random variable of the number of successes occurring in n such
trials, then x can take on any of the discrete values 0, 1, 2, .., n. The probabilities
associated with each of the possible outcomes have a special frequency distribution called
the binomial probability distribution.
A discrete random variable X is said to follow a Binomial distribution with parameters n
and p, written X ~ Bi(n,p) or X ~ B(n,p), if it has probability distribution
P( X x ) C x p x (1 p ) n x
n
where
x = 0, 1, 2, ......., n
n = 1, 2, 3, .......
p = success probability; 0 < p < 1
The Binomial distribution has expected value E(X) = np and variance V(X) = np(1-p).
62
it is functioning normally, the probability of a component selected at random being
defective is .05. What is the probability that there will be 2 defectives?
Solution: P(success) 0.05, P(failure) = 1-.05 = .95, n= 5, and x = 2
TUTORIAL QUESTIONS
QUESTION 1
Let X be a binomial random variable. Compute the following probabilities:
a. P(x=3) if n=5 and p=.2
b. P(x=2) if n=6 and p=.3
c. P(x=5) if n=5 and p=.75
QUESTION 2
A shoe store’s records that 30% of customers making a purchase use a credit card to
make payment. On a particular day, 20 customers purchased shoes from the store ;
o. Find the probability that at least customers use credit card
p. What is the expected number of customer who used a credit card?
QUESTION 3
An auditor is preparing for a physical count of inventory as a means of verifying its
values. Items counted are reconciled with a list prepared by the storeroom supervisor.
Normally, 20% of the items counted cannot be reconciled without reviewing invoices.
The auditor selects 10 items.
63
Find the probability of each of the following:
i. Up to 4 items cannot be reconciled.
ii. At least 6 items cannot be reconciled.
iii. Between 4 and 6 items (inclusive) cannot be reconciled.
QUESTION 4
A student majoring in Accounting is trying to decide upon the number of firms to which
she should apply. Given her work experience, grade and extracurricular activities, she has
been told by a placement counselor that she cannot expect to receive a job offer from
80% of the firms to which she applies. Wanting to save time, the student applies to only
five firms. Assuming the counselor’s estimate is correct, find the probability that the
student receives the following:
i. No offers
ii. At most 2 offers
iii. Between 2 and 4 offers (inclusive)
iv. 5 offers
QUESTION 5
A multiple-choice quiz has 15 questions. Each question has five possible answers, of
which only one is correct.
a. What is the probability that sheer guesswork will yield at least seven
correct answers?
b. What is the expected number of correct answers by sheer guesswork?
QUESTION 6
A sign on the gas pumps of a certain chain of gasoline station encourages customers
to have their oil checked, claiming that one out of every four cars should have its oil
topped up.
a. What is the probability that exactly 3 of the next 10 cars entering a station
should have their oil topped up?
b. What is the probability that least half of the next 10 cars entering a station
should have their oil topped up?
c. What is the probability that at least half of the next 20 cars entering a
station should have their oil topped up?
QUESTION 7
64
A company Minibus has 7 passenger seats and on a routine run it is estimated that any
passenger seat will be filled with probability 0.42.
a. What is the mean and variance of the binomial distribution of the number of
passengers on a routine run?
b. Calculate the probability (to 3 decimal places) that on a routine run:
i. There will be no passengers
ii. There will be just one passenger
iii. There will be exactly two passengers
iv. There will be at least three passengers
QUESTION 8
Suppose a poll of 20 voters is taken in a large city. The purpose is to determine the
number who favour a candidate for mayor. Suppose that 60% of all the city voters
favour the candidate.
a. find the mean and standard deviation of x.
b. find the probability that:
i. x 3
ii. x > 17
Poisson Distribution
While binomial random variable counts the number of successes that occur in a fixed
number of trials, a Poisson random variable counts the number of rare events (successes)
that occur in a specified time interval or a specified region. Activities to which the
Poisson distribution can be successfully applied include counting the number of
telephone calls received by a switchboard in a specified time period, counting the number
of arrivals at a service location (such as tollbooth or counter etc) in a given time period.
Typically, a Poisson random variable is a count of the number of events that occur in a
certain time interval or spatial area. For example, the number of cars passing a fixed point
in a 5 minute interval, or the number of calls received by a switchboard during a given
period of time.
where
65
x = 0, 1, 2, ..., n
m > 0.
In Poisson experiment, success refers to occurrence of the event of interest, and interval
refers to either an interval of time or an interval of space.
.
The Poisson distribution has expected value E(X) = m and variance V(X) = m; i.e. E(X) =
V(X) = m.
The Poisson distribution can sometimes be used to approximate the Binomial distribution
with parameters n and p. When the number of observations n is large, and the success
probability p is small, the Bi(n,p) distribution approaches the Poisson distribution with
the parameter given by m = np. This is useful since the computations involved in
calculating binomial probabilities are greatly reduced.
QUESTION 1
The manager of a company has noted that she usually receives 10 complaint calls from
customers during a week(consisting of 5 working days) and that the calls occur at
random. Find the probability of her receiving exactly five such calls in a single day.
QUESTION 2
The number of calls received by a switchboard operator between 9am. And 10 am. has a
Poisson distribution with mean 12. Find the probability that the operator received
a. 1 call
66
b. at least five calls during the periods
QUESTION 3
The number of accidents that occur on an assembly line have a Poisson distribution, with
an average of three accidents a week.
a. Find the probability that a particular week will accident free.
b. Find the probability that at least 3 accidents will occur a week
c. Find the probability that exactly 5 accidents will occur a week
CORRELATION
When the value of one variable is related to the value of another, they are said to be
correlated. Thus, correlation measures the relationship (association) between quantitative
variables. The coefficient of correlation is the measure of the association
If there is perfect linear relationship with positive slope between the two variables, we
have a correlation coefficient of 1; if there is positive correlation, whenever one variable
has a high (low) value, so does the other. If there is a perfect linear relationship with
negative slope between the two variables, we have a correlation coefficient of -1.
Movements in one variable may cause movement in the same direction in the other
variable. For example, there is likely to be some correlation between a person’s height
and weight; price and quantity of an item supplied
If there is negative correlation, whenever one variable has a high (low) value, the other
has a low (high) value. Movements in one variable may cause movement in the opposite
67
direction in the other variable. For example, price and quantity demanded of an item.
Correlation of 0 indicates no relation ship. It means that the variables are independent
The Pearson correlation coefficient formula makes several assumption about the nature of
data to which it is applied. First of all, the two variables are assumed to have been
measured using interval or ratio scales. If this is not the case, there are other types of
coefficients that can be computed that match the data on hand. A second implicit
assumption is that the nature of the relationship that we are trying to measure is linear.
The use of Pearson correlation coefficient also assumes that the variables you want to
analyse come from a bivariate normally distributed population. That is, the population is
such that all the observations with a given value of one variable have values of the second
variable that are normally distributed.
68
When this assumption is not justified, a non-parametric measure such as the Spearman
Rank Correlation Coefficient might be more appropriate.
If the variables are denoted by X and Y and the data is in the form (X1,Y1), (X2,Y2),
(X3,Y3),, (X4,Y4),…, (Xn,Yn), then r is given by the formula
n xy - xy
r =
( nx2 –(nx)2 y2 –(y)2
In computing this coefficient the actual values of the variables are not taken into
consideration but only their relative magnitudes. For each of the series of values x and y
we note the magnitudes of the items and assign them ranks in the descending order of
their magnitudes. In case of a tie, when two or more items have the same value, we adopt
the following methods:
i. All such items are given the same rank, that is, the rank in which the tie
occurred.
69
ii. They are given the same rank, namely the average of the ranks of places
occupied by these items. In this case, the ranks of some items may be
fractional.
iii. The next item gets as usual the rank which it would have but for the tie.
iv. The difference d in the two ranks of the corresponding items of the two series
is then ascertained.
v. The coefficient of correlation R is then given by:
6d2
R = 1-
n(n2-1)
Where n is the number of items in each series. The value of R so obtained, also ranges
from -1 to +1. Since this coefficient is based on the ranks of the items in the two series, it
is known as Spearman’s rank correlation coefficient
In case m items in a series have the same rank, we add M=1/12(m3 –m) to the value d2
as a correction factor. The coefficient of correlation is given as:
6(d2 +M)
R = 1-
n(n2-1)
There may be as many corrections as the number of groups of items having the same
rank. Thus, if there are more than one such groups of items with with common ranks, the
above correction (m may be different from each group) is added as many times as the
number of such groups. This method has the advantage that only the ranks of the items
are required to be known and not their actual values. Hence it can be profitably used in
those cases where the variables cannot be given numerical values, but only positions, for
example, in cases when the variables are attributes like honesty, intelligence, character,
etc.. It is also easier to compute than the product moment correlation coefficient.
The Spearman rank correlation coefficient is the recommended statistics to use when the
two variables have been measured using ordinal scale. If either one of the variables is
represented by rank order data, the best approach is to use the Spearman rank order
correlation rather than the Pearson product moment correlation coefficient.
70
Example 1: Calculate the coefficient of correlation for the following by the method of
rank difference.
x 75 88 95 70 60 80 81 50
y 120 134 150 115 110 140 142 100
Solution:
x y Rank of x Rank of y d d2
75 120 5 5 0 0
88 134 2 4 -2 4
95 150 1 1 0 0
70 115 6 6 0 0
60 110 7 7 0 0
80 140 4 3 1 1
81 142 3 2 1 1
50 100 8 8 0 0
d2 =6
6*6
= 1-
8(82-1)
= 0.93
Example 2: The following table shows the marks obtained by ten students in
accountancy and statistics. Find Spearman’s coefficient of rank correlation.
No 1 2 3 4 5 6 7 8 9 10
x 45 70 65 30 90 40 50 75 85 60
y 35 90 70 40 95 40 60 80 80 50
Solution:
Student Rank Rank
number x y x y difference(d) d2
1 45 35 8 10 -2 4.00
2 70 90 4 2 2 4.00
3 65 70 5 5 0 0
4 30 40 10 8.5 1.5 2.25
5 90 95 1 1 0 0
6 40 40 9 8.5 0.5 0.25
7 50 60 7 6 1 1.00
71
8 75 80 3 3.5 -0.5 0.25
9 85 80 2 3.5 -1.5 2.25
10 60 50 6 7 -1 1
Total d =15
2
In statistics, the highest number of marks are 95. Hence this student gets rank 1. The rank
2 goes to the student with 90 marks. Now, there are two students who got 80 marks each.
They should get the 3rd and 4th ranks. Since their marks are equal, their ranks should also
be equal. Therefore, each of them is given the rank (3+4)/2 = 3.5. In a similar manner, the
two students who got 40 marks each are given the rank (8+9)/2 = 8.5 each.
The rank correlation coefficient is:
6(d2 +M)
R = 1-
n(n2-1)
Where M = 1/1/(m3 – m)
For the given data, there are two groups of two figures having
the same item. So, m1(number of student having the same
marks, i.e. 80) = 2 and m1(number of student having the same
marks, i.e. 40) = 2
Therefore, m1 = 2 and m2 = 2,
Hence, M = 1/12(23-2)+1/12(23-2) =1
6*(15+1)
R = 1 -
10(102-1)
= 0.9
SIMPLE LINEAR REGRESSION
Simple linear regression aims to find a linear relationship between a response variable
and a possible predictor variable by the method of least squares.
Least Squares
The method of least squares is a criterion for fitting a specified model to observed data.
For example, it is the most commonly used method of defining a straight line through a
set of points on a scatterplot
Regression Equation
72
A regression equation allows us to express the relationship between two (or more)
variables algebraically. It indicates the nature of the relationship between two (or more)
variables. In particular, it indicates the extent to which you can predict some variables by
knowing others, or the extent to which some are associated with others.
a = y - bx
n n
Regression Line
A regression line is a line drawn through the points on a scatter plot to summarise the
relationship between the variables being studied. When it slopes down (from top left to
bottom right), this indicates a negative or inverse relationship between the variables;
when it slopes up (from bottom right to top left), a positive or direct relationship is
indicated.
The regression line often represents the regression equation on a scatter plot.
73
74
75
76
TIME SERIES
A time series is a collection of data recorded over a period of time: usually weekly, monthly,
quarterly or yearly. Examples include quarterly earning reports of a firm. Monthly
shipment of cements from a habour, annual consumer price indices, etc.
Unless the time series data are subject to a constant rate of change in an upward or
downward direction, which in fact is not very likely to obtain, all time series are subject
to variations whose pattern and amplitude vary from one period to another.
Given this, the objective of dealing with a time series is to study and analyse these
variations with a view to knowing how the time series has behaved in the past. This
knowledge about the past behaviour is useful for drawing inferences about the future.
77
.
TIME SERIES ANALYSIS The object of time series is to discover the magnitude and
direction of the trend that may exist in the time series, the nature and amplitude of
cycles, the effect of seasonal change and the size of random movements. The analysis is
done to estimate and separate the four types of variations and to bring out the selective
impact of each on the overall behaviour of the time series.
Analysis of the time series essentially, involves decomposition of the time series into its
components.
78
Irregular(random) variations: These are abrupt but not frequent, variations that go
extremely deep downwards or very high upward. These are caused by chance and
unusual situations. The follow no discernible pattern and so they cannot be predicted
Which of these two models to be used in decomposition depends on the assumption that
we might make about the nature of relationship among the four components.
The additive model: The additive model is used where it is assumed that the four
components are independent of one another. Under this assumption, the four components
are additive in the sense that the magnitude of the time series are the sum of separate
influence of the four components.
Thus, if Yt is taken to represent the magnitude of the time series data at that period t, then
Yt can be expressed as:
Y t = T t + Ct + S t + Rt
When the time series data are recorded against years, the seasonal component would
vanish, and in that case the additive model will take the form:
Y t = T t + Ct + R t
The multiplicative model: It is used when it is assumed that the forces giving rise to the
four types of variations are interdependent, so that the overall pattern of variations in the
79
time series is the combined result of the interaction of the forces operating on the time
series.
According to this assumption, the original magnitudes of the time series are the product
of its four components. That is:
Y t = T t x Ct x S t x Rt
and Y t = T t x Ct x Rt
As regards the choice between the two models, it is generally the multiplicative model,
which is used more frequently. The reason being that most business and economic time
series data are the result of interaction of a large number of forces which, individually,
cannot be treated responsible for generating any particular type of variations. Since the
forces responsible for one type of variations are also responsible for the other types of
variations, it is the multiplicative model, which is ideally suited for the purpose of
decomposition of a time series.
For this purpose. the Pearson’s approach based on the multiplicative model is used The
first step in the Pearson’s approach is to estimate the trend variations by fitting an
appropriate trend on the time series. After estimating the trend variations, these are then
separated from the time series data by dividing the original magnitudes by the
corresponding trend values. Separating trend variables from a time series is known as
detrending. This is given as:
Y t = Ct x S t x Rt
Tt
Alternatively, the seasonal variation , can be separated by dividing the monthly data by
the seasonal index. This is known as deseasonalising a time series, and the resultant series
is called the deseasonalised time series.
The trend variation and seasonal variation having been removed, only the cyclical and
irregular variations would be left which can easily be examined with reference to the
pattern of their occurrences and amplitude.
Generally, the random variations are neither very important, nor can they be easily
eliminated. However, the extent to which their elimination is possible , they tend to
become marginal in the process of deseasonalisation.
ESTIMATION OF TREND: The essence of trend estimation lies in fitting a trend line on
the time series data in such a way that it passes through ( as nearly as possible ) the
middle of the high and low turning points of the time series graph.
The trend can take many possible shapes. A straight line trend is frequently encountered
because most business and economic time series either consistently tend to increase or
decline over a long period of time.
Moving average method: This method of obtaining a time series trend involves
calculating a set of averages, each one corresponding to a trend(t) value for a time point
of the series. These are known as moving averages, since each average is calculated by
moving from one overlapping set of values to the next. The number of values in each set
is always the same and is known as the period of the moving average.
81
To demonstrate the technique, a set of moving averages of period 5 has been calculated
for a set of values:
Original values: 12 10 11 11 9 11 10 10 11 10
Moving totals: 53 52 52 51 51 52
Moving averages: 10.6 10.4 10.4 10.2 10.2 10.4
The first total, 53, is formed from adding the first 5 items; i.e
12+10+11+11+9=53. Similarly, the second total is given by 10+11+11+9+11. The rest are
found in like manner. The averages are then obtained by dividing each total by 5.Notice
that the totals and the averages are written down in line with the middle value of the set
being worked on. These averages are the trend (t) values required. It should also be
noticed that there are no trend values corresponding to the first and last two original
values. This is always the case with moving averages and is a disadvantage of this
particular method of obtaining a trend. Another disadvantage is that, the averages do not
yield an equation, which could be used for forecasting the values of a time series variable
for the future.
The choice of the length of the period for a moving average is important because
this would determine the extent to which the variations would be smoothed in the process
of averaging.Cycles of uniform length and height can be more easily eliminated by
choosing a moving average period equal to or a multiple of the cycle. In other words, if
the period of the moving average is equal to the period of the cycle or is a multiple of it,
the smoothing is perfect and we have a straight line trend. In general, a shorter period is
more useful in averaging out the cycles.
Least square method: Estimation of trend values by the method of least square makes
use of the regression equation. That is:
Tt = a+bt
Where Tt = the trend value of the time series variable T in time, t.
a = trend value at the point of origin.
b = amount by which the trend value changes per unit of time.
t =the value of the independent variable, that is time.
The values of the two constants, a and b, in the regression equation are determined by
solving the two normal equations
y = na + bx
82
xy =ax + bx2
Or using the identities:
b = nxy - xy
nx2 - ( x ) 2
a = y - bx
n n
Example 1 : Sales of food since 1987 are shown below:
YEAR 1987 1988 1989 1990 1991
SALES($M) 7 10 9 11 13
Determine the least square trend line equation.
Solution:
YEAR ( x ) SALES (y ) X xy x2
1987 7 0 0 0
1988 10 1 10 1
1989 9 2 18 4
1990 11 3 33 9
1991 13 4 52 16
TOTAL 10 113 30
The values for the years have been coded i.e. 1987=0, 1988=1, e.t.c.
The problem can be simplified by making the following change of scale(coding): let x be
the variable which measures time and taking the origin(the zero) of the new scale at the
middle of the series, that is, at the middle of the x’s, so that in the new scale x= 0. If the
series has an odd number of years, we assign x=0 to the middle year and number the
years …,-5, -4, -3, -2, -1, 0, 1, 2, 3, 4, 5, ….If the series has an even numbers, there is no
middle year, and we assign successive years the numbers …,-7, -5, -3, -1, 1, 3, 5, 5, 7,
….with –1 and 1 assigned to the two middle years Substituting x= 0 into the identities
above we have:
b = xy
x2
83
a = y
n
Example 2: Fit a least square trend line to the data below.
Year Total annual
revenues,
1966 789
1967 789
1968 542
1969 1,093
1970 1,175
1971 1,067
1972 1,166
1973 1,426
1974 1,692
Solution: Since we have figures for nine (an odd number of ) years, we label them –4, -3,
-2, -1, 0, 1, 2, 3, 4, and the sums needed for substitution into the formulas for a and b are
obtained in the following table.
Year X Y xy x2
1966 -4 789 -3,156 16
1967 -3 789 -1,626 9
1968 -2 542 -1,538 4
1969 -1 1,093 -1,093 1
1970 0 1,175 0 0
1971 1 1,067 1,067 1
1972 2 1,166 2,332 4
1973 3 1,426 4,278 9
1974 4 1,692 6,768 16
Total 0 9,719 7,032 60
a = y = 7,032 = 117.2
n 60
84
This makes it clear that 1,079.9 is the trend value for 1970 and that the annual trend
increment(the year-to-year growth) in the annual revenues is estimated as $117.2 million
for the given period of time.
Using the equation, we can now determine the trend value for any year by substituting the
corresponding value of x. For instance, for 1966 we substitute x = -4 and get a trend value
of y’= 1,079.9 + 117.2(-4) = 611.1, and for 1974 we substitute x=4 and get a trend value
of y’=1,079.9+117.2(4) =1,548.7 Plotting these two trend values and joining them with a
straight line, we obtain the least-square trend line.(see graph)
Y t = C t x Rt
Tt
85
The values in columns 1 and 2 have been used to calculate the trend equation. Column 1
values have been coded. The trend equation is y’ 2708.08 – 154.04x (please verify).The
trended values are: for 1957,
y’ =2708.08 – 154.04(-6) = 3632.32
The rest are and put in column 3. (Please check the values)
When these values are multiplied by 100, we have what is known as cyclical relatives.
These are put in column 5. It is assumed that random variables do not have much
influence on annual data. Therefore what remains after detrending the annual time series
data are the cyclical variations only.
A set of data showing the relative values of a variable during all the months of the years
is known as “seasonal index”. For example, if the production of a particular item during
the months of January, February, March, …, are 60, 90, 120, …, percent of the average
monthly production for the whole year, then 60, 90, 120, ..,constitute the seasonal index
for that year. These are sometimes referred to as “seasonal index number”. The average
seasonal index for the whole year should be 100%, or the sum of the index number
should be 1200%.
A typical index of 96.0 for January indicates that sales(or whatever the variable is ) are
4% below the average for the year. An index of 107.2 for October means that the variable
is typically 7.2% above the annual average.
86
The method most commonly used to compute the typical seasonal pattern is called the
ratio-to- moving average method. It eliminates the trend, cyclical and irregular
components from the original sales.
Example 3 : Toy’s international takes an inventory of its dolls, mechanical toys, and
other products on hand every quarter. The value of the inventory, in millions of dollars at
the beginning of each quarter since 1989 is indicated below:
YEAR QUARTER
WINTER SPRING SUMMER FALL
1987 6.7 4.9 10.0 12.7
1988 6.5 4.8 9.8 13.6
1989 6.7 4.3 10.4 13.1
1990 7.0 5.5 10.8 15.0
1991 7.1 4.4 11.1 14.5
1992 8.0 4.2 11.4 14.9
What are the typical quarterly indexes using the ratio to moving average method?
Answer?
Step 1. Determine a four quarterly moving total.
Starting with the winter quarter of 1987, we add 6.7, 4,9, 10.0, 12.7. The total is 34.3.
The four-quarterly total in column 2 is “moved along” by adding the Spring, Summer and
Fall inventories of the 1987 and the 1988 Winter inventory. The total is $34.1 found by
4.9+10.0+12.7+6.5.
Note: Instead of adding the four inventory values, we can subtract the Winter 1987
inventory(6.7) from the initial total of $34.3 and add the Winter 1988 inventory(6.5)
87
Step 2. Determine the four-quarterly moving average. Each quarterly moving total in
column 2 is divided by 4 to give the four-quarterly moving average. All the moving
averages are positioned between quarters
Step 3. Centre the four-quarterly moving average. The first centered moving average is
determined as follows:
( 8.515+8.525 ) = 8.550
4
Step 4. determine the specific seasonal indexes. A specific seasonal index for each
quarter is then computed by dividing the value of the inventory in column 1 by the
88
centered moving average in column 4. Each quotient is multiplied by 100 to convert it to
an index.
The first specific seasonal index is :
( 10 )x 100 =117.0
8.550
The others are computed in like manner
Step 5. Organize the specific seasonal indexes. The specific seasonal indexes are
organized in a table
YEAR QUARTER
Winter Spring Summer Fall
1987 117.0 149.2
1988 76.7 56.1 112.3 156.1
1989 79.1 49.2 119.7 148.6
1990 77.3 58.9 112.6 158.5
1991 75.8 47.1 118.2 153.0
1992 84.3 43.9
TOTAL 393.2 255.2 579.80 764.8
MEAN 78.64 51.04 115.96 152.96
TYPICAL 78.92 51.22 116.27 153.50
INDEX
(Adjusted index)
STEP 6. Adjust the seasonal indexes The four quarter means ( 78.64, 51.04, 115.96,and
152.96 ) should theoretically total 400 because the average is set at 100. The total may
not equal 400 due to rounding. In this problem, the total is 398.0. A correction factor is
therefore applied to each of the four means to force them to total 400.
= 400
398.0
= 1.00351
89
To adjust, we have: (1.00351 ) ( 152.96 ) = 153.50 for the Fall. The rest are computed in
like manner.(See table below)
(ii)
To remove the seasonal variation, the inventory for each quarter (which contains trend,
cyclical, irregular and seasonal variations ) is divided by the seasonal index for that
quarter. That is:
T t x Ct x Rt x S t = T t x Ct x Rt
St
For example, the actual inventory for the first quarter of 1987 was $6.7million. The
seasonal index for the first quarter is 78.92. The index of 78.92 indicates that inventory in
the first quarter is typically 21.08% below the average for a typical quarter. By dividing
the actual inventory of $6.7million by78.92 and multiplying by100, the deseasonalsed
value of inventory of $8,489,610 is obtained for the first quarter of 1987. (This is put in
column 5 in the table below)
90
Second Quarter 4.9 51.26 9.6
Third Quarter 10.0 116.51 8.6
Fourth Quarter 12.7 153.68 8.3
1988 First Quarter 6.5 78.55 8.3
Second Quarter 4.8 51.26 9.4
Third Quarter 9.8 116.51 8.4
Fourth Quarter 13.6 153.68 8.8
1989 First Quarter 6.7 78.55 8.5
Second Quarter 4.3 51.26 8.4
Third Quarter 10.4 116.51 8.9
Fourth Quarter 13.1 153.68 8.5
1990 First Quarter 7.0 78.55 8.9
Second Quarter 5.5 51.26 10.7
Third Quarter 10.8 116.51 9.3
Fourth Quarter 15.0 153.68 9.8
1991 First Quarter 7.1 78.55 9.0
Second Quarter 4.4 51.26 8.6
Third Quarter 11.1 116.51 9.5
Fourth Quarter 14.5 153.68 9.4
1992 First Quarter 8.0 78.55 10.2
Second Quarter 4.2 51.26 8.2
Third Quarter 11.4 116.51 9.8
Fourth Quarter 14.9 153.68 9.7
Since the seasonalised component has been removed ( divided out ) from the quarterly
inventory, the deseasonalised inventory contains only trend (T ), cyclical (C ), and
irregular/random ( R ) components
Example: Toy International would like to forecast their inventory for each quarter of
1993. Use Table Q* to determine the forecast.
Solution: The deseasonalized trend equation is determined as follows: The winter quarter
of 1987 is the period t = 1, and t = 24 corresponds to the fourth quarter of 1992.
91
Using these coding, the constants a and b in the equation y’ =a + bt are determined. The
trend equation is y’ = 8.5169 + 0.0425t.
The slope of the trend line is 0.0425. This indicates that over the 24 quarter, the
deseasonalized inventory growth rate is 0.0425($millions) or $42,500 per quarter. We can
estimate the inventory for 1993 as follows:
Using the trend equation, we can forecast inventories for the other quarters of 1993 using
t = 26, 27, 28.
Y’ = 8.5169 + 0.0425 ( 26) =85180.1
Y’ = 8.5169 + 0.0425 ( 27) =85180.5
Y’ = 8.5169 + 0.0425 ( 28) =85180.9
QUESTIONS
Q.1. Wyoming park and Yellowstone park contain shops, restaurant and motels. They
have two peak seasons: winter for skiing and summer for tourists visiting the parks. The
specific seasonals with respect to the total sales volume for recent years are:
QUARTER
Winter Summer Spring Fall
YEAR
1986 117.0 80.7 129.6 76.1
1987 118.6 82.5 121.4 77.0
1988 114.0 84.3 119.9 75.0
1989 120.7 79.6 130.7 69.6
1990 125.2 80.2 127.6 72.0
i. Calculate the quarterly indexes using the ratio to moving average method.
ii. Estimate the typical quarterly indexes.
iii. Use the multiplicative model to deseasonalize the typical quarterly indexes.
iv. Estimate the trend equation.
v. Use the trend equation to forecast the inventory for 1991
92
Q.2
QUARTER
1 2 3 4
YEAR
1 2.2 5.0 7.9 3.2
2 2.9 5.2 8.2 3.8
3 3.2 5.8 9.1 4.1
Q3
WEEK
1 2 3 4
DAY
Monday 22 22 24 26
Tuesday 36 34 38 38
Wednesday 40 42 43 45
Thursday 48 49 49 50
Friday 61 58 62 64
Saturday 58 59 58 58
c. Calculate the quarterly indexes using the ratio to moving average method.
d. Estimate the typical quarterly indexes.
e. Use the multiplicative model to deseasonalize the typical quarterly
indexes.
f. Estimate the trend equation.
g. Use the trend equation to forecast the inventory for 1991
CONFIDENCE INTERVAL
If independent samples are taken repeatedly from the same population, and a confidence
interval calculated for each sample, then a certain percentage (confidence level) of the
intervals will include the unknown population parameter. Confidence intervals are
93
usually calculated so that this percentage is 95%, but we can produce 90%, 99%, 99.9%
(or whatever) confidence intervals for the unknown parameter.
The width of the confidence interval gives us some idea about how uncertain we are
about the unknown parameter (see precision). A very wide interval may indicate that
more data should be collected before anything very definite can be said about the
parameter.
Confidence intervals are more informative than the simple results of hypothesis tests
(where we decide "reject H0" or "don't reject H0") since they provide a range of plausible
values for the unknown parameter.
Confidence Limits
Confidence limits are the lower and upper boundaries / values of a confidence interval,
that is, the values which define the range of a confidence interval.
The upper and lower bounds of a 95% confidence interval are the 95% confidence limits.
These limits may be taken for other confidence levels, for example, 90%, 99%, 99.9%.
Confidence Level
Example
Suppose an opinion poll predicted that, if the election were held today, the Conservative
94
party would win 60% of the vote. The pollster might attach a 95% confidence level to the
interval 60% plus or minus 3%. That is, he thinks it very likely that the Conservative
party would get between 57% and 63% of the total vote.
A confidence interval for a mean specifies a range of values within which the unknown
population parameter, in this case the mean, may lie. These intervals may be calculated
by, for example, a producer who wishes to estimate his mean daily output; a medical
researcher who wishes to estimate the mean response by patients to a new drug; etc.
The (two sided) confidence interval for a mean contains all the values of 0 (the true
population mean) which would not be rejected in the two-sided hypothesis test of:
H0: µ = µ0
against
H1: µ not equal to µ0
The width of the confidence interval gives us some idea about how uncertain we are
about the unknown population parameter, in this cas the mean. A very wide interval may
indicate that more data should be collected before anything very definite can be said
about the parameter.
We calculate these intervals for different confidence levels, depending on how precise we
want to be. We interpret an interval calculated at a 95% level as, we are 95% confident
that the interval contains the true population mean. We could also say that 95% of all
confidence intervals formed in this manner (from different samples of the population)
will include the true population mean.
A confidence interval for the difference between two means specifies a range of values
within which the difference between the means of the two populations may lie. These
95
intervals may be calculated by, for example, a producer who wishes to estimate the
difference in mean daily output from two machines; a medical researcher who wishes to
estimate the difference in mean response by patients who are receiving two different
drugs; etc.
The confidence interval for the difference between two means contains all the values of
µ1 - µ2 (the difference between the two population means) which would not be rejected in
the two-sided hypothesis test of:
H0: µ1 = µ2
against
H1: µ1 not equal to µ2
i.e.
H0: µ1 - µ2 = 0
against
H1: µ1 - µ2 not equal to 0
If the confidence interval includes 0 we can say that there is no significant difference
between the means of the two populations, at a given level of confidence.
The width of the confidence interval gives us some idea about how uncertain we are
about the difference in the means. A very wide interval may indicate that more data
should be collected before anything definite can be said.
We calculate these intervals for different confidence levels, depending on how precise we
want to be. We interpret an interval calculated at a 95% level as, we are 95% confident
that the interval contains the true difference between the two population means. We could
also say that 95% of all confidence intervals formed in this manner (from different
samples of the population) will include the true difference.
HYPOTHESIS TEST
96
proved, for example, claiming that a new drug is better than the current drug for
treatment of the same symptoms.
In each problem considered, the question of interest is simplified into two competing
claims / hypotheses between which we have a choice; the null hypothesis, denoted H0,
against the alternative hypothesis, denoted H1. These two competing claims / hypotheses
are not however treated on an equal basis. Special consideration is given to the null
hypothesis.
The hypotheses are often statements about population parameters like expected value and
variance; for example H0 might be that the expected value of the height of ten year old
boys in the Ghanaian population is not different from that of ten year old girls. A
hypothesis might also be a statement about the distributional form of a characteristic of
interest, for example that the height of ten year old boys is normally distributed within the
Ghanaian population.
The outcome of a hypothesis test is "Reject H0 in favour of H1" or "Do not reject H0".
Null Hypothesis
97
The null hypothesis, H0, represents a theory that has been put forward, either because it is
believed to be true or because it is to be used as a basis for argument, but has not been
proved. For example, in a clinical trial of a new drug, the null hypothesis might be that
the new drug is no better, on average, than the current drug. We would write
H0: there is no difference between the two drugs on average.
We give special consideration to the null hypothesis. This is due to the fact that the null
hypothesis relates to the statement being tested, whereas the alternative hypothesis relates
to the statement to be accepted if / when the null is rejected.
The final conclusion once the test has been carried out is always given in terms of the
null hypothesis. We either "Reject H0 in favour of H1" or "Do not reject H0"; we never
conclude "Reject H1", or even "Accept H1".
If we conclude "Do not reject H0", this does not necessarily mean that the null hypothesis
is true, it only suggests that there is not sufficient evidence against H0 in favour of H1.
Rejecting the null hypothesis then, suggests that the alternative hypothesis may be true.
Alternative Hypothesis
The alternative hypothesis, H1, is a statement of what a statistical hypothesis test is set up
to establish. For example, in a clinical trial of a new drug, the alternative hypothesis
might be that the new drug has a different effect, on average, compared to that of the
current drug. We would write
H1: the two drugs have different effects, on average.
The alternative hypothesis might also be that the new drug is better, on average, than the
current drug. In this case we would write
H1: the new drug is better than the current drug, on average.
The final conclusion once the test has been carried out is always given in terms of the
null hypothesis. We either "Reject H0 in favour of H1" or "Do not reject H0". We never
conclude "Reject H1", or even "Accept H1".
98
If we conclude "Do not reject H0", this does not necessarily mean that the null hypothesis
is true, it only suggests that there is not sufficient evidence against H0 in favour of H1.
Rejecting the null hypothesis then, suggests that the alternative hypothesis may be true.
Type I Error
In a hypothesis test, a type I error occurs when the null hypothesis is rejected when it is in
fact true; that is, H0 is wrongly rejected.
For example, in a clinical trial of a new drug, the null hypothesis might be that the new
drug is no better, on average, than the current drug; i.e.
H0: there is no difference between the two drugs on average.
A type I error would occur if we concluded that the two drugs produced different effects
when in fact there was no difference between them.
The following table gives a summary of possible results of any hypothesis test:
Decision
Reject H0 Don't reject H0
H0 Type I Error Right decision
Truth
H1 Right decision Type II Error
A type I error is often considered to be more serious, and therefore more important to
avoid, than a type II error. The hypothesis test procedure is therefore adjusted so that
there is a guaranteed 'low' probability of rejecting the null hypothesis wrongly; this
probability is never 0. This probability of a type I error can be precisely computed as
P(type I error) = significance level =
If we do not reject the null hypothesis, it may still be false (a type II error) as the sample
may not be big enough to identify the falseness of the null hypothesis (especially if the
truth is very close to hypothesis).
For any given set of data, type I and type II errors are inversely related; the smaller the
risk of one, the higher the risk of the other.
99
Type II Error
In a hypothesis test, a type II error occurs when the null hypothesis H0, is not rejected
when it is in fact false. For example, in a clinical trial of a new drug, the null hypothesis
might be that the new drug is no better, on average, than the current drug; i.e.
H0: there is no difference between the two drugs on average.
A type II error would occur if it was concluded that the two drugs produced the same
effect, i.e. there is no difference between the two drugs on average, when in fact they
produced different ones.
Test Statistic
A test statistic is a quantity calculated from our sample of data. Its value is used to decide
whether or not the null hypothesis should be rejected in our hypothesis test.
The choice of a test statistic will depend on the assumed probability model and the
hypotheses under question.
Critical Value(s)
100
The critical value(s) for a hypothesis test is a threshold to which the value of the test
statistic in a sample is compared to determine whether or not the null hypothesis is
rejected.
The critical value for any hypothesis test depends on the significance level at which the
test is carried out, and whether the test is one-sided or two-sided.
Critical Region
The critical region CR, or rejection region RR, is a set of values of the test statistic for
which the null hypothesis is rejected in a hypothesis test. That is, the sample space for the
test statistic is partitioned into two regions; one region (the critical region) will lead us to
reject the null hypothesis H0, the other will not. So, if the observed value of the test
statistic is a member of the critical region, we conclude "Reject H0"; if it is not a member
of the critical region then we conclude "Do not reject H0".
.Significance Level
It is the probability of a type I error and is set by the investigator in relation to the
consequences of such an error. That is, we want to make the significance level as small as
possible in order to protect the null hypothesis and to prevent, as far as possible, the
investigator from inadvertently making false claims.
One-sided Test
101
A one-sided test is a statistical hypothesis test in which the values for which we can reject
the null hypothesis, H0 are located entirely in one tail of the probability distribution.
In other words, the critical region for a one-sided test is the set of values less than the
critical value of the test, or the set of values greater than the critical value of the test.
The choice between a one-sided and a two-sided test is determined by the purpose of the
investigation or prior reasons for using a one-sided test.
Example
Suppose we wanted to test a manufacturers claim that there are, on average, 50 matches
in a box. We could set up the following hypotheses
H0: µ = 50,
against
H1: µ < 50 or H1: µ > 50
Either of these two alternative hypotheses would lead to a one-sided test. Presumably, we
would want to test the null hypothesis against the first alternative hypothesis since it
would be useful to know if there is likely to be less than 50 matches, on average, in a box
(no one would complain if they get the correct number of matches in a box or more).
Yet another alternative hypothesis could be tested against the same null, leading this time
to a two-sided test:
H0: µ = 50,
against
H1: µ not equal to 50
Here, nothing specific can be said about the average number of matches in a box; only
that, if we could reject the null hypothesis in our test, we would know that the average
number of matches in a box is likely to be less than or greater than 50.
Two-Sided Test
102
A two-sided test is a statistical hypothesis test in which the values for which we can reject
the null hypothesis, H0 are located in both tails of the probability distribution.
In other words, the critical region for a two-sided test is the set of values less than a first
critical value of the test and the set of values greater than a second critical value of the
test.
The choice between a one-sided test and a two-sided test is determined by the purpose of
the investigation or prior reasons for using a one-sided test.
Example
Suppose we wanted to test a manufacturers claim that there are, on average, 50 matches
in a box. We could set up the following hypotheses
H0: µ = 50,
against
H1: µ < 50 or H1: µ > 50
Either of these two alternative hypotheses would lead to a one-sided test. Presumably, we
would want to test the null hypothesis against the first alternative hypothesis since it
would be useful to know if there is likely to be less than 50 matches, on average, in a box
(no one would complain if they get the correct number of matches in a box or more).
Yet another alternative hypothesis could be tested against the same null, leading this time
to a two-sided test:
H0: µ = 50,
against
H1: µ not equal to 50
Here, nothing specific can be said about the average number of matches in a box; only
that, if we could reject the null hypothesis in our test, we would know that the average
number of matches in a box is likely to be less than or greater than 50.
103
A one sample t-test is a hypothesis test for answering questions about the mean where the
data are a random sample of independent observations from an underlying normal
That is, the sample has been drawn from a population of given mean and unknown
variance (which therefore has to be estimated from the sample).
This null hypothesis, H0 is tested against one of the following alternative hypotheses,
depending on the question posed:
H1: µ is not equal to µ
H1: µ > µ
H1: µ < µ
When carrying out a two sample t-test, it is usual to assume that the variances for the two
populations are equal, i.e.
That is, the two samples have both been drawn from the same population. This null
hypothesis is tested against one of the following alternative hypotheses, depending on the
question posed.
104
H1: µ1 is not equal to µ2
H1: µ1 > µ2
H1: µ1 < µ2
The paired sample t-test is a more powerful alternative to a two sample procedure, such
as the two sample t-test, but can only be used when we have matched samples.
105