Sunteți pe pagina 1din 296

LESSON 1

STATISTICS FOR MANAGEMENT


Session 1 Duration: 1 hr
Meaning of Statistics
The term statistics mean that the numerical statement as well as statistical methodology. When it
is used in the sense of statistical data it refers to quantitative aspects of things and is a numerical
description.
Example: Income of family, production of automobile industry, sales of cars etc. There quantities are
numerical. But there are some quantities which are not in themselves numerical but can be made so by
counting. The sex of a baby is not a number, but by counting the number of boys, we can associate a
numerical description to sex of all new born babies, for an example, when saying that 60% of all live-born
babies are boy. This information then, comes within the realm of statistics.
Definition
The word statistics can be used is two senses, viz, singular and plural. In narrow sense and plural
sense, statistics denotes some numerical data (statistical data). In a wide and singular sense statistics
refers to the statistical methods. Therefore, these have been grouped under two heads Statistics as a
data and Statistics as a methods.
Statistics as a Data
Some definitions of statistics as a data are
a) Statistics are numerical statement of facts in any department of enquiring placed
in relation to each other.
- Powley
b) By statistics we mean quantities data affected to a marked extent by
multiplasticity of course.
- Yule and Kendall
c) By statistics we mean aggregates of facts affected to a marked extent by
multiplicity of causes, numerically expressed, enumerated or estimated according to reasonable
standard of accuracy, collected in a systematic manner for pre-determinated purpose and placed in
relation to each other.
- H. Secrist
This definition is more comprehensive and exhaustive. It shows more light on characteristics of
statistics and covers different aspects.
Some characteristics the statistics should possess by H. Secrist can be listed as follows.
1
Statistics are aggregate of facts
Statistics are affected to a marked extent by multiplicity of causes.
Statistics are numerically expressed
Statistics should be enumerated / estimated
Statistics should be collected with reasonable standard of accuracy
Statistics should be placed is relation to each other.
Statistics as a methods
Definition
a) Statistics may be called to science of counting
- A.L. Bowley
b) Statistics is the science of estimates and probabilities.
- Boddington
c) Dr. Croxton and Cowden have given a clear and concise definition.
Statistics may be defined as the collection, presentation, analysis and interpretation of numerical
data.
According to Croxton and Cowden there are 4 stages.
a) Collection of Data
A structure of statistical investigation is based on a systematic collection of data. The data is
classified into two groups
i) Internal data and
ii) External data
Internal data are obtained from internal records related to operations of business organisation such
as production, source of income and expenditure, inventory, purchases and accounts.
The external data are collected and purchased by external agencies. The external data could be
either primary data or secondary data. The primary data are collected for first time and original, while
secondary data are collected by published by some agencies.
b) Organisations of data
The collected data is a large mass of figures that needs to be organised. The collected data must
be edited to rectify for any omissions, irrelevant answers, and wrong computations. The edited data must
be classified and tabulated to suit further analysis.
c) Presentation of data
The large data that are collected cannot be understand and analysis easily and quickly. Therefore,
collected data needs to be presented in tabular or graphic form. This systematic order and graphical
presentation helps for further analysis.
d) Analysis of data
The analysis requires establishing the relationship between one or more variables. Analysis of
data includes condensation, abstracting, summarization, conclusion etc. With the help of statistical tools
and techniques like measures of dispersion central tendency, correlation, variance analysis etc analysis
can be done.
2
e) Interpretation of data
The interpretation requires deep insight of the subject. Interpretation involves drawing the valid
conclusions on the bases of the analysis of data. This work requires good experience and skill. This
process is very important as conclusions of results is done based on interpretation.
We can define statistics as per Seligman as follows.
Statistics is a science which deals with the method and of collecting, classifying, presenting,
comparing and interpretating the numerical data collected to throw light on enquiry.
Importance of statistics
In todays context statistics is indispensable. As the use of statistics is extended to various field of
experiments to draw valid conclusions, it is found increased importance and usage. The number of
research investigations in the field of economics and commerce are largely statistical. Further, the
importance and statistics in various fields are listed as below.
a) State Affairs: In state affairs, statistics is useful in following ways
1. To collect the information and study the economic condition of people in the states.
2. To asses the resources available in states.
3. To help state to take decision on accepting or rejecting its policy based on statistics.
4. To provide information and analysis on various factors of state like wealth, crimes, agriculture
experts, education etc.
b) Economics: In economics, statistics is useful in following ways
1. Helps in formulation of economic laws and policies
2. Helps in studying economic problems
3. Helps in compiling the national income accounts.
4. Helps in economic planning.
c) Business
1. Helps to take decisions on location and size
2. Helps to study demand and supply
3. Helps in forecasting and planning
4. Helps controlling the quality of the product or process
5. Helps in making marketing decisions
6. Helps for production, planning and inventory management.
7. Helps in business risk analysis
8. Helps in resource long term requirements, in estimating consumers preference and helps in
business research.
d) Education: Statistics is necessary to formulate the polices regarding start of new courses,
consideration of facilities available for proposed courses.
e) Accounts and Audits:
1. Helps to study the correlation between profits and dividends enable to know trend of future
profits.
2. In auditing sampling techniques are followed.
3
Functions of statistics
Some important functions of statistics are as follows
1. To collect and present facts in a systematic manner.
2. Helps in formulation and testing of hypothesis.
3. Helps in facilitating the comparison of data.
4. Helps in predicting future trends.
5. Helps to find the relationship between variable.
6. Simplifies the mass of complex data.
7. Help to formulate polices.
8. Helps Government to take decisions.
Limitations of statistics
1. Does not study qualitative phenomenon.
2. Does not deal with individual items.
3. Statistical results are true only on an average.
4. Statistical data should be uniform and homogeneous.
5. Statistical results depends on the accuracy of data.
6. Statistical conclusions are not universally true.
7. Statistical results can be interpreted only if person has sound knowledge of statistics.
Distrust of Statistics
Distrust of statistics are due to lack of knowledge and limitations of its uses, but not due to
statistical sciences.
Distrust of statistics are due to following reasons.
a) Figures are manipulated or incompleted.
b) Quoting figures without their context.
c) Inconsistent definitions.
d) Selection of non-representative statistical units.
e) Inappropriate comparison
f) Wrong inference drawn.
g) Errors in data collection.
Statistical Data
Statistical investigation is a long and comprehensive process and requires systematic collection of
data in large size. The validity and accuracy of the conclusion or results of the study depends upon how
well the data were gathered. The quality of data will greatly influence the conclusions of the study and
hence importance is to be given to the data collection process.
Statistical data may be classified as Primary Data and Secondary Data based on the sources of data
collection.
4
Primary data
Primary data are those which are collected for the first time by the investigator / researchers and
are thus original in character. Thus, data collected by investigator may be for the specific purpose / study
at hand. Primary data are usually in the shape of raw materials to which statistical methods are applied for
the purpose of analysis and interpretation.
Secondary data
Secondary have been already collected for the purpose other than the problem at hand. These data
are those which have already been collected by some other persons and which have passed through the
statistical analysis at least once. Secondary data are usually in the shape of finished products since they
have been already treated statistically in one or the other form. After statistical treatment the primary data
lose their original shape and becomes secondary data. Secondary data of one organisation become the
primary data of other organisation who first collect and publish them.
Primary Vs Secondary Data
Primary data are originated by researcher for specific purpose / study at hand while secondary data
have already been collected for purpose other than research work at hand.
Primary data collection requires considerably more time, relatively expensive. While the
secondary data are easily accessible, inexpensive and quickly obtained.
Table A compression of Primary and Secondary Data
Primary data Secondary data
Collection purpose For the problem at hand For other problems
Collection process Very involved Rapid and easy
Collection cost High Relatively low
Collection time Long Short
Suitability Its suitability is positive It may or may not suit the
object of survey
Originality It is original It is not original
Precautions No extra precautions
required to use the data
It should be used with extra
case
Limitations of secondary data
a) Since secondary data is collected for some other purpose, its usefulness to current problem may
be limited in several important ways, including relevancies and accuracy.
b) The objectives, nature and methods used to collect secondary data may not be appropriate to
present situation.
c) The secondary data may not be accurate, or they may not be completely current or dependable.
Criteria for evaluating secondary data
Before using the secondary data it is important to evaluate them on following factors
5
a) Specification and methodology used to collect the data
b) Error and accuracy of data of the data
c) The currency
d) The objective The purpose for which data were collected
e) The nature content of data
f) The dependability
Sources of data
Primary source The methods of collecting primary data.
When data is neither internally available nor exists as a secondary source, then the primary sources
of data would be approximate.
The various method of collection of primary data are as follows
a) Direct personal investigation
- Interview
- Observation
b) Indirect or oral investigation
c) Information from local agents and correspondents
d) Mailded questionnaires and schedules
e) Through enumerations

Secondary source The methods of collecting secondary data
i) Published Statistics
a) Official publications of Central Government
Ex: Central Statistical Organisation (CSO) Ministry of planning
- National Sample Survey Organisation (NSSO)
- Office of the Registrar General and Census Committee GOI
- Director of Statistics and Economics Ministry of Agriculture
- Labour Bureau Ministry of Labour etc.
ii) Publications of Semi-government organisation
Ex:
- The institute of foreign trade, New Delhi
- The institute of economic growth, New Delhi.
iii) Publication of research institutes
Ex:
- Indian Statistical Institute
- Indian Agriculture Statistical Institute
- NCRET Publications
- Indian Standards Institute etc.
6
iv) Publication of Business and Financial Institutions
Ex:
- Trade Association Publications like Sugar factory, Textile mill, Indian chamber of Industry
and Commerce.
- Stock exchange reports, Co-operative society reports etc.
v) News papers and periodicals
Ex:
- The Financial Express, Eastern Economics, Economic Times, Indian Finance, etc.
vi) Reports of various committees and commissions
Ex:
- Kothari commission report on education
- Pay commission reports
- Land perform committee reports etc.
vii) Unpublished statistics
- Internal and administrative data like Periodical Loss, Profit, Sales, Production Rate,
Balance Sheet, Labour Turnover, Budges, etc.
Classification and Tabulation
The data collected for the purpose of a statistical inquiry some times consists of a few fairly
simple figures which can be easily understood without any special treatment. But more often there is an
overwhelming mass of raw data without any structure. Thus, unwidely, unorganised and shapeless mass
of collected is not capable of being rapidly or easily associated or interpreted. Unorganised data are not
fit for further analysis and interpretation. In order to make the data simple and easily understandable the
first task is not condense and simplify them in such a way that irrelevant datas are removed and their
significant features are stand out prominently. The procedure adopted for this purpose is known as
method of classification and tabulation. Classification helps proper tabulation.
Classified and arranged facts speak themselves; unarranged, unorganised they are dead as
mutton.
- Prof. J.R. Hicks
Meaning of Classification
Classification is a process of arranging things or data in groups or classes according to their
resemblances and affinities and gives expressions to the unity of attributes that may subsit among a
diversity of individuals.
Definition of Classification
Classification is the process of arranging data into sequences and groups according to their
common characteristics or separating them into different but related parts.
- Secrist
The process of grouping large number of individual facts and observations on the basis of
similarity among the items, is called classification.
- Stockton & Clark
Characteristics of classification
7
a) Classification performs homogeneous grouping of data
b) It brings out points of similarity and dissimilating
c) The classification may be either real or imaginary
d) Classification is flexible to accommodate adjustments
Objectives / purposes of classifications
i) To simplify and condense the large data
ii) To present the facts to easily in understandable form
iii) To allow comparisons
iv) To help to draw valid inferences
v) To relate the variables among the data
vi) To help further analysis
vii) To eliminate unwanted data
viii) To prepare tabulation
Guiding principles (rules) of classifications
Following are the general guiding principles for good classifications
a) Exhaustive: Classification should be exhaustive. Each and every item in data must belong
to one of class. Introduction of residual class (i.e. either, miscellaneous etc.) should be
avoided.
b) Mutually exclusive: Each item should be placed at only one class
c) Suitability: The classification should confirm to object of inquiry.
d) Stability: Only one principle must be maintained throughout the classification and
analysis.
e) Homogeneity: The items included in each class must be homogeneous.
f) Flexibility: A good classification should be flexible enough to accommodate new situation
or changed situations.
Modes / Types of Classification
Modes / Types of classification refers to the class categories into which the data could be sorted
out and tabulated. These category depends on the nature of data and purpose for which data is being
sought.
Important types of classification
a) Geographical (i.e. on the basis of area or region wise)
b) Chronological (On the basis of Temporal / Historical, i.e. with respect to time)
c) Qualitative (on the basis of character / attributes)
d) Numerical, quantitative (on the basis of magnitude)
a) Geographical Classification
Non-smokers
Illiterate
Male Female
Male Female
Illiterate

Male Female
Male Female
8
In geographical classification, the classification is based on the geographical regions.
Ex: Sales of the company (In Million Rupees) (region wise)
Region Sales
North 285
South 300
East 185
West 235
b) Chronological Classification
If the statistical data are classified according to the time of its occurrence, the type of classification
is called chronological classification.
Sales reported by a departmental store
Month
Sales
(Rs.) in lakhs
January 22
February 26
March 32
April 25
May 27
June 29
July 30
August 30
c) Qualitative Classification
In qualitative classifications, the data are classified according to the presence or absence of
attributes in given units. Thus, the classification is based on some quality characteristics / attributes.
Ex: Sex, Literacy, Education, Class grade etc.
Further, it may be classified as
a) Simple classification b) Manifold classification
i) Simple classification: If the classification is done into only two classes then classification is
known as simple classification.
Ex: a) Population in to Male / Female
b) Population into Educated / Uneducated
ii) Manifold classification: In this classification, the classification is based on more than one attribute
at a time.
Ex:
Non-smokers
Illiterate
Male Female
Male Female
Illiterate

Male Female
Male Female
9
d) Quantitative Classification: In Quantitative classification, the classification is based on
quantitative measurements of some characteristics, such as age, marks, income, production, sales etc.
The quantitative phenomenon under study is known as variable and hence this classification is also
called as classification by variable.
Ex:
For a 50 marks test, Marks obtained by students as classified as follows
Marks No. of students
0 10 5
10 20 7
20 30 10
30 40 25
40 50 3
Total Students = 50
In this classification marks obtained by students is variable and number of students in each class
represents the frequency.
Meaning and Definition of Tabulation
Tabulation may be defined as systematic arrangement of data is column and rows. It is designed
to simplify presentation of data for the purpose of analysis and statistical inferences.

Major Objectives of Tabulation
1. To simplify the complex data
2. To facilitate comparison
3. To economise the space
4. To draw valid inference / conclusions
5. To help for further analysis
Population
Smokers Non-smokers
Illiterate Literate
Male Female
Male Female
Literate
Illiterate

Male Female
Male Female
10
Differences between Classification and Tabulation
1. First data are classified and presented in tables; classification is the basis for tabulation.
2. Tabulation is a mechanical function of classification because is tabulation classified data are
placed in row and columns.
3. Classification is a process of statistical analysis while tabulation is a process of presenting data is
suitable structure.
Classification of tables
Classification is done based on
1. Coverage (Simple and complex table)
2. Objective / purpose (General purpose / Reference table / Special table or summary table)
3. Nature of inquiry (primary and divided table).
Ex:
a) Simple table: Data are classified based on only one characteristic
Distribution of marks
Class Marks No. of students
30 40 20
40 50 20
50 60 10
Total 50
b) Two-way table: Classification is based on two characteristics
Class Marks
No. of students
Boys Girls Total
30 40 10 10 20
40 50 15 5 20
50 60 3 7 10
Total 28 22 50
Frequency Distribution
Frequency distribution is a table used to organize the data. The left column (called classes or
groups) includes numerical intervals on a variable under study. The right column contains the list of
frequencies, or number of occurrences of each class/group. Intervals are normally of equal size covering
the sample observations range.
It is simply a table in which the gathered data are grouped into classes and the number of
occurrences which fall in each class is recorded.
11
Definition
A frequency distribution is a statistical table which shows the set of all distinct values of the
variable arranged in order of magnitude, either individually or in groups with their corresponding
frequencies.
- Croxton and Cowden
A frequency distribution can be classified as
a) Series of individual observation
b) Discrete frequency distribution
c) Continuous frequency distribution
a) Series of individual observation
Series of individual observation is a series where the items are listed one after the each
observations. For statistical calculations, these observation could be arranged is either ascending or
descending order. This is called as array.
Ex:
Roll No.
Marks obtained
in statistics
paper
1 83
2 80
3 75
4 92
5 65
The above data list is a raw data. The presentation of data in above form doesnt reveal any
information. If the data is arranged in ascending / descending in the order of their magnitude, which gives
better presentation then, it is called arraying of data.
Discrete (ungrouped) Frequency Distribution
If the data series are presented in such away that indicating its exact measurement of units, then it
is called as discrete frequency distribution. Discrete variable is one where the variates differ from each
other by definite amounts.
Ex:
Assume that a survey has been made to know number of post-graduates in 10 families at random,
the resulted raw data could be as follows.
0, 1, 3, 1, 0, 2, 2, 2, 2, 4
This data can be classified into an ungrouped frequency distribution. The number of post-
graduates becomes variable (x) for which we can list the frequency of occurrence (f) in a tabular from as
follows;
12
Number of post
graduates (x)
Frequency
(f)
0 2
1 2
2 4
3 1
4 1
The above example shows a discrete frequency distribution, where the variables has discrete
numerical values.
Continuous frequency distribution (grouped frequency distribution)
Continuous data series is one where the measurements are only approximations and are expressed
in class intervals within certain limits. In continuous frequency distribution the class interval theoretically
continuous from the starting of the frequency distribution till the end without break. According to
Boddington the variable which can take very intermediate value between the smallest and largest value in
the distribution is a continuous frequency distribution.
Ex:
Marks obtained by 20 students in students exam for 50 marks are as given below convert the
data into continuous frequency distribution form.
18 23 28 29 44 28 48 33 32 43
24 29 32 39 49 42 27 33 28 29
By grouping the marks into class interval of 10 following frequency distribution table can be
formed.
Marks No. of students
0 - 5 0
5 10 0
10 15 0
15 20 1
20 25 2
25 30 7
30 35 4
35 40 1
40 45 3
45 50 2
13
Technical terms used in formulation frequency distribution
a) Class limits:
The class limits are the smallest and largest values in the class.
Ex:
0 10, in this class, the lowest value is zero and highest value is 10. the two boundaries of the
class are called upper and lower limits of the class. Class limit is also called as class boundaries.
b) Class intervals
The difference between upper and lower limit of class is known as class interval.
Ex:
In the class 0 10, the class interval is (10 0) = 10.
The formula to find class interval is gives on below
R
S L
i

L = Largest value
S = Smallest value
R = the no. or classes
Ex:
If the marks of 60 students in a class varies between 40 and 100 and if we want to form 6 classes,
the class interval would be
R
S L
i

=
6
40 100
=
6
60
= 10 L = 100
S = 40
K = 6
Therefore, class intervals would be 40 50, 50 60, 60 70, 70 80, 80 90 and 90 100.
Methods of forming class-interval
a) Exclusive method (overlapping)
In this method, the upper limits of one class-interval is the lower limit of next class. This methods
makes continuity of data.
Ex:
Marks No. of students
20 30 5
30 40 15
40 50 25
A student whose mark is between 20 to 29.9 will be included in the 20 30 class.
Better way of expressing is
Marks No. of students
14
20 to les than 30
(More than 20 but les than 30)
5
30 to les than 40 15
40 to les than 50 25
Total Students 50
b) Inclusive method (non-overlaping)
Ex:
Marks No. of students
20 29 5
30 39 15
40 49 25
A student whose mark is 29 is included in 20 29 class interval and a student whose mark in 39 is
included in 30 39 class interval.
Class Frequency
The number of observations falling within class-interval is called its class frequency.
Ex: The class frequency 90 100 is 5, represents that there are 5 students scored between 90 and 100. If
we add all the frequencies of individual classes, the total frequency represents total number of items
studied.
Magnitude of class interval
The magnitude of class interval depends on range and number of classes. The range is the
difference between the highest and smallest values is the data series. A class interval is generally in the
multiples of 5, 10, 15 and 20.
Sturges formula to find number of classes is given below
K = 1 + 3.322 log N.
K = No. of class
log N = Logarithm of total no. of observations
Ex: If total number of observations are 100, then number of classes could be
K = 1 + 3.322 log 100
K = 1 + 3.322 x 2
K = 1 + 6.644
K = 7.644 = 8 (Rounded off)
NOTE: Under this formula number of class cant be less than 4 and not greater than 20.
Class mid point or class marks
15
The mid value or central value of the class interval is called mid point.
Mid point of a class =
2
class) of limit upper class of limit (lower +
Sturges formula to find size of class interval
Size of class interval (h) =
N log 322 . 3 1
Range
+
Ex: In a 5 group of worker, highest wage is Rs. 250 and lowest wage is 100 per day. Find the size of
interval.
h =
N log 322 . 3 1
Range
+
=
50 log 322 . 3 1
100 250
+

= 55.57 56
Constructing a frequency distribution
The following guidelines may be considered for the construction of frequency distribution.
a) The classes should be clearly defined and each observations must belong to one and to only one
class interval. Interval classes must be inclusive and non-overlapping.
b) The number of classes should be neither too large nor too small.
Too small classes result greater interval width with loss of accuracy. Too many class interval
result is complexity.
c) All interval should be of the same width. This is preferred for easy computations.
The width of interval =
classes of Number
Range

d) Open end classes should be avoided since creates difficulty in analysis and interpretation.
e) Intervals would be continuous throughout the distribution. This is important for continuous
distribution.
f) The lower limits of the class intervals should be simple multiples of the interval.
Ex: A simple of 30 persons weight of a particular class students are as follows. Construct a frequency
distribution for the given data.
62 58 58 52 48 53 54 63 69 63
57 56 46 48 53 56 57 59 58 53
52 56 57 52 52 53 54 58 61 63
Steps of construction
Step 1
Find the range of data (H) Highest value = 70
(L) Lowest value = 46
Range = H L = 69 46 = 23
Step 2
Find the number of class intervals.
Sturges formula
16
K = 1 + 3.322 log N.
K = 1 + 3.222 log 30
K = 5.90 Say K = 6
No. of classes = 6
Step 3
Width of class interval
Width of class interval =
classes of Number
Range
= 4 883 . 3
6
23

Step 4
Conclusions all frequencies belong to each class interval and assign this total frequency to
corresponding class intervals as follows.
Class interval Tally bars Frequency
46 50 | | | 3
50 54 | | | | | | | 8
54 58 | | | | | | | 8
58 62 | | | | | 6
62 66 | | | | 4
66 70 | 1
Cumulative frequency distribution
Cumulative frequency distribution indicating directly the number of units that lie above or below
the specified values of the class intervals. When the interest of the investigator is on number of cases
below the specified value, then the specified value represents the upper limit of the class interval. It is
known as less than cumulative frequency distribution. When the interest is lies in finding the number of
cases above specified value then this value is taken as lower limit of the specified class interval. Then, it
is known as more than cumulative frequency distribution.
The cumulative frequency simply means that summing up the consecutive frequency.
Ex:
Marks No. of students
Less than
cumulative
frequency
0 10 5 5
10 20 3 8
20 30 10 18
30 40 20 38
40 50 12 50
17
In the above less than cumulative frequency distribution, there are 5 students less than 10, 3 less
than 20 and 10 less than 30 and so on.
Similarly, following table shows greater than cumulative frequency distribution.
Ex:
Marks No. of students
Less than
cumulative
frequency
0 10 5 50
10 20 3 45
20 30 10 42
30 40 20 32
40 50 12 12
In the above greater than cumulative frequency distribution, 50 students are scored more than 0,
45 more than 10, 42 more than 20 and so on.
Diagrammatic and Graphic Representation
The data collected can be presented graphically or pictorially to be easy understanding and for
quick interpretation. Diagrams and graphs gives visual indications of magnitudes, groupings, trends and
patterns in the data. There parameter can be more simply presented in the graphical manner. The
diagrams and graphs helps for comparison of the variables.
Diagrammatic presentation
A diagram is a visual form for presentation of statistical data. The diagram refers various types of
devices such as bars, circles, maps, pictorials and cartograms etc.
Importance of Diagrams
1. They are simple, attractive and easy understandable
2. They give quick information
3. It helps to compare the variables
4. Diagrams are more suitable to illustrate discrete data
5. It will have more stable effect in the readers mind.
Limitations of diagrams
1. Diagrams shows approximate value
2. Diagrams are not suitable for further analysis
3. Some diagrams are limited to experts (multidimensional)
4. Details cannot be provided fully
5. It is useful only for comparison
18
General Rules for drawing the diagrams
i) Each diagram should have suitable title indicating the theme with which diagram is intended at the
top or bottom.
ii) The size of diagram should emphasize the important characteristics of data.
iii) Approximate proposition should be maintained for length and breadth of diagram.
iv) A proper / suitable scale to be apoted for diagram
v) Selection of approximate diagram is important and wrong selection may mislead the reader.
vi) Source of data should be mentioned at bottom.
vii) Diagram should be simple and attractive
viii) Diagram should be effective than complex.
Some important types of diagrams
a) One dimensional diagrams (line and bar)
b) Two-dimensional diagram (rectangle, square, circle)
c) Three dimensional diagram (cube, sphere, cylinder etc.)
d) pictogram
e) Cartogram
a) One dimensional diagrams (line and bar)
In one dimensional diagrams, the length of the bars or lines are taken into account. Width of the
bars are not considered. Bar diagrams are classified mainly as follows.
i) Line diagram
ii) Bar diagram
- Vertical bar diagram
- Horizontal bar diagram
- Multiple (compound) bar diagram
- Sub-divided (component) bar diagram
- Percentage subdivided bar diagram
i) Line diagram
This is simplest type of one dimensional diagram. On the basis of size of the figures, heights of
the bar / lines are drawn. The distance between bars are kept uniform. The limitation of this diagram are
it is not attractive cannot provide more than one information.
Ex: Draw the line diagram for the following data
Year 2001 2002 2003 2004 2005 2006
No. of students passed in first class
with distinction
5 7 12 5 13 15
19
2001 2002 2003 2004 2005 2006
4
6
8
10
12
14
16
(15)
(13)
(5)
(12)
(7)
(5)


N
o
.

o
f

s
t
u
d
e
n
t
s

p
a
s
s
e
d

i
n

F
C
D
Year
Indication of diagram: Highest FCD is at 2006 and lowest FCD are at 2001 and 2004.
b) Simple bars diagram
A simple bar diagram can be drawn using horizontal or vertical bar. In business and economics, it
is very a common diagram.
Vertical bar diagram
The annual expresses of maintaining the car of various types are given below. Draw the vertical
bar diagram. The annual expenses of maintaining includes (fuel + maintenance + repair + assistance +
insurance).
Type of the car Expense in Rs. / Year
Maruthi Udyog 47533
Hyundai 59230
Tata Motors 63270
Source: 2005 TNS TCS Study
Published at: Vijaya Karnataka, dated: 03.08.2006
20
47533
59230
63270
30000
35000
40000
45000
50000
55000
60000
65000
70000
Maruthi Udyog Hyundai Tata Motors
Source: 2005 TNS TCS Study
Published at: Vijaya Karnataka, dated: 03.08.2006
Indicating of diagram
a) Annual expenses of Maruthi Udyog brand car is comparatively less with other brands
depicted
b) High annual expenses of Tata motors brand can be seen from diagram.
Horizontal bar diagram
World biggest top 10 steel makers are data are given below. Draw horizontal bar diagram.
Steel
maker
Arcelor
Mittal
Nippo
n
POSCO JFE
BAO
Steel
US
Steel
NUCOR
RIVA Thyssen-
krupp
Tangshan
Prodn.
in
million
tonnes
110 32 31 30 24 20 18 18 17 16
21
110
32
31
30
24
20
18
18
17
16
0 20 40 60 80 100 120
Arcelor Mittal
Nippon
POSCO
JFE
BAO Steel
US Steel
NUCOR
RIVA
Thyssen-krupp
Tangshan
T
o
p

-

1
0

S
t
e
e
l

M
a
k
e
r
s
Production of Steel (Million Tonnes)
Source: ISSB Published by India Today

Compound bar diagram (Multiple bar diagram)
Multiple bar diagrams are used to provide more information than simple bar diagram. Multiple
bar diagram provides more than one phenomenon and highly useful for direct comparison. The bars are
drawn side by side and different columns, shades hatches can be used for indicating each variables used.
Ex: Draw the bar diagram for the following data. Resale value of the cars (Rs. 000) are as follows.
Year (Model) Santro Zen Wagonr
2003 208 252 248
2004 240 278 274
2005 261 296 302
208
252
248
240
278
274
261
296
302
0
50
100
150
200
250
300
350
1 2 3
Model of Car
V
a
l
u
e

i
n

R
s
.
Santro Zen Wagnor
22
Source: True value used car purchase data
Published by: Vijaya Karnataka, dated: 03.08.2006
Ex: Represent following in suitable diagram
Class A B C
Male 1000 1500 1500
Female 500 800 1000
Total 1500 2300 2500
1000
500
1500
800
1500
1000
0
500
1000
1500
2000
2500
P
o
p
u
l
a
t
i
o
n

(
i
n

N
o
s
.
)
1 2 3
Class
Male Female
Ex: Draw the suitable diagram for following data
Mode of
investment
Investment in 2004 in Rs. Investment in 2005 in Rs.
Investment %age Investment %age
NSC 25000 43.10 30000 45.45
MIS 15000 25.86 10000 15.15
Mutual Fund 15000 25.86 25000 37.87
LIC 3000 5.17 1000 1.52
Total 58000 100 66000 100
23
1500
2300
2500
2004 2005
0
10
20
30
40
50
60
70
80
90
100
110
45.45
15.15
37.87
1.52 5.17
25.86
25.86
43.10


%

o
f

I
n
v
e
s
t
m
e
n
t
Year
Two-dimensional diagram
In two-dimensional diagram both breadth and length of the diagram (i.e. area of the diagram) are
considered as area of diagram represents the data. The important two dimensional diagrams are
a) Rectangular diagram
b) Square diagram
a) Rectangular diagram
Rectangular diagrams are used to depict two or more variables. This diagram helps for direct
comparison. The area of rectangular are kept in proportion to the values. It may be of two types.
i) Percentage sub-divided rectangular diagram
ii) Sub-divided rectangular diagram
In former care width of the rectangular are proportional to the values, the various components of
the values are converted into percentages and rectangles are divided according to them. While later case is
used to show some related phenomenon like cost per unit, quality of production etc.
Ex: Draw the rectangle diagram for following data
Item Expenditure
Expenditure in Rs.
Family A Family B
Provisional stores 1000 2000
Education 250 500
Electricity 300 700
House Rent 1500 2800
Vehicle Fuel 500 1000
Total 3500 7000
Total expenditure will be taken as 100 and the expenditure on individual items are expressed in
percentage. The width of two rectangles are in proportion to the total expenses of the two families i.e.
3500 : 7000 or 1 : 2. The height of rectangles are according to percentage of expenses.
24
Item Expenditure
Monthly expenditure
Family A (Rs. 3500) Family B(Rs. 7000)
Rs. %age Rs. %age
Provisional stores 1000 28.57 2000 28.57
Education 250 7.14 500 7.14
Electricity 300 8.57 700 10
House Rent 1500 42.85 2800 40
Vehicle Fuel 500 12.85 1000 14.28
Total 3500 100 7000 100
0
20
40
60
80
100
B A


%

o
f

E
x
p
e
n
d
i
t
u
r
e
Family
Provisonal Stores Education
Electricity House Rent Vehicle Fuel
b) Square diagram
To draw square diagrams, the square root is taken of the values of the various items to be shown.
A suitable scale may be used to depict the diagram. Ratios are to be maintained to draw squares.
Ex: Draw the square diagram for following data
4900 2500 1600
Solution: Square root for each item in found out as 70, 50 and 40 and is divided by 10; thus we get 7, 5
and 4.
25
0
1000
2000
3000
4000
5000
6000
7 5
4

3 2 1
4900
2500
1600


26
Pie diagram
Pie diagram helps us to show the portioning of a total into its component parts. It is used to show
classes or groups of data in proportion to whole data set. The entire pie represents all the data, while each
slice represents a different class or group within the whole. Following illustration shows construction of
pie diagram.
Draw the pie diagram for following data
Revenue collections for the year 2005-2006 by government in Rs. (crore)s for petroleum products
are as follows. Draw the pie diagram.
Customs 9600
Excise 49300
Corporate Tax and dividend 18900
States taking 48800
Total 126600
Solution:
Item / Source Value in
crores
Angle of circle %ge
Customs 9600
o
30 . 27 360 x
126600
9600
7.58
Excise 49300
o
20 . 140 360 x
126600
49300
39.00
Corporate Tax and Dividend 18900
o
70 . 53 360 x
126600
18900
14.92
States taking 48800
o
80 . 138 360 x
126600
48800
38.50
Total 126600 360
o
100
27
7.58
39
14.92
38.5
Customs
Excise
Corporate Tax
and Dividend
States taking
Source: India Today 19 June, 2006
Choice or selection of diagram
There are many methods to depict statistical data through diagram. No angle diagram is suited for
all purposes. The choice / selection of diagram to suit given set of data requires skill, knowledge and
experience. Primarily, the choice depends upon the nature of data and purpose of presentation, to whom it
is meant. The nature of data will help in taking a decision as to one-dimensional or two-dimensional or
three-dimensional diagram. It is also required to know the audience for whom the diagram is depicted.
The following points are to be kept in mind for the choice of diagram.
1. To common man, who has less knowledge in statistics cartogram and pictograms are suited.
2. To present the components apart from magnitude of values, sub-divided bar diagram can be used.
3. When a large number of components are to be shows, pie diagram is suitable.
Graphic presentation
A graphic presentation a visual form of presentation graphs are drawn on a special type of paper
known are graph paper.
Common graphic representations are
a) Histogram
b) Frequency polygon
c) Cumulative frequency curve (ogive)
Advantages of graphic presentation
1. It provides attractive and impressive view
2. Simplifies complexity of data
3. Helps for direct comparison
4. It helps for further statistical analysis
5. It is simplest method of presentation of data
6. It shows trend and pattern of data
28
Difference between graph and diagram
Diagram Graph
1. Ordinary paper can be used 1. Graph paper is required
2. It is attractive and easily
understandable
2. Needs some effect to understand
3. It is appropriate and effective to
measure more variable
3. It creates problem
4. It cant be used for further analysis 4. Can be used for further analysis
5. It gives comparison 5. It shows relationship between
variables
6. Data are represented by bars,
rectangles
6. Points and lines are used to represent
data
Frequency Histogram
In this type of representation the given data are plotted in the form of series of rectangles. Class
intervals are marked along the x-axis and the frequencies are along the y-axis according to suitable scale.
Unlike the bar chart, which is one-dimensional, a histogram is two-dimensional in which the length and
width are both important. A histogram is constructed from a frequency distribution of grouped data,
where the height of rectangle is proportional to respective frequency and width represents the class
interval. Each rectangle is joined with other and the blank space between the rectangles would mean that
the category is empty and there is no values in that class interval.
Ex: Construct a histogram for following data.
Marks obtained (x) No. of students (f) Mid point
15 25 5 20
25 35 3 30
35 45 7 40
45 55 5 50
55 65 3 60
65 75 7 70
Total 30
For convenience sake, we will present the frequency distribution along with mid-point of each
class interval, where the mid-point is simply the average of value of lower and upper boundary of each
class interval.
29
0
1
2
3
4
5
6
7
75 65 55 45 35 25 15


F
r
e
q
u
e
n
c
y

(
N
o
.

o
f

s
t
u
d
e
n
t
s
)
Class Interval (Marks)
Frequency polygon
A frequency polygon is a line chart of frequency distribution in which either the values of discrete
variables or the mid-point of class intervals are plotted against the frequency and those plotted points are
joined together by straight lines. Since, the frequencies do not start at zero or end at zero, this diagram as
such would not touch horizontal axis. However, since the area under entire curve is the same as that of a
histogram which is 100%. The curve must be enclosed, so that starting mid-point is jointed with
fictitious preceding mid-point whose value is zero. So that the beginning of curve touches the horizontal
axis and the last mid-point is joined with a fictitious succeeding mid-point, whose value is also zero, so
that the curve will end at horizontal axis. This enclosed diagram is known as frequency polygon.
Ex: For following data construct frequency polygon.
Marks (CI) No. of frequencies (f) Mid-point
15 25 5 20
25 35 3 30
35 45 7 40
45 55 5 50
55 65 3 60
65 75 7 70
30
0 10 20 30 40 50 60 70 80 90 100
0
2
4
6
8
10
A Frequency polygon


F
r
e
q
u
e
n
c
y
Mid point (x)
Cumulative frequency curve (ogive)
ogives are the graphic representations of a cumulative frequency distribution. These ogives are
classified as less than and more than ogives. In case of less than, cumulative frequencies are plotted
against upper boundaries of their respective class intervals. In case of grater than cumulative
frequencies are plotted against upper boundaries of their respective class intervals. These ogives are used
for comparison purposes. Several ogves can be compared on same grid with different colour for easier
visualisation and differentiation.
Ex:
Marks
(CI)
No. of
frequencies (f)
Mid-point
Cum. Freq.
Less than
Cum. Freq.
More than
15 25 5 20 5 30
25 35 3 30 8 25
35 45 7 40 15 22
45 55 5 50 20 15
55 65 3 60 23 10
65 75 7 70 30 7
31
Less than give diagram
20 30 40 50 60 70
5
10
15
20
25
30
'Less than' ogive


L
e
s
s

t
h
a
n

C
u
m
u
l
a
t
i
v
e

F
r
e
q
u
e
n
c
y
Upper Boundary (CI)
Less than give diagram
10 20 30 40 50 60 70
10
15
20
25
30
35
'More than' ogive


M
o
r
e

t
h
a
n

O
g
i
v
e
Lower Boundary (CI)
32
LESSON 1
STATISTICS FOR MANAGEMENT
Session 2 Duration: 1 hr
Classification and Tabulation
The data collected for the purpose of a statistical inquiry some times consists of a few fairly
simple figures, which can be easily understood without any special treatment. But more often there is an
overwhelming mass of raw data without any structure. Thus, unwieldy, unorganised and shapeless mass
of collected is not capable of being rapidly or easily associated or interpreted. Unorganised data are not
fit for further analysis and interpretation. In order to make the data simple and easily understandable the
first task is not condense and simplify them in such a way that irrelevant data are removed and their
significant features are stand out prominently. The procedure adopted for this purpose is known as
method of classification and tabulation. Classification helps proper tabulation.
Classified and arranged facts speak themselves; unarranged, unorganised they are dead as
mutton.
- Prof. J.R. Hicks
Meaning of Classification
Classification is a process of arranging things or data in groups or classes according to their
resemblances and affinities and gives expressions to the unity of attributes that may subsit among a
diversity of individuals.
Definition of Classification
Classification is the process of arranging data into sequences and groups according to their
common characteristics or separating them into different but related parts.
- Secrist
The process of grouping large number of individual facts and observations on the basis of
similarity among the items is called classification.
- Stockton & Clark
Characteristics of classification
e) Classification performs homogeneous grouping of data
f) It brings out points of similarity and dissimilarities.
g) The classification may be either real or imaginary
h) Classification is flexible to accommodate adjustments
Objectives / purposes of classifications
ix) To simplify and condense the large data
x) To present the facts to easily in understandable form
xi) To allow comparisons
xii) To help to draw valid inferences
xiii) To relate the variables among the data
Non-smokers
Illiterate
Male Female
Male Female
Illiterate

Male Female
Male Female
33
xiv) To help further analysis
xv) To eliminate unwanted data
xvi) To prepare tabulation
Guiding principles (rules) of classifications
Following are the general guiding principles for good classifications
g) Exhaustive: Classification should be exhaustive. Each and every item in data must belong
to one of class. Introduction of residual class (i.e. either, miscellaneous etc.) should be
avoided.
h) Mutually exclusive: Each item should be placed at only one class
i) Suitability: The classification should confirm to object of inquiry.
j) Stability: Only one principle must be maintained throughout the classification and
analysis.
k) Homogeneity: The items included in each class must be homogeneous.
l) Flexibility: A good classification should be flexible enough to accommodate new situation
or changed situations.
Modes / Types of Classification
Modes / Types of classification refers to the class categories into which the data could be sorted
out and tabulated. These categories depend on the nature of data and purpose for which data is being
sought.
Important types of classification
e) Geographical (i.e. on the basis of area or region wise)
f) Chronological (On the basis of Temporal / Historical, i.e. with respect to time)
g) Qualitative (on the basis of character / attributes)
h) Numerical, quantitative (on the basis of magnitude)
e) Geographical Classification
In geographical classification, the classification is based on the geographical regions.
Ex: Sales of the company (In Million Rupees) (region wise)
Region Sales
North 285
South 300
East 185
West 235
f) Chronological Classification
Non-smokers
Illiterate
Male Female
Male Female
Illiterate

Male Female
Male Female
34
If the statistical data are classified according to the time of its occurrence, the type of classification
is called chronological classification.
Sales reported by a departmental store
Month
Sales
(Rs.) in lakhs
January 22
February 26
March 32
April 25
May 27
June 30
g) Qualitative Classification
In qualitative classifications, the data are classified according to the presence or absence of
attributes in given units. Thus, the classification is based on some quality characteristics / attributes.
Ex: Sex, Literacy, Education, Class grade etc.
Further, it may be classified as
a) Simple classification b) Manifold classification
iii) Simple classification: If the classification is done into only two classes then classification is
known as simple classification.
Ex: a) Population in to Male / Female
b) Population into Educated / Uneducated
iv) Manifold classification: In this classification, the classification is based on more than one attribute
at a time.
Ex:
h) Quantitative Classification: In Quantitative classification, the classification is based on
quantitative measurements of some characteristics, such as age, marks, income, production, sales etc.
Population
Smokers Non-smokers
Illiterate Literate
Male Female
Male Female
Literate
Illiterate

Male Female
Male Female
35
The quantitative phenomenon under study is known as variable and hence this classification is also
called as classification by variable.
Ex:
For a 50 marks test, Marks obtained by students as classified as follows
Marks No. of students
0 10 5
10 20 7
20 30 10
30 40 25
40 50 3
Total Students = 50
In this classification marks obtained by students is variable and number of students in each class
represents the frequency.
Tabulation
Meaning and Definition of Tabulation
Tabulation may be defined, as systematic arrangement of data is column and rows. It is designed
to simplify presentation of data for the purpose of analysis and statistical inferences.

Major Objectives of Tabulation
6. To simplify the complex data
7. To facilitate comparison
8. To economize the space
9. To draw valid inference / conclusions
10. To help for further analysis
Differences between Classification and Tabulation
4. First data are classified and presented in tables; classification is the basis for tabulation.
5. Tabulation is a mechanical function of classification because is tabulation classified data are
placed in row and columns.
6. Classification is a process of statistical analysis while tabulation is a process of presenting data is
suitable structure.
Classification of tables
Classification is done based on
4. Coverage (Simple and complex table)
5. Objective / purpose (General purpose / Reference table / Special table or summary table)
6. Nature of inquiry (primary and derived table).
36
Ex:
c) Simple table: Data are classified based on only one characteristic
Distribution of marks
Class Marks No. of students
30 40 20
40 50 20
50 60 10
Total 50
d) Two-way table: Classification is based on two characteristics
Class Marks
No. of students
Boys Girls Total
30 40 10 10 20
40 50 15 5 20
50 60 3 7 10
Total 28 22 50
Frequency Distribution
Frequency distribution is a table used to organize the data. The left column (called classes or
groups) includes numerical intervals on a variable under study. The right column contains the list of
frequencies, or number of occurrences of each class/group. Intervals are normally of equal size covering
the sample observations range.
It is simply a table in which the gathered data are grouped into classes and the number of
occurrences, which fall in each class, is recorded.
Definition
A frequency distribution is a statistical table which shows the set of all distinct values of the
variable arranged in order of magnitude, either individually or in groups with their corresponding
frequencies.
37
- Croxton and Cowden
A frequency distribution can be classified as
d) Series of individual observation
e) Discrete frequency distribution
f) Continuous frequency distribution
b) Series of individual observation
Series of individual observation is a series where the items are listed one after the each
observation. For statistical calculations, these observation could be arranged is either ascending or
descending order. This is called as array.
Ex:
Roll No.
Marks obtained
in statistics
paper
1 83
2 80
3 75
4 92
5 65
The above data list is a raw data. The presentation of data in above form doesnt reveal any
information. If the data is arranged in ascending / descending in the order of their magnitude, which gives
better presentation then, it is called arraying of data.
Discrete (ungrouped) Frequency Distribution
If the data series are presented in such away that indicating its exact measurement of units, then it
is called as discrete frequency distribution. Discrete variable is one where the variants differ from each
other by definite amounts.
Ex:
Assume that a survey has been made to know number of post-graduates in 10 families at random;
the resulted raw data could be as follows.
0, 1, 3, 1, 0, 2, 2, 2, 2, 4
This data can be classified into an ungrouped frequency distribution. The number of post-graduates
becomes variable (x) for which we can list the frequency of occurrence (f) in a tabular from as follows;
Number of post Frequency
38
graduates (x) (f)
0 2
1 2
2 4
3 1
4 1
The above example shows a discrete frequency distribution, where the variable has discrete
numerical values.
Continuous frequency distribution (grouped frequency distribution)
Continuous data series is one where the measurements are only approximations and are expressed
in class intervals within certain limits. In continuous frequency distribution the class interval theoretically
continuous from the starting of the frequency distribution till the end without break. According to
Boddington the variable which can take very intermediate value between the smallest and largest value in
the distribution is a continuous frequency distribution.
Ex:
Marks obtained by 20 students in students exam for 50 marks are as given below convert the
data into continuous frequency distribution form.
18 23 28 29 44 28 48 33 32 43
24 29 32 39 49 42 27 33 28 29
By grouping the marks into class interval of 10 following frequency distribution tables can be
formed.
Marks No. of students
0 - 5 0
5 10 0
10 15 0
15 20 1
20 25 2
25 30 7
30 35 4
39
35 40 1
40 45 3
45 50 2
LESSON 1
STATISTICS FOR MANAGEMENT
Session 3 Duration: 1 hr
Technical terms used in formulation frequency distribution
c) Class limits:
The class limits are the smallest and largest values in the class.
Ex:
0 10, in this class, the lowest value is zero and highest value is 10. the two boundaries of the
class are called upper and lower limits of the class. Class limit is also called as class boundaries.
d) Class intervals
The difference between upper and lower limit of class is known as class interval.
Ex:
In the class 0 10, the class interval is (10 0) = 10.
The formula to find class interval is gives on below
R
S L
i

L = Largest value
S = Smallest value
R = the no. of classes
Ex:
If the mark of 60 students in a class varies between 40 and 100 and if we want to form 6 classes,
the class interval would be
I= (L-S ) / K =
6
40 100
=
6
60
= 10 L = 100
S = 40
K = 6
Therefore, class intervals would be 40 50, 50 60, 60 70, 70 80, 80 90 and 90 100.
Methods of forming class-interval
c) Exclusive method (overlapping)
In this method, the upper limits of one class-interval are the lower limit of next class. This method
makes continuity of data.
40
Ex:
Marks No. of students
20 30 5
30 40 15
40 50 25
A student whose mark is between 20 to 29.9 will be included in the 20 30 class.
Better way of expressing is
Marks No. of students
20 to les than 30
(More than 20 but les than 30)
5
30 to les than 40 15
40 to les than 50 25
Total Students 50
d) Inclusive method (non-overlaping)
Ex:
Marks No. of students
20 29 5
30 39 15
40 49 25
A student whose mark is 29 is included in 20 29 class interval and a student whose mark in 39 is
included in 30 39 class interval.
Class Frequency
The number of observations falling within class-interval is called its class frequency.
Ex: The class frequency 90 100 is 5, represents that there are 5 students scored between 90 and 100. If
we add all the frequencies of individual classes, the total frequency represents total number of items
studied.
Magnitude of class interval
The magnitude of class interval depends on range and number of classes. The range is the
difference between the highest and smallest values is the data series. A class interval is generally in the
multiples of 5, 10, 15 and 20.
Sturges formula to find number of classes is given below
41
K = 1 + 3.322 log N.
K = No. of class
log N = Logarithm of total no. of observations
Ex: If total number of observations are 100, then number of classes could be
K = 1 + 3.322 log 100
K = 1 + 3.322 x 2
K = 1 + 6.644
K = 7.644 = 8 (Rounded off)
NOTE: Under this formula number of class cant be less than 4 and not greater than 20.
Class mid point or class marks
The mid value or central value of the class interval is called mid point.
Mid point of a class =
2
class) of limit upper class of limit (lower +
Sturges formula to find size of class interval
Size of class interval (h) =
N log 322 . 3 1
Range
+
Ex: In a 5 group of worker, highest wage is Rs. 250 and lowest wage is 100 per day. Find the size of
interval.
h =
N log 322 . 3 1
Range
+
=
50 log 322 . 3 1
100 250
+

= 55.57 56
Constructing a frequency distribution
The following guidelines may be considered for the construction of frequency distribution.
g) The classes should be clearly defined and each observation must belong to one and to only one
class interval. Interval classes must be inclusive and non-overlapping.
h) The number of classes should be neither too large nor too small.
Too small classes result greater interval width with loss of accuracy. Too many class interval
result is complexity.
i) All intervals should be of the same width. This is preferred for easy computations.
The width of interval =
classes of Number
Range

j) Open end classes should be avoided since creates difficulty in analysis and interpretation.
k) Intervals would be continuous throughout the distribution. This is important for continuous
distribution.
l) The lower limits of the class intervals should be simple multiples of the interval.
Ex: A simple of 30 persons weight of a particular class students are as follows. Construct a frequency
distribution for the given data.
62 58 58 52 48 53 54 63 69 63
42
57 56 46 48 53 56 57 59 58 53
52 56 57 52 52 53 54 58 61 63
Steps of construction
Step 1
Find the range of data (H) Highest value = 70
(L) Lowest value = 46
Range = H L = 69 46 = 23
Step 2
Find the number of class intervals.
Sturges formula
K = 1 + 3.322 log N.
K = 1 + 3.222 log 30
K = 5.90 Say K = 6
No. of classes = 6
Step 3
Width of class interval
Width of class interval =
classes of Number
Range
= 4 883 . 3
6
23

Step 4
Conclusions all frequencies belong to each class interval and assign this total frequency to
corresponding class intervals as follows.
Class interval Tally bars Frequency
46 50 | | | 3
50 54 | | | | | | | 8
54 58 | | | | | | | 8
58 62 | | | | | 6
62 66 | | | | 4
66 70 | 1
Cumulative frequency distribution
Cumulative frequency distribution indicating directly the number of units that lie above or below
the specified values of the class intervals. When the interest of the investigator is on number of cases
below the specified value, then the specified value represents the upper limit of the class interval. It is
43
known as less than cumulative frequency distribution. When the interest is lies in finding the number of
cases above specified value then this value is taken as lower limit of the specified class interval. Then, it
is known as more than cumulative frequency distribution.
The cumulative frequency simply means that summing up the consecutive frequency.
Ex:
Marks No. of students
Less than
cumulative
frequency
0 10 5 5
10 20 3 8
20 30 10 18
30 40 20 38
40 50 12 50
In the above less than cumulative frequency distribution, there are 5 students less than 10, 3 less
than 20 and 10 less than 30 and so on.
Similarly, following table shows greater than cumulative frequency distribution.
Ex:
Marks No. of students
Less than
cumulative
frequency
0 10 5 50
10 20 3 45
20 30 10 42
30 40 20 32
40 50 12 12
In the above greater than cumulative frequency distribution, 50 students are scored more than 0,
45 more than 10, 42 more than 20 and so on.
Diagrammatic and Graphic Representation
The data collected can be presented graphically or pictorially to be easy understanding and for
quick interpretation. Diagrams and graphs give visual indications of magnitudes, groupings, trends and
patterns in the data. These parameter can be more simply presented in the graphical manner. The
diagrams and graphs help for comparison of the variables.
Diagrammatic presentation
44
A diagram is a visual form for presentation of statistical data. The diagram refers various types of
devices such as bars, circles, maps, pictorials and cartograms etc.
Importance of Diagrams
6. They are simple, attractive and easy understandable
7. They give quick information
8. It helps to compare the variables
9. Diagrams are more suitable to illustrate discrete data
10. It will have more stable effect in the readers mind.
Limitations of diagrams
1. Diagrams shows approximate value
2. Diagrams are not suitable for further analysis
3. Some diagrams are limited to experts (multidimensional)
4. Details cannot be provided fully
5. It is useful only for comparison
General Rules for drawing the diagrams
ix) Each diagram should have suitable title indicating the theme with which diagram is intended at the
top or bottom.
x) The size of diagram should emphasize the important characteristics of data.
xi) Approximate proposition should be maintained for length and breadth of diagram.
xii) A proper / suitable scale to be adopted for diagram
xiii) Selection of approximate diagram is important and wrong selection may mislead the reader.
xiv) Source of data should be mentioned at bottom.
xv) Diagram should be simple and attractive
xvi) Diagram should be effective than complex.
Some important types of diagrams
f) One dimensional diagrams (line and bar)
g) Two-dimensional diagram (rectangle, square, circle)
h) Three-dimensional diagram (cube, sphere, cylinder etc.)
i) Pictogram
j) Cartogram
c) One dimensional diagrams (line and bar)
In one-dimensional diagrams, the length of the bars or lines is taken into account. Widths of the
bars are not considered. Bar diagrams are classified mainly as follows.
iii) Line diagram
iv) Bar diagram
- Vertical bar diagram
45
- Horizontal bar diagram
- Multiple (compound) bar diagram
- Sub-divided (component) bar diagram
- Percentage subdivided bar diagram
ii) Line diagram
This is simplest type of one-dimensional diagram. On the basis of size of the figures, heights of
the bar / lines are drawn. The distances between bars are kept uniform. The limitation of this diagram are
it is not attractive cannot provide more than one information.
Ex: Draw the line diagram for the following data
Year 2001 2002 2003 2004 2005 2006
No. of students passed in first class
with distinction
5 7 12 5 13 15
2001 2002 2003 2004 2005 2006
4
6
8
10
12
14
16
(15)
(13)
(5)
(12)
(7)
(5)


N
o
.

o
f

s
t
u
d
e
n
t
s

p
a
s
s
e
d

i
n

F
C
D
Year
Indication of diagram: Highest FCD is at 2006 and lowest FCD are at 2001 and 2004.
d) Simple bars diagram
A simple bar diagram can be drawn using horizontal or vertical bar. In business and economics, it
is very a common diagram.
Vertical bar diagram
The annual expresses of maintaining the car of various types are given below. Draw the vertical
bar diagram. The annual expenses of maintaining includes (fuel + maintenance + repair + assistance +
insurance).
Type of the car Expense in Rs. / Year
Maruthi Udyog 47533
46
Hyundai 59230
Tata Motors 63270
Source: 2005 TNS TCS Study
Published at: Vijaya Karnataka, dated: 03.08.2006
47533
59230
63270
30000
35000
40000
45000
50000
55000
60000
65000
70000
Maruthi Udyog Hyundai Tata Motors
Source: 2005 TNS TCS Study
Published at: Vijaya Karnataka, dated: 03.08.2006
Indicating of diagram
a) Annual expenses of Maruthi Udyog brand car is comparatively less with other brands
depicted
b) High annual expenses of Tata motors brand can be seen from diagram.
Horizontal bar diagram
World biggest top 10 steel makers are data are given below. Draw horizontal bar diagram.
Steel
maker
Arcelor
Mittal
Nippo
n
POSCO JFE
BAO
Steel
US
Steel
NUCOR
RIVA Thyssen-
krupp
Tangshan
Prodn.
in
million
tonnes
110 32 31 30 24 20 18 18 17 16
47
110
32
31
30
24
20
18
18
17
16
0 20 40 60 80 100 120
Arcelor Mittal
Nippon
POSCO
JFE
BAO Steel
US Steel
NUCOR
RIVA
Thyssen-krupp
Tangshan
T
o
p

-

1
0

S
t
e
e
l

M
a
k
e
r
s
Production of Steel (Million Tonnes)
Source: ISSB Published by India Today

Compound bar diagram (Multiple bar diagram)
Multiple bar diagrams are used to provide more information than simple bar diagram. Multiple
bar diagram provides more than one phenomenon and highly useful for direct comparison. The bars are
drawn side-by-side and different columns, shades hatches can be used for indicating each variable used.
Ex: Draw the bar diagram for the following data. Resale value of the cars (Rs. 000) is as follows.
Year (Model) Santro Zen Wagonr
2003 208 252 248
2004 240 278 274
2005 261 296 302
208
252
248
240
278
274
261
296
302
0
50
100
150
200
250
300
350
1 2 3
Model of Car
V
a
l
u
e

i
n

R
s
.
Santro Zen Wagnor
48
Source: True value used car purchase data
Published by: Vijaya Karnataka, dated: 03.08.2006
Ex: Represent following in suitable diagram
Class A B C
Male 1000 1500 1500
Female 500 800 1000
Total 1500 2300 2500
1000
500
1500
800
1500
1000
0
500
1000
1500
2000
2500
P
o
p
u
l
a
t
i
o
n

(
i
n

N
o
s
.
)
1 2 3
Class
Male Female
Ex: Draw the suitable diagram for following data
Mode of
investment
Investment in 2004 in Rs. Investment in 2005 in Rs.
Investment %age Investment %age
NSC 25000 43.10 30000 45.45
MIS 15000 25.86 10000 15.15
Mutual Fund 15000 25.86 25000 37.87
LIC 3000 5.17 1000 1.52
Total 58000 100 66000 100
49
1500
2300
2500
2004 2005
0
10
20
30
40
50
60
70
80
90
100
110
45.45
15.15
37.87
1.52 5.17
25.86
25.86
43.10


%

o
f

I
n
v
e
s
t
m
e
n
t
Year
Two-dimensional diagram
In two-dimensional diagram both breadth and length of the diagram (i.e. area of the diagram) are
considered as area of diagram represents the data. The important two-dimensional diagrams are
a) Rectangular diagram
b) Square diagram
c) Rectangular diagram
Rectangular diagrams are used to depict two or more variables. This diagram helps for direct
comparison. The area of rectangular are kept in proportion to the values. It may be of two types.
iii) Percentage sub-divided rectangular diagram
iv) Sub-divided rectangular diagram
In former case, width of the rectangular are proportional to the values, the various components of the
values are converted into percentages and rectangles are divided according to them. Later case is used to
show some related phenomenon like cost per unit, quality of production etc.
Ex: Draw the rectangle diagram for following data
Item Expenditure
Expenditure in Rs.
Family A Family B
Provisional stores 1000 2000
Education 250 500
Electricity 300 700
House Rent 1500 2800
Vehicle Fuel 500 1000
Total 3500 7000
50
Total expenditure will be taken as 100 and the expenditure on individual items are expressed in
percentage. The widths of two rectangles are in proportion to the total expenses of the two families i.e.
3500: 7000 or 1: 2. The heights of rectangles are according to percentage of expenses.
Item Expenditure
Monthly expenditure
Family A (Rs. 3500) Family B(Rs. 7000)
Rs. %age Rs. %age
Provisional stores 1000 28.57 2000 28.57
Education 250 7.14 500 7.14
Electricity 300 8.57 700 10
House Rent 1500 42.85 2800 40
Vehicle Fuel 500 12.85 1000 14.28
Total 3500 100 7000 100
0
20
40
60
80
100
B A


%

o
f

E
x
p
e
n
d
i
t
u
r
e
Family
Provisonal Stores Education
Electricity House Rent Vehicle Fuel
d) Square diagram
To draw square diagrams, the square root is taken of the values of the various items to be shown.
A suitable scale may be used to depict the diagram. Ratios are to be maintained to draw squares.
Ex: Draw the square diagram for following data
4900 2500 1600
Solution: Square root for each item in found out as 70, 50 and 40 and is divided by 10; thus we get 7, 5
and 4.
51
0
1000
2000
3000
4000
5000
6000
7 5
4

3 2 1
4900
2500
1600


52
Pie diagram
Pie diagram helps us to show the portioning of a total into its component parts. It is used to show
classes or groups of data in proportion to whole data set. The entire pie represents all the data, while each
slice represents a different class or group within the whole. Following illustration shows construction of
pie diagram.
Draw the pie diagram for following data
Revenue collections for the year 2005-2006 by government in Rs. (crore)s for petroleum products
are as follows. Draw the pie diagram.
Customs 9600
Excise 49300
Corporate Tax and dividend 18900
States taking 48800
Total 126600
Solution:
Item / Source Value in
crores
Angle of circle %ge
Customs 9600
o
30 . 27 360 x
126600
9600
7.58
Excise 49300
o
20 . 140 360 x
126600
49300
39.00
Corporate Tax and Dividend 18900
o
70 . 53 360 x
126600
18900
14.92
States taking 48800
o
80 . 138 360 x
126600
48800
38.50
Total 126600 360
o
100
53
7.58
39
14.92
38.5
Customs
Excise
Corporate Tax
and Dividend
States taking
Source: India Today 19 June, 2006
Choice or selection of diagram
There are many methods to depict statistical data through diagram. No angle diagram is suited for
all purposes. The choice / selection of diagram to suit given set of data requires skill, knowledge and
experience. Primarily, the choice depends upon the nature of data and purpose of presentation, to which it
is meant. The nature of data will help in taking a decision as to one-dimensional or two-dimensional or
three-dimensional diagram. It is also required to know the audience for whom the diagram is depicted.
The following points are to be kept in mind for the choice of diagram.
4. To common man, who has less knowledge in statistics cartogram and pictograms are suited.
5. To present the components apart from magnitude of values, sub-divided bar diagram can be used.
6. When a large number of components are to be shows, pie diagram is suitable.
Graphic presentation
A graphic presentation is a visual form of presentation graphs are drawn on a special type of paper
known are graph paper.
Common graphic representations are
a) Histogram
b) Frequency polygon
c) Cumulative frequency curve (ogive)
54
Advantages of graphic presentation
7. It provides attractive and impressive view
8. Simplifies complexity of data
9. Helps for direct comparison
10. It helps for further statistical analysis
11. It is simplest method of presentation of data
12. It shows trend and pattern of data
Difference between graph and diagram
Diagram Graph
7. Ordinary paper can be used 7. Graph paper is required
8. It is attractive and easily
understandable
8. Needs some effect to understand
9. It is appropriate and effective to
measure more variable
9. It creates problem
10. It cant be used for further analysis 10. Can be used for further analysis
11. It gives comparison 11. It shows relationship between
variables
12. Data are represented by bars,
rectangles
12. Points and lines are used to represent
data
Frequency Histogram
In this type of representation the given data are plotted in the form of series of rectangles. Class
intervals are marked along the x-axis and the frequencies are along the y-axis according to suitable scale.
Unlike the bar chart, which is one-dimensional, a histogram is two-dimensional in which the length and
width are both important. A histogram is constructed from a frequency distribution of grouped data,
where the height of rectangle is proportional to respective frequency and width represents the class
interval. Each rectangle is joined with other and the blank space between the rectangles would mean that
the category is empty and there are no values in that class interval.
Ex: Construct a histogram for following data.
Marks obtained (x) No. of students (f) Mid point
15 25 5 20
25 35 3 30
35 45 7 40
45 55 5 50
55 65 3 60
65 75 7 70
Total 30
For convenience sake, we will present the frequency distribution along with mid-point of each
class interval, where the mid-point is simply the average of value of lower and upper boundary of each
class interval.
55
0
1
2
3
4
5
6
7
75 65 55 45 35 25 15


F
r
e
q
u
e
n
c
y

(
N
o
.

o
f

s
t
u
d
e
n
t
s
)
Class Interval (Marks)
Frequency polygon
A frequency polygon is a line chart of frequency distribution in which either the values of discrete
variables or the mid-point of class intervals are plotted against the frequency and those plotted points are
joined together by straight lines. Since, the frequencies do not start at zero or end at zero, this diagram as
such would not touch horizontal axis. However, since the area under entire curve is the same as that of a
histogram which is 100%. The curve must be enclosed, so that starting mid-point is jointed with
fictitious preceding mid-point whose value is zero. So that the beginning of curve touches the horizontal
axis and the last mid-point is joined with a fictitious succeeding mid-point, whose value is also zero, so
that the curve will end at horizontal axis. This enclosed diagram is known as frequency polygon.
Ex: For following data construct frequency polygon.
Marks (CI) No. of frequencies (f) Mid-point
15 25 5 20
25 35 3 30
35 45 7 40
45 55 5 50
55 65 3 60
65 75 7 70
56
0 10 20 30 40 50 60 70 80 90 100
0
2
4
6
8
10
A Frequency polygon


F
r
e
q
u
e
n
c
y
Mid point (x)
Cumulative frequency curve (ogive)
ogives are the graphic representations of a cumulative frequency distribution. These ogives are
classified as less than and more than ogives. In case of less than, cumulative frequencies are plotted
against upper boundaries of their respective class intervals. In case of grater than cumulative
frequencies are plotted against upper boundaries of their respective class intervals. These ogives are used
for comparison purposes. Several ogves can be compared on same grid with different colour for easier
visualisation and differentiation.
Ex:
Marks
(CI)
No. of
frequencies (f)
Mid-point
Cum. Freq.
Less than
Cum. Freq.
More than
15 25 5 20 5 30
25 35 3 30 8 25
35 45 7 40 15 22
45 55 5 50 20 15
55 65 3 60 23 10
65 75 7 70 30 7
57
Less than give diagram
20 30 40 50 60 70
5
10
15
20
25
30
'Less than' ogive


L
e
s
s

t
h
a
n

C
u
m
u
l
a
t
i
v
e

F
r
e
q
u
e
n
c
y
Upper Boundary (CI)
Less than give diagram
10 20 30 40 50 60 70
10
15
20
25
30
35
'More than' ogive


M
o
r
e

t
h
a
n

O
g
i
v
e
Lower Boundary (CI)
58
Session 4
Measures of Central Tendency
A classified statistical data may sometimes be described as distributed around some value called
the central value or average is some sense. It gives the most representative value of the entire data.
Different methods give different central values and are referred to as the measures of central tendency.
Thus, the most important objective of statistical analysis is to determine a single value that
represents the characteristics of the entire raw data. This single value representing the entire data is called
Central value or an average. This value is the point around which all other values of data cluster.
Therefore, it is known as the measure of location and since this value is located at central point nearest to
other values of the data it is also called as measures of central tendency.
Different methods give different central values and are referred as measures of central tendency.
The common measures of central tendency are a) Mean b) Median c) Mode.
These values are very useful not only in presenting overall picture of entire data, but also for the
purpose of making comparison among two or more sets of data.
Average
Definition
Average is a value which is typical or representative of a set of data.
- Murry R. Speigal
Average is an attempt to find one single figure to describe whole of figures.
- Clark & Sekkade
From above definitions it is clear that average is a typical value of the entire data and is a measure
of central tendency.
Functions of an average
To represents complex or large data.
It facilitates comparative study of two variables.
Helps to study population from sample data.
Helps in decision making.
Represents single value for a series of data.
To establish mathematical relationship.
59
Characteristics of a typical average
It should be rigidly defined and easily understandable.
It should be simple to compute and in the form of mathematical formula.
It should be based on all the items in the data.
It should not be unduly influenced by any single item.
It should be capable of further mathematical treatment.
It should have sampling stability.
Types of average
Average or measures of central tendency are of following types.
1. Mathematical average
a. Arithmetical mean
i. Simple mean
ii. Weighted mean
b. Geometric mean
c. Harmonic mean
2. Positional Averages
a. Median
b. Mode
Arithmetic mean
Arithmetic mean is also called arithmetic average. It is most commonly used measures of central
tendency. Arithmetic average of a series is the value obtained by dividing the total value of various item
by its number.
Arithmetic average are of two types
a. Simple arithmetic average
b. Weighted arithmetic average
Simple arithmetic average (Mean)
Arithmetic mean is simply sometimes referred as Mean. Ex: Mean income, Mean expenses,
Mean marks etc.
Unlike other averages, mean has to be computed by considering each and every observations in
the series. Hence, the mean cannot be found by either by inspection or observation of items.
Simple arithmetic mean is equal to sum of the variable divided by their number of observations in
the sample.
Let x
i
is the variable which takes values x
1
, x
2
, x
3
, x
n
over n items, then arithmetic mean,
simply the mean of x, denoted by bar over the variable x is given by.
n
x
n
x ..... .......... x x x
x
n 3 2 1

+ + + +

60
Where, is the Greek symbol sigma denotes the summation of all x
i
values.
Arithmetic mean can be computed by following two methods for direct observation of individual
items.
a. Direct method
b. Short cut method.
Direct method uses above equation and steps for short cut method is illustrated in the subsequent
topic.
Ex: (For Direct Method)
1. Calculate the mean for following data.
Marks obtained by 65 students are given below:
20, 15, 23, 22, 25, 20.
Mean marks
n
x ......... x x
x
n 2 1
+ + +

6
20 25 22 23 15 20 + + + + +

6
125

= 20.83
2. Six month income of departmental store are given below. Find mean income of stores.
Month Jan Feb Mar Apr May June
Income (Rs.) 25000 30000 45000 20000 25000 20000
n = Total No. of items (observations) = 6
Total income = x
i
= (25000 + 30000 + 45000 + 20000 + 20000)
= 140000
Mean income = 33 . 23333 . Rs
6
140000
n
x
i

The above example shows that if there are large data or large figures are there in data,
computations required to get mean in high. In order to reduce computations one can go for short-cut
method. The method is illustrated below.
Shortcut method
Steps of this method is given below.
Step 1: Assume any one value as a mean which is called arbitrary average (A).
Step 2: Find the difference (deviations) of each value from arbitrary average.
D = x
i
A
Step 3: Add all deviations (differences) to get d.
Step 4: Use following equation and compute the mean value.
61
n
d
A x

+
n = Total No. of observations
d = Total deviation value
A = Arbitrary mean
Example: Find the mean marks obtained by the students for the joining data given.
20 25 20 22 20 21 23 25 22 18
Let A = 20 and n = 10
Marks D = (x
i
20)
20 0
25 5
20 0
22 2
20 0
21 1
23 3
25 5
22 2
18 -2
d = 16
n
d
A x

+
10
16
20 x +
= 20 + 1.6
Mean Marks 6 . 21 x
1. Mathematical characteristics of mean
a. Algebraic sum of deviations of all observations from their arithmetic mean is zero i.e. (x
i
- x ) =
0.
b. The sum of squared deviations of the items from the mean is a minimum, that is less than the sum
of squared deviations of items from any other value.
d
2
= minimum
c. Since
n
x
x

. If any two values are given, third value can be computed.
d. If all the items of a sets are increased / decreased by any constant value, the arithmetic mean will
also increases / decreases by the same constant.
62
2. Weighted arithmetic mean
The weighted mean is computed by considering the relative importance of each of values to the
total value. The arithmetic mean gives equal importance to all the items of distribution. In certain cases,
relative importance of items is not the same. To give relative importance, weightage may be given to
variables depending on cases. Thus, weightage represents the relative importance of the items.
The weighted arithmetic mean in computed by following equation.
Let
x
1
, x
2
, x
3
, x
n
are the variables and
w
1
, w
2
, w
3
, w
n
are the respective weights assigned. Then weighted mean w x is given
by below equation.

+ + + +
+ + + +

w
xw
w .. .......... w w w
w x ...... w x w x w x
x
n 3 2 1
n n 3 3 2 2 1 1
w
i.e., weighted average is the ratio of product of all values and respective weights to sum of
weights.
Ex: Compute simple weighted arithmetic mean and comment on them.
Designation
Monthly salary
(Rs) (x)
Strength of
cadre (w)
xw
General Manager 25000 10 250000
Mangers 19000 20 380000
Supervisors 14000 10 140000
Office Assistant 10000 50 500000
Helpers 8000 25 200000
(N = 5) Total
x = 76000 w = 115 xw = 1470000
63
a. Simple arithmetic mean = 15200 . Rs
5
76000
N
x

b. Weighted arithmetic mean =


6 . 12782 . Rs
115
1470000
w
xw

In this example, simple arithmetic mean does not accounts the difference in salary range for
various staff. It is given equal importance. The salary of General Manager and Manager has inflated the
value of simple mean. The weighted mean gives importance to the number of persons in various salary
range.
Ex: Comment on performance of students of two universities given below.
University Bombay Madras
Course
% of pas
(x)
No. of (w)
students
(000)
w
x
% of
pas (x)
No. of
(w)
students
w
x
MBA 71 3 213 81 5 405
MCA 83 2 166 76 3 228
MA 73 5 365 58 3 174
M.Sc. 75 2 150 76 1 76
M.Com. 70 2 140 81 2 162
Total () x = 372 w =14 wx =1034 x =372 w =14 wx =1045
a. Since x is same, simple arithmetic average for both universities.
= 4 . 74
5
372
N
x

b. Weighted mean for Bombay University =


86 . 73
14
1034
w
wx

c. Weighted mean for Madras University =


64 . 74
14
1045
w
wx

Comment: Madras University students performance is better than Bombay University students.
Discrete Series
Frequencies of each value is multiplied with respective size to get total number of items is discrete
series and their total number of item is divided by total number of frequencies to obtain arithmetic mean.
This can be done in two methods one by direct or by short cut method.
64
Ex: Calculate the mean for following data.
Value (x) 1 2 3 4 5
Frequency (f) 10 15 10 9 5
Steps:
1. Multiply each size of item by frequency to get fx
2. Add all frequencies (f = N)
3. Use formula
N
fx
f
fx
x

to get mean value.


Solution:
By direct method
Value (x) Frequency (f) fx
1 10 10
2 15 30
3 10 30
4 9 36
5 5 25
f = 49 fx = 131
67 . 2
49
131
N
fx
x

By short-cut method
Let A = 3, (Assumed mean = 3)
Value (x) Frequency (f) d = (x A) fd
1 10 -2 -20
2 15 -1 -15
3 10 0 0
4 9 1 9
5 5 2 10
f = 49 fd = - 16
67 . 2
49
16
3
N
fx
A x

,
_

+
65
Continuous series
In continuous frequency distribution, the individual value of each item in the frequency
distribution is not known. In a continuous series the mid points of various class intervals are written down
to replace the class interval. In continuous series the mean can be calculated by any of the following
methods.
a. Direct method
b. Short cut method
c. Step deviation method
a. Direct method
Steps of their method are as follows
1. Find out the mid value of class group or class.
Ex: For a class interval 20-30, the mid value is 25
2
50
2
30 23

+
mid value is denoted by m.
2. Multiply the mid value m by frequency f of each class and sum up to get fm.
3. Use
N
fm
x

where N = f formula to get mean value.
Ex: Compute the mean for following data.
Age group
(CI)
No. of persons
(f)
Mid point
m
fm
0 10 5 5 25
10 20 15 15 225
20 30 25 25 625
30 40 8 35 280
40 50 7 45 315
Total
f = 60 = N fm = 1470
Mean age =
245
60
1470
N
fm
f
fm

x
= 24.5
b. Short cut method
Steps of above methods are described below.
1. Find the mid value of each class
2. Assume any of the mid value as arbitrary average (A).
3. Multiply the deviation (differences) d by frequency f.
Using the formula
N
fd
A x

+ find the mean value.
Ex: Find the mean age of patient visiting to hospital in a particular day using following data.
Age group No. of patients Mid value d = (m 25) fd
66
CI (f) M
0 10 5 5 -20 -100
10 20 15 15 -10 -150
20 30 25 25 0 0
30 40 8 35 10 80
40 50 7 45 20 140
Total
f = 60 = N fd = 30
Let Arbitrary average = A = 25
Mean age
N
fd
A x

+
5 . 24
2
1
25
60
30
25 x

,
_


+
5 . 24 x
c. Step deviation method
In this method, after finding deviation from arbitrary mean, it is divided by a common factor.
Scaling down the deviation by a step will reduce the calculation to minimum. The procedure of this
method is described below.
Steps of step deviation method
1. Find out the mid value m.
2. Select the arbitrary men A.
3. Find the deviation (d) of mid value of each from A.
4. Deviations d are divided by a common factor d'.
5. multiply d' of each class by frequency f to get fd' and sum up for all classes to get fd'.
6. Using the formula C x
N
' fd
A x

+ (where, C is a common factor) calculate mean value.
67
Ex: Find the mean age of following data.
Age (CI)
No. of persons
f
Mid value
m
(d=mA)
(d=m25)
d'=
10
d
fd'
0 10 5 5 -20 -2 -10
10 20 15 15 -10 -1 -15
20 30 25 25 0 0 0
30 40 8 35 10 1 8
40 50 7 45 20 2 14
Total
f=60=N fd'= -3
Let A = 25 and
C = 10
C x
N
' fd
A x

+
10 x
60
) 3 (
25 x

+
2
1
25 x
5 . 24 x
68
Session 5
Measures of Central Tendency
Combined Mean
Combined arithmetic mean can be computed if we know the mean and number of items in each
groups of the data.
The following equation is used to compute combined mean.
Let
2 1
x & x are the mean of first and second group of data containing N
1
& N
2
items
respectively.
Then, combined mean =
2 1
2 2 1 1
12
N N
x N x N
x
+
+

If there are 3 groups then


3 2 1
3 3 2 2 1 1
123
N N N
x N x N x N
x
+ +
+ +

Ex - 1:
a) Find the means for the entire group of workers for the following data.
Group 1 Group 2
Mean wages 75 60
No. of workers 1000 1500
Given data: N
1
= 1000 N
2
= 1500
60 x & 75 x
2 1

Group Mean =
2 1
2 2 1 1
12
N N
x N x N
x
+
+

=
1500 1000
60 x 1500 75 x 1000
+
+
= 66 . Rs x
12

Ex - 2: Compute mean for entire group.
Medical examination No. examined Mean weight (pounds)
A 50 113
B 60 120
C 90 115
Combined mean (grouped mean weight)
3 2 1
3 3 2 2 1 1
N N N
x N x N x N
+ +
+ +

69
) 90 60 50 (
) 115 x 90 120 x 60 113 x 50 (
x
123
+ +
+ +

pounds 116 weight Mean x


123

Merits of Arithmetic Mean
1. It is simple and easy to compute.
2. It is rigidly defined.
3. It can be used for further calculation.
4. It is based on all observations in the series.
5. It helps for direct comparison.
6. It is more stable measure of central tendency (ideal average).
Limitations / Demerits of Mean
1. It is unduly affected by extreme items.
2. It is sometimes un-realistic.
3. It may leads to confusion.
4. Suitable only for quantitative data (for variables).
5. It can not be located by graphical method or by observations.
Geometric Mean (GM)
The GM is n
th
root of product of quantities of the series. It is observed by multiplying the values of
items together and extracting the root of the product corresponding to the number of items. Thus, square
root of the products of two items and cube root of the products of the three items are the Geometric Mean.
Usually, geometric mean is never larger than arithmetic mean. If there are zero and negative
number in the series. If there are zeros and negative numbers in the series, the geometric means cannot be
used logarithms can be used to find geometric mean to reduce large number and to save time.
In the field of business management various problems often arise relating to average percentage
rate of change over a period of time. In such cases, the arithmetic mean is not an appropriate average to
employ, so, that we can use geometric mean in such case. GM are highly useful in the construction of
index numbers.
Geometric Mean (GM) =
n 2 1
x x . .......... x x x x n
When the number of items in the series is larger than 3, the process of computing GM is difficult.
To over come this, a logarithm of each size is obtained. The log of all the value added up and divided by
number of items. The antilog of quotient obtained is the required GM.
(GM) = Antilog
1
]
1

1
]
1

+ + +

N
x log
log Anti
n
log ...... .......... log log
i
1 i
n 2 1
Merits of GM
a. It is based on all the observations in the series.
b. It is rigidly defined.
70
c. It is best suited for averages and ratios.
d. It is less affected by extreme values.
e. It is useful for studying social and economics data.
Demerits of GM
a. It is not simple to understand.
b. It requires computational skill.
c. GM cannot be computed if any of item is zero or negative.
d. It has restricted application.
Ex - 1:
a. Find the GM of data 2, 4, 8
x
1
= 2,
x
2
= 4,
x
3
= 8
n = 3
GM =
3 2 1
x x x x x n
GM = 8 x 4 x 2 3
GM = 4 64 3
GM = 4
b. Find GM of data 2, 4, 8 using logarithms.
Data: x
1
= 2
x
2
= 4
x
3
= 8
N = 3
x log x
2 0.301
4 0.602
8 0.903
logx = 1.806
GM = Antilog
1
]
1

N
x log
GM = Antilog
1
]
1

3
806 . 1
71
GM = Antilog (0.6020)
= 3.9997
GM 4
Ex - 2:
Compare the previous year the Over Head (OH) expenses which went up to 32% in year 2003,
then increased by 40% in next year and 50% increase in the following year. Calculate average increase in
over head expenses.
Let 100% OH Expenses at base year
Year OH Expenses (x) log x
2002 Base year
2003 132 2.126
2004 140 2.146
2005 150 2.176
log x = 6.448
GM = Antilog
1
]
1

N
x log
GM = Antilog
1
]
1

3
448 . 6
GM = 141.03
GM for discrete series
GM for discrete series is given with usual notations as month:
GM = Antilog
1
]
1

N
x log
i
1 i
Ex - 3:
Consider following time series for monthly sales of ABC company for 4 months. Find average
rate of change per monthly sales.
Month Sales
I 10000
II 8000
III 12000
IV 15000
Let Base year = 100% sales.
Solution:
72
Month Base year
Sales
(Rs)
Increase /
decrease
%ge
Conversion
(x)
log (x)
I 100% 10000
II 20% 8000 80 80 1.903
III + 50% 12000 130 130 2.113
IV + 25% 15000 155 155 2.190
logx = 6.206
GM = Antilog
1
]
1

3
206 . 6
= 117.13
Average sales = 117.13 100 = 14.46%
Ex - 4: Find GM for following data.
Marks
(x)
No. of students
(f)
log x f log x
130 3 2.113 6.339
135 4 2.130 8.52
140 6 2.146 12.876
145 6 2.161 12.996
150 3 2.176 6.528
f = N = 22 f log x =47.23
GM = Antilog
1
]
1

N
x log f
GM = Antilog
1
]
1

22
23 . 47
GM = 140.212
Geometric Mean for continuous series
Steps:
1. Find mid value m and take log of m for each mid value.
2. Multiply log m with frequency f of each class to get f log m and sum up to obtain f log m.
3. Divide f log m by N and take antilog to get GM.
Ex: Find out GM for given data below
Yield of wheat
in
MT
No. of farms
frequency
(f)
Mid value
m
log m f log m
1 10 3 5.5 0.740 2.220
73
11 20 16 15.5 1.190 19.040
21 30 26 25.5 1.406 36.556
31 40 31 35.5 1.550 48.050
41 50 16 45.5 1.658 26.528
51 60 8 55.5 1.744 13.954
f = N = 100 f log m = 146.348
GM = Antilog
1
]
1

N
m log f
GM = Antilog
1
]
1

100
348 . 146
GM = 29.07
Harmonic Mean
It is the total number of items of a value divided by the sum of reciprocal of values of variable. It
is a specified average which solves problems involving variables expressed in within Time rates that
vary according to time.
Ex: Speed in km/hr, min/day, price/unit.
Harmonic Mean (HM) is suitable only when time factor is variable and the act being performed remains
constant.
HM =
x
1
N

Merits of Harmonic Mean


1. It is based on all observations.
2. It is rigidly defined.
3. It is suitable in case of series having wide dispersion.
4. It is suitable for further mathematical treatment.
Demerits of Harmonic Mean
1. It is not easy to compute.
2. Cannot used when one of the item is zero.
3. It cannot represent distribution.
Ex:
1. The daily income of 05 families in a very rural village are given below. Compute HM.
Family Income (x) Reciprocal (1/x)
74
1 85 0.0117
2 90 0.01111
3 70 0.0142
4 50 0.02
5 60 0.016

x
1

= 0.0738
HM =
x
1
N

=
0738 . 0
5
= 67.72
HM = 67.72
2. A man travel by a car for 3 days he covered 480 km each day. On the first day he drives for 10 hrs at
the rate of 48 KMPH, on the second day for 12 hrs at the rate of 40 KMPH, and on the 3
rd
day for 15
hrs @ 32 KMPH. Compute HM and weighted mean and compare them.
Harmonic Mean
x
x
1
48 0.0208
40 0.025
32 0.0312
x
1

= 0.0770
Data:
10 hrs @ 48 KMPH
12 hrs @ 40 KMPH
15 hrs @ 32 KMPH
HM =
x
1
N

=
0770 . 0
3

HM = 38.91
Weighted Mean
w x wx
10 48 480
12 40 480
75
15 32 480
w = 37 wx = 1440
Weighted Mean =
w
wx
x

=
37
1440

91 . 38 x
Both the same HM and WM are same.
3. Find HM for the following data.
Class (CI) Frequency (f) Mid point (m) Reciprocal
,
_

m
1
f
,
_

m
1
0 10 5 5 0.2 1
10 20 15 15 0.0666 0.999
20 30 25 25 0.04 1
30 40 8 35 0.0285 0.228
40 50 7 45 0.0222 0.1554
f = 60
f
,
_

m
1
= 3.3824
HM =

,
_

m
1
f
N
=
3824 . 3
60

HM = 17.73
Relationship between Mean, Geometric Mean and Harmonic Mean.
1. If all the items in a variable are the same, the arithmetic mean, harmonic mean and Geometric mean
are equal. i.e., HM GM x .
2. If the size vary, mean will be greater than GM and GM will be greater than HM. This is because of the
property that geometric mean to give larger weight to smaller item and of the HM to give largest
weight to smallest item.
Hence, HM GM x > > .
Median
Median is the value of that item in a series which divides the array into two equal parts, one
consisting of all the values less than it and other consisting of all the values more than it. Median is a
76
positional average. The number of items below it is equal to the number. The number of items below it is
equal to the number of items above it. It occupies central position.
Thus, Median is defined as the mid value of the variants. If the values are arranged in ascending
or descending order of their magnitude, median is the middle value of the number of variant is odd and
average of two middle values if the number of variants is even.
Ex: If 9 students are stand in the order of their heights; the 5
th
student from either side shall be the one
whose height will be Median height of the students group. Thus, median of group is given by an
equation.
Median =
1
]
1

+
2
1 N
Ex
1. Find the median for following data.
22 20 25 31 26 24 23
Arrange the given data in array form (either in ascending or descending order).
20 22 23 24 25 26 31
Median is given by
1
]
1

+
2
1 N
th
item =
1
]
1

+
2
1 7
=
4
8
Median = 4
th
item.
2. Find median for following data.
20 21 22 24 28 32
Median is given by
1
]
1

+
2
1 N
th
item =
1
]
1

+
2
1 6
Median = 3.5
th
item.
The item lies between 3
rd
and 4.
So, there are two values 22 and 24.
The median value will be the mean values of these two values.
Median =
1
]
1

+
2
24 22
= 23
Discrete Series Median
In discrete series, the values are (already) in the form of array and the frequencies are recorded
against each value. However, to determine the size of median
1
]
1

+
2
1 N
th
item, a separate column is to be
prepared for cumulative frequencies. The median size is first located with reference to the cumulative
frequency which covers the size first. Then, against that cumulative frequency, the value will be located
as median.
77
Ex: Find the median for the students marks.
Obtained in statistics
Marks (x)
No. of
students (f)
Cumulative
frequency
10 5 5
20 5 10
30 3 13
40 15 28
50 30 58
60 10 68
N = 68
Ex: In a class 15 students, 5 students were failed in a test. The marks of 10 students who have passed
were 9, 6, 7, 8, 9, 6, 5, 4, 7, 8. Find the Median marks of 15 students.
Marks No. of students (f) cf
0
5
1
2
3
4 1 6
5 1 7
6 2 9
7 2 11
8 2 13
9 2 15
f = 15
Median =
2
1 N
th
+
item
Me =
2
1 15+
= 8
th

Me 8
th
item covers in cf of 9. the marks against cf 9 is 6 and hence
Median = 6
78
Just above 34
is 58. Against
58 c.f. the
value is 50
which is
median value
Continuous Series
The procedure is different to get median in continuous series. The class intervals are already in
the form of array and the frequency are recorded against each class interval. For determining the size, we
should take
th
2
n
item and median class located accordingly with reference to the cumulative frequency,
which covers the size first. When the median class is located, the median value is to be interpolated using
formula given below.
Median =
1
]
1

+ C
2
N
f
h

Where
2
1 0

+
where,
0
is left end point of N/2 class and l
1
is right end point of previous
class.
h = Class width, f = frequency of median clas
C = Cumulative frequency of class preceding the median class.
Ex: Find the median for following data. The class marks obtained by 50 students are as follows.
CI Frequency (f)
Cum.
frequency (cf)
10 15 6 6
15 20 18 24
20 25 9
33 N/2 class
25 30 10 43
30 35 4 47
35 40 3 50
f = N = 50
25
2
50
2
N

Cum. frequency just above 25 is 33 and hence, 20 25 is median class.
2
1 0

20
2
20 20

+
20
h = 20 15 = 5
f = 9
c = 24
Median =
1
]
1

+ C
2
N
f
h

79
Median = [ ] 24 25
9
5
20 +
=
9
5
20+
Median = 20.555
Ex: Find the median for following data.
Mid values (m) 115 125 135 145 155 165 175 185 195
Frequencies (f) 6 25 48 72 116 60 38 22 3
The interval of mid-values of CI and magnitudes of class intervals are same i.e. 10. So, half of 10
is deducted from and added to mid-values will give us the lower and upper limits. Thus, classes are.
115 5 = 110 (lower limit)
115 5 = 120 (upper limit) similarly for all mid values we can get CI.
CI Frequency (f)
Cum.
frequency (cf)
110 120 6 6
120 130 25 31
130 140 48 79
140 150 72 151
150 160 116 267 N/2 class
160 170 60 327
170 180 38 365
180 190 22 387
190 200 3 390
f = N = 390
2
390
2
N

195
Cum. frequency just above 195 is 267.
Median class = 150 160

=
2
150 150 +
= 150
h = 116
N/2 = 195
C = 151
h = 10
80
Median =
1
]
1

+ C
2
N
f
h

Median = [ ] 151 195


116
10
150 +
Median = 153.8
Merits of Median
a. It is simple, easy to compute and understand.
b. Its value is not affected by extreme variables.
c. It is capable for further algebraic treatment.
d. It can be determined by inspection for arrayed data.
e. It can be found graphically also.
f. It indicates the value of middle item.
Demerits of Median
a. It may not be representative value as it ignores extreme values.
b. It cant be determined precisely when its size falls between the two values.
c. It is not useful in cases where large weights are to be given to extreme values.
81
Session 6
Measures of Central Tendency
Mode
It is the value which occurs with the maximum frequency. It is the most typical or common value
that receives the height frequency. It represents fashion and often it is used in business. Thus, it
corresponds to the values of variable which occurs most frequently. The model class of a frequency
distribution is the class with highest frequency. It is denoted by z.
Mode is the value of variable which is repeated the greatest number of times in the series. It is the
usual, and not casual, size of item in the series. It lies at the position of greatest density.
Ex: If we say modal marks obtained by students in class test is 42, it means that the largest number of
student have secured 42 marks.
If each observations occurs the same number of times, we can say that there is no mode. If two
observations occur the same number of times, we can say that it is a Bi-modal. If there are 3 or more
observations occurs the same number of times we say that multi-modal case. When there is a single
observation occurs mot number of times, we can say it is uni-modal case.
For a grouped data mode can be computed by following equations with usual notations.
Mode =
2 1 m
1 m
f f f 2
) f f ( h



where,
f
m
= max frequency (modal class frequency)
f
1
= frequency preceding to modal class.
f
2
= frequency succeeding to modal class
h = class width.
or
Mode =
2 1
2
f f
hf
+
+
82
Ex:
1. Find the modal for following data.
Marks
(CI)
No. of students
(f)
1 10 3
11 20 16
21 30 26
31 40 31 Max. frequency
41 50 16
51 60 8
f = N = 100
We shall identify the modal class being the class of maximum frequency. i.e. 31-40.
where,
f
m
= 31
f
1
= 26
f
2
= 16
h = 10
2
31 30 +


30.5
Mode (z) =
2 1 m
1 m
f f f 2
) f f ( h

+

Mode =
16 26 31 x 2
26) - (31 10
30.5

+
Mode = 33.
Or
Mode =
2 1
2
f f
hf
+
+
=
) 16 26 (
16 x 10
5 . 30
+
+
Mode = 34.30
It can be noted that there exists slightly different mode value in the second method.
83
Partition values
Median divides in to two equal parts. There are other values also which divides the series
partitioned value (PV).
Just as one point divides as series in to two equal parts (halves), 3 points divides in to four points
(Quartiles) 9 points divides in to 10 points (deciles) and 99 divide in to 100 parts (percentage). The
partitioned values are useful to know the exact composition of series.
Quartiles
A measure, which divides an array, in to four equal parts is known as quartile. Each portion
contain equal number of items. The first second and third point are termed as first quartile (Q
1
). Second
quartile (Q
2
) and third quartile (Qs). The first quartile is also known as lower quartiles as 25% of
observation of distribution below it, 75% of observations of the distribution below it and 25% of
observation above it.
Calculation of quartiles
Q
1
= size of
( )
item
4
1 N
th
+
Q
2
= size of
( )
item
4
1 N 3
th
+
Q
2
= (median) =
1
]
1

+ C
2
N
f
h

Measures of quartiles
The quartile values are located on the principle similar to locating the median value.
84
Following table shows procedure of locating quartiles.
Measure
Individual and Discrete
senses
Continuous series
Q
1
( )
item
4
1 N
th
+
item
4
n
th
Q
2
( )
item
4
1 N 2
th
+
item n
4
2
th
Q
3
( ) item 1 N
4
3
th
+ item n
4
3
th
Ex - 1: From the following marks find Q
1
, Median and Q
3
marks
23, 48, 34, 68, 15, 36, 24, 54, 65, 75, 92, 10, 70, 61, 20, 47, 83, 19, 77
Let us arrange the data in array form.
Sl.
No.
x
1. 10
2. 15
3. 19
4. 20
5.
23 Q
1
6. 24
7. 34
8. 36
9. 47
10.
48 Q
2
11. 54
12. 61
13. 65
14. 68
15.
70 Q
3
16. 75
17. 77
18. 83
19. 92
a. Q
1
= ( ) item 1 n
4
1
th
+
85
Q
1
= ( ) 1 19
4
1
+ Here, n = 19 items
Q
1
= 20 x
4
1
Q
1
= 5
th
item
Q
1
= 23
b. Q
2
= ( ) item 1 n
4
2
th
+
Q
2
= 20 x
4
2
10
th
item
Q
2
= 48
c. Q
3
= ( ) item 1 n
4
3
th
+
Q
3
= 20 x
4
3
= 15
th
item
Q
3
= 70
Ex - 2: Locate the median and quartile from the following data.
Size of shoes 4 4.5 5 5.5 6 6.5 7 7.5 8
Frequencies 20 36 44 50 80 30 30 16 14
X f cf
4 20 20
4.5 36 56
5 44
100 Q
1
5.5 50 150
6 80
230 Q
2
6.5 30
260 Q
3
7 30 290
7.5 16 306
8 14 320
N = f = 320
Q
1
= ( ) item 1 n
4
1
th
+
Q
1
= 321
4
1
86
Q
1
= 80.25
th
item
Just above 80.25, the cf is 100. Against 100 cf, value is 5.
Q
1
= 5
Q
2
= ( ) item 1 n
2
1
th
+
Q
2
= 321 x
2
1
160.5
th
item
Just above 160.5, the cf is 230. Against 230 cf value is 6.
Q
2
= 6
Q
3
= ( ) item 1 n
4
3
th
+
Q
3
= x
4
3
321 = 240.75
th
item
Just above 240.75, the cf is 260. Against 260 cf value is 6.5.
Q
3
= 6.5
Ex - 3: Compute the quartiles from the following data.
CI 0-10 10-20 20-30 30-40 40-50 50-60 60-70 70-80
Frequency
(f)
5 8 7 12 28 20 10 10
First quartile (Q
1
) =
1
]
1

+ C N
4
1
f
h

and Q
3
=
1
]
1

+ C N
4
3
f
h


and (Q
2
) = Median =
and C
2
N
f
h
1
]
1

+
87
CI f cf
0-10 5 5
10-20 8 13
20-30 7 20
30-40 12
32 Q
1
40-50 28
60 Q
2
50-60 20
80 Q
3
60-70 10 90
70-80 10 100
N = f = 100
a. First locate Q
1
for N
N = 25
= 30
h = 10
f = 12
c = 20
(Q
1
) =
1
]
1

+ C N
4
1
f
h

= 30
2
30 30

+
+
Q
1
= [ ] 20 25
12
10
30 +
Q
1
= 34.16
b. Locate Q
2
(Median)
Q
2
corresponds to N/2 = 50, 40
2
40 40

+
+
Q
2
=
1
]
1

+ C
2
N
f
h

Q
2
= [ ] 32 50
28
10
40 +
Q
2
= 46.42
Q
3
corresponds to N = 75, 50
2
50 50

+
+
Q
3
=
1
]
1

+ C N
4
3
f
h

88
Q
3
= [ ] 60 75
20
10
50 +
Q
3
= 57.5
Deciles
The deciles divide the arrayed set of variates into ten portions of equal frequency and they are
some times used to characterize the data for some specific purpose. In this process, we get nine decile
values. The fifth decile is nothing but a median value. We can calculate other deciles by following the
procedure which is used in computing the quartiles.
Formula to compute deciles.
, C N
10
1
f
h
D
1

,
_

+

,
_

+ C N
20
2
f
h
D
2


on , so &
Percentiles
Percentile value divides the distribution into 100 parts of equal frequency. In this process, we get
ninety-nine percentile values. The 25
th
, 50
th
and 75
th
percentiles are nothing but quartile first, median and
third quartile values respectively.
Formula to compute percentiles is given below:
P
25
=
, C N
100
25
f
h

,
_

+
P
26
=
,
_

+ C N
100
26
f
h

and so, on
Ex:
Find the decile 7 and 60
th
percentile for the given data of patients visited to hospital on a particular day.
CI f Cf
10-20 1 1
20-30 3 4
30-40 11 15
40-50 21 36
50-60 43
79 P
60
60-70 32
111 D
70
70-80 9 120
f = N = 120
a. D
7
=
, C N
10
7
f
h

,
_

+
60
2
60 60

+

84 N
10
7

h = 10, f = 32
c = 79
89
D
7
= ( ) 79 84
32
10
60 +
7
th
Decile = D
7
= 61.562
b. 60
th
percentile
P
60
=
,
_

+ C N
100
60
f
h

50
2
50 50

+

h = 10
f = 43
c = 36
72 N
100
60

P
60
= ( ) 36 72
43
10
50 +
P
60
= ( ) 36 72
43
10
50 +
P
60
= 58.37
SOME NUMERICAL EXAMPLES
1. Show that following distribution is symmetrical about the average. Also shows that median is the
mid-way between lower and upper quartiles.
X 2 3 4 5 6 7 8 9 10
Frequency 2 9 29 57 80 57 29 9 2
To show the given distribution is symmetrical, Mean, Median and Mode must be same.
To show median is mid-way between the lower and upper quartile i.e., Q
2
Q
1
= Q
3
Q
2
.
Mid-point
x
Class interval
CI
f d = (x 6) fd
cf
Cum. freq.
2 1.5 2.5 2 -4 -8 2
3 2.5 3.5 9 -3 -27 11
4 3.5 4.5 29 -2 -58 40
5 4.5 5.5 57 -1 -57 97 Q
1
class
6 5.5 6.5 80 0 0 177 Q
2
class
7 6.5 7.5 57 1 57 234 Q
3
class
8 7.5 8.5 29 2 58 263
90
9 8.5 9.5 9 3 27 272
10 8.5 10.5 2 4 8 274
N=274 fd = 0
Let A = 6
Mean =
N
fd h
A

+
= 6
274
0 x 1
6 +
Mean = 6.
Median
Q
2
=
1
]
1

+ C
2
N
f
h

137
2
274
2
N

C = 97
Q
2
= [ ] 97 137
80
1
. 5 +
Q
2
= 5.5 + 0.5
Median = Q
2
= 6.
91
Mode
Mode =
( )
2 1 m
1 m
f f f 2
f f h

+
Modal class 5.5 6.5
Mode =
( )
57 57 80 x 2
57 80 1
5 . 5

+
Mode = 6.
Since, Mean = Mode = Median. The given distribution is symmetrical.
Q
1
calculation
Q
1
=
1
]
1

+ C N
2
4
f
h

Q
1
= [ ] 40 5 . 68
57
1
5 . 6 +
Q
1
= 7.
Now, Q
2
Q
1
= Q
3
Q
2

i.e. 6 5 = 7 5
2 = 2
2. Find the mean for the set of observations given below.
6, 7, 5, 4
5
4 8 7 8 6
N
xi
x
1 i
n
+ + + +

= 6
5
30

3. Find the mean for the following data.


CI f x
i
fx
0-10 3 5.5 16.5
11-20 16 15.5 248
21-30 26 23.5 683
31-40 31 35.5 1180.5
41-50 16 45.5 728
51-60 8 55.5 444
N = f = 100 3300
100
3200
N
fx
x

32 x
92
4. Find the mean profit of the organisation for the given data below:
Profit CI f x
i
fx
100-200 10 150 1500
200-300 18 250 4500
300-400 20 350 7000
400-500 26 450 11700
500-600 30 550 16500
600-700 28 650 18200
700-800 18 750 13500
N = f = 150 72900
x
1
=
2
200 100 +
x
1
=
2
300
x
1
= 150
N
fx
x

=
150
72900
x 486
Step Deviation Method
x = a + hd d =
h
a x
N
fd
h a x

+
a = Arbitrary constant
h = class width
93
Profit CI f x
i
d fd
100-200 10 150 -3 -30
200-300 18 250 -2 -36
300-400 20 350 -1 -20
400-500 26 450 0 0
500-600 30 550 +1 30
600-700 28 650 +2 56
700-800 18 750 +3 34
N = f = 150 fm = 54
N
fd
h a x

+

,
_

+
150
54
100 450 x
486 x
5. In an office there are 84 employees and there salaries are given below.
Salary 2430 2590 2870 3390 4720 5160
Employees 4 28 31 16 3 2
1. Find the mean salary of the employees
2. What is the total salary of the employees?
N
fx
x

=
84
2 x 5160 3 x 4730 16 x 3390 31 x 2870 28 x 2590 4 x 2430 + + + + +
N
fx
x

84
249930
x
Rs. 2975.36
1. x 2975.36
2. Total salary = 2,49,930 (Rs.)
94
6. The average marks secured by 36 students was 52 but it was discovered that on item 64 was misread
as 46. Find the correct me of the marks.
N
fx
x

56
fx
52

fx = 52 x 36 = 1872
fx = fx - incorrect + correct
correct = 1872 46 64 = 1890
x
N
correct fx
x
36
1890

x 52.5
7. The mean of 100 items is 46, later it was discovered that an item 16 was misread as 61 and another
item 43 was misread as 34 and also found that the total number of items are 90 not 100 find the correct
mean value.
N
fx
x

100
fx
46

fx = 4600
fx = fx - incorrect + correct
= 4600 61 - 34 + 16 + 43
= 4564
x
N
correct fx
x
90
4564

= 50.71
95
8. Calculate the mean for the following data.
Value Frequency
< 10 4
< 20 10
< 30 15
< 40 25
< 50 30
CI f m mid point fm
0-10 4 5 20
10-20 10 15 150
20-30 15 25 375
30-40 25 35 875
40-50 30 45 1350
f = 84 fx 2770
N
fm
x

84
2770

x 32.97
9. For a given frequency table, find out the missing data. The average accident are 1.46.
No. of accidents Frequency
0 46
1 ?
2 ?
3 25
4 10
5 5
96
No. of accidents
(x)
Frequency
(f)
fx
0 46 0
1 ? f
1
2 ? 2f
1
3 25 75
4 10 40
5 5 25
N = 200
fx = 140 + f
1
+ 2f
2
1.46 =
200
f 2 f 140
2 1
+ +
292 = 140 + f
1
+ 2f
2
f
1
+ 2f
2
= 152 ----(1)
w.k.t. N = f
200 = 86 + f
1
+ f
2
f
1
+ f
2
= 114 ----(2)
f
1
+ 2f
2
= 152 ----(1)
f
1
+ f
2
= 114 ----(2) (1) (2)
---------------------------------
f
2
= 38
---------------------------------
f
2
= 38
f
1
+ f
2
= 114
f
1
+ 114 38
f
1
= 76
97
10. Find out the missing values of the variate for the following data with mean is 31.87.
x
i
F
12 8
20 16
27 48
33 90
? 30
54 8
N = 200
x
i
f fx
12 8 96
20 16
320
27 48 1296
33 90 2970
x 30 30x
54 8 432
N = 200
fx = 5114 + 30x
x 31.87
N
fx
x

200
fx
87 . 31

fx = 6374 ----(1)
fx = 5114 + 30x ----(2)
(1) = (2)
6374 = 5114 + 30x
6374 - 5114 = 30x
30x = 1260
x = 42.
11. The average rainfall of a city from Monday to Saturday is 0.3 inches. Due to heavy rainfall Sunday
the average rainfall for the week increased to 0.5 inches. What is the rainfall on Sunday?
Given: Mon Sat = 0.3
Sun = 0.5
98
N
fx
x
1


6
fx
3 . 0
1

fx
1
= 1.8
N
fx
x
2


7
fx
5 . 0
2

fx
2
= 3.5
Rainfall on Sunday = fx
2
fx
1

= 3.5 1.8
= 1.7
12. The average salary of male employees in a firm was Rs. 520 and that of females Rs. 420 the mean of
salary of all the employees as a whole is Rs. 500. Find the percentage of male and female employees.
Given: 520 x
1
420 x
2
500 x
n
1
= Male persons. n
2
= Female persons.
2 1
2 2 1 1
n n
x n x n
x
+
+

2 1
2 1
n n
420 x n 520 x n
500
+
+

2 1
2 1
n n
n 420 n 520
500
+
+

500n
1
+ 500n
2
= 520n
1
420n
2
80n
2
= 20n
1
n
1
= 4n
2
Let n
1
+ n
2
= 100
4n
2
+ n
2
= 100
5n
2
= 100
n
2
= 20% Female
n
1
= 80% Male
20% and 80% are male and females in the firm.
99
13. The A-M of two observations is 25 and there GM is 15. Find the HM.
Given:
AM = 25
2
b a
x
+

2
b a
x
+

2
b a
25
+

a + b = 50
GM = 15
GM = ab 2
GM = ab
15 = ab
(15)
2
= ( ab )
2
ab = 225
HM = ?
HM =
b
1
a
1
2
+
HM =
b a
ab 2
+
HM =
50
225 x 2
HM = 9
a + b = 50
ab = 225
a =
b
225
HM = 9
14. The GM is 60 an HM is 28.24. Find AM for two observations.
AM GM HM
2
b a
x
+

2
b 95 254
x

= 127.475
60 = ab
60
2
= ab
ab = 3600
28.24 =
b a
ab 2
+
a + b =
4 . 28
ab 2
=
4 . 28
3600 x 2
a + b = 254.95
100
15. Calculate the missing frequency from the data if the median is 50.
CI f cf
10-20
2 2
20-30
8 10
30-40
6 16
40-50
? f
1
16+f
1
50-60
15
31+f
1
median class
60-70
10 41+f
1
f = 41 + f
1
Q =
1
]
1

+ C
2
N
f
h

50 = 50 +
1
]
1

+ ) f 16 (
2
N
15
10
1
50 50 =
1
]
1

+ ) f 16 (
2
N
15
10
1
0 =
1
]
1

+ ) f 16 (
2
N
15
10
1
0 =
1
]
1

+ ) f 16 (
2
N
10
1
0 =
1
]
1

+ ) f 16 (
2
N
1
2
N
) f 16 (
1
+
16 + f
1
= (41 + f
1
)
2 (16 + f
1
) = 41 + f
1
32 + 2f
1
= 41 + f
1
f
1
= 9
101
SOURCES AND REFERENCES
1. Statistics for Management, Richard I Levin, PHI / 2000.
2. Statistics, RSN Pillai and Bagavathi, S. Chands, Delhi.
3. An Introduction to Statistical Method, C.B. Gupta, & Vijaya Gupta, Vikasa Publications, 23e/2006.
4. Business Statistics, C.M. Chikkodi and Salya Prasad, Himalaya Publications, 2000.
5. Statistics, D.C. Sancheti and Kappor, Sultan Chand and Sons, New Delhi, 2004.
6. Fundamentals of Statistics, D.N. Elhance and Veena and Aggarwal, KITAB Publications, Kolkata,
2003.
7. Business Statistics, Dr. J.S. Chandan, Prof. Jagit Singh and Kanna, Vikas Publications, 2006.
102
Session 7
Measures of Dispersions
The measures of Central Tendency alone will not exhibit various characteristics of the frequency
distribution having the same total frequency. Two distribution can have the same mean but can differ
significantly. We need to know the extent of variation or deviation of the values in comparison with the
central value or average referred to as the measures of central tendency.
Measures of dispassion are the average of second order. The are based on the average of
deviations of the values obtained from central tendencies
x
, Me or z. The variability is the basic feature
of the values of variables. Such type of variation or dispersion refers to the lack of uniformity.
Definition: A measure of dispersion may be defined as a statistics signifying the extent of the
scatteredness of items around a measure of central tendency.
Absolute and Relative Measures of Dispersion:
A measure of dispersion may be expressed in an absolute form, or in a relative form. It is said to
be in absolute form when it states the actual amount by which the value of item on an average deviates
from a measure of central tendency. Absolute measures are expressed in concrete units i.e., units in terms
of which the data have been expressed e.g.: Rupees, Centimetres, Kilogram etc. and are used to describe
frequency distribution.
A relative measures of dispersion is a quotient by dividing the absolute measures by a quality in
respect to which absolute deviation has been computed. It is as such a pure number and is usually
expressed in a percentage form. Relative measures are used for making comparisons between two or
more distribution.
Thus, absolute measures are expressed in terms of original units and they are not suitable for
comparative studies. The relative measures are expressed in ratios or percentage and they are suitable for
comparative studies.
Measures of Dispersion Types
Following are the common measures of dispersions.
a. The Range
b. The Quartile Deviation (QD)
c. The Mean Deviation (MD)
d. The Standard Deviation (SD)
103
Range
Range represents the differences between the values of the extremes. The range of any such is
the difference between the highest and the lowest values in the series.
The values in between two extremes are not all taken into consideration. The range is an simple
indicator of the variability of a set of observations. It is denoted by R. In a frequency distribution, the
range is taken to be the difference between the lower limit of the class at the lower extreme of the
distribution and the upper limit of the distribution and the upper limit of the class at the upper extreme.
Range can be computed using following equation.
Range = Large value Small value
value Small value e arg L
value Small value e arg L
Range of t Coefficien
+

Problems
1. Compute the range and also the co-efficient of range of the given series of state which one is more
dispersed and which is more uniform.
Series I 9, 10, 15, 19, 21 Series II 1, 15, 24, 28, 29
R = LV SV = 21 9 = 12 R = LV SV = 29 1 = 28
CR =
30
12
9 21
12

+
= 0.4 CR =
30
28
SV LV
R

+
= 0.933
Series I is les dispersed and more uniform
Series II is more dispersed and less uniform
Evaluating Criteria
i. Less the CR is less dispersion
ii. More the CR is less uniform
Range Merits
i. It is very simplest to measure.
ii. It is defined rigidly
iii. It is very much useful in Statistical Quality Control (SBC).
iv. It is useful in studying variation in price of shars and stocks.
104
Limitations
i. It is not stable measure of dispersion affected by extreme values.
ii. It does not considers class intervals and is not suitable for C.I. problems.
iii. It considers only extreme values.
2. Find range of Co-efficient of range from following data.
A: 10 11 12 13 14
B: 40 41 42 43 44
C: 100 101 102 103 104
Series - I Series II Series III
R =LV 3m
= 14 10
= 4
CR =
SV LV
R
+
=
24
4
= 0.166
R = 44 - 40
= 4
CR =
SV LV
R
+
=
84
4
= 0.0476
R = 104 - 100
= 4
CR =
SV LV
R
+
=
204
4
= 0.0196
Series III is less dispersed and more uniform
Series I is more dispersed and less uniform
3. Compute range and coefficient of range for the following data.
x: 6 12 18 24 30 36 42
f: 20 130 16 14 20 15 40
Range = LV SV = 42 6 = 36
CR =
SV LV
R
+
=
48
36
= 0.75
105
Quartile Deviation
Quartile divides the total frequency in to four equal parts. The lower quartile Q
1
refers to the
values of variates corresponding to the cumulative frequency N/4.
Upper quartile Q
3
refers the value of variants corresponding to cumulative frequency N.
Quartile deviation is defined as QD =
2
1
(Q
3
Q
1
). In this quartile Q
2
as it corresponds to the
value of variate with cumulative frequency is equal to c.f. =
2
N
.
a) QD =
2
1
(Q
3
Q
1
)
b) Relative measure of dispersion coefficient of QD =
1 3
1 3
Q Q
Q Q
+

Problems
1. Find quartile deviation and coefficient of quartile deviation for the given grouped data also compute
middle quartile.
Class f
1 10 3
11 20 16
21 30 26
31 40 31
41 50 16
51 60 8
f = N = 100
Class f Cf
1 10 3 3
11 20 16 19
21 30 26
45 Q
1
Class
31 40 31
76 Q
2
& Q
3
Class
41 50 16 92
51 60 8 100
N = 100
106
Q
1

4
N
=
25
4
100
Q
1
=
1
]
1

+ C
4
N
f
h

Q
1
= [ ] 19 25
26
10
5 . 20 +
Q
1
= 22.80
Q
2
=
1
]
1

+ C
2
N
f
h

Q
2
= [ ] 45 50
31
10
5 . 30 +
Q
2
= 32.11
Q
3
=
1
]
1

+ C N
4
3
f
h

Q
3
= [ ] 45 75
31
10
5 . 30 +
Q
3
= 40.17
QD =
2
1
(Q
3
Q
1
) = 0.5 (Q
3
Q
1
)
=
2
1
(40.17 22.80)
= 8.685
Coef. QD =
1 3
1 3
Q Q
Q Q
+

=
80 . 22 17 . 40
80 . 22 17 . 40
+


97 . 62
37 . 17

= 0.275
107
2. Find quartile deviation from the following marks of 12 students and also co-efficient of
quartile deviation.
Sl. No. Marks
1. 25
2. 30
3. 37
4. 43
5. 48
6. 54
7. 61
8. 67
9. 72
10. 80
11. 84
12. 89
Q
1
= 3.25
th
item
= 3
rd
item + 0.25 of item
= 37 + 0.25 (43 - 37)
Q
1
= 38.5
Q
3
=9.75
th
item
= 9 + 0.75
rd
item
= 72 + 0.75 (80- 72)
Q
3
= 78
QD =
2
1
(Q
3
Q
1
)
=
2
1
(78 38.3)
QD = 19.75
Coef. QD =
1 3
1 3
Q Q
Q Q
+

=
5 . 38 78
5 . 38 78
+

= 0.339
3. Compute quartile deviation. and its Coefficient for the data given below:
x f Cf
58 15 15
59 20 35
108
60 32
67 Q
1
Class
61 35 102
62 33 135
63 22
157 Q
3
Class
64 20 177
65 10 187
65 8 195
N = 195
Q
1
= size
4
1 n
th
+
= size
4
1 195
th
+
Q
1
= 48.78
th
size and corresponding to cf 67, which gives
Q
1
= 60
Q
3
= ( ) size 1 n
4
3
th
+
= ( ) . size 33 . 146 196
4
3
th th

It lies in 157, cf. Against cf 157 Q


3
= 63
QD =
2
1
(Q
3
Q
1
)
=
2
1
(63 60)
QD = 1.5
Coef. QD =
1 3
1 3
Q Q
Q Q
+

=
123
3
60 63
60 63


= 0.024
Merits of Quartile Deviation
It is very easy to compute
It is not affected by extreme values of variable.
It is not at all affected by open and class intervals.
Demerits of Quartile Deviation
109
It ignores completely the portions below the lower quartile and above the upper of
quartile.
It is not capable for further mathematical treatment.
It is greatly affected by fluctuations in the sampling.
It is only the positional average but not mathematical average.
110
Session 8
Measures of Dispersions
Mean Deviation
Mean deviation is the average differences among the items in a series from the mean itself or
median or mode of that series. It is concerned with the extent of which the values are dispersed about the
mean or median or the mode. It is found by averaging all the deviations from control tendency. These
deviations are taken into computations with regard to negative sign. Theoretically the deviations of item
are taken preferably from median instead than from the mean and mode.
Merits of Mean Deviation
It is rigidly defined and easy to compute.
It takes all items in to considerations and gives weight to deviation according to these
sign.
It is less affected by extreme values.
It removes all irregularities by obtaining deviation and provides correct measures.
Demerits of Mean Deviation
It is not suitable for algebraic treatments.
It is positive which is not justified mathematically.
It is not satisfactory measure when the deviations are taken from mode.
It is not suitable when class intervals are open end.
111
Formula to compute Mean Deviation
If x
i
is variant and takes the values x
1
, x
2
, x
3
, .. x
n
with average. A (mean, median, mode),
then mean deviation from the average A is defined by
MD =
N
A x
i

For the grouped data
MD =
N
A x f
i

Coefficient of MD =
Mean
MD
1. Compute MD and CMD from mean for the given data below.
X
d =
x x
i

21 26.55
32 15.55
38 9.55
41 6.55
49 1.45
54 6.45
59 11.45
66 18.45
68 20.45
x = 428

x x
i

= d= 116.45

1 i
i
n
x x

35 . 47
9
428
x
MD =
9
45 . 116
N
x x
i


MD = = 12.938
Coefficient of MD =
Avg
MD
=
55 . 47
938 . 12
= 0.272
2. Following are the wages of workers. Find mean deviation from median and its coefficient.
x Wages
Me x
i

=
47 x
i

59 17 30
112
32 22 25
67 25 22
43 32 15
22 43 4
17
47 M
0
64 55 8
55 59 12
47 64 17
80 67 20
25 80 33
25

M x
i

= 186
M x
i

= 186
Median =
,
_

+
2
1 11
th
item
=
,
_

+
2
1 11
= 6
th
item
Me = 47
MD =
N
Me x
i


=
11
186
= 16.91
Coefficient of MD =
Median
MD

47
91 . 16
= 0.359
3. Compute MD about its mode and its coefficient.
x f
d =
Mode x
i

fd
20 6 100 600
40 19 80 1520
60 40 60 2400
80 23 40 920
113
100 65 20 1300
120 Mode 83 Modal
class
0 0
140 55 20 1100
160 20 40 800
180 9 60 5401
f = 320
f
Mode x
i

=
9180
the highest frequency is 83 and hence
Z = 120
MD=
N
Mode x
i


Median =
,
_

320
9180

= 28.68
Coefficient of MD =
120
68 . 28
= 0.239

114
4. Find out the mean deviation from the data given below about its median.
Salaries 40 50 50-100 100-200 200-400
No. of Employees 22 18 10 8 2
x
No. of
Employees
x(mv) cf d =
Me x
i

fd
40 22 40 22 10 220
50 18 50 40 0 0
50-100 10 75 50 25 250
100-200 8 150 58 100 800
200-400 2 300 60 250 500
f = 60
f
Me x
i

= 1770
Median =
th
2
1 N

,
_

+
item
=
2
1 60 +
=
2
61
= 30.5 It lies in 40 cf and against 40 cf
discrete value is 50

MD =
N
Median x
i


=
,
_

60
1770

MD = 29.5
Coefficient of MD =
Median
MD
=
50
5 . 29
= 0.59
Session 9
Measures of Dispersions
Standard Deviation
115
Standard deviation is the root of sum of the squares of deviations divided by their numbers. It is
also called Mean error deviation. It is also called mean square error deviation (or) Root mean square
deviation. It is a second moment of dispersion. Since the sum of squares of deviations from the mean is a
minimum, the deviations are taken only from the mean (But not from median and mode).
The standard deviation is Root Mean Square (RMS) average of all the deviations from the mean.
It is denoted by sigma ().
Characteristics of standard deviation
1. Standard deviation and coefficient of variation possesses all these properties which a good
measure of dispersion should possess.
2. The process of squaring the deviation eliminates negative sign and makes mathematical
computations easy.
Merits
1. It is based on all observations.
2. It can be smoothly handled algebraically.
3. It is a well defined and definite measure of dispersion.
4. It is of great importance when we are making comparison between variability of two series.
Merits
1. It is difficult to calculate and understand.
2. It gives more weightage to extreme values as the deviation is squared.
3. It is not useful in economic studies.
Standard deviation
If the variant x
i
takes the values of x
1
, x
2
.. x
n
the standard deviation denoted by and it
is defined by
=
( )
N
x x
2
i

The quantity
2
is called variance.
116
Alternate Expressions
For raw data

2
= ( )
2
2
x
n
x

For a grouped data


2
= ( )
2
2
x
n
fx

For a grouped data with step deviation method =


2
2
N
fd
N
fd

,
_

Coefficient of variance
It is defined as the ratio to be equal to standard deviation divided by mean. The percentage form of
CV is given by CV =
100 x
x

117
Problems
1. Ten students of a class have obtained the following marks in a particular subject out of 100. Calculate
SD and CV for the given data below.
Sl. No.
(x)
marks
d = (x
1
= 38.5)
d = (x
1
-
x
)
(x
1
-
x
)
2
1. 5 - 33.5 1122.25
2. 10 - 28.4 812.25
3. 20 - 18.5 342.25
4. 25 - 13.5 182.25
5. 40 1.5 2.25
6. 42 3.5 12.25
7. 45 6.5 42.25
8. 48 9.5 90.25
9. 70 31.5 992.25
10. 80 41.5 1722.25
x = 385
(x
1
-
x
)
2
=
d
2
= 5320.50
N
x
x


=
10
385
= 38.5
=
( )
N
x x
2
i

=
10
5 . 5320
= 23.066
CV =
100 x
x

CV = 100 x
5 . 38
23
CV = 59.9%
2. Compute standard deviation and coefficient of varience for following data of 100 students marks.
118
Class f Class
Mid
point
x
d fd fd
2
1 10 3 0.5 10.5 5.5 -2 -6 12
11 20 16 10.5 20.5 15.5 -1 -16 16
21 30 26 20.5 30.5 25.5 0 0 0
31 40 31 30.5 40.5 35.5 1 31 31
41 50 16 40.5 50.5 45.5 2 32 64
51 60 8 50.5 60.5 55.5 3 24 72
N = f =
100
fd = 65 fd
2
= 195
a = 25.5
d = d
10
5 . 25 x
h
a x

d = 1
10
10
10
5 . 25 5 . 15

N
fd
h a x

+ +

,
_

+
100
65
10 5 . 25
= 25.5 + 6.5
x 32
= h
2
2
N
fd
N
fd

,
_

= 10
2
100
65
100
195

,
_

= 12.359
CV =
100 x
x

CV = 100 x
32
359 . 12
= 38.62%
3. The AM and SD of a set of nine items are 43 and 5 respectively if an item of value 63 is added, find
the mean and SD.
N
x
x
i

x
i
=
x
x N
x
i
= 43 x 9
x = 387 for 9 items
119
x = 387 + 63 for 10 item
x = 450
Modified mean
10
450
N
x
x

x
= 45
x
= 43 = 5 for 9 items

2
= ( )
2
2
x
N
x

25 = ( )
2
2
43
9
x

25 = 1849
9
x
2

25 + 1849 =
9
x
2

9
x
2

= 1874
x
2
= 1874
x
2
= 16866 for 9 items
If 63 is added
x
2
= 16866 + (63)
2
= 20835 for 10 items
Modified
2
= ( )
2
2
x
N
x

2
=
( )
2
45
10
20835


2
= 7.64 is modified SD.
4. The mean of 5 observations is 4.4. and variance is 8.24 and if the 3 items of the five observations are
1, 2 and 6. Find the values of other two observations.
w.k.t.
N
x
x

N
x
4 . 4

x = 22

2
= ( )
2
2
x
N
x

120
8.24 = ( )
2
2
4 . 4
5
x

8.24 = 36 . 19
9
x
2

8.24 + 19.36 =
5
x
2

x
2
= 138
x
2
= 1
2
+ 2
2
+ 6
2
+ x
1
2
+ x
2
2
138 = 1 + 4 + 36 + x
1
2
+ x
2
2
97 = x
1
2
+ x
2
2
x
1
2
+ x
2
2
= 97 ---- (1)
x = 1 + 2 + 6 + x
1
+ x
2
22 = 9 + x
1
+ x
2
x
1
+ x
2
= - 13 ---- (2) put (2) in (1)
x
2
= 13 x
1
by (1) & (2)
x
1
2
+ (13 x
1
)
2
= 97
x
1
2
+ 169 + x
1
2
26x
1
= 97
2 x
1
2
26x
1
+ 72 = 0
x
1
2
13x
1
+ 36 = 0
x
1
=
a 2
49 b b -
2
t
x
1
=
2
36 x 4 169 (-13) - t
x
1
=
2
5 13 t
x
1
=
2
5
2
13
t
x
1
= 6.5 t 2.5
x
1
= 9 or x
1
= 4
x
1
= 9 x
2
= 4
121
5. The mean and S.D. of the frequency distribution of a continuous random variable x are 40.604 and
7.92 respectively. Change of origin and scale is given below. Determine the actual class interval.
d -3 -2 -1 0 1 2 3 4
f 3 15 45 57 50 36 25 9
d f fd fd
2
MV CI
-3 3 -9 27 22.5 20-25
-2 15 -30 60 29.5 25-30
-1 45 -45 45 32.5 30-35
0 57 0 0 37.5 35-40
1 50 50 50 42.5 40-45
2 36 72 144 47.5 50-55
3 25 75 225 52.5 55-60
4 9 36 144 57.5
N = 240 fd = 149 fd
2
= 695
N
fd
h a x

+
240
149
h a 604 . 40 +
40.604 = a + 0.62h ----- (1)
= h
2
2
N
fd
N
fd

,
_

7.92 = h
2
240
149
240
695

,
_

= h 620 . 0 895 . 2
7.92 = h x 1.584
h = 4.998
h = 5
Put h = 5 in equation (1)
40.604 = a + 0.62 x 5
a = 37.5
Combined Standard Deviation
Suppose we have different samples of various sizes n
1
, n
2
, n
3
.. having means x
1
, x
2
, x
3
and
standard deviation
1
,
2
,
3
. then combine standard deviation can be computed by the following
formula.
122

2
(n
1
+ n
2
) = n
1
(
1
2
+ d
1
2
) + n
2
(
2
2
+ d
2
2
)
d
1
= x x
1

d
2
= x x
2

1. The means of two samples of sizes 50 and 100 respectively are 54.1 and 50.3 and there standard
deviations are 8 and 7 respectively obtain the SD for combined group.
n
1
= 50
1
x = 54.1

1
= 8
n
2
= 100
2
x = 50.3

2
= 7
) n n (
x n x n
x
2 1
2 2 1 1
+
+

100 50
) 3 . 50 x 100 ( ) 1 . 54 x 50 (
x
+
+

x 51.56

2
(n
1
+ n
2
) = n
1
(
1
2
+ d
1
2
) + n
2
(
2
2
+ d
2
2
)
d
1
= x x
1

d
2
= x x
2

d
1
= 94.1 51.56
d
1
= 2.54 d
1
2
= 6.45
d
2
= 50.3 51.56
d
2
= - 1.26 d
2
2
= 1.56

2
150 = 50 (8
2
+ 6.45) + 100 (7
2
+ 1.58)
3
2
= (64 + 6.45) + 2 (49 + 1.58)
3
2
= 70.45 + 2 x 50.58
= 7.56
123
2. The mean wage is Rs. 75 per day, SD wage is Rs. 5 per day for a
group of 1000 workers and the same is Rs. 60 and Rs. 4.5 for the
other group of 1500 workers. Find mean and standard deviation
for the entire group.
We have by data,
1
x = 75,
1
= 5, n
1
= 1000
2
x = 60,
2
= 450, n
2
= 1500
Let
x
and be the mean and SD of the entire group.
Consider
2 1
2 2 1 1
n n
x n x n
x
+
+

i.e.,
0 6
1500 1000
60 x 1500 75 x 1000
x
+
+

Also we have,
(n
1
+ n
2
)
2
= n
1
(
1
2
+ d
1
2
) + n
2
(
2
2
+ d
2
2
),
where d
1
=
1
x -
x
= 75 66 = 9; d
2
=
2
x -
x
= 60 66 = -6
(1000 + 1500)
2
= 1000 (5
2
+ 9
2
) + 1500 (4.5
2
+ (-
6)
2
)

2
= 76.15 or = 8.73
3. The runs scored by 3 batsman are 50, 48 and 12. Arithmtic
means respectively. The SD of there runs are 15, 12 and 2
respectively. Who is t he most consistent of the three batsman? If
the one of these three is to be selected who is to be selected?
A B C
AM (
x
)
50 48 12
SD()
15 12 2
CV
A
=
A
A
x

x 100
CV
A
=
50
15
x 100
CV
A
= 30%
CV
B
=
B
B
x

x 100
CV
B
=
48
12
x 100
CV
B
= 25%
CV
C
=
C
C
x

x 100
CV
C
=
12
2
x 100
CV
C
= 16.66%
Evaluation Criteria
1. Less CV indicates more constant player and hence more
consistent player is (Player C)
2. Highest rune scorer =
x A
= 50
4. The coefficient of variation of the two series are 75% and 90%
with SD 15 and 18 respectively compute there mean.
CV
A
= 75%
CV
B
= 80%

A
= 15

B
= 18
CV

=
100 x
x


75 =
100 x
x
15
A
90 =
100 x
x
18
A
x
A
= 20
x
A
= 20
5. Goals scored by two teams A & B in a foot ball season are as shown below. By
calculating CV in each, find which team may be considered as more consistent.
No. of goals
x
No. of matches
Team (A)
fx
Team (B)
fx A-team B-team
0 27 17 0 0
1 9 9 9 9
2 8 6 16 12
3 5 5 15 15
4 4 3 16 12
N = f = 53 f = 40 fx = 56 fx
2
= 48
Team (A)
fx
2
Team (B)
fx
2
0 0
9 9
32 24
45 45
64 48
fx
2
= 150 fx
2
= 126
x
A
=
N
fx
=
53
56
= 1.056
x
B
=
N
fx
=
40
48
= 1.2
( )
2
2
2
A
x
N
fx

= ( ) 715 . 1 056 . 1
53
150
2
=
30 . 1
A

24
( )
2
2
2
B
x
N
fx

= ( ) 95 . 1 2 . 1
40
126
2
=
30 . 1
B

CV
A
=
A
A
x

x 100 = 100 x
056 . 1
30 . 1
= 123.8%
CV
B
=
B
B
x

x 100 = 100 x
2 . 1
30 . 1
= 109%
Since, CV
B
< CV
A
, team B is more consistent player
6. The prices of x and y share A & B respectively state which share more stable in its
value.
Price A
(x)
(x
i
= 53)
(x
i
=
x
)
(x
i
=
x
)
2
Price - A
(4)
(x
i
= 105)
(x
i
=
x
)
(x
i
=
x
)
2
55 2 4 108 3 9
54 1 1 107 2 4
52 -1 1 105 0 0
53 0 0 105 0 0
56 3 9 106 1 1
58 5 25 107 2 4
52 -1 1 104 -1 1
50 -3 9 103 -2 4
51 -2 4 104 -1 1
49 -4 16 101 -4 16
x = 530 (x
i
=
x
)
2
= 70 x = 1050 x(x
i
=
x
)
2
= 40
25
x
A
=
N
x
=
10
530
= 53
x
B
=
N
x
=
10
1050
= 105
64 . 2
10
70
A A

2
10
40
B B

CV
A
=
x
A

x 100 = 100 x
53
64 . 2
= 4.98%
CV
B
=
x
B

x 100 = 100 x
105
2
= 1.903%
Since, CV
B
is less share B is more stable.
7. A student while computing the coefficient of variation obtained the mean and SD of
100 observations as 40 and 5.1 respectively. It was later discovered that he had
wrongly copied an observation as 50 instead of 40. Calculate the correct coefficient of
variation.
>>
n
x
x

i.e.
100
x
40

x (incorrect) = 4000
Now correct x = 4000 50 + 40 = 3990
correct
100
3990
x
= 39.9
Let us consider ( )
2
2
2
x
n
x


( ) ( )
2
2
2
40
100
x
1 . 5

i.e. ( ) ( ) 01 . 1626
100
x
or
100
x
1 . 5 40
2 2
2 2


+
26
x
2
(incorrect) = 100 x 1626.01 = 162601
Now correct x
2
= 162601 (50)
2
+ (40)
2
= 161701
correct
2
= correct ( )
2
2
x correct
n
x

i.e., correct
2
=
( ) 25 9 . 39
100
161701
2

Now correct efficient of variation =
100 x
x

% 56 . 12 100 x
9 . 39
5

Hence correct C.V. = 12.53%


27
8. The mean and SD of 21 observations are 30 and 5 respectively. It was subsequently
noted that one of the observations 10 was incorrect. Omit it and determine the mean
and SD of the rest.
>>
n
x
x

i.e. 630 x or
21
x
30

incorrect x = 630
Now omitting the incorrect value 10,
New x = 630 10 = 620
n = 21 1 = 20
New 31
20
620
x
Next consider ( )
2
2
2
x
n
x


( ) ( )
2
2
2
30
100
x
5

i.e.
21
x
25 900
2

+
19425 21 x 925 x incorrect
2

Again omitting the incorrect value 10.
New x = 19425 (10)
2
= 19325, n = 20
Hence new ( )
2
2
2
x new
20
x
new


25 . 5 ) 31 (
20
19325
2

New = 25 . 5 = 2.29
9. The mean of 200 items was 50. Later on it was discovered that two items were misread
as 92 and 8 instead of 192 and 88. Find out the correct mean.
>>
n
x
x

i.e. 10000 x or
200
x
50

incorrect x = 10000
Correct x = 10000 92 8 + 192 + 88 = 10180
28
Correct mean =
200
10180
= 50.9
10. Find the missing frequencies in the following data given that the median is 137.2.
Class 100-
110
110-
120
120-
130
130-
140
140-
150
150-
100
106-
170
170-
180
Frequency 15 44 133 F
1
125 F
2
35 16 N=600
>> We prepare the table with the column of cumulative frequencies and use the
formula for median.
Class Frequency cf
100-110 15 15
110-120 44 59
120-130 133 192
130-140 f
1
192 + f
1 Median class
140-150 125 317 + f
1
150-160 f
2
317 + f
1
+ f
2
160-170 35 352 + f
1
+ f
2
170-180 16 368 + f
1
+ f
2
N = 600
Median = 1 +
,
_

c
2
N
f
h
We can take the median class as 130-140 since median is given to be 137.2
130
2
130 130
l
+
, h = 10 f = f
1
, c = 192
137.2 = 130 +
1
f
10
(300 - 192)
i.e., 137-2 130 =
1
f
1080
i.e., 7.2 f
1
= 1080 or f
1
150
But the last cumulative frequency must be equal to N = 600
i.e. 368 + f
1
+ f
2
= 600
29
368 + 150 + f
2
= 600 f
2
= 82
Thus f
1
= 150, f
2
= 82
30
Relationship between various measures of dispersion
We have some of following relationships among the various methods of measures
of dispersion
1. Mean t QD covers 50% of observations of the distribution
2. Mean t MD covers 57.5% of observations
3. Mean t 1 includes 68.27% of observations
4. Mean t 2 includes 95.45% of observations
5. Mean t 3 includes 99.73% of observations
6. QD =
3
2
6745
7. MD =
5
4
x
A
2
8. QD =
6
5
MD
9. Combining the results we get 3 QD = 2 SD and 5 MD = 4 SD that is also equal to 6
QD.
10. Range = 6 times SD.
SOURCES AND REFERENCES
8. Statistics for Management, Richard I Levin, PHI / 2000.
9. Statistics, RSN Pillai and Bagavathi, S. Chands, Delhi.
10. An Introduction to Statistical Method, C.B. Gupta, & Vijaya Gupta, Vikasa
Publications, 23e/2006.
11. Business Statistics, C.M. Chikkodi and Salya Prasad, Himalaya Publications, 2000.
12. Statistics, D.C. Sancheti and Kappor, Sultan Chand and Sons, New Delhi, 2004.
13. Fundamentals of Statistics, D.N. Elhance and Veena and Aggarwal, KITAB
Publications, Kolkata, 2003.
14. Business Statistics, Dr. J.S. Chandan, Prof. Jagit Singh and Kanna, Vikas Publications,
2006.
31
CORRELATION ANALYSIS
Concept and Importance of Correlation
We may come across certain series wherein there may be more than one variable. A
distribution in which each variable assumes two values is called a Bivariate Distribution. If
we measure more than two variables on each unit of a distribution, it is called Multivariate
Distribution. In a bivariate distribution, we may be interested to find if there is any
relationship between the two variables under study. The Correlation is a statistical tool
which studies the relationship between two variables and the correlation analysis involves
various methods and techniques used for studying and measuring the extent of the
relationship between the two variables. Correlation analysis is used as a statistical tool to
ascertain the association between two variables.
When the relationship is of a quantitative nature, the appropriate statistical tool for
discovering & measuring the relationship and expressing it in a brief formula is known as
correlation.
- Croxton &
Cowden
Correlation is an analysis of the covariation between two or more variables.
- A. M. Tuttle
Correlation Analysis contributes to the understanding of economic behaviour, aids in
locating the critically important variables on which others depend, may reveal to the
economist the connections by which disturbances spread and suggest to him the paths
through which stabilizing forces may become effective.
- W. A.
Neiswanger
The effect of correlation is to relation is to reduce the range of uncertainty of our
prediction.
- Tippett
The problem in analyzing the association between two variables can be broken down into
three steps.
32
o We try to know whether the two variables are related or independent of each other.
o If we find that there is a relationship between the two variables, we try to know its
nature and strength. This means whether these variables have a positive or a
negative relationship and how close that relationship is.
o We may like to know if there is a causal relationship between them. This means
that the variation in one variable causes variation in another.
When data regarding two or more variables are available, we may study the related
variation of these variables. For e.g. in a data regarding heights (x) and weights (y) of
students of a college, we find that those students who have greater height would have
greater weight. Also, students who have lesser height would have lesser weight. This type
of related variation among variables is called correlation. Correlation may be (i) Simple
correlation (ii) Multiple correlation (iii) Partial correlation.
Simple correlation concerns with related variation among two variables. Multiple
correlation and partial correlation concern with related variation among three or more
variables.
Two variables are said to be correlated when they vary such that
a. The higher values of one variable correspond to the higher values of the other and
the lower values of the variable correspond to the lower values of the other. or
b. The higher values of one variable correspond to the lower values of the other.
Generally, it can be seen that those who are tall will have greater weight, and those who are
short will have lesser weight. Thus height (x) and weight (y) of persons show related
variation. And so they are correlated. On the other hand production (x) and price (y) of
vegetables show variation in opposite directions. Here the higher the production the lower
would be the price.
In both the above examples, the variables x and y show related variation. And so they are
correlated.
TYPES OF CORRELATION
33
Correlation is positive (direct) if the variables vary in the same directions, that is, if they
increase and decrease together.
Height (x) and weight (y) of persons are positively correlated.
Correlation is negative (inverse) if the variables vary in the opposite directions, that is, if
one variable increases the other variable decreases. Production (x) and price (y) of
vegetables are negatively correlated.
If variables do not show related variation, they are said to be non correlated. If variables
show exact linear relationship, they are said to be perfectly correlated. Perfect correlation
may be positive or negative.
Correlation and Causation
o The correlation may be due to chance particularly when the data pertain to a small
sample.
o It is possible that both the variables are influenced by one or more other variables.
o There may be another situation where both the variables may be influencing each
other so that we cannot say which is the cause and which is the effect.
Types of Correlation
o Positive and Negative: If the values of the two variables deviate in the same
direction i.e., if the increase in the values of one variable results, on an average, in a
corresponding increase in the values of the other variable or if a decrease in the
values of one variable results, on an average, in a corresponding decrease in the
values of the other variable, correlation is said to be positive or direct. For example:
Price & Supply of the commodity. On the other hand, correlation is said to be
negative or inverse if the variables deviate in the opposite direction i.e., if the
increase (decrease) in the values of one variable results, on the average, in a
corresponding decrease (increase) in the values of the other variable. For example:
Temperature and Sale of Woolen Garments.
34
o Linear and Non-Linear: The correlation between two variables is said to be linear
if corresponding to a unit change in one variable, there is a constant change in the
other variable over the entire range of the values. For example: y = ax + b. The
relationship between two variables is said to be non-linear or curvilinear if
corresponding to a unit change in one variable, the other variable does not change
at a constant rate but at a fluctuating rate. When this is plotted in the graph this will
not be a straight line.
o Simple, Partial and Multiple: The distinction amongst these three types of
correlation depends upon the number of variables involved in a study. If only two
variables are involved in a study, then the correlation is said to be simple
correlation. When three or more variables are involved in a study, then it is a
problem of either partial or multiple correlation. In multiple correlation, three or
more variables are studied simultaneously. But in partial correlation we consider
only two variables influencing each other while the effect of other variable is held
constant. For example: Let us suppose that we have three variables, number of
hours studied (x); IQ (y); marks obtained (z). In a multiple correlation we will study
the correlation between z with 2 variables x & y. In contrast, when we study the
relationship between x & z, keeping an average IQ as constant, it is said to be a
study involving partial correlation.
Methods of Correlation
Process of Calculating Coefficient of Correlation
o Calculate the means of the two series: X and Y.
35
METHODS OF CORRELATION
GRAPHIC ALGEBRAIC
SCATTER
DIAGRAM
CONCURRENT
DEVIATION
METHOD
RANK
CORRELATION
COVARIENCE
METHOD
o Take deviations in the two series from their respective means, indicated as x and y.
The deviation should be taken in each case as the value of the individual item
minus () the arithmetic mean.
o Square the deviations in both the series and obtain the sum of the deviation-squared
columns. This would give x
2
and y
2
.
o Take the product of the deviations, that is, xy. This means individual deviations
are to be multiplied by the corresponding deviations in the other series and then
their sum is obtained.
o The values thus obtained in the preceding steps xy, x
2
and y
2
are to be used in
the formula for correlation.
SCATTER DIAGRAM METHOD
Scatter diagram is a graphic presentation of bivariate data. Here, bivariate data with n pairs
of values is represented by n points on the xy plane. The two variables are taken along
the two axes, and every pair of values in the data is represented by a point on the graph.
The pattern of distribution of points on the graph can be made use of for the rough
estimation of degree of correlation between the variables.
In the scatter diagram
a. If the points form a line with positive sloe (a line moving upwards), the variables
are positively and perfectly correlated.
b. If the points form a line with negative slope (a line moving downwards), the
variables are negatively and perfectly correlated.
c. If the points cluster around a line with positive slope the variables are positively
correlated.
d. If the points cluster around a line with negative slope, the variables are negatively
correlated.
e. If the points are spread all over the graph, the variables are non correlated.
f. Any other curve form of spread of points indicates curvilinear relation between
the variables.
36
Scatter diagram is one of the simplest ways of diagrammatic representation of a bivariate
distribution and provides us one of the simplest tools of ascertaining the correlation
between two variables. Suppose we are given n pairs of values of two variables X and Y.
For example, if the variables X and Y denote the height and weight respectively, then the
pairs my represent the heights and weights (in pairs) of n individuals. These n points may
be plotted as dots (.) on the x axis and y axis in the xy plane. (It is customary to take
the dependent variable along the x axis.) the diagram of dots so obtained is known as
scatter diagram. From the scatter diagram we can form a fairly good, though rough idea
about the relationship between the two variables. The following points may be borne in
mind in interpreting the scatter diagram regarding the correlation between the two
variables:
1. If the points are very dense i.e very close to each other, a fairly good amount of
correlation may be expected between the two variables. On the other hand, if the
points are widely scattered, a poor correlation may be expected between them.
2. If the points on the scatter diagram reveal any trend (either upward or downward),
the variables are said to be correlated and if no trend is revealed, the variables are
uncorrelated.
3. If there is an upward trend rising from lower left hand corner and going upward to
the upper right hand corner , the correlation is positive since this reveals that the
values of the two variables are move in the same direction. If, on the other hand the
points depict a downward trend from the upper left hand corner, the correlation is
negative since in this case the values of the two variables move in the opposite
directions.
4. In particular , if all the points lie on a straight line starting from the left bottom and
going up towards the right top, the correlation is perfect and positive , and if all the
points lie on a straight line starting from the left top and coming down to right
bottom , the correlation is perfect and negative.
5. The method of scatter diagram is readily comprehensible and enables us to form a
rough idea of the nature of the relationship between the two variables merely by
inspection of the graph. Moreover, this method is not affected by extreme
observation whereas all mathematical formulae of ascertaining correlation between
37
two variables are affected by extreme observations. However, this method is not
suitable if the number of observations is fairly large.
6. The method of scatter diagram tells us about the nature of the relationship whether
it is positive or negative and whether it is high or low. It does not provide us exact
measure of the extent of the relationship between the two variables.
7. The scatter diagram enables us to obtain an approximate estimating line or line of
best fit by free hand method. The method generally consists in stretching a piece of
thread through the plotted points to locate the best possible line.
KARL PEARSONS COEFFICIENT OF CORRELATION (COVARIENCE
METHOD; PRODUCT MOMENT)
This is a measure of linear relationship between the two variables. It indicates the degree of
correlation between the two variables. It is denoted by r.
INTERPRETATION OF COEFFICIENT OF CORRELATION
a. A positive value of r indicates positive correlation
b. A negative value of r indicates negative correlation
c. r = +1 means, correlation is perfect positive.
d. r = -1 means, correlation is perfect negative.
e. r = 0 (or low) means, the variables are non correlated.
Karl Pearsons measure known as Pearsonian correlation co efficient between two
variables ( series) X and Y , usually donated by r , is a numerical measure of linear
relationship between them and is defined as the ratio of the covariance between X and Y ,
written as Cov (x, y) to the product of standard deviation of X and Y .
Assumptions of the Karl Pearsons Correlation
o The two variables X and Y are linearly related.
o The two variables are affected by several causes, which are independent, so as to
form a normal distribution.
38
Coefficient of Determination
The strength of r is judged by coefficient of determination, r
2
for r = 0.9, r
2
= 0.81. We
multiply it by 100, thus getting 81 per cent. This suggests that when r is 0.9 then we can
say that 81 per cent of the total variation in the Y series can be attributed to the relationship
with X.
Rank Correlation
Limitations of Spearmans Method of Correlation
o Spearmans r is a distribution-free or non parametric measure of correlation.
o As such, the result may not be as dependable as in the case of ordinary correlation
where the distribution is known.
o Another limitation of rank correlation is that it cannot be applied to a grouped
frequency distribution.
o When the number of observations is quite large and one has to assign ranks to the
observations in the two series, then such an exercise becomes rather tedious and
time-consuming. This becomes a major limitation of rank correlation.
Some Limitations of Correlation Analysis
o Correlation analysis cannot determine cause-and-effect relationship.
o Another mistake that occurs frequently is on account of misinterpretation of the
coefficient of correlation and the coefficient of determination.
o Another mistake in the interpretation of the coefficient of correlation occurs when
one concludes a positive or negative relationship even though the two variables are
actually unrelated.
39
Properties of Correlation Coefficient
Property 1 - Limits for Correlation Coefficient
Pearsonian correlation coefficient can not exceed 1 numerically. In other words it lies
between 1 and -1. Symbolically: 1 r 1. r = + 1 implies perfect positive correlation
between the variables.
Property 2 - Correlation Coefficient is independent of the change of origin and scale.
Mathematically, if X and Y are given and they are transformed to the new variables U and
V by the change of origin and scale viz,
u = (x A)/h and v = (y B)/k ; h >0, k >0
Where A, B, h >0, k >0; then the correlation coefficient between x and y is same as the
correlation coefficient between u and v i.e.,
r (x,y) = r ( u, v) r
xy
= r
uv
Property 3 - Two independent variables are uncorrelated but the converse is not true.
Remarks: one should not be confused with the words of uncorrelation and independence.
r
xy
= 0 i.e., uncorrelation between the variables x and y simply implies the absence of any
linear (straight line) relationship between them. They may however, be related in some
other form (other than straight line) e.g., quadratic (as we have see in the above example,
logarithmic or trigonometric form.
Property 5 - If the variables x and y is (+ 1) if the signs of a and b are different and (-1) if
the signs of a and b are alike. Interpretation of r the following general points may be borne
in mind while interpreting an observed value of correlation coefficient r: If r = -1 there is
perfect negative correlation between the variables. In this scatter diagram will again be a
straight line.
40
nX
2
(X)
2 .
nY
2
(Y)
2

If r = 0, the variables are uncorrelated in other words there is no linear (straight line)
relationship between the variables. However, r = 0 does not imply that the variables are
independent.
For other values of r lying between + 1 and 1 there are no set guidelines for its
interpretation. The maximum we can conclude is that nearer the value of r to 1, the closer
is the relationship between the variables and nearer is the value of r to 0 the less close is the
relationship between them. One should be very careful in interpreting the value of r as it is
often misinterpreted.
The reliability or the significance of the value of the correlation depends on a number of
factors. One of the ways of testing the significance of r is finding its probable error, which
in addition to the value of r takes into account the size of the sample also.
Another more useful measure for interpreting the value of r is the coefficient of
determination. It is observed there that the closeness of the relation ship between two
variables is not proportional to r.
In total the Properties are:
o Limits for Correlation Coefficient.
o Independent of the change of origin & scale.
o Two independent variables are uncorrelated but the converse is not true.
o If variable x & y are connected by a linear equation: ax+by+c=0, if the correlation
coefficient between x & y is (+1) if signs of a, b are different & (-1) if signs of a, b are
alike.
Important Formulas:
n dx.dy - dx. dy xy
ndx
2
(dx)
2.
ndy
2
-(dy)
2
[x
2.
y
2
]
r = [Cov (x,y)} / [SD (x)*SD (y)]
41
r = r =
nX
2
(X)
2 .
nY
2
(Y)
2

The application of the formulaes depends on different situations. Following are
some problems which are solved using different formulas. We can notice that
irrespective of the formulas the answer will remain same.
Problem Number 1, 2, 3 are solved with different formulas for the same data.
X Y x=X-X y=Y-Y x2 y2
xy
39 47 -26 -19 676 361 494
65 53 0 -13 0 169
0
62 58 -3 -8 9 64 24
90 86 25 20 625 400 500
82 62 17 -4 289 16
-68
75 68 10 2 100 4 20
25 60 -40 -6 1600 36 240
98 91 33 25 1089 625 825
36 51 -29 -15 841 225 435
78 84 13 18 169 324 234
650 660 0 0 5398 2224 2704
X Y dx=X-A dy=Y-A dx2 dy2 dxdy
39 47 -31 -13 961 169 403
65 53 -5 -7 25 49 35
62 58 -8 -2 64 4 16
90 86 20 24 400 576 480
82 62 12 2 144 4 24
75 68 5 8 25 64 40
25 60 -45 0 2025 0 0
98 91 28 31 784 961 868
36 51 -34 -9 1156 81 306
78 84 8 24 64 576 192
650 660 -50 58 5648 2484 2364
X Y X2 Y2 XY
39 47 1521 2209 1833
65 53 4225 2809 3445
62 58 3844 3364 3596
90 86 8100 7396 7740
82 62 6724 3844 5084
75 68 5625 4624 5100
25 60 625 3600 5100
98 91 9604 8281 8918
36 51 1296 2601 1836
78 84 6084 7056 6552
650 660 47648 45784 45604
42
r =
2704
[5398*2224]
r = 0. 7804
r =
10*2404 (50)58
[10*5648 (-50)
2 .
10*2484 (58)
2
]
r = 0.78
r =
n XY - X. Y
nX
2
(X)
2 .
nY
2
(Y)
2

r =
10*45604 650*660
10*47648

(650)
2 .
10*45784(660)
2

r=0. 7804
Problem No 4:
From the following data given calculate n: Correlation coefficient 0.8; Summation of
product deviations 60; SD of y 2.5; Summation of x
2
90. x & y are the deviations
from their arithmetic mean.
Answer:
r = [Cov (x,y)} / [SD (x)*SD (y)]
0.8 = [1/n (60)] / [{(90/n)}*(2.5)]
0.8*0.8 = [(1/n)*(1/n)*60*60] / [(90/n)*2.5*2.5]
0.8*0.8*2.5*2.5*90 = [(1/n)*(1/n)*60*60]
n=10
Problem 5:
A computer while calculating correlation coefficient between x & y from a pair of 25
observations. Summation X is 125, Summation X2 is 650; Summation Y is 100,
Summation Y2 is 460; Summation of X&Y is 508. Later it is observed that two pairs of
observations were taken as (6, 14) and (8,6) instead of (8, 12) and (6,8). Prove that the
correct correlation coefficient is 0.67.
Answer:
When we apply the formula we get the answer. First applying the formula we need to find
all terms. Then add all correct values [(8, 12) and (6,8)] after deducting wrong values [(6,
14) and (8,6)] from those terms. Now apply them in the formula. We get the answer as 2/3.
Problem 6:
If the relation between two random variables x & y is: 2x+3y=4, then the correlation
coefficient is:
43
Answer:
-1 (by the property)
Problem 7:
In two sets of variables X & Y with 50 observations each, following data was observed:
AM of X is 10; SD of X is 3; AM of Y is 6; SD of Y is 2; coefficient of correlation is 0.3.
However after subsequent verification one pair (10,6) was weeded out. What is the change
in the correlation coefficient with the remaining 49 pairs of values?
Answer:
As that in problem first we need to find all terms in the formula. After that deduct the
wrong values (10,6) from those terms. Now apply new terms in the formula again. We get
the answer.
PROBABLE ERROR
After computing the value of the correlation coefficient, the next step is to find the extent
to which it is dependable. Probable error of correlation coefficient usually denoted by P.E
(r) is an old measure of testing the reliability of an observed value of correlation
coefficient in so far as it depends upon the condition of random sampling.
If r is the observed correlation coefficient in a sample of n pairs of observation then its
standard error, usually denoted by S.E (r) is given by
The reason for taking the factor 0.6745 is that in a normal distribution 50% of the
distribution lie in the rang 0.6745 is the s.d.
According to Secrist, The probable error of the correlation coefficient is an amount which
if added to and subtracted from the mean correlation coefficient, produces amounts within
44
SE (r) =
1 r
2
n
PE (r) = SE (r) * 0.6745
10*140 (-2)
2 .
10*176 (2)
2

which the chances are even that a coefficient of correlation from a series selected a random
will fall.
Uses of probable error
The probable error of correlation coefficient may be used to determine the limits which the
population correlation coefficient may be expected to lie.
Limits for population correlation coefficient are
1. r P.E. (r) : This implies that if we take another random sample of the same size n
from the same population from which the first sample was taken, then the
observed value of the correlation coefficient , say, r
1
in the second sample can be
expected to lie within the limits given.
2. P.E. (r) may be used to test if an observed value of sample correlation coefficient is
significant of any correlation in the population. The following guidelines may be
used:
a. If r < P.E. ( r ) i.e, if the observed value of r is less than its P.E., then the
correlation is not at all significant.
b. If r > P.E. ( r ) i.e, if the observed value of r is greater than 6 times its P.E.,
then r is definitely significant.
c. In other situation nothing can be concluded with certainty.
Important Remarks 1: Sometimes P.E. may lead to fallacious conclusions particularly when
n , the number of pairs of observations is small. In order to use P.E. effectively, n should be
fairly large. However a rigorous test for testing the significance of an observed sample
correlation coefficient is provided by Students t test.
Important Remarks 2: P.E. can be used only under the following conditions
a. The data must have been drawn from a normal population.
b. The conditions of random sampling should prevail in selecting sampled
observation.
45
10*140 (-2)
2 .
10*176 (2)
2

r < PE (r) r is not at all significant; r > 6 PE (r) r is significant; other cases nothing can
be concluded with certainty.
Problem 1: Comment whether the correlation coefficient is significant or not.
X Y Dx=(X-60)/5 Dy=(Y-65)/5 dx2 dy2 dxdy
45 35 -3 -6 9 36 18
70 90 2 5 4 25 10
65 70 1 1 1 1 1
30 40 -6 -5 36 25 30
90 95 6 6 36 36 36
40 40 -4 -5 16 25 20
50 60 -2 -1 4 1 2
75 80 3 3 9 9 9
85 80 5 3 25 9 15
60 50 0 -3 0 9 0
2 -2 140 176 141
Working Note:
CORRELATION IN BIVARIATE FREQUENCY TABLE
If in a bivariate distribution the data are fairly large, they may be summarized in the form
of a two way table. Here for each variable , the values are grouped into various classes ( not
necessarily the same for both the variables) keeping in view the same considerations as in
the case of univariate distribution. For example, if there are m classes for the X variable
series and n classes for the Y variable series then there will be m x n cells in the two
46
SE (r) =
1 r
2
n
PE (r) = SE (r) * 0.6745
SE (r) =
1 (0.9)
2
10
PE (r) = 0.06 * 0.6745
SE (r) = 0.0600 PE (r) = 0.0405
r =
n dx.dy - dx. dy
ndx
2
(dx)
2.
n dy
2
-(dy)
2
r = 0.90
r =
10*141 (2) 2
10*140 (-2)
2 .
10*176 (2)
2

0.9 > 6 PE (r) [i.e.,0.2432]
r is highly significant
way table. By going through the different pairs of the values ( x, y) and using tally marks
we can find the frequency for each cell and thus obtain the so called bivariate frequency
table.
Food Expenditure
(in %)
Family Income (Rs.)
200-300 300-400 400-500 500-600 600-700
10-15 - - - 3 7
15-20 - 4 9 4 3
20-25 7 6 12 5 -
25-30 3 10 19 8 -
47
r =
Nfuv (fu)(fv)
[N fu
2
(fu)
2
] [Nfv
2
(fv)
2
]
r =
Nfxy (fx)(fy)
[N fx
2
(fx)
2
] [Nfy
2
(fy)
2
]
-48 -14 18 0 -26 -26 fuv
120 40 20 0 20 40 fu
2
0 20 20 0 -20 -20 fu
-48 200 100 100 10 20 40 20 10 f
-16 160 80 40 - 8 19 10 3 2 27.5
-15 30 30 30 - 5 12 6 7 1 22.5
0 0 0 20 3 4 9 4 - 0 17.5
-17 10 -10 10 7 3 - - - -1 12.5
fuv fv
2
fv f v Y CI
2 1 0 -1 -2 u
650 550 450 350 250 x
C-I
Problem 2:
Marks
Age in Years
18 19 20 21 22
22.5 3 2 - - -
17.5 - 5 4 - -
12.5 - - 7 10 -
7.5 - - - 3 2
2.5 - - - 3 1
48
100*(-48) 0*100
[(100*120)-0] [(100*200)-(100)
2
]


r = -0.4381
RANK CORRELATION METHOD
Sometimes we come across statistical series in which the variable under consideration are
not capable of quantitative measurements but can be arranged in a serial order. This
happens when we are dealing with qualitative characteristics ( attributes) such as honesty,
beauty, character, morality, etc. Which cannot be measured quantitatively but can be
arranged serially. In such situations Karl Pearsons coefficient of correlation cannot be
used as such. Charles Edward Spearman, a British psychologist, developed a formula in
1904 which consists in obtaining the correlation coefficient between the ranks of n
individuals in the two attributes under study.
The Pearson Correlation Coefficient between the ranks X and Y is called the rank
correlation coefficient between the characteristics A and B for that group of individuals.
50
2.5
-38 -8 -9 0 -9 -12 fuv
47 12 16 0 7 12 fu
2
9 6 16 0 -7 -6 fu
-38 6 40 3 16 11 7 3 f
-10 16 -8 4 1 3 - - - -2
-7 5 -5 5 2 3 - - - -1 7.5
0 0 0 17 - 10 7 - - 0 12.5
-5 9 9 9 - - 4 5 - 1 17.5
-16 20 10 5 - - - 2 3 2 22.5
fuv fv
2
fv f v y
2 1 0 -1 -2 u
22 21 20 19 18 x
-12
-5
-4
-3
-6
-4
-4
49
40*(-38) 9*6
[(40*47)-(9)
2
] [(40*50)-(6)
2
]
r = -0.8373
The students are assigned ranks in Statistics according to their marks in Statistics. Also,
they are assigned ranks in Mathematics according to their marks in Mathematics. Then, the
correlation between these two sets of ranks is called rank correlation. The coefficient of
correlation computed for these ranks is called Spearmans coefficient of rank correlation.
In a bivariate data, if the values of the variables are ranked in the decreasing (or increasing)
order, the correlation between these ranks is rank correlation. The coefficient of correlation
computed for these rank is Spearmans coefficient of rank correlation. It is denoted by
(Rho)
If R1 and R2 are the ranks in the two characteristics, and d = R1 R2 is the difference
between the ranks, coefficient of rank correlation is
= 1 - 6d
2

n
3
n
Since is the product moment coefficient of correlation between the ranks , it is a value
between -1 and +1
Karl Pearsons coefficient of correlation can be calculated only if the characteristics under
study are quantitative ( they should be numerically measurable) but, Spearmans coefficient
of rank correlation can be calculated even if the characteristics under study are qualitative.
If it is possible to assign ranks to the units with regard to the two characteristics , co
efficient of rank correlation can be calculated.
REPEATED RANKS
In case of attributes if there is a tie i.e. if any two or more individuals are placed together in
any classification w.r.t an attribute or if in case of variable data there is more than one item
with the same value in either or both the series, then Spearmans formula for calculating
the rank correlation coefficient breaks down, since in this case the variable X ( the ranks of
individuals in characteristic A ( 1
st
series) and Y ( the ranks of individuals characteristic B
( 2
nd
series) do not take the values from 1 to n and consequently x y, while in proving we
had assumed that x = y.
= 1-
50
= 1-
For the computation of coefficient of rank correlation, while ranking the values, two or
more values may be equal. And so, a situation of ties may arise. In such a case, all those
values which are equal are assigned with the same average rank. And then, the coefficient
of rank correlation is found. Here, corresponding to every such repeated rank correlation is
found. Here corresponding to every such repeated rank (which repeats m times), a factor
(m
3
m) / 12 is added to d
2
In this case, common ranks are assigned to the repeated items. These common ranks are the
arithmetic mean of the ranks which these items would have got if they were different from,
each other and the next item will get the rank next to the rank used in computing the
common rank. For e.g, suppose an item is repeated at rank 4. Then the common rank to be
assigned to each item is ( 4 + 5) / 2 i.e, 4.5 which is the average of 4 and 5 , the ranks
which these observations would have assumed if they were different. The next item will be
assigned the rank 6. if an item is repeated thrice at rank 7, then the common rank to be
assigned to each value will be ( 7+8+9)/ 3, i.e 8 which the arithmetic mean of 7,8 and 9 viz,
the ranks these observation would have got if they were different from each other. The next
rank to be assigned will be 10.
If only a small proportion of the ranks are tied, this technique may be applied together with
formula. If a large proportion of ranks are tied, it is advisable to apply an adjustment or a
correction factor as explained:
In a formula add the factor m (m
2
1) / 12 to d
2
, where m is the number of
times an item is repeated. This correction factor is to be added for each repeated value in
both the series.
REMARKS ON SPEARMANS RANK CORRELATION COEFFICIENT
1. Since Spearmans rank correlation coefficient is nothing but Pearsons
correlation coefficient between the ranks, it can be interpreted in the same
way as the Karl Pearsons correlation coefficient.
2. Karl Pearsons correlation coefficient assumes that the parent population
from which sample observations are drawn is normal. If this assumption is
= 1-
51
= 1-
violated than we need a measure which is distribution free (or non
parametric). A distribution free measure is one which does not make any
assumptions about the form of the population. Spearmans is such a
measure (i.e. distribution free), since no strict assumptions are made about the
form of the population from which sample observations are drawn.
3. Spearmans formula is easy to understand and apply as compared with Karl
Pearsons formula. The values obtained by the two formulae, viz Pearsonian r
and Spearmans are generally different. The differences arise due to the fact
that when ranking is used instead of full set of observations, there is always
some loss of information. Unless many ties exist, the coefficient of rank
correlation should be slightly lower than the Pearsonian coefficient.
4. Spearmans formula is the only formula to be used for finding correlation
coefficient if we are dealing with qualitative characteristics which cannot be
measured quantitatively but can be arranged serially. It can also be used
where actual data are given. In case of extreme observations, Spearmans
formula is preferred to Pearsons formula.
5. Spearmans formula has its limitation also. It is not practicable in the case of
bivariate frequency distribution. For n > 30, this formula should not be used
unless the ranks are given, since in the contrary case the calculations are quite
time consuming.

When ranks are repeated:
Problem 1:
= 1-
6D
2
n (n
2
1)
When ranks are not repeated:
52
= 1-
6[D
2
+ {m(m
2
-1)/12}]
n(n
2
1)
206
25 5 5 10
0 0 9 9
49 7 1 8
16 4 3 7
4 -2 8 6
1 1 4 5
4 -2 6 4
1 1 2 3
25 -5 7 2
81 -9 10 1
D
2
D=x-y Rank in B (y) Rank in A (x)
Problem 2:
53
1-
6*206
10(100-1)
= -0.24
3
9
1
6
4
5
2
7
8
10
Y
4
9
1
10
5
3
2
7
6
8
X
30
1 1 84 78
0 0 51 36
0 0 91 98
16 4 60 25
1 1 68 75
4 -2 62 82
0 0 86 90
0 0 58 62
4 -2 53 65
4 -2 47 39
D
2
D Sales Cost
1-
6*30
10(100-1)
= 0.82
Problem 3:
ALGEBRAIC METHOD (CONCURRENT DEVIATIONS)
This is very casual method of determining the correlation between two series when we are
not very serious about its precision. This is based on the signs of the deviations. ( i.e.
direction of the change) of the values of the variable from its preceding value and does not
take into account the exact magnitude of the values of the variable. Thus, we put a plus
(+) sign , minus (- ) sign, or equality (=) sign for the deviation if the value of the variable is
greater than, less than or equal to the preceding value respectively. The deviation in the
values of the two variables are said to be concurrent if they have the same sign i.e. either
both deviation are positive , both are negative or both are equal. The formula used for
computing correlation coefficient r by this method is given by
54
41
1 -1 3 2 19 57
0.25 -0.5 8.5 8 6 16
1 -1 7 6 9 24
1 -1 2 1 20 65
4 -2 10 8 4 16
16 4 4 8 15 16
2.25 1.5 8.5 10 6 9
9 3 1 4 24 40
0.25 -0.5 5.5 5 13 33
6.25 -2.5 5.5 3 13 48
D
2
D R2 R1 Y X
=1-
6(D
2
+[m(m
2
-1)/12])
n (n
2
1)

=1-
6 (41+ 2 + 0.5 +0.5)
10(10
2
1)
= 0.7333
r = [(2c-n)/n] + +
Where c is the number of pairs of concurrent deviation and n is the number of pairs of
deviation. In the formula plus / minus sign to be taken in side and outside the square root is
of fundamental importance.
Since -1 r 1 , the quantity inside the square root , viz, ( 2c n) must be positive
otherwise r will be imaginary which is not possible. n
Thus, if (2c n) is positive , e take positive sign in and outside the square root in and if ( 2c
n) is negative , we take negative sign in and outside the square root.
Remarks 1: it should be clearly noted that here n is not the number of pairs of observation
but it is the number of pairs of deviation and as such it is one less than the number of pairs
of observation.
Remarks 2: r computed by formula is also known as coefficient of concurrent deviations.
Remarks 3: coefficient of concurrent deviations is primarily based o the following
principle:
If the short time fluctuations of the time series are positively correlated or in other words,
if their deviation is concurrent, their curves would move in the same direction and would
indicate positive correlation between them.
55
- + 200 - 186 2001
- - 190 + 192 2000
- - 230 + 178 1999
- - 254 + 170 1998
- + 266 - 166 1997
- - 234 + 182 1996
- - 260 + 172 1995
- - 280 + 164 1994
292 160 1993
xy y Price x Supply Year
r = [(2c-n)/n]
r = [(0-8)/8]
56
r = -1
r.
x

dx.dy dx.dy
REGRESSION
Literally the word regression means return to the origin. In statistics, the word is used in a
different sense. If two variables are correlated, the unknown value of one of the variables
can be estimated by using the known value of the other variable. The so estimated value
may not be equal to the actually observed value, but it will be close to the actual value.
Regression Analysis, in general sense, means the estimation or prediction of the unknown
value of one variable from the known value of the other variable.
The Regression Analysis confined to the study of only two variables at a time is termed as
Simple Regression. But quite often the values of a particular phenomenon may be affected
by multiplicity of causes. The Regression analysis for studying more than two variables at
a time is known as Multiple Regression.
In Regression Analysis there are two types of variables. The variable whose value is
influenced or is to be predicted is called dependent variable. The variable which influences
the values or used for prediction is called independent variable. The Regression Analysis
independent variable is known as regressor or predictor or explanator while the dependent
variable is also known as regressed or explained variable.
LINEAR & NON-LINEAR REGRESSION
If the given bivariate data are plotted on a graph, the points so obtained on the diagram will
more or less concentrate around a curve, called the Curve of Regression. The
mathematical equation of the Regression curve, is called the Regression Equation. If the
regression curve is a straight line, we say that there is linear regression between the
variables under study. If the curve of regression is not a straight line, the regression is
termed as curved or non-linear regression.
The property of the tendency of the actual value to lie close to the estimated value is called
regression. In a wider usage regression is the theory of estimation of unknown value of a
variable with the help of known values of the variables. The regression theory was first
introduced and developed by Sir Francis Galton in the field of Genetics.
57
r.
x

dx.dy dx.dy
Here, firstly, a mathematical relation between the two variables is framed. This relation
which is called regression equation is obtained by the method of least squares. It may be
linear or non linear.
For a bivariate data on x and y, the regression equation obtained with the assumption that x
is dependent on y is called regression of x on y. The regression of x on y is:
(x AM of x ) = b
xy
(y AM of y)
The regression equation obtained with the assumption that y is dependent on x is called
regression of y on x. the regression of y on x is
(y AM of y) = b
yx
(x AM of x)
The following set of formulas explains all the terms given below:
The regression of x on y is used for the estimation of x values and the regression of y on x
is used for the estimation of y values. The graph of the regression equations are the
regression lines.
PROPERTIES OF REGRESSION
Regression coefficient are the coefficients of the independent variables in the regression
equations.
1. The regression coefficient b
xy
is the change occurring in x for unit change in y. The
regression coefficient byx

is the change

occurring in y for unit change in x.
58
bxy =
r.
x

y
byx =
r. y
x
bxy =
Cov (x,y)

y
2
byx =
Cov (x,y)

x
2
bxy=
nxy - x.y
ny
2
-(y)
2
byx=
nxy - x.y
nx
2
-(x)
2
bxy =
dx.dy
dy
2
byx =
dx.dy
dx
2
2. The regression coefficient is independent of the origin of measurements of the
variables. But, they are dependent on the scale.
3. The geometric mean of regression coefficients is equal to the coefficient of
correlation (numerically).
4. The regression coefficients cannot be of opposite signs. If r is positive, both the
regression coefficients will be positive. If r is negative, both the regression
coefficients will be negative. If r is zero, both the regression coefficients will be
zero.
5. Since coefficient of correlation, numerically cannot be greater than 1, the product of
regression coefficients cannot be greater than 1.
PROPERTIES OF REGRESSION LINES
There are two regression lines.
1. The regression lines intersect at ( x,y)
2. The regression lines have positive slope if the variables are positively
correlated. They have negative slope if the variables are negatively correlated.
3. If there is perfect correlation, the regression lines coincide ( there will be only
one regression line)
LINES OF REGRESSION
Line of regression is the lines which gives the best estimate of one variable for any given
value of the other variable. In case of two variable say x & y, we shall have two regression
equations; x on y and the other is y on x.
Line of regression of y on x is the line which gives the best estimate for the value of y for
any specified value of x.
Line of regression of x on y is the line which gives the best estimate for the value of x for
any specified value of y.
(x-90) = 1.361(y-70)

x=1.361y - 5.27

59
dx.dy dx.dy
LINES OF REGRESSION OF y on x
LINES OF REGRESSION OF x on y
REMEMBER
a. When r=0 i.e., when x & y are uncorrelated, then the lines of regression of y on x, and
x on y are given as: y y = 0 and x x = 0. The lines are perpendicular to each other.
b. When r=+1 then the two lines coincide.
c. If the value of r is significant, we can use the lines of regression for estimation and
prediction.
d. If r is not significant, then the linear model is not a good fit and hence the line of
regression should not be used for prediction.
COEFFICIENTS OF REGRESSION
a. b
xy
is the Coefficient of regression of x on y.
b. b
yx
is the Coefficient of regression of y on x.
THEOREMS ON REGRESSION COEFFICIENTS
a. The correlation coefficient is the Geometric Mean between the Regression Coefficients
i.e., r
2
= b
xy
b
yx
b. The sign to be taken before the square root is same as that of regression coefficients.
c. If one of the regression coefficient is greater than one, then the other must be less than
one.
d. The AM of the modulus value of regression coefficients is greater than the GM of the
modulus value of the Correlation Coefficient.
(x-90) = 1.361(y-70)

x=1.361y - 5.27

r.
x

60
(y - AM of y) = (x AM of x)
r.
y

x
(x AM of x) = (y - AM of y) r.
x

y
dx.dy dx.dy
e. Regression coefficients are independent of change of origin but not of scale.
Problem 1:
X Y dx=X-X dy=Y-Y dx2 dy2 dxdy
91 71 1 1 1 1 1
97 75 7 5 49 25 35
105 69 18 -1 324 1 -18
121 97 31 27 961 729 837
67 70 -23 0 529 0 0
124 91 34 21 1156 441 714
51 39 -39 -31 1521 961 1209
73 61 -17 -9 289 81 153
111 80 21 10 441 100 210
57 47 -33 -23 1089 529 759
900 700 0 0 6360 2868 3900
Problem 2:
The data about the sales & advertisement expenditure of a firm is given below:
Sales Advertisement Expenditure
Means 40 6
Standard Deviations 10 1.5
Coefficient of Correlation is 0.9
(x-90) = 1.361(y-70)

x=1.361y - 5.27

(y-70) = 0.6132 (x-90)
y=0.6132x + 14.812
r.
x

10*10000

-(300)
2
10*6500

-(250)
2
5*160

-(50)
2
= (11
17)
5*100

-(30)
2
= (11
4)
61
bxy =
dx.dy
dy
2
byx =
dx.dy
dx
2
bxy =
3900
2868
1.361
byx =
3900
6360
0.6132
(y-y) = byx (x-x) (x-x) = bxy (y-y)
o Estimate the likely sales for a proposed advertisement expenditure of Rs. 10 crores.
o What should be the advertisement expenditure if the firm proposes a sales target of
60 crores of rupees?
Answer:
Problem 3:
Point out the consistency, if any, in the following statement: The Regression Equation of y
on x is 2y+3x=4 and the correlation coefficient between x & y is 0.8
Answer:
Refer properties.
Problem 4:
bxy =
r.
x

y
byx =
r. y
x
(x-40) = (0.9*10/1.5) (y-6)
x = 6y+4
x = 6*10+4
x = 64
(y-6) = (0.9*1.50/10) (x-40)
y = 0.135x+0.6
y = 0.135*60+0.6
y =8.7
(y-y) = byx (x-x) (x-x) = bxy (y-y)
10*10000

-(300)
2
10*6500

-(250)
2
5*160

-(50)
2
= (11
17)
5*100

-(30)
2
= (11
4)
62
By using the following data, find out the two lines of regression and from them compute
the Karl-Pearsons coefficient of correlation: X=250; Y=300; XY=7900; X
2
=6500;
Y
2
=10000; n=10
Answer:
Problem 5:
Find the two regression coefficients and hence the r. n=5; X=10; Y=20; (X-4)
2
=100; (Y-
10)
2
=160; (X-4)(Y-10)=80
Answer:
U=X-4; U=X-4=6; U= nU = 30. Similarly V=50
bxy =
nxy - x.y
ny
2
-(y)
2
byx =
nxy - x.y
nx
2
-(x)
2
bxy =
10*7900 250*300
10*10000

-(300)
2
0.4
byx =
10*7900 250*300
10*6500

-(250)
2
1.6
rxy
2
= bxy* bxy rxy
2
= 1.6* 0.4 rxy = 0.8
byx=
nUV - U.V
nU
2
-(U)
2
byx=
nUV - U.V
nV
2
-(V)
2
byx=
5*80 30*50
5*160

-(50)
2
= (11
17)
byx=
5*80 30*50
5*100

-(30)
2
= (11
4)
63
r = (11/4)(11/17) = 1.33 ( it is impossible)
Time Series
Generally, planning of economic and business activities is based on predictions of
production, demand, sales etc. The future can be predicted by a detailed study of the past
variations. Thus, future demand can be predicted by studying the variations in the demand
for last few years. A time series may be defined as a collection of readings belonging to
different time periods, of some economic variable or composite of variables.
A series of observations of a phenomenon recorded at successive points of time is called
Time Series. It is a chronological arrangement of statistical data regarding the
phenomenon. Generally, time series are those of production, demand, sales, price, imports,
exports, bank rate, value of money, etc. Usually in time series equidistant points of time are
considered. There may be weekly, monthly, yearly, etc recordings. A graphical
presentation of a time series is called Historigram.
COMPONENTS OF A TIME SERIES
In a time series, the observations vary with time. The variation occurring in any period is
the result of many factors. The effects of these factors may be summed up as four
components. They are
a. Trend. ( Secular trend, Long Term Movement)
b. Seasonal Variation. Cyclical variation ( Business Cycle)
c. Irregular variation ( Random Fluctuation, Erratic Variation)
d. Cyclical Variation
An analytical Study of different components of a time series, the effects of these
components, etc is called analysis of time series.
The utility of such analysis is
64
a. Understanding the past behaviour of the variable
b. Knowing the existing nature of variation
c. Predicting the future trend
d. Comparison with other similar variables.
Trend (Secular Trend)
Trend is the overall change taking place in the time series over a long period of time. It is
the change taking place in a period of many years. Most of the time series show a general
tendency to increase, decrease or to remain constant over a long period of time. Such an
overall change occurring is the trend.
Examples
a. Steady increase in the population of India in the past many years is an upward
trend.
b. Steady increase in the price of gold in last many years is an upward trend.
c. Due to availability of greater medical facilities, death rate is decreasing. Thus,
death rate shows a downward trend.
d. Atmospheric temperature at a place, though show short time variation, does not
show significant upward or downward trend.
The root cause of trend is technological advancement, growth of population change in
tastes etc. Trend is measured, mainly by the method of moving averages and by the method
of least squares.
Seasonal Variation
The regular and periodic variation in a time series is called seasonal variation. Generally,
the period of seasonal variation would generally, the period of seasonal variation would be
within one year. The factors causing seasonal variation are (1) weather condition, (2)
customs, tradition and habits of people. Seasonal variation is predictable.
Examples
65
a. An increase in the sales of woolen cloths during winter.
b. An increase in the sales of note books during the month of June, July and August.
c. An increase in atmospheric temperature during summer.
Cyclical Variation (Business Cycle)
Cyclical Variation is an oscillatory variation which occurs in four stages viz prosperity,
recession, depression and recovery. Generally, such variation occurs in economic and
business activities. They occur in a gap of more than one year.
One cycle consisting of four stages occurs in a period of few years. The period is not
definite. Generally, the period is 5 to 10 years. Many Economists have explained the causes
of cyclical variation. Each of them is significant.
Irregular variation (Random Fluctuation)
Apart from the regular variations, most of the time series show variations which are totally
unexpected. Irregular variations occur as a result of unexpected happenings such as wars,
famines, strikes, floods etc. they are unpredictable. Generally, the effect of such variation
lasts for a short period.
Examples
a. An increase in the price of vegetables due to a strike by the railway employees.
b. A decrease in the number of passengers in the city buses, occurring as a result of
strike by public sector employees.
c. An increase in the number of deaths due to earthquakes.
Measurement Of Trend
o Graphic (or Free-hand Curve fitting) Method
o Method of Semi-Averages
o Method of Curve Fitting by the Principle of Least Squares
o Method of Moving Averages
66
METHOD OF SEMI-AVERAGES
Problem 1:
Estimate value for 2000. If the actual sales figures are 35000 units, how do you account for
the difference between the figures obtained?
Years 1993 1994 1995 1996 1997 1998
Sales 20 24 22 30 28 32
Answer:
67
32 1998
30 90 28 1997
30 1996
22 1995
22 66 24 1994
20 1993
Semi-
average
3 yearly
Semi Avg
Sales
(000s)
Year
(30-22) = 8
8/3 = 2.667
38
35.334
32.667
30
27.333
24.667
22
19.333
35.334 + 2.667 2000
32.667 + 2.667 1999
30 + 2.667 1998
30 1997
30 - 2.667 1996
22 + 2.667 1995
22 1994
22 2.667 1993
Trend Values (000s) Year
The difference is because of the assumption that there is a linear relationship
between the given time series values. Moreover, the effects of seasonal, cyclical
and irregular variations have been completely neglected.

Problem 2:
From the following series find the Trend by Semi Average method. Estimate the value for
the year 1999.
Answer:
Year Values 4 yearly
Semi-Totals
Semi-
Average
1990 170
1991 231
929 232
1992 261
1993 267
1994 278
1995 302
1996 299
1239 310
1997 298
1998 340
(310 232) = 78
78 / 5.
Estimate of the year 1999:
310+(5/2)*(78/5) = 349
Year 90 91 92 93 94 95 96 97 98
Value 170 231 261 267 278 302 299 298 340
68
METHOD OF CURVE FITTING: PRINCIPLE OF LEAST SQUARES
Fitting of Linear Trend: y = a + bx
To find a & b: (i) y = na + bx; (ii) xy = a x + b x
2
Fitting of a Second Degree (Parabolic) Trend: y = a + bx + cx
2
To find a, b & c: (i) y = na + bx + cx
2
(ii) xy = ax + bx
2
+ cx
3
(iii) x
2
y = a
x
2
+ bx
3
+ cx
4
Problem 3:
Fit a linear trend from the following data. Estimate the production for the year 1999. Verify
(y-ye) = 0 where ye is the corresponding trend value of y.
Answer:
Let us consider the year 1994 to be the mid point (It would be nice to take this as the mid
point as there are odd number of years).
Year Production x x
2
xy Trend Values (y-ye)
1990 18 -4 16 -72 20.6 -2.6
1992 21 -2 4 -42 20.8 0.2
1994 23 0 0 0 21 2
1996 27 2 4 54 21.2 5.8
1998 16 4 16 64 21.4 -5.4
105 40 4 0
Year 1990 1992 1994 1996 1998
Production 18 21 23 27 16
69
Fitting of Linear Trend: y = a + b x
To find a & b: y = n a + b x 105 = a*5 + b*0 a = 21
xy = a x + b x
2
4 = a*0 + b*40 b = 0.1
Therefore the equation will be given by: y = 21 + 0.1x
Estimated production of 1999: y = 21 + 0.1*5 y=21.5 thousands of units.
Problem 4:
Calculate the quarterly trend values by the method of least squares for the following
quarterly data for the last 5 years given below:
Answer:
Year Total Average U U
2
Uy Trend Values
1994 280 70 -2 4 140 64
1995 360 90 -1 1 -90 88
1996 400 100 0 0 0 112
Year I Quarter II Quarter III Quarter IV Quarter
1994 60 80 72 68
1995 68 104 100 88
1996 80 116 108 96
1997 108 152 136 124
1998 160 184 172 164
70
1997 520 130 1 1 130 136
1998 680 170 2 4 340 160
560 0 10 240
Fitting of Linear Trend: y = a + b U
To find a & b: y = n a + b U 560 = a*5 + b*0
a = 112
Uy = a U + b U2
240 = a*0 + b*10
b = 24
Therefore the equation will be given by: y = 112 + 24x
Therefore the quarterly increment is : (24/4)=6
By the calculations we come to know that the quarterly increment is 6. Therefore the values
for second & third Quarters of 1994 are: 64 - (6/2) & 64 + (6/2) respectively.
Year I Quarter II Quarter III Quarter IV Quarter
1994 55 61 67 73
1995 79 85 91 97
1996 103 109 115 121
1997 127 133 139 145
1998 151 157 163 169
71
72
Problem 1:
Fit an equation of the form y = a + b x + c x2 to the data given below.
X 1 2 3 4 5
Y 25 28 33 39 46
Answer:
Fitting of a Second Degree (Parabolic) Trend:
y = na + bx + cx2 171 = 5a+0b+10c ..(i)
xy = ax + bx2 + cx3 53=0a+10b+0c..(ii)
X Y x x
2
x
3
x
4
xY Yx
2
Trend Values
1 25 -2 4 -8 16 -50 100 24.88
2 28 -1 1 -1 1 -28 28 28.26
3 33 0 0 0 0 0 0 32.92
4 39 1 1 1 1 39 39 38.86
5 46 2 4 8 16 92 184 46.08
171 10 0 34 53 351
73
x2y = a x2 + bx3 + cx4 351=10a+0b+34c ..(iii)
By (ii) b = 5.3; Solving (i) and (iii) [Multiply (i) by 2 and deduct that from (iii)] we get
c = o.64 (14c = 9) and a = 32.92 (171-10*0.64=5a)
Therefore the equation is: y = 32.92 + 5.3 x + 0.64 x2
Problem 2:
Fit an equation of the form y = A. Bx to the data given below
x 1 2 3 4 5
y 1.6 4.5 13.8 40.2 125
Answer:
Fitting of a Exponential Curve: y = A. Bx ..(i)
Taking Logarithm we get: log y = log A+ x log B
Y = a + bx ..(ii); Y = log y; a = log A; b = log B ..(iii)
Equation (ii) can be written as:
Y = na + bx 5.6983 = 5a + 15b ..(iv)
xY = ax + bx2 21.8315 = 15a+55b ..(v)
x y Y= log y Yx x
2
Trend Values
1 1.6 0.2041 0.2041 1 1.6
2 4.5 0.6532 1.3064 4 4.6
3 13.8 1.1399 3.4197 9 13.8
4 40.2 1.6042 6.4168 16 41.1
5 125 2.0969 10.4845 25 122.3
15 5.6983 21.8315
74
By solving (iv) & (v) we get b = 0.4737 & a = -0.2814
Take Antilog we get A = 0.5231; B = 2.977; Therefore the trend equation is: y =
0.5231*(2.977)x
METHOD OF MOVING AVERAGES
This is the simple and flexible method of measuring trend. Moving Average is an
averaging process that smoothens out the fluctuations and ups & downs in the given data.
The Moving Average of period m is a series of successive averages of m overlapping
values at a time, starting with 1st, 2nd, 3rd value and so on.
Problem 3:
Calculate 5 yearly Moving Average from the data given below: 10; 14; 18; 22; 26; 30; 34;
38; 42; 46
Answer:
Year Values 5 yearly Moving
Total Average
1 10
2 14
3 18 90 18
4 22 110 22
5 26 130 26
6 30 150 30
7 34 170 34
8 38 190 38
9 42 210 42
10 46
11 50
75
Problem 4:
Calculate 4 yearly Moving Average from the following data: 37.4; 31.1; 38.7; 39.5; 47.9;
42.6
Answer:
Year Production 4 yearly Moving 2 Period
Moving Total
Centered
Moving
Average
Total Average
1991 37.4
1992 31.1
146.7 36.675
1993 38.7 75.975 37.99
157.2 39.300
1994 39.5 81.475 40.74
168.7 42.175
1995 47.9
1996 42.6
SEASONAL VARIATIONS
The variations due to such forces which operate in a regular periodic manner with period
less than one year. The objectives of studying this is as follows:
o To isolate seasonal variations: To determine the effect of seasonal swings on the
values of a given phenomenon.
o To eliminate them: To determine the value of the phenomenon if there were no
seasonal ups & downs.
76
Methods:
o Method of Simple Averages
o Ratio to Trend Method
o Ratio to Moving Averages Method
o Link Relative Method
SIMPLE AVERAGES
This is the simplest method of measuring the seasonal variations in a time series and
involves the following steps:
o Arrange the data by years & months
o Compute the average for the months
o Compute the overall average
o Obtain seasonal Indices for different months
Problem 5:
Compute the seasonal index from the data given:
Quarter 1990 1991 1992 1993 1994 1995
I 3.5 3.5 3.5 4.0 4.1 4.2
II 3.9 4.1 3.9 4.6 4.4 4.6
III 3.4 3.7 3.7 3.8 4.2 4.3
IV 3.6 4.8 4.0 4.5 4.5 4.7
Answer:
77
Year I Qtr. II Qtr. III Qtr. IV Qtr.
1990 3.5 3.9 3.4 3.6
1991 3.5 4.1 3.7 4.8
1992 3.5 3.9 3.7 4.0
1993 4.0 4.6 3.8 4.5
1994 4.1 4.4 4.2 4.5
1995 4.2 4.6 4.3 4.7
TOTAL 22.8 25.5 23.1 26.1
A.M. 3.8 4.25 3.85 4.35
Seasonal
Index
93.6 104.7 94.8 107.1
X = 4.06 {(3.8+4.25+3.85+4.35)/4}
o {(3.8/4.06)*100}=93.6
o {(4.25/4.06)*100}=104.7
o {(3.85/4.06)*100}=94.8
o {(4.35/4.06)*100}=107.1
RATIO TO TREND
This is a method which is an improvement over the previous method. This is on the
assumption that seasonal fluctuations for any season are a constant factor of the trend. This
involves the following steps:
o Compute the trend values by the appropriate method
o Assuming multiplicative model, trend is eliminated.
o Arrange values according to the years, months or quarters
o These seasonal indices are adjusted to the total of 1200 for monthly data or 400 for
quarterly data.
Problem 6:
78
Using Ration to Trend method, determine seasonal index.
Year I Quarter II Quarter III Quarter IV Quarter
1 68 60 61 63
2 70 58 56 60
3 68 63 68 67
4 65 56 56 62
5 60 55 55 58
Answer:
Year Total Average x x
2
xy Trend values
1 252 63.0 -2 4 -126 64.3
2 244 61.0 -1 1 -61 62.85
3 266 66.5 0 0 0 61.4
4 242 60.5 1 1 60.5 59.95
5 224 56.0 2 4 112 58.5
307 0 10 -14.5
Fitting of Linear Trend: y = a + b x
To find a & b: y = n a + b x 307 = a*5 + b*0 a = 61.4
xy = a x + b x2 -14.5 = a*0 + b*10 b = -1.45
Therefore the equation will be given by: y = 61.4 -1.45x
Quarterly values will be: increment of (-1.45/2 = -0.36)
Between II & III quarter: - 0.36/2 = -0.18
Year
Trend Values Trend Eliminated Values
I Quarter II Quarter III Quarter IV Quarter I Quarter II Quarter III Quarter IV Quarter
1 64.84 64.48 64.12 63.76 104.9 93.05 95.13 98.81
2 63.39 63.03 62.67 62.61 110.4 92.02 89.36 96.29
3 61.94 61.58 61.22 60.86 109.8 102.3 111.1 110.1
79
4 60.50 60.14 59.78 59.42 107.4 98.10 93.68 104.3
5 59.06 58.70 58.34 57.98 101.6 93.70 87.42 100.03
Total 534.1 479.2 476.7 509.6
Average 106.8 95.84 95.33 101.9
Adjusted Seasonal Indices 106.9 95.9 95.4 101.9
Sum of the averages: 106.8 + 95.84 + 95.33 + 101.9 = 399.90
Trend Eliminated Values are:
(Given Value for that Quarter / Trend Value for that Quarter)* 100
Therefore the Correction Factor is: 400/ 399.90
RATIO TO MOVING AVERAGES
This is a method which is an improvement over the previous method. This is a widely used
measure which involves the following steps:
o Obtain 12-month (4-quarter) moving average values.
o Express the original values as a percentage of centered moving average.
o Arrange these according to the years/months/quarter
o These indices should be 1200 or 400.
Problem 7:
Calculate the seasonal indices.
1991 I Quarter 68
II Quarter 62
III Quarter 61 63.125
IV Quarter 63 62.250
80
1992 I Quarter 65 62.375
II Quarter 58 62.750
III Quarter 66 62.875
IV Quarter 61 63.875
1993 I Quarter 68 64.125
II Quarter 63 64.500
III Quarter 63
IV Quarter 67
Answer:
Ratio to Moving Averages:
(61/63.125)*100 = 96.63; (63/62.250)*100 = 101.20; .. and so on.
Year
Trend Eliminated Values
I Quarter II Quarter III Quarter IV Quarter
1991 - - 96.63 101.20
1992 104.21 92.43 104.97 95.50
1993 106.04 97.67 - -
Total 210.25 190.1 201.6 196.7
Averages 105.13 95.05 100.80 98.35 399.33
Adjusted Seasonal Indices 105.31 95.21 100.97 98.52
LINK RELATIVES
This is the value of the given phenomenon in any season expressed as a percentage of its
value in the preceding season. This involves the following steps:
o Convert the original data into link relatives.
o Average these link relatives for each month.
o Convert Link Relatives into Chain relatives.
o Obtain CR for the first month
o Obtain Corrected Chain relatives.
Problem 8:
81
Wheat Prices (10 Kgs.)
Year
Quarter
1990 1991 1992 1993
I Qtr. (Jan- Mar) 75 86 90 100
II Qtr. (Apr June) 60 65 72 78
III Qtr. (Jul Sept.) 54 63 66 72
IV Qtr. (Oct. Dec.) 59 80 85 93
Answer:
Note:
Link Relatives for any month = (Current Months Value / Previous Months Value) * 100
Chain Relative for any month = (Link Relative of that month * Chain Relative of the
preceding month) / 100
New CR for the First Quarter: (LR of I Qtr. * CR of last Qtr.)/100
(123.303 * 89.81) / 100 =112.54
d = (New CR of first Qtr. -100) = (112.54 100) = 3.135
Adjusted CR: 78.395 3.135 = 75.26; 72.69 6.27 = 66.42; 89.81 9.405 = 80.41
Year I Quarter II Quarter III Quarter IV Quarter
1990 - 80 90 109.26
1991 145.76 75.58 96.92 126.98
1992 112.5 80 91.67 128.79
1993 117.65 78 92.31 129.17
82
Total 375.91 313.58 370.90 494.20
Average 125.3 78.395 92.725 123.55
Chain Relative 100 78.395 72.69 89.81
Adjusted CR 100 75.26 66.42 80.41 322.09
Seasonal Indices 124.2 93.47 82.49 99.87 400
CYCLICAL VARIATIONS
This is an approximate or crude method of measuring cyclical variations, which consists of
estimating trend, seasonal components and then eliminating their effect from the given
Time Series.
RANDOM VARIATIONS
These can not be estimated accurately, we can not obtain an estimate the variance of
random components.
83
INDEX NUMBERS
Index number is an indicator of the level of a phenomenon at a specific point of time in
comparison with its level at some other specific point of time. Index numbers may be of
varying price, production, growth rate, imports, exports, cost of living, etc. Generally,
index numbers of various economic activities are found useful. For Economists, index
numbers are of use at every stage of planning, policy making, decision making etc. and so,
index numbers may very be called Economic Barometers. Just as Barometers measure
atmospheric pressure, index numbers measure changes occurring in economic field.
An index number is a statistical device designed to measure relative level of a group of
related variables over a period of time and space. In other words it is a number which
expresses the overall level of a group of related variables at a given time called Current
Period as compared to the level a some other time called Base Period. Generally, index
numbers are expressed in percentage. Thus, if index number of wholesale prices of food
articles in 1995 as compared to 1990 is 150, the implication is that overall level of
wholesale prices of food articles I 1995 is 150% of the level in 1990. Here, 1995 is the
current year and 1990 is the base year.
Index number can very well be calculated for individual variables. For instance, if price of
a commodity is Rs. 5 in 1992 and Rs. 8 in 1995, the index number of price for the year
1995 with respect to the base 1992 is P = (8/5)* 100 = 160. That is, the price of the
commodity in 1995 is 160% of its price in 1992. Here, since only a single variable is
considered, the index number is called Relative. In this particular case, it is the Price
Relative. Price Relative is the price in the current year expressed as a percentage of the
price in the base year. If p
0
and p
1
are the prices of a commodity in the base year and the
current year respectively, the price relative is P = (p
1
/p
0
)* 100.
This is an indicator which reflect the relative changes in the level of certain phenomenon in
any given period (or over a specified period of time) called the current period with respect
to its values in some fixed period, called base period selected for comparison
84
DEFINITION
Index Numbers are statistical devices designed to measure the relative change in the level
of a phenomenon (variable or group of variables) with respect to time, geographical
location or other characteristics such as income, profession etc.
Generally index numbers are of three types.
1. Price index number
2. Quantity index number
3. Value index number
Various price index numbers which are in use are wholesale price index number, consumer
price index number, etc. The price index number may be of different groups of
commodities food articles, laboratory equipments etc. Price Index Numbers indicate the
general level of prices of articles in the current period as compared to that of the base
period.
Quantity Index Numbers are index numbers of quantity of goods imported or exported,
quantity of agricultural produce etc.
Value Index Numbers are the index numbers of the total money value of transaction
taking place.
Note 1: price index is 125 means price level in the current year is 125% of price level in the
base year.
Note 2: Average price level in 1990 is double the average price level in 1980 means index
numbers of price for 1990 with base 1980 is 200.
Note 3: index number for 1995 with base 1970 is 325 means average price level has
increased by 225% from 1970 to 1995.
PROBLEMS IN CONSTRUCTION
o The Purpose of Index Numbers
85
o Selection of Commodities or Items
o Data for Index Numbers
o Selection of Base Period
o Type of Average to be used
o System of Weighting
o Choice of formula
IMPORTANT NOTATIONS
o p
0
: Price of the Commodity in the Base Period
o p
1
: Price of the Commodity in the Current Period
o q
0
: Quantity of a Commodity consumed or purchased during the Base Period
o q
1
: Quantity of a Commodity consumed or purchased in the Current Period
o w: Weight assigned to a commodity according to its relative importance in the
group.
o I: Simple Index Number or Price Relative obtained on expressing current year price
as a percentage of the base year price and is given by: I = Price Relative =
(p1/p0)*100
o P
01
: Price Index Number for the Current Year w.r.t. the Base Year
o P
10
: Price Index Number for the Base Year w.r.t. the Current Year
o Q
01
: Quantity Index Number for the Current Year w.r.t. the Base Year
o Q
10
: Quantity Index Number for the Base Year w.r.t. the Current Year
o V
01
: Value Index Number for the Current Year w.r.t. the Base Year
o p
0j
: Price for the jth commodity in the Base Year, j = 1,2,3 n.
o p
1j
: Price for the jth commodity in the Current Year
USES OF INDEX NUMBER
1. Index numbers are useful to governments in formulating policies regarding
economic activities such as taxation, imports and exports, grant of license to new
firms, bank rate.
2. Index number are useful in comparing variation in production , price etc.
86
3. Index numbers help industrialist and businessman in planning their activities such
as production of goods, their stock etc.
4. Consumer price index number is used for the fixation of salary and grant of
allowance to employees.
5. Consumer price index numbers are used for the evaluation of purchasing power of
money.
LIMITATIONS OF INDEX NUMBERS
1. While constructing index numbers, some representative items alone are made use
of. The index number so obtained may not indicate the changes in the concerned
fields accurately.
2. As customs and habits change from time to time the use of commodities also varies.
And so, it is not possible to assign proper \weights to various items.
3. Many formulae are used for the construction of index numbers. These formulae
give different values for the index.
4. There is ample scope for bias in the construction of index numbers. By altering the
price quotation or by improper selection of items, index numbers can be
manipulated.
STEPS IN THE CONSTRUCTION OF INDEX NUMBERS
The various steps in the construction of index numbers are
o Defining (Stating) the purpose of the index number.
o Selecting the base period
o Selecting the items
o Obtaining price quotations
o Selecting the appropriate systems of weights.
o Selecting the appropriate formula.
1. Defining (Stating) the purpose of the index number.
87
At the very outset, the purpose of the index number should be decided. As different index
numbers are useful for different purposes, the purpose on hand may need a particular index
number. A clear definition of purpose will help in the selection of the right index number.
While constructing the index number, the selection of items, base periods, weights, etc,
depend mainly on the purpose. Absence of clear definition of purpose often leads to
construction of an unsuitable index number.
2. Selecting the base period.
While constructing an index number, appropriate base period should be selected. The base
period should be selected. The base period should be economically stable. There should not
be abnormal variations. The period should be free of wars, floods, famines, etc. it should
not be too distant from the current period. Again, the consumption pattern during the two
periods should not differ much. Depending on the situation, fixed base index number or
chain base index number may be preferred.
3. Selecting the items.
Selection of items is mainly based on the purpose of the index number. Items differ with
the purpose. For example, a wholesale price index number requires items which are
transacted at the wholesale market. A consumer price index number requires items which
are consumed by the particular group of people. However, in a consumer price index
number, items differ with the habits, customs and standard of living. Generally, there are
many items that could be included in the index number. But the list can be reduced by
selecting representative items only.
4. Obtaining price quotations.
After selecting the items for constructing an index number, price quotations for these items
should be obtained. Since price is likely to vary from place to place, it is better to obtain
price quotations from different places. Also, it is advisable to obtain price quotations from
different agencies. Then, the prices should be averaged. Again prices are likely to vary
during the span of the base period and also during the span of the current period. Hence, it
88
is better to collect price quotations at regular intervals. These quotations should be
averaged and the average should be used in the construction.
5. Selecting the appropriate systems of weights.
The items considered in constructing index numbers often have varied importance; weights
are attached to the items. Mostly, these weights are quantities in the base period, those in
the current period or these in any other period. Sometimes, a combination of quantities in
different periods may be considered as weights.
6. Selecting the appropriate formula
The selection of formula is based mainly on the availability of data regarding quantities,
Laspeyres, Paasches, fishers or any other index number is calculated. While selecting the
formula care should be taken to see that maximum use of available data is made.
PRICE INDEX NUMBER
The various price index numbers in common use are
o Laspeyres index number
o Paasches index number
o Marshall Edgeworth index number
o Fischers ideal index number.
QUANTITY INDEX NUMBERS
Generally, quantity index numbers are calculated by adopting price as weights. Some of the
quantity index numbers are -
o Laspeyres Quantity index number
o Paasches Quantity index number
o Marshall Edgeworth Quantity index number
o Fischers ideal Quantity index number.
89
Tests for an Index Number
A good index number should satisfy the following tests.
1. Time reversal test
2. Factor reversal test.
Time reversal test.
This test is proposed by Irving Fisher. According to him, an index number (formula) should
be such that when the base year and current year are interchanged (reversed) the resulting
index number should be the reciprocal of the earlier.
The time reversal test requires that the index number computed backwards should be the
reciprocal of the index number computed forwards, except for the constant of
proportionality.
Let P
01
be the index number (based on certain formula) for the period 1 with respect to the
base period 0. Let P
10
be the index number (based on the same formula) for the period 0
with respect to the base period 1. Then, the particular index number (formula) satisfies
time reversal test if - P
01
x P
10
= 1
Here, P
01
and P
10
are mere ratios they should not be expressed as percentages.
Time reversal test is not satisfied by Laspeyres and Paasches index numbers. But it is
satisfied by Marshall Edgeworth and Fischers ideal index numbers.
Factor Reversal Test
This test also proposed by Irving Fisher. Here, the argument is that the index number
(formula) should be such that the price index and quantity index computed according to the
formula should both be quality effective in indicating changes.
Factor reversal test requires that the product of the index number of price (with quantities
as weights) and the index number of quantity (with prices as weights) should indicate net
change in value taking place in between the two periods.
90
Thus if, P
01
and Q
01
are mere ratios they should not be expressed as percentages. Fishers
index number satisfies factor reversal test. But, Laspeyres, Paasches and Marshall
Edgeworth index numbers do not satisfy this test.
BIAS IN AN INDEX NUMBER
Generally, if price of a commodity shows significantly high increase, its use will decrease.
The consumers lessen the use of such commodities. Thus, if base year quantities are used
as weights, the greater variation of price will get greater weightage than needed. Therefore
such an index number will be an overestimate of the actual situation. Thus, Laspeyres
index number which uses the base year quantities as weights, is generally an over estimate.
It shows upward bias.
On the other hand, if current year quantities are used as weights, the greater variations will
be paid lesser importance than needed. This leads to a downward bias. Thus, Paasches
index number, which uses current year quantities as weights, is generally an under
estimate. It shows downward bias. However, fishers and Marshall Edgeworth index
numbers make use of base as well as current year quantities and so, they are free of bias.
FISHERS INDEX NUMBER IS IDEAL.
Fishers index number is called Ideal Index Number because of the following reasons.
o It is a geometric mean which is considered as the appropriate average for averaging
ratios.
o It takes into account the base year quantities as well as the current year quantities.
o It is free of bias.
o It satisfies both time reversal and factor reversal test.
CONSUMER PRICE INDEX NUMBER
Consumer Price Index Number is an index number of the cost met by a specified class of
consumers in buying a Basket of goods and services. Here, Basket of goods and services
means goods and services needed in day to day life of the specified class of consumers.
91
p
1
p
0
q
1
q
0
The pattern of consumption of goods is different in different classes. And so, the general
index numbers fail to indicate the changes in costs with regard to various classes of
consumers. Here, Class of consumers means group of consumers having almost identical
pattern of consumption. Generally, the classes are those of workers of a factory, people
belonging to a particular community, government employees, etc.
USES OF CONSUMER PRICE INDEX NUMBERS
1. Consumer Price Index Numbers indicate the changes in the consumer prices. And
so, they help governments in formulating policies regarding control of price,
taxation, imports and exports of commodities, etc.
2. They are used in granting allowances and other facilities to employees.
3. They are used for the evaluation of purchasing power of money. They are used for
deflating money.
4. They are used for comparing changes in the coat of living of different classes of
people.
STEPS IN THE CONSTRUCTION OF CONSUMER PRICE INDEX NUMBER
The steps in the construction of a consumer price index number are
1. Defining Scope and Coverage
At the very outset, it is necessary to decide the class of consumers for which the index
number is required. The class may be that of bank employees, government employees,
merchants, farmers etc. In any case the geographical coverage should also be decided. That
is, the locally, city or town where the class dwells should be mentioned. Anyhow the
consumers in the class should have almost the same pattern of consumption.
2. Conducting family budget enquiry and selecting the weights.
Having decided about the scope and coverage, the next step is to conduct a sample survey
of consumer families regarding their budget on various items. The survey should cover a
reasonably good number of representative families. It should be conducted during a
period of economic stability. In the survey, information regarding commodities consumed
92
p
1
p
0
q
1
q
0
by the families, their quality, and the respective budget are collected. The items included in
the index number are classified generally under the heads (1) Food, (2) Clothing, (3) Fuel
and lighting, (4) Miscellaneous. Sufficiently large number of representative items is
included under each head.
3. Obtaining price quotations
The quotations of retail prices of different commodities are collected from local market.
The quotation is collected from different agencies and from different places. Then, they are
averaged and the averages are made use of. The price quotations of the current period and
that of the base periods should be collected.
4. Computing the index number.
There are two methods of computation of consumer price index number. They are
a. Aggregative expenditure method.
b. Family budget method.
Aggregative expenditure method
Here the quantities used in the base year are taken as weights. Thus, the consumer price
index number by this method is: P
01
= (Total expenditure in the current year / Total
expenditure in the base year) x 100
Family budget method:
Consumer price index number by this method is the weighted arithmetic mean of the price
relatives. The weights assigned are the expenditure in a normal period. Thus, the consumer
price index number is: P
01
= (WI / W) where W = P
0
Q
0
and I = (P
1
/P
0
)
METHODS (along with formulas)
o Simple (Unweighted) Aggregate Method:
P
01
= Q01 =
93
p
1
p
0
*100
q
1
q
0
*100
o Weighted Aggregate Method:
P01 =
o Lapeyres Price Index or Base Year Method:
P
01
La

=
o Paasches Price Index:
P
01
Pa

=
o Fishers Price Index
P
01
F
= [P
01
Pa
*P
01
La
]
1/2
o Marshall Edgeworth Price Index Number:
P
01
Ma

=
Problem 1:
From the following compute Price Index Numbers using all four methods.
Commodities
1970 1980
Price Quantity Price Quantity
A 20 8 40 6
B 50 10 60 5
C 40 15 50 15
D 20 20 20 25
Answer:
94
wp
1
wp
0
*100
p
1
q
0
p
0
q
0
*100
p
1
q
1
p
0
q
1
*100
p
1
q
1
+ p
1
q
0
p
0
q
1
+ p
0
q
0
*100
1790

+ 2070
1470

+ 1660
Commodities
1970 1980
p
0
q
0
p
0
q
1
p
1
q
0
p
1
q
1
p
0
q
0
p
1
q
1
A 20 8 40 6 160 120 320 240
B 50 10 60 5 500 250 600 300
C 40 15 50 15 600 600 750 750
D 20 20 20 25 400 500 400 500
1660 1470 2070 1790
Answer:
Laspeyres Index Number:
Paasches Index number:
Fishers Ideal Index Number:
p
1
q
0
p
0
q
0
*100
2070
1660
*100 124.699
p
1
q
1
p
0
q
1
*100
121.77
1790
*100
1470
202

* 199
91

* 92
95
[124.699*121.77 ]
1/2
123.32
[P
01
F
= P
01
Pa
*P
01
La
]
1/2
1790

+ 2070
1470

+ 1660
Marshall Edgeworth Index Number:
Problem 2:
From the following construct index number of the group of four commodities by using
Fishers Ideal method
Answer:
Commodities
Base Year Current Year
q0 q1 p1q0 p0q1
p0 p0q0 p1 p1q1
A 2 40 5 75 20 15 100 30
B 4 16 4 8 4 5 32 20
C 1 10 2 24 10 12 20 12
Commodities
Base Year Current Year
Price Expenditure Price Expenditure
A 2 40 5 75
B 4 16 8 40
C 1 10 2 24
D 5 25 10 60
202

* 199
91

* 92
96
p
1
q
1
+ p
1
q
0
p
0
q
1
+ p
0
q
0
*100
1790

+ 2070
1470

+ 1660
*100
123.23
D 5 25 10 60 5 6 50 30
91 199 202 92
TEST OF CONSISTENCY
o Unit Test: This test requires that the Index Number formula should be independent
of the units in which the prices or the quantities of various commodities are quoted.
All those formulas which were discussed earlier other than Simple Aggregate of
Prices (Quantities) satisfy this test.
o Time Reversal Test : P
01
* P
10
= 1
Other than Laspeyres & Paasches Index Numbers all others satisfy this test.
o Factor Reversal Test: P
01
* Q
01
=
Problem 3:
From the following check whether (i) Laspeyres (ii) Paasches (iii) Fishers Index Numbers
satisfy the Time & factor Reversal Tests
commodities Base Year Current Year
Price Quantity Price Quantity
A 6.5 500 10.8 560
B 2.8 124 2.9 148
C 4.7 69 8.2 78
D 10.9 38 13.4 24
E 8.6 49 10.8 27
Answer:
Fishers Ideal Price Index
[P
01
F
= P
01
Pa
*P
01
La
]
1/2
202

* 199
91

* 92
*100

97
219.12
[p
1
q
1
/ p
0
q
0
]
Laspeyres Price Index
Number: 154.80
Laspeyres Quantity
Index Number: 101.21
Paasches Price Index
Number: 157.28
Paasches Quantity
Index Number: 104.97
Fishers Ideal Price Index Number: 156.03 Fishers Ideal Quantity Index Number:
103.01
By trail we can find that Fishers Index Number satisfies both the tests.
Problem 4:
From the following calculate Cost Of Living Index Number
Commodities Base Year Price Current Year Price Weights
A 30 47 4
B 8 12 1
C 14 18 3
D 22 15 2
E 25 30 1
Answer:
Commodities p0q0 p1q0 p0q1 p1q1
A 3250 5400 3640 6048
B 347.2 359.6 414.4 429.2
C 324.3 565.8 366.6 639.6
D 414.2 509.2 261.6 321.6
E 421.4 529.2 232.2 291.6
4757.1 7363.8 4914.8 7730
98
Notes by: Prof.Sudheer Pai, RNSIT, Bangalore
PROBABILITY PROBABILITY
99
1418.74
120 120 E
136.36
385.71
150
626.67
WP
68.18 D
128.57 C
150 B
156.67 A
P Commodities
1418.74/11 = 128.98
1418.74/11 = 128.98
INTRODUCTION
Suppose a coin is tossed. The toss may result in the occurrence of 'Head' or in the
occurrence of 'Tail'. Here, the chances of head and tail are equal*. In other words, the
probability of occurrence of head is and the probability of occurrence of tail is Thus,
Probability is a numerical measure which indicates the chance of occurrence.
There are three systematic approaches to the study of probability. They are
1. The classical approach
2. The empirical approach.
3. The axiomatic approach.
Each of these approaches has its own merits and demerits.
Chance has a part to play in almost all activities. In every such activity, there is
indefiniteness. For example,
1. A new-born child may be male or female.
2. A stone aimed at a mango on a tree may hit it or it may not.
3. A student who takes P.U.E. examination may score any mark
between 0 and 100.
In the midst of such indefiniteness, predictions are made. This necessitates a systematic
study of probabilistic happenings.
RANDOM EXPERIMENT RANDOM EXPERIMENT (Stochastic experiment, Trial)
There are two types of experiments. They are
(i) Deterministic experiment and (ii) Random experiment.
A deterministic experiment, when repeated under the same conditions, results in the same
outcome. It has a unique outcome.
Random experiment is an experiment which may not result in the same outcome
when repeated under the same conditions. It is an experiment which does not have a
unique outcome.
For example,
1. The experiment of 'Toss of a coin' is a random experiment. It is so because when a
coin is tossed the result may be 'Head' or it may be 'Tail'.
2. The experiment of 'Drawing a card randomly from a pack of playing cards' is a
random experiment. Here, the result of the draw may be any one of the 52 cards.
100
SAMPLE SPACE SAMPLE SPACE
The set of all possible outcomes of a random experiment is the Sample space. The
sample space is denoted by S. The outcomes of the random experiment (elements of the
sample space) are called sample points or outcomes or cases.
A sample space with finite number of outcomes is a finite sample space. A sample space
with infinite number of outcomes is an infinite sample space.

Ex1. While throwing a die, the sample space is
S = {1, 2, 3, 4, 5, 6}. This is a finite sample space.
Ex2. While tossing two coins simultaneously, the sample space is
S = {HH, HT, TH, TT}. This is a finite sample space.
Ex3. Consider the toss of a coin successively until a head is obtained. Let the number of
tosses be noted. Here, the sample space is
S= {1, 2, 3,4....}.This is an infinite sample space.
EVENT EVENT
Even is a subnet of the sample space. Events are denoted by A, B, C etc.
An event which does not contains any outcome is a null event (impossible event). It is
denoted by . An event which has only one outcome is an ELEMENTARY EVENT OR
SAMPLE EVENT. An event which has more than outcome is a compound event. An event
which contains all the outcomes is equal to the sample and it is called sure event or certain
event.
Ex.1. While throwing a die, A={2,4,6} is an events. It is the event that the throw results in
an even number. Here, A is a compound event.
Ex.2. While tossing two coins, A={TT} is an event. It is the event that the toss results in
two tails. Here, A is a simple event.
The outcomes which belong to an event are said to be favourable to that event. The event
happens whenever the experiment results in a favourable outcomes . Otherwise, the event
does not happen
While throwing a die, the event A = {2,4,6} has three favourable outcomes, namely, 2,4
and 6. Where the throw results in 2,4 or 6, event A occurs.
COMPLEMENT OF AN EVENT COMPLEMENT OF AN EVENT
101
Let A be an event. Then, Complement of A is the event of non-occurrence of A. It is the
event constituted by the outcomes which are not favourable to A. The complement of A is
denoted by A or or A
c
.
While throwing a die, If A = {2,4,6}, its complement is A = {1,3,5}. Here, A is the event
that throw result in an even number. A is the event that throw does not result in an even
number. That is, A is the event that throw result in an odd number.
SUB-EVENT. SUB-EVENT.
Let A and B be two events such that event A occurs whenever event B occurs. Then, event
B is sub-event of event A.
While throwing a die, let A = {2,4,6} and B = {2}. Here, B is a sub-event of event A. That
is, B

A.
UNION OF EVENTS UNION OF EVENTS. .
Definition:
Union of two or more events is the event of occurrence of at least one of
these events. Thus, union of two events A and B is the event of occurrence of at least one
of them. The union of A&B is denoted by AB or A+B or AorB.

Ex1. While tossing two coins simultaneously, let A = {HH} and B = {TT} be two events.
Then, their union is AB = {HH, TT}.
Here, A is the event of occurrence of two heads and B is the event of occurrence of two
tails. Their union AB is the event of occurrence of two heads or two heads or two tails.
Ex2. While throwing a die, let A = {2,4,6}, B = {3,6} and C = {4,5,6} be three events.
Then, their union is AB C = {2,3,4,5,6}.
INTERSECTION OF EVENTS INTERSECTION OF EVENTS
Intersection of two or more events is the event of simultaneous occurrence of all these
events. Thus, Intersection of two events A and B is the event of occurrence of both of
them. The intersection of A and B is denoted by AB or AB or A and B.
Ex1. While tossing two coins, let A = {HH,TT} B = {HH,HT,TH} be two events. Then,
their intersection is AB = {HH}.
Ex2. While throwing a die, let A = {2,4,6}, B = {3,6} and C = {4,5,6} be three events.
Then, their intersection is ABC = {6}.
102
EQUALLY LIKELY EVENTS (Equiprobable events)
Two or more events are equally likely if they have equal chance of occurrence. That is,
equally likely events are such that none of them has greater chance of occurrence than the
others.
Ex. 1. While tossing a fair coin, the outcomes 'Head' and 'Tail' are equally likely.
Ex.2. While throwing a fair die, the events A={2,4,6}, B = {1,3, 5}&C={ 1,2, 3} are
equally likely.
A sample space is called an equiprobable space if the outcomes are equally likely. For
instance, the sample space S = {1, 2, 3, 4, 5, 6} of throw of a fair die is equiprobable space
because the six outcomes are equally likely.
MUTUALLY EXCLUSIVE EVENTS (Disjoint events)
Two or more events are mutually exclusive if only one of them can occur at a time.
That is, the occurrence of any of these events totally excludes the occurrence of the other
events. Mutually exclusive events cannot occur together.

Ex. 1. While tossing a coin, the outcomes 'Head1 and 'Tail' are mutually exclusive because
when the coin is tossed once, the result cannot be Head as well as Tail.
Ex.2. While throwing a die, the events A = {2, 4, 6}, B= {3,5} and C = {1} are mutually
exclusive.

If A is an event, A and A' are mutually exclusive. It should be noted that intersection of
mutually exclusive events is a null event.
EXHAUSTIVE EVENTS (Exhaustive set of events)
A set of events is exhaustive if one" or the other of the events in the set occurs
whenever the experiment is conducted.
That is, the set of events exhausts all the outcomes of the experiment
The union of exhaustive events is equal to the sample space.
Ex.1. While throwing a die, the six outcomes together are exhaustive. But here, if any one
of these outcomes is leftout, the remaining five outcomes are not exhaustive.
Ex.2. While throwing a die, events A = {2,4, 6},B = {3, 6} and C = {1,5,6} together are
exhaustive.
103
THE CLASSICAL APPROACH THE CLASSICAL APPROACH
CLASSICAL (MATHEMATICAL, PRIORI) DEFINITION
Let a random experiment have n equally likely, mutually exclusive and exhaustive
outcomes. Let m of these outcomes be favourable to an event A. Then, probability of
A is
Number of favourable outcomes
Total number of outcomes
Limitations of classical definition:
This definition is applicable only when
(i) The outcomes are equally likely, mutually exclusive and exhaustive.
(ii) The number of outcomes n is finite.
RESULT 1
P(A) is a value between 0 and 1. That is, 0 < P(A) < 1.
Proof:
Let a random experiment have n equally likely, mutually exclusive and exhaustive
outcomes. Let m of these outcomes be favourable to event A.
Then,P(A)
Here, the least possible value of m is 0. Also, the highest possible value of m is n.
And so, 0 m n.
1 ) ( 0
0


A p
n
n
n
m
n
Thus, P(A) is a value between 0 and 1.
RESULT 2
P(A') = 1 - P(A). That is, P(A) = 1 - P(A').
Proof:
In a random experiment with n equally likely, mutually exclusive and exhaustive
outcomes, if m outcomes are favourable to event A, the remaining (n-m) outcomes are
favourable to the complementary event A'. Therefore,
P(A) =
104
= m
n
= m
n
Thus, P(A') = 1 - P(A). That is, P(A) = 1 - P(A').
Exercise 1:
a. Find the probability of head in the toss of a fair coin.
Solution:
The sample space is 5 = {H,T}. There are n- 2 equally likely, mutually exclusive and
exhaustive outcomes. One outcome, namely H is favourable to the event 'A : toss results in
head'.
Thus, m = 1.
P[head] = P(A) = m/n =
b. Find the probability that a throw of an unbiased die results in
(i) an ace (number 1) (ii) an even number (iii) a multiple of 3.
Solution:
The sample space is S = {1,2,3,4,5,6]. There are n = 6 equally likely, mutually exclusive
and exhaustive outcomes. Let events A, Band C be
A : throw results in an ace (number 1)
B : throw results in an even number
C: throw results in a multiple of 3
(i) Event A has one favourable outcome.
P[ace] = P(A)= m/n = 1/6
(ii) Event B has 3 favourable outcomes, namely, 2, 4 and 6.
P [even number] = P(B) = m/n = 3/6=
(iii) Event C has 2 favourable outcomes, namely, 3 and 6
P [multiple of 3] = P(C)= m/n = 2/6 = 1/3
c. A bag contains 3 white, 4 red and 2 green balls. One ball is selected at random from the
bag. Find the probability that the selected ball is
(i) white (ii) non-white (iii) white or green.
Solution:
The bag totally has 9 balls. Since the ball drawn can be any one of them, there are 9 equally
likely, mutually exclusive and exhaustive outcomes. Let events A, B and C be
A: selected ball is white
B: selected ball is non-white
C: selected ball is white or green
(i) There are 3 white balls in the bag. Therefore, out of the 9 outcomes, 3 are favourable to
event A.
P [white ball] = P(A) = 3/9 = 1/3
(ii) Event B is the complement of event A. Therefore,
105
P(non-white ball] = P(B) = 1 - P(A) = 1 1/3 = 2/3
(iii) There are 3 white and 2 green balls in the bag. Therefore, out of 9 outcomes, 5 are
either white or green.
P[white or green ball] = P(C) = 5/9
d. One card is drawn from a well-shuffled pack of playing cards. Find the probability that
the card drawn (i) is a Heart (ii) is a King (iii) belongs to red suit (iv) is a King or a Queen
(v) is a King or a Heart.
Solution:
A pack of playing cards has 52 cards. There are four suits, namely, Spade, Club, Heart and
Diamond (Dice). In each suit, there are thirteen denominations - Ace (1), 2, 3, 10,
Jack (Knave), Queen and King.
A card selected at random may be any one of the 52 cards. Therefore, there are 52 equally
likely, mutually exclusive and exhaustive outcomes. Let events A, B, C, D and E be
A: selected card is a Heart
B: selected card is a King
C: selected card belongs to a red suit.
D: selected card is a King or a Queen
E: selected card is a King or a Heart
(i) There are 13 Hearts in a pack. Therefore, 13 outcomes are favourable to event A.
P [Heart] = P(A) =13/52 =
(ii) There are 4 Kings in a pack. Therefore, 4 outcomes are favourable to event B.
P[King] = P(B)=4/52 =1/13
(iii)There are 13 Hearts and 13 Diamonds in a pack. Therefore, 26 outcomes are
favourable to event C.
P [Red card] = P(C) =26/52 =
(iv) There are 4 Kings and 4 Queens in a pack. Therefore, 8 outcomes are favourable to
event D.
P[King or Queen] = P(D) = 8/52 = 2/13
(v) There are 4 Kings and 13 Hearts in a pack. Among these, one card is Heart-King.
Therefore, (4+13-1) = 16 outcomes are favourable to event E.
P[King or Heart] = P(E) =16/52 = 4/13
e. The selection can be any one of the eight numbers. Therefore, there are 8 equally Hkely,
mutually exclusive and exhaustive outcomes. Let events A and B be
106
Solution:
A bag contains 8 tickets which are marked with the numbers 1,2,3,.. 8. Find the probability
that a ticket drawn at random from the bag is marked with (i) an even number (ii) a
multiple of 3.
A: selected number is even.
B: selected number is a multiple of 3.
(i) Four of the selections, namely, 2, 4, 6 and 8 are favourable to event A.
P [even number] = P(A) = 4/8 =
(ii) Two of the selections, namely, 3 and 6 are favourable to event B.
P[multiple of 3] = P(B) = 2/8 = .
Exercise 2:
a. A fair coin is tossed twice. Find the probability that the tosses result in (i) two heads (ii)
at least one head.
b. Two fair dice are rolled. Find the probability that (i) both the dice show
number 6 (ii) the sum of numbers obtained is 7 or 10 (iii) the sum of
the numbers obtained is less than 11 (iv) the sum is divisible by 3.
c. A box has 5 white, 4 red and 3 green balls. Two balls are drawn at random from the
box. Find the probability that they are (i) of the same colour (ii) of different colours.
d. Two cards are drawn at random from a pack of cards. Find the probability that (i)
both are Spades (ii) both are Kings (iii) one is Spade and the other is a Heart (v) the
cards belong to the same suit (v) the cards belong to different suits.
e. A bag has 9 tickets marked with numbers 1, 2, 3,9. Two tickets are drawn at
random from the bag. Find the probability that both the numbers drawn are (i) even
(ii) odd.
Solution:
a. The sample space is 5 = (HH, HT, TH, TT}, There are four equally likely, mutually
exclusive and exhaustive outcomes.-Let events A and B be
A : the tosses result in 2 heads
107
B : the tosses result in at least one head.
(i) One outcome, HH is favourable to event A.
P[two heads] = P(A) =
(ii) 3 outcomes HH, HT and TH are favourable to event B.
P[at least one head] = P(B) =
b. The sample space is S= {(1,1), (1, 2), (1,3) (1,6)
(2,1), (2, 2), (2,3) ....(2,6)
..
(6,1), {6, 2), (6, 3).(6,6)}
There are 6x6 = 36 equally likely, mutually exclusive and exhaustive outcomes. Let events
A, B, C and D be
A : both the dice show number 6
B : sum of the numbers obtained is 7 or 10
C: sum of the numbers obtained is less than 11.
D : sum of the numbers obtained is divisible by 3.
(i) One outcome, namely, (6, 6) is favourable to event A.
P[6 on both the dice] = P(A) = 1/36
(ii) Nine outcomes, namely, (6,1), (5,2), (4, 3), (3,4), (2,5), (1,6), (6, 4), (5, 5) and (4,6) are
favourable to event B.
P[sum is 7 or 10] = 9/36 =
(iii) The complement of event C is
C': sum is 11 or 12.
Event C' has three favourable outcomes, namely, (6,5), (5, 6) and (6, 6).
P[sum is less than 11] = 1 P[sum is 11 or 12]
= 1-3/36
= 1-1/12 = 11/12
(iv) The sum is divisible by 3 if it is 3, 6, 9 or 12. Therefore, the outcomes favourable to
event D are (2, 1), (1,- 2), (5,1), (4,2), (3,3), (2, 4), (1,5), (6, 3), (5, 4), (4, 5), (3, 6) and (6,
6). Thus, 12 outcomes are favourable.
P[sum is divisible by 3] = 12/36 = 1/3.
c. The box totally has 12 balls. A random draw of two balls has
12
C
2
equally likely,
mutually exclusive and exhaustive outcomes. Let events A and B be
A : the balls drawn are of the same colour
108
B : the balls drawn are of different colours.
(i)Events happens when the drawn balls are both white or both red or both green. Out of
12
C
2
selections,
5
C
2
selections are both white,
4
C
2
selections are both red and
3
C
2
selections
are both green. Thus,
5
C
2
+
4
C
2
+
3
C
2
outcomes are favourable to event A.
P[balls of same colour] =
2
12
2
3
2
4
2
5
C
C C C + +
=
66
3 6 10 + +
2879 . 0
66
19

(ii) Event B is the complement of event A. Therefore,


P [balls of different colours] = 1 - P[same colour]
= 1- P(A)
= 1 - 19/66 = 47/66

d. A random draw of 2 cards from a pack of 52 cards has
52
C
2
equally likely, mutually
exclusive and exhaustive outcomes. Let events A, B, C, D and E be
A: both the cards drawn are Spades
B: both the cards drawn are Kings.
C: the cards drawn are one Spade and one Heart.
D: the cards belong to the same suit.
E: the cards belong to different suits.
(i) Since there are 13 Spades in a pack, event A has
13
C
2
favourable outcomes. Therefore,
P[both spades]=
17
1
51 26
6 13
2
52
2
13

C
C
(ii) Since there are 4 Kings in a pack, event B has
4
C
2
favourable outcomes. Therefore,
P[both Kings] =
221
1
51 26
6 2
2
52
2
4

C
C
(iii) Here, one card should be a Spade and the other should be a Heart.
From 13 Spades, one Spade can be had in
13
C
1
ways. From 13 Hearts, one Heart can be had
in
13
C
1
ways. Thus,
13
C
1
X
13
C
1
outcomes are favorable to event C. Therefore,
P [a Spade and a Heart] =
2
52
1
13
1
13
C
C C

102
13
51 26
13 13

109
(iv) Here, the cards should be 2 Spades or 2 Clubs or 2 Hearts or 2 Diamonds. There are 13
cards of each suit. In each case, a selection of two cards can be made in
13
C
2
ways. Thus,
totally the number of favourable cases is
13
C
2
+
13
C
2
+
13
C
2
+
13
C
2
P[cards of same suit] =
2
52
2
13
2
13
2
13
2
13
C
C C C C + + +

17
4
51 26
78 4

(v) Events E is the complement of event D. Therefore,


P[cards of different suits] = 1 P[cards of same suit]
= 1 4/17 = 13/17
(e) There are
9
C
2
equally likely, mutually exclusive and exhaustive outcomes. Let events A
and B be
A : both the selected numbers are even.
B : both the selected numbers are odd.
(i) Out of 9 numbers, 4 numbers, namely, 2,4,6 and 8 are even. Therefore,
4
C
2
selections
will have two even numbers. Therefore,
P[both even] = P(A) =
6 / 1 36 / 6
2
9
2
4

C
C
(ii) Out of 9 numbers, 5 numbers, namely, 1,3,5,7 and 9 are odd. Therefore,
5
C
2
selections
will have two odd numbers. Therefore,
P[both odd] = P(B) =
18 / 5 36 / 10
2
9
2
5

C
C
Exercise 3: A bag contains 3 red, 4 green and 3 yellow marbles. Three marbles are
randomly drawn from the bag. What is the probability that they are of (i) the same colour
(ii) different colours (one of each colour)?

Solution :
There are
10
C
3
equally likely, mutually exclusive and exhaustive outcomes. Let events A
and B be
A: Selected marble are of the same colour.
B: Selected marbles are of different colours
(i)The marbles drawn should be 3 red or 4 green or 3 yellow.
Therefore,
3
C
3
+
4
C
3
+
3
C
3
outcomes are favourable to events A, Therefore,
110
P [marbles of the same colour] =

3
10
3
3
3
4
3
3
C
C C C + +

20
1
120
1 4 1

+ +


(ii) The marbles should be one of each colour. Therefore,
3
C
1
x
3
C
1
x
3
C
1
outcomes are
favourable. Therefore,
P [marbles of different colours] =
3
10
1
3
1
4
1
3
C
C C C + +


10
3

111
THE AXIOMATIC APPROACH
Consider a random experiment with sample space S. Associated with this random
experiment, many events can be defined. Let for every event A, a real number P(A) be
assigned. Then, P(A) is the probability of event A, if the following axioms are satisfied.
Axiom 1 : P(A) 0
Axiom 2 : P(S) - 1, S being the sure event.
Axiom 3 : For two mutually exclusive events A and B,
P(A

B) = P{A) + P(B)
Note that the third axiom can be generalised for any number of mutually exclusive events.
ADDITION THEOREM PROBABILITY
Exercise 12:
(i) S how that P(A) = 1 P(A')
(ii) Show that probability is a value between 0 and 1.
(iii) Show that P() = 0 where is null event.
Solution:
(i) If A and A' are complementary events, A

A' = S. By the axiom 2, P(S) = 1.


And so, P(A

A') =1 .... Result 1


But A and A' are mutually exclusive events. Therefore, by the axiom 3,
P(A

A') = P(A) + P(A') ....Result 2


By the results 1 and 2, P(A) + P(A') = 1
That is, P(A) = 1-(A')
(ii) Let A be an event. Then, by the axiom ],
P(A)0 ....Result 1
If A' is the complementary event of A,
P(A') = 1 P(A)
But, by axiom1,,P(A') 0
Therefore, 1 - P(A) 0 ....Result 2
And so, P(A) By the results 1 and 2,
0 P(A) 1 That is, probability is a value between 0 and 1.
(iii) If A is an event and if is a null event, A

= A
) ( ) ( A P A P
.. Result 1
But, A and are mutually exclusive. Therefore
112
) ( ) ( ) ( P A P A P +
.. Result 2
By the result 1 and 2
P(A) + P() = P(A)
That is, P() = P(A) P(A) = 0
ADDITION THEOREM PROBABILITY
For two events A and B, Show that
Solution :
For events A and B,
Here, AB and A`B are mutually exclusive. Therefore, by axiom 3,
Also,
Here, AB and A`B are mutually exclusive therefore,
By result 1 and result 2
Exercise: Show that (i) P(A

B) P{A) + P(B)
(ii) P(A

B) = P(A) + P(B) - P(A

B)
Solution :
(i) The addition theorem is
P (A

B) = P(A) + P(B) - P(A

B)
Here, P(A

B) 0. Therefore,
P(A

B) P(A) + P(B).
(ii) The additional theorem is ----
P(A

B) = P(A) + P(B) - P(A

B)

P(A

B) = P(A) + P(B) - P(A

B)
113
---Result 1
-------Result 2
Also, note that P(A

B) + P(A

B) = P(A) + P(B)
SOLVED PROBLEMS
Exercise: Write down the sample space for each of the following random experiments.
(i) A coin is tossed three times and the result of each throw is noted,
(ii) A coin is tossed three times and the number of heads obtained is noted,
(iii) A couple goes on producing children until a male child is born. The number of female
children born is noted,
(iv) In case (iii) above, instead of noting the 'Number of female children', the 'Number of
children bom' is noted,
(v) A tetrahedron (a solid with four triangular surfaces) whose sides are painted red, red, blue
and green is thrown. The colour of the side which touches the ground is noted.
(vi) Blood of husband and wife are tested and the blood group (whether O, A, B or AB) in
each case is identified.
(vii) A person is randomly selected and his religion is noted.
Solution:
(i) S= {HHH, HHT, HTH, THH, HTT, THT, TTH, TTT)
(ii) S= {0,1,2,3}
(iii) S ={0,1,2,3,....}
(iv) 5= {1,2, 3, 4 }
(v) S = {red, blue, green}
(vi) S = {(O,O), (O, A), (O, B), (O, A3), (A, O), (A, A), (A, B), (A, AB), (B, O),
(B, A), (B, B), (B, AB), (AB, O), (AB,A), (AB,B), (AB,AB)}
(vii) S = { Hindu, Christian, Muslim, Jain, Jew, ....}
Exercise
(i) Given the equiprobable sample space S = {1, 2, 3,4, 5, 6] and the event A = {1, 3, 5},
find P(A).
(ii) Given the sample space S = {1, 2, 3, 4, 5, 6} and the events A = { 1 , 3, 5} and
B = {2, 4, 6}. If P(A) = 1/3 find P(B).
(iii) If 5 = {E
1
, E
2
) is the sample space and if P(E
1
) = 0.3, find P(E
2
).
Solution:
(i) Since the sample space is equiprobable, mathematical definition can be used
for finding probability.
Number of favourable outcomes = 3/6 = 1/2
Total number of outcomes
114
P(A)
(ii) Here, events A and B are complementary.
P(B) = 1 P(A) = 1 1/3 = 2/3
(iii) Here, E1, E2 are complementary events.
P(E
2
) = 1 P(E
1
) = 1 0.3 = 0.7
Exercise:
(i) If P(A) = 1/3, find P(A').
(ii) If P(A) = 1/2, P(B)= and P(A

B) = , find P(A

B).
(iii) If P(A) = 1/8, P(B) = 1/6 and P(A

B) = , find P(A

B)
(iv) If P(A) = and P(A

B) = find P(B|A).
Solution:
(i) P(A') = 1-P(A) = 1 1/3 = 2/3
(ii) P(A

B) = P(A) +P(B) P(A

B)
= 1/2 +3/4 1/4 = 1
(iii) By additional theorem -----
P(A

B) = P(A) +P(B) P(A

B)

P(A

B) = P(A) + P(B) - P(A

B)
= 1/8 + 1/6 1/4 = 1/24
(iv)
Exercise : If P(A) = 0.8, P(B) = 0.5 and P(A

B) = 0.9 find P(A|B). Are A and B


independent events?
Solution:
By additional theorem----
P(A

B) = P(A) + P(B) P(A

B)

P(A

B) = P(A) +P(B) - P(A

B)
= 0.8 + 0.5 0.9 = 0.4
And so,

Thus, P(A|B) = 0.8
Here, P(A|B) = P(A). Therefore, events A and B are independent.
Exercise : Three unbiased dice are thrown once. Find the probability that all the three
dice show the number 6.
115
P(B|A) =
2
1
2
1
4
1
) (
) (

A P
B A P

P(B|A) =
8 . 0
5 . 0
4 . 0
) (
) (

A P
B A P

Solution :
When 3 dice are thrown, there are 6 x 6 x 6 = 216 equally, mutually exclusive and
exhaustive outcomes. of these 216 outcomes, 1 outcome, namely, (6, 6, 6) is
favourable. Therefore probability of all the three dice showing the number 6 is
P[all the three result in the number 6] =1/216
Exercise : A fair coin is tossed five times. Find the probability of obtaining
(i) head in all the tosses, (ii) head in at least one of the tosses.
Solution:
There are 2
5
= 32 equally likely, mutually exclusive and exhaustive outcomes. Out of
them, one outcome is HHHHH and another outcome is TTTTT. Therefore,
(i) P[head in ail tosses] = 1/32
(ii) P[at least one head] = 1 P[tail in all tosses]
= 1-1/32 = 31/32
Note : Whenever probability of the event at least one has to be found, it is easier to find
it by using the probability of the complementary event as follows.
P [ a t l e a s t o n e ] = P [ n o n e ]
Exercise : There are 20 persons. 5 of them are graduates. 3 persons are randomly selected
from these 20 persons. Find the probability that at least one of the selected persons is
graduate.
Solution:
From 20 persons, 3 persons can be selected in
20
C
3
ways. Thus, there are
20
C
3
equally likely,
mutually exclusive and exhaustive outcomes. Since there are 15 persons who are not
graduates,
P[ at least one is graduate] = 1 P[none is graduate]


Exercise :
In a college, there are five lecturers. Among them, three are doctorates. If a committee
consisting three lecturers is formed , what is the probability that at least two of them are
doctorates ?
116
228
91
1 1
3
20
3
15

C
C
6 . 0
228
137

Solution:
From the five lecturers, three lecturers can be selected in
5
C
3
ways. Thus, there are
5
C
3
equally
likely, mutually exclusive and exhaustive outcomes. Let events A and B be
A : Two of the selected lecturers are doctrates.
B : All the three selected lecturers are doctrates.
Then, events has
3
C
2
x
2
C
1
favourable outcomes. And, event B has C3 favourable outcomes.
Here, events A and B are mutually exclusive.
.'. P[at least two doctrates] = P[two or three doctrates]


PROBLEMS:3
What is the probability that there will be 53 Sundays in a randomly selected
(i) leap year
(ii) non-leap year?
Solution:
(i) A leap year has 366 days, Out of them, 7*52 = 364 days make 52 complete weeks. The
remaining two days may occur in any of the following pattern ---
(Sunday, Monday), (Monday, Tuesday), (Tuesday, Wednesday), (Wednesday, Thursday),
(Thursday, Friday), (Friday, Saturday) and (Saturday, Sunday).
Out of these 7 cases which are equally likely, mutually exclusive and exhaustive, 2 cases
namely (Sunday, Monday) and (Saturday, Sunday) have Sunday. Therefore,
P[leap year has 53 Sundays]=2/7
(ii) A non- leap year has 365 days. Out of them, 364 days make 52 complete weeks. The
remaining one day may be Sunday, Monday, ---- Saturday. Out of these 7 possibilities, only
one is Sunday. Therefore,
P[non-leap year has 53 Sundays]=1/7
CONDITIONAL PROBABILITY CONDITIONAL PROBABILITY
117
7 . 0
10
7
10
1
10
2 3
) ) ) (
) (
3
5
3
3
3
5
1
2
2
3

+

+

C
C
C
C C
B P A P
B A P
Let A and B be two events. Then, conditional probability of given A is the probability
of happening of B when it is known that A has already happened. On the other hand, the
probability of happening of B when nothing is known about happening of A is called
unconditional probability of B.
The conditional probability of B given A is denoted by P{B\A). The unconditional
probability is P{B).
Let P(A) > 0. Then, conditional probability of event B given A is defined as-----
If P(A) = 0, the conditional probability P(B\A) is not defined.
If A and B are independent events, occurrence of B will be independent of occurrence of A.
Therefore, the conditional and unconditional probabilities are equal. That is, P(B\A) =
P(B).
) (
) (
) (
A P
B A P
B P

That is, P(A

B) = P(A).P(B)
INDEPENDENT EVENTS INDEPENDENT EVENTS
Two events A and B are independent if and only if P(A

B) = P(A).P(B)
If two events are independent, the occurrence or non-occurrence of one does not
depend on the occurrence or non-occurrence of the other.
MULTIPLICATION THEOREM
Let A and B be two events with respective probabilities P(A) and P(B). Let P(B/A)
be the conditional probability of event B given that event A has happened. Then, the
probability of simultaneous occurrence of A and B is
If the events are independent, the statement reduces to -
118
MULTIPLICATION THEOREM
Proof:
By the definition of conditional probability, for P(A)>0,
If A and B are independent, by the definition of independence,
Exercise.
a. A card is drawn at random from a pack of cards.
(i) What is the probability that it is a heart ?
(ii) If it is known that the card drawn is red, what is the probability that it is a heart?
b. A fair coin is tossed thrice. What is the probability that all the three tosses result in heads
?
Solution:
a. There are 52 equally likely, mutually exclusive and exhaustive outcomes. Let events A
and B be
A : card drawn is red.
B : card drawn is heart.
There are 26 red cards and 13 hearts in a pack of cards. Therefore, event A has 26
favourable outcomes and event B has 13 favourable outcomes. Event A

B has 13
favourable outcomes because when any of the 13 hearts is drawn A

B happens.
Therefore, P(A) = 26/52. P(B) = 13/52 and P(A

B)=13/52
(i)The unconditional probability of drawing a heart is ---
P(B) = 13/52 =
(ii) The conditional probability of drawing a heart given that it is red card is-----
P(B/A) =
) (
) (
A P
B A P
=
2
1
52
26
52
13

B. Let events A, B, and C be------


A: the first toss results in head
B: the second toss results in head.
119
C: the third toss results in head.
Then, P(A) = P(B) =P(C) =
Since A, B, and C are results of three different tosses, they are independent. Therefore,
probability that all the three tosses result in head is ---
P[ 3 heads] = P(A

C) = P(A).P(B).P(C)
=
8
1
2
1
2
1
2
1

Exercise: Two fair dice are rolled. If the sum of the numbers obtained is 4, find the
probability that the numbers obtained on both the dice are even-
Solution:
Let events A and B be
A: the sum of the numbers is 4
B: the numbers on both the dice are even
Here, we have to find -----

Event A has 3 favourable outcomes, namely, (1,3),(2,2) and (3,1)
P[Sum 4] = P(A) = 3/36
Event (A

B) has 1 favourable outcomes, namely, (2,2).


P[Sum 4 and number even] = P(A

B) = 1/36
Thus, P[Number even given Sum 4]

Exercise: A box has 1 red and 3 white balls. Balls are drawn one after one from the box.
Find the probability that the two balls drawn would be red if
a. the ball drawn first is returned to the box before the second draw is made. (Draw
with replacement).
b. the ball drawn first is not returned before the second draw is made. (Draw without
replacement).
Solution:
Let A : the first ball drawn is red
B : the second ball drawn is red.
Draw with replacement:
120
P(B/A) =
) (
) (
A P
B A P
3
1
36
3
36
1

) (
) (
) | (
A B
B A P
A B P

Here, P(A) =1/4 Also, since the first ball- is returned before the second draw is made,
P(B|A) =1/4
P[Two balls are red] = P(A

B)
= P(A).P(B|A)
= 1/4 * 1/4 =1/16
Draw without replacement:
Here, Since the first ball drawn is not returned before the second draw is made,
P(B|A) = 0/4
.'. P [Two balls are red} = P(A

B)
= P(A).P(B|A)
= * 0/4 = 0
PROBLEMS:1
The probability that a contractor will get a plumbing contract is 2/3 and probability that he
will not get an electrical contract is 5/9. If the probability of getting at least one of these
contracts is 4/5, what is the probability that he will get both?
Solutions:
Let A: contractor gets plumbing contract
B: contractor gets electrical contract
Then, P(A) = 2/3 P(B`) = 5/9 and P(A

B) = 4/5
Therefore, P(B) = 1-P(B`) = 4/9
By addition theorm we have,
P(A

B) = P(A) +P(B) P(A

B)
That is, P(A

B) = P(A) +P(B) P(A

B)
Therefore,
P[he gets both plumbing and electrical contract] = P(A

B)
= P(A) +P(B) P(A

B)
= 45 / 14 5 / 4 9 / 4 3 / 2 +
PROBLEMS:3
A can solve 90 percent of the problems given in a book and B can solve 70
percent. What is the probability that at least one of them will solve a problem selected at
random.
Solutions: event A : student A solve the problem
event B : student B solve the problem.
121
P(at least one solve the problem) = 1-P(none solve the problem)
( )
97 . 0
) 30 . 0 )( 10 . 0 ( 1
) ( ). ( 1
1




B P A P
B A P
PROBLEMS:4
The probability that a trainee will remain with a company 0.6, The probability that
an employee earns more ten Rs.10,000 per year 0.5. The probability an employee is trainee
who remained with the company or who earn more then Rs.10,000 per year is 0.7. What is
the probability earn more than Rs.10,000 per year given that he is a trainee who stayed with
the company
Solutions: event A: A trainee will remain with the company
Event B: A trainee earns more than Rs. 10,00.
Given P(A) = 0.6 P(B) = 0.5 P(A

B) = 0.7
We need to find
67 . 0
6 . 0
4 . 0
) (
) ( ) ( ) (
) (
) (
) | (
+

A P
B A P B P A P
A P
B A P
A B P
PROBLEMS:5
Suppose that one of the three men, a politician a bureaucrat and an educationist will
be appointed as VC of the university. The probabilities of there appointment are
respectively 0.3,0.25,and 0.45. The probability that these people will promote research
activities if there are appointed is 0.4,0.7 and 0.8 respectively. What is the probability that
research will be promoted by the new VC
Solutions:
event A: Politician appointed as VC
event B: bureaucrat appointed as VC
event C: Educationist appointed as VC
event D: promotion of research activities
655 . 0 ) 8 . 0 )( 45 . 0 ( ) 7 . 0 )( 25 . 0 ( ) 4 . 0 )( 3 . 0 (
) ( ). | ( ) ( ). | ( ) ( ). | (
). ( ) ( ) (
+ +
+ +
+ +
C P C D P B P B D P A P A D P
D C P D B P P A P

PROBLEMS:6
122
A box contains 4 green and 6 white bolls another box contains 7 green and
8 white bolls. Two bolls are transferred from box 1 to box 2 and then a boll is drawn from
box 2. What is the probability that it is white?
event A: transferred balls are green
event B: transferred balls are white
event C: Among transferred balls one green & 1 white
event D: selection of a white ball from box 2.
5412 . 0
17
9
17
10
.
17
8
.
) ( ). | ( ) ( ). | ( ) ( ). | (
) ( ) ( ) (
2
10
1
6
1
4
2
10
2
6
2
10
2
4

+ +
+ +
+ +
C
C C
C
C
C
C
C P C D P B P B D P A P A D P
D C P D B P D A P
PROBLEMS:7
Probabilities of Husbands and wifes selection to a post are 1/5 and 1/7
respectively, what is the probability that.
Both of them will be selected.
Exactly one of them will be selected
None of them will be selected
Solutions:
event A: selection of Husband P(A) =
5
1
event B: selection of Husband P(B) =
7
1
(i) P(both of them will be selected) = P(A

B) =P(A).P(B)
=
35
1
7
1
5
1

(ii) P(exactly one of them will bw selected)
=
.
35
10
5
1
7
6
7
1
5
4
) ( ). ( ) ( ). (
) ( ) (
+
+
+
A P B P B P A P
A B P B A P
(iii) P(none of them will be selected) =
35
24
7
6
5
4
) ( ). (
) (

B P A P
B A P

123
Random variable
INTRODUCTION
Suppose two fair coins are tossed. Here, the sample space is 5 = {TT, TH, HT, HH} Suppose
to each of the four sample points in this sample space, a number is assigned as follows.
Sample point TT TH HT HH
Number 0 1 1 2
Here, the assigned numbers indicate the number of heads obtained in each case. Let 'the
number of heads' be denoted by X. Then, X is a function on the sample space. It takes the
values 0,1 and 2 with probabilities
P[X=0] = P[no head] =
P[X=1] = P[one head] =
P[X=2] = P[two head] =
Here, X is called Random variable or Variate.
RANDOM VARIABLE
Random variable is a function which assigns a real number to every sample point in
the sample space. The set of such real values is the range of the random variable.
There are two types of random variable, namely, Discrete random variable and Continuous
random variable.
A Variable X which takes values x
1
,x
2
,.x
n
with probabilities p1,p2,.pn is a
Discrete random variable. Here, the value x
1
,x
2
,.x
n
from the range of the random
variable.

A random variable whose range id uncountable infinite is a Continuous random
variable.
Ex1. Let X denote the number of heads obtained while tossing two fair coins. Then, X is
a random variable which takes the values 0,1 and 2 wit respective probabilities , and
. Here, X is a discrete random variable.
Ex. 2. Let X denote the number obtained while throwing a fair die. Then, X is a discrete
random variable taking values 1, 2, 3, 4, 5 and 6 with probability 1/6 each
Ex. 3. Let X denote the weight of apples. Then, X is a continuous random variable.
Generally, random variables are denoted by X, Y, Z, etc. If X is a random variable, the
values taken by X are denoted by x (small letter).
124
PROBABILITY MASS FUNCTION
Let X be a discrete random variable. And let p(x) be a function such that p(x) = P[X=x].
Then, p(x) is the probability mass function of X.
Here, (i)p(x) 0 for all x
(ii)p(x) = 1

A similar function is defined for a continuous random variable X. Its is called
probability density function (p.d.f.). It is denoted by f(x).
PROBABILITY DISTRIBUTION
A systematic presentation of the values taken by a random variable and the corresponding
probabilities is called probability distribution of the random variable.
Session 4
MATHEMATICAL EXPECTATION
Mathematical expectation of a random variable
Let X be a discrete random variable with probability mass function p(x). Then,
mathematical expectation of X is --- E(X) = x.p(x)
Mathematical expectation of a function h(x) of X
Let X be a discrete random variable with probability mass function p(x). Then,
mathematical expectation of any function h(X) of X is ---E[h(X)] = h(x).p(x)
Exercise 1 : Two fair coins are tossed once. Find the mathematical expectation of the
number of heads obtained.
Solution :
Let X denote the number of heads obtained. Then, X is a random variable which takes
the values 0, 1 and 2 with respective probabilities and and That is,
x 0 1 2
p(x)
The mathematical expectation of the number of head is
125
1
4
1
2
2
1
1
4
1
0 ) ( . ) ( + +

x p x X E
RESULTS
1.For a random variable X, the Arithmetic Mean is E(X).
2.For a random variable X, the Variance is
Var(X) = E[X-E(X)]
2
= E(X)
2
- [E(X)]
2

The Standard Deviation is the square root of the variance.
Exercise: A bag has 3 white and 4 red balls. Two balls are randomly drawn from the bag.
Find the expected number of white balls in the draw.
Solution:
Let X denote the number of white balls obtained in the draw. Then, X is a random variables
which takes the values 0, 1 and 2 with respective probabilities
P(0) = P[both red] =
P(0) = P[one white & one red] =
P(0) = P[both white] =
The probability distribution of X is
x 0 1 2
p(x) 2/7 4/7 1/7
Thus, one white ball is expected in the draw
THEORETICAL PROBABILITY DISTRIBUTIONS
In day to day life, we come across many random variables such as ------
1.Number of male children in a family having three children.
2.Number of passengers getting into a bus at the bys stand .
3.I.Q. of children
4.Number of stones thrown successively at a mango on the tree until the mango in hit
5.Marks scored by a candidate in the P.U.E. examination.
126
7
2
2
7
2
4

C
C
7
4
2
7
1
4
2
3

C
C C
7
1
2
7
2
3

C
C
) ( 1
7
6
7
1
2
7
4
1
7
2
0
) ( . ) (
ely approximat
x p x X E

+ +

For a quick analysis of distributions of such random variables, we consider their theoretical
equivalents. These equivalent distributions are originated according to certain theoretical
assumptions and restrictions. Such theoretically designed distributions are called theoretical
distributions.
There are many types (families) of theoretical distributions. Some of them
(i) Bernoulli distribution
(ii) Binomial distribution
(iii) Poisson distribution
(iv) Hypergeometric distribution
(v) Normal distribution.
The Bernoulli distribution and the Binomial distribution were discovered by James
Bernoulli during the first decade of eighteenth century. These works were published
posthumously in 1713.
The Poisson distribution was introduced by S.D. Poisson in 1837.
The Normal distribution was introduced by De Moivre in 1753. This distribution is also
called Gaussian distribution.
BERNOULLI EXPERIMENT
A random variable X which assumes values 1 and 0 with respective probabilities p and q =
1-p is called Bernoulli variables
The Bernoulli distribution is---
x 1 0
p(x) p q
Note1: Bernoulli distribution has one constant, namely, p. This constant p is called
parameter of the Bernoulli distribution. Different values of p(where 0<p<1) give different
Bernoulli distributions.
Note2: Bernoulli distribution can also be written down as p(x) = p
x
q
1-x
, x = 0,1
Note3: Here occurrence of the value 1 may be termed as Success and the occurrence of
the value 0 may be termed as Failure.
Therefore, P[Success]=p and P{Failure] = q=1-p
Examples:
1. A fair coin is tossed. Let the variable X takes values 1 and 0 according as
the toss results in Head or Tail. Then, X is a Bernoulli variables with
parameter p = .. Here, X denotes the number of Head obtained in the toss.
REPETITION OF BERNOULLI EXPERIMENT
Bernoulli experiment: Suppose a random experiment has two outcomes, namely,
Success and Failure. Let probability of Success be p and let probability of Failure be
q=(1-p). Such an experiment is called Bernoulli experiment or Bernoulli trial.
127
Let a Bernoulli experiment be conducted (repeated) n times. Let the variable X
i
(i=1,2,3,
n) take values 1 and 0 according as the i
th
experiment is a Success or a Failure. Then, X
i
is a
Bernoulli variate with parameter p. It denotes the number of success in the i
th
experiment.
Let X = X
1
+X
2
+..+X
n
. Then, X denotes the number of success in these n repetitions.
Example:
1. Let coin be tossed 3 times. Let X
i
(i=1,2,3) be a variate which takes values 1 and 0
according as the ith toss result in Head or Tail. Then, X = X
1
+X
2
+X
3
denotes
the number of head obtained in the 3 tosses.
Result:
If X
1
, X
2
, X
n
are independently and identically distributed (i,i,d) Bernoulli variates
with common parameter p, their sum X = X
1
+X
2
+ .+X
n
is a Binomial variate with
parameters n and p
MEAN & VARIANCE OF BERNOULLI DISTIBUTION
Let X be a Bernoulli variate with parameter p. Then, probability distribution of X is
x 1 0
p(x) p q
pq p p
p p
X E X E X Var
p q p
x p x X E
p q p
x p x X E



+

) 1 (
)] ( [ ) ( ) (
0 1
) ( . ) (
0 1
) ( . ) (
2
2 2
2 2
2 2

Thus, Mean is E(X) = P
Variance is Var(X) = p(1-p) = pq
Standard deviation is S.D.(X) = pq
BINOMIAL DISTRIBUTION
A Probability distribution which has the following probability mass function (p.m.f) is
called Binomial distribution.
p(x) =
n
C
x
p
x
q
n-x
, x=0,1,2,n.
0<p<1; q=1-p
Here n, p are parameters the variable X is discrete and it is called Binomial variate.
Note1: Binomal p.m.f. has two independent constants, namely, n and p. These two
constants are the parameters of binomial distribution
Note2: A binomial distribution with parameter n and p is denoted by b(x;n,p) or B(n,p).
128
EXAMPLES FOR BINOMIAL VARIATE
1.Number of heads obtained in 3 tosses of a coin.
2.Number of male children in a family of 5 children
3.Number of bombs hitting a bridge among 8 bombs which are dropped on it.
4.Number of defective articles in a random sample of 5 articles drawn from a manufactured
lot
5.Number of seeds germinating among 10 seeds which were sown
129
Session 05
RECURRENCE RELATION BETWEEN SUCCESSIVE TERMS
We have
and
Therefore,
Thus,
This is the recurrence relation between success probabilities of binomial distribution . When
p(x-1) is known, this relation can be used to obtain p(x).
If N is the total frequency, the frequency of x is---
1
1
. .
) 1 (
. .
) 1 (
)] 1 ( . .[ .
) 1 (
) ( .

+

x x
x
x
T
q
p
x
x n
T
T
q
p
x
x n
x p N
q
p
x
x n
x p N T
This is the recurrence relation between successive frequencies of binomial distribution. When
T
x-1
is known, this relation can be used to obtain T
x
Exercise: The incidence of an occupational disease in an industry is such that the worker
have 25% change of suffering from it. What is the probability that out of 5 workers, at the
most two contract that disease.
Solutions:
X : number of workers contracting the diseases among 5 workers
Then, X is a binomial variate with parameter n=5 and
p = P[a worker contracts the disease] = 25/100 = 0.25
The probability mass function (p.m.f) is
P(x) =
5
C
x
(0.25)
x
(0.75)
5-x
, x=0,1,2,.5
The probability that at the most two workers contract the disease is ---
8965 . 0
2637 . 0 3955 . 0 2373 . 0
) 75 . 0 ( ) 25 . 0 ( ) 75 . 0 ( ) 25 . 0 ( ) 75 . 0 ( ) 25 . 0 (
) 2 ( ) 1 ( ) 0 ( ] 2 [
3 2
2
5 4 1
1
5 5 0
0
5

+ +
+ +
+ +
C C C
p p p X P
130
1 1
1
) 1 (
) (
+

x n x
x
n
x n x
x
n
q p C x p
q p n x p
) 1 ( . .
) 1 (
) (
) 1 (
) 1 (
) (
1 1
1

x p
q
p
x
x n
x p
x
x n
q p C
q p C
x p
x p
x n x
x
n
x n x
x
n
Exercise: In a large consignment of electric lamps, 5% are defective. A random sample of
8 lamps is taken for inspection. What is the probability that it has one or more defectives.
Solution:
X: number of defective lamps
Then, X is B(n=8, p=5/100 = 0.05)
The p.m.f. is
p(x) = 8 ,... 2 , 1 , 0 , ) 95 . 0 ( ) 05 . 0 (
8 8

x C
x x
x
P[sample has one or more defectives] = 1-P[no defectives]
= 1-p(0)
= 1 -
8
C
0
(0.05)
0
(0.95)
8
= 1 0.6634 = 0.3366
Exercise: In a Binomial distribution the mean is 6 and the variance is 1.5. Then, find (i)
P[X=2] and (ii) P[X2].
Solution:
Let n and p be the parameters. Then,
Mean = np = 6
Variance = npq = 1.5
4
1
6
5 . 1

np
npq
Mean
Variance
Therefore, q = and p =
Therefore, Mean = n*3/4 = 6
That is, n = 24/3 = 8
The p.m.f is -------
p(x) =
8
C
x
(3/4)
x
(1/4)
8-x
, x=0,1,2,.8
(i) P[X=2] =
8
C
2
(3/4)
2
(1/4)
6
=252/65536=0.003845
(ii) P[X2] = p(0)+p(1)+p(2)
8
C
0
(3/4)
0
(1/4)
8
+
8
C
1
(3/4)
1
(1/4)
7
+
8
C
2
(3/4)
2
(1/4)
6
= 277/65536 = 0.004227
POISSON DISTRIBUTION
A probability distribution which has the following probability mass function (p.m.f) is
called Poisson distribution
Here, the variable X is discrete and it is called Poisson variate.
Note 1: is the parameter of Poisson, Poisson distribution has only one parameter
Note 2: Poisson distribution may be treated as limiting form of binomial distribution under
the following conditions. (Binomial distribution tends to Poisson distribution under the
following conditions.)
131
(i) p is very small (p0)
(ii) n is very large (n ) and
(iii) np= is fixed
EXAMPLES FOR POISSON VARIATE
Many variables which occurs in nature vary according to the Poisson law. Some of them
are
1. Number of death occurring in a city in a day
2. Number of road accidents occurring in a city in a day
3. Number of incoming telephone calls at an exchange in one minutes
4. Number of vehicles crossing a junction in one minutes.
Session 06
Exercise: The number of accidents occurring in a city in a day is a Poisson variate with
mean 0.8. Find the probability that on a randomly selected day
(i)There are no accidents
(ii)There are accidents
Solutions:
Let X: number of accidents per day.
Then, X is P(=0.8).
The p.m.f. is
,.... 3 , 2 , 1 , 0 ,
!
) 8 . 0 (
) (
8 . 0

x
x
e
x p
x
(i) Probability that ob a particular day three are no accidents is ------------
P[no accidents] = P[X=0]=p(0)
= 449 . 0
! 0
) 8 . 0 (
8 . 0
0 8 . 0

e
e
(ii) P[accidents occur] = 1-P[no accidents]
= 1-p(0) = 1-0.449 = 0.551
Exercise: The number of persons joining a cinema queue in a minute has Poisson
distribution with parameter 5.8. Find the probability that (i) no one joins the queue in a
particular minute
(ii)2 or more persons join the queue in the minute.
Solution:
Let X : number of persons joining the queue in a minute Then, is P(=5.8).
The p.m.f is -------------------
,.... 3 , 2 , 1 , 0 ,
!
) 8 . 5 (
) (
8 . 5

x
x
e
x p
x
132
(i) P[no one joints the queue] = P[X=0] = p(0)
=
! 0
) 8 . 5 (
0 8 . 5
e
=
8 . 5
e =0.003

(ii) P[two or more join] =
{ }
9796 . 0 0204 . 0 1
8 . 6 003 . 0 1
8 . 5 1 1
! 1
) 8 . 5 (
! 0
) 8 . 5 (
1
)} 1 ( ) 0 ( { 1
] 2 [ 1
] 2 [
8 . 5
1 8 . 5 0 8 . 5


+

'

+
+


e
e e
p p
X P
X P

Exercise: The average number of telephone calls booked at an exchange between 10-00
A.M. and 10-10 A.M. is Find the probability that on a randomly selected day 2 or more
calls are booked between 10-00 A.M. and 10-10 A.M. On how many days of a year, would
you expect booking of 2 or more calls during that times gap.
Solutions:
Let X : number of telephone calls booked at the exchange during 10-00 A.M. to
10-10 A.M. Then, X is P(=4).
The p.m.f is ---
,... 3 , 2 , 1 , 0 ,
!
4
) (
4

x
x
e
x p
x
P[2 or more calls] = 1-P[less than 2 calls]
= 1 [p(0) +p(1)]
= 1 e
-4
[(4
0
)/0! +(4
1
)/1!]
= 1 0.0183[1+4]
= 1 0.0915 = 0.9085
An year has 365 days. Out of these N = 365 days, the number of days on which there will
be 2 or more calls is ---
N *P[2 or more calls] = 365 * 0.9085 = 332
Exercise: 2 percent of the fuses manufactured by a firm are expected to be defective, Find
the probability that a box containing 200 fuses contains
(i) defective fuses
(ii) 3 or more defective fuse
Solutions:
2 percent of the fuses are defective. Therefore, probability that a fuses is defective is p =
2/100 = 0.02
133
Let X denote the number of defective fuses in the box of 200 fuses. Then, X is B(n = 200, p
= 0.02)
Let X denote the number of defective fuses in the box of 200 fuses. Then, X is B (n = 200,
p = 0.02)
Here, p is very small and n is very large. Therefore, X can be treated as Poisson variate
with parameter =np = 200 * 0.02 = 4.
The p.m.f. is ----

,... 3 , 2 , 1 , 0 ,
!
4
) (
4

x
x
e
x p
x
P[box has defective fuses] = 1-P[no defective fuses]
= 1 p(0)
= 1 e
-4
[(4
0
)/0!
= 1 0.0183 =0.9817

P[3 or more defective fuses] = 1-P[less than 3 defective fuses]
= 1 [p(0)+p(1) +p(2)]
= 1 e
-4
[1+4+8]
= 1 0.0183*13
= 1 0.2379 =0.7621
Exercise: The probability that a razor blade manufactured by a firm is defective is 1/500.
Blades are supplied in packets of 5 each. In a lot of 10,000 packets, how many packets
would
(i)Be free defective blades?
(ii) Contains exactly one defective blade?(e
-0.01
=0.99)
Solution:
Let X be the number of defective blades in a packet of 5 blades. Then, X is B (n = 5, p =
1/500)
Since p is very small and n is sufficiently large, X is treated as Poisson variate with
parameter =np = 5*(1/500) = 0.01
,... 3 , 2 , 1 , 0 ,
!
) 01 . 0 (
) (
01 . 0

x
x
e
x p
x
(i) P[ no defective blades] = p(0)
= e
-0.01
(0.01)
0
/0! = 0.99
The number of packets which will be free of defective blades is -----
N * P[no defective blades] = 10000*0.99 = 9900
(ii) P[one defective blade] = p(1)

The probability that a razor blade manufactured by a firm is defective is 1/500. Blades are
supplied in packets of 5 each. In a lot of 10,000 packets, how many packets would
(i)Be free defective blades?
(ii) Contains exactly one defective blade?(e
-0.01
=0.99)
134

(ii) P[one defective blade] = p(1)
= e
-0.01
(0.01)
1
/1! = 0.0099
The number pf packets which will have one defectives blade is ----
N * P[one defective blade] = 10000*0.0099 = 99
Exercise:On an average, a typist mistakes while typing one page. What is the probability
that a randomly observed page in free of mistakes? Among 200 pages, in how many pages
would you expect mistakes?
Solutions:
Let X: number of mistakes in a page.
Then, X is P(=3).
The p.m.f. is ---
,.... 3 , 2 , 1 , 0 ,
!
3
) (
3

x
x
e
x p
x
P[page is free of mistakes] = p(0)
0498 . 0
! 0
3
3
0 3

e
e
P[page has mistakes] = 1-P[Page has no mistakes]
= 1-0.0498=0.9502

Among 200 pages, the expected number of pages containing mistakes is ---
N*P[page has mistakes] = 200*0.9502=190
Exercise: In a Poisson distribution P[X=2] = P[X=3]. Find P[X=4].
Solution:
Let be the parameter
Here, P[X = 2] = P[X=3]
3
1
! 3 ! 2
3
3 2






e e
And so, =3
The p.m.f. is ----
,.... 2 , 1 , 0 ,
!
3
) (
3

x
x
e
x p
x
P[X=4] = p(4) =
! 4
3
2 3
e
= 1681 . 0
24
81 0498 . 0

Exercise: For a Poisson variables 3 * P[X=2] = P[X=4]. Find standard deviation.


Solution:
135
Here, 3*P[X = 2] = P[X=4]
4 3
3
! 4 ! 2
3
2
4 2



e e
And so,
2
=36
That is , =3
Thus, the parameter is =6
The standard deviation S.D.(X) = 449 . 2 6
Exercise: The following data relates to the number of mistakes in each page of a book
containing 180 pages.
No of mistakes per page: 0 1 2 3 4 5 or more Total
No. of Pages 156 16 5 2 1 0 180
Fit a Poisson distribution to the data. Obtain the theoretical frequencies
Solution:
Let X denotes the number of mistakes per page. Then, X is a Poisson variate. The
parameter is ---
2 . 0
180
36
180
5 0 4 1 3 2 2 5 1 16 0 156

+ + + + +



N
fx
x

The p.m.f is ---
,.... 3 , 2 , 1 , 0 ,
!
2 . 0
) (
2 . 0

x
x
e
x p
x
The frequency function is---
37 . 147 8187 . 0 180
! 0
2 . 0
180
,.... 3 , 2 , 1 , 0 ,
!
2 . 0
180
0 2 . 0
0
2 . 0


e
T
x
x
e
T
x
X

NORMAL DISTRIBUTION
136
A probability distribution which has the following probability density functions(p.d.f) is
called Normal distribution
Here, the variable X is continuous and it is called Normal variate.
Note 1: The distribution has two parameters, namely, and .(Here, =3.14 and e=2.718.
Note 2 : This normal distribution has Mean E(X) = and Variance = V(X) =
2.
S.D.
(X)=.
Note 3: A normal variate with parameters and is denoted by N(,
2
)
Note 4: The normal p.d.f. can also be written as-
137
EXAMPLES FOR NORMAL VARIATE
Many of the variables which occur in nature have normal distribution. Some examples are

1.Height of students of a college


2.Weight of apples grown in an orchard
3.I.Q(Intelligence Quotient) of a large group of children.
4.Marks scored by students in an examination
Session 7
PROPERTIES OF NORMAL DISTRIBUTION
(Properties of normal Curve)
A normal distribution with parameters and has the following properties.
1.The curve is Bell shaped
a. It is symmetrical (Non-skew).
That is
1
= 0
b. The mean, media and mode are equal
2.
The curve is asymptotic to the X-axis. That is, the curve touches the X-axis only at -
and+.
3. The curve has points of inflexion at - and +.
4. For the distribution .
a. Standard deviation =
b. Quartile deviation = 2/3 (approximately)

c.
Mean deviation = 4/5 (approximately)
5. For the distribution
a. The odd order moments are equal to zero.
b. The even order moments are given by
138
Thus,
2
=
2
and
4
= 3
4
6. The distribution is mesokurtic. That is,
2
=3.
7. Total area under the curve is unity.

P[a < X b]= Area bounded by the curve
and the ordinates at a and b

a. P[ - < X + ] = 0.6826 = 68.26%
b. P[ 2 < X + 2] = 0.9544 = 95.44%
c .
P[ 3 < X + 3] = 0.9974 = 99.74%
STANDARD NORMAL VARIATE (SNV)
A normal variate with mean =0 and standard deviation =1 is called Standard Normal
Variate. It is denoted by Z. Its probability density function is
The graph of standard normal distribution is shown in the figure
The shaded area in the figure represents the probability that the variate takes a value
between 0 and z. This area can be read from the table of areas under Standard Normal
Curve. Corresponding to nay positive z, the area from 0 to z can be read from this table.
Let X be a normal variate with mean and standard deviation . Then

X
Z is a
Standard Normal Variate
Therefore, to find any probability regarding X, the Standard Normal Variate can be made
use of.
Note: The Standard Normal Variate (SNV)is denoted by N(0,1).
PROBLEMS
X is a normal variate with mean 42 and standard deviation 4. Find the probability that a
value taken by X is
(i)less then 50 (ii) greater than 50
(iii) less than 40 (iv) greater than 40
(v) between 43 and 46 (vi) between 40 and 44
139
(vii) between 37 and 41.
Solution:
X is a normal variate with parameters =42 and =4
Therefore,
is a Standard Normal Variate.

(i)
140
= = P[Z<2]
= area from(-) to 2
= [area from(-) to 0]+[area from(-)0 to 2
= 0.5 + 0.4772(from the table)
= 0.977200
= = P[Z>2]
= area from 2 to
= [area from 0 to ] - [area from 0 to 2]
= 0.5 - 0.4772(from the table)
= 0.0228

= = P[Z < -0.5]
= area from(- ) to (-0.5)
= area from 0.5 to
= [area from 0 to ] - [area from 0 to 0.5]
= 0.5 - 0.1915
= 0.3085
= = P[Z > -0.5]
= area from (-0.5) to
= area from (-0.5) to 0] + [area from 0 to ]
= [area from 0 to 0.5] + [area from 0 to ]
= 0.1915 + 0.5
= 0.6915
= = P[-0.5 < Z < 0.5]
= area from -0.5 to 0.5
= area from -0.5 to 0] + [area from 0 to 0.5]
= area from 0 to 0.5] + [area from 0 to 0.5]
= 0.1915 + 0.1915
= 0.3830
= P[-1.25 < Z <-0.25]
= area from 1.25 to 0.25
= area from 0.25 to 1.25
= area from 0 to 1.25] - [area from 0 to 0.25]
= 0.3944 0.0987
= 0.2957
141
PROBLEMS
Height of students is normally distribute with mean 165 cms. And standard deviation 5
cms. Find the probability that height of a students is
more than 177 cms.
less than 162 cms
Solution:
Let X denote height. Then, X is a normal variate with parameters = 165 cms. And =5
cms.
Is N(0,1).
(i) Probability that the student is more than 177 cms tall is
= P[Z > 2.4]
= area from 2.4 to
= [area from 0 to ] [area from 0 to 2.4]
= 0.5 0.4918
= 0.0082
ii) Probability that the student is less than 162 cms. Tall is -----
= P[Z < -0.6]
= area from (-) to (-0.6)
= area from 0.6 to
= [area from 0 to ] [area from 0 to 0.6]
= 0.5 0.2258
= 0.2742
PROBLEMS
Mean life of electric bulbs manufactured by a firm is 1200 hrs. The standard deviation is
200 hrs.
(i) In a lot of 10,000 bulbs, how many bulbs are expected have life 1050 hrs. or more?
(ii) What is the percentage of bulbs which are expected to find before 1500 hrs. of service?
Solution:
142
Let X denotes the life of the bulbs. Then, X is a normal variate with parameters =1200hrs
=200 hrs
(i) Probability that life of a bulb is 1050 hrs. or more is ---
=P[ Z -0.75]
=0.2734 + 0.5
=0.7734
In a lot of n=10,000 bulbs, expected number of bulbs with life 1050 hrs. or
more is ---
N * P[X1050]=10000*0.7734=7734
(ii) Probability that life of a bulb is 1050 hrs. or more is ---
=P[ Z < 1.5]
=0.5 + 0.4332
=0.9332
The percentage of bulbs with life less than 1500 hrs is ---
100 * P[X<1050]=100*0.9332=93.32
PROBLEMS
The mean and standard deviation of marks scored by a group of students in an examination
are 47 and 10 respectively. If only 20% of the students have to be promoted, which should
be the marks limits for promotion?
Solution:
Let X denotes marks. Then, X is a Normal variate with parameters =47 and =10.
is N(0,1).
Let a be the marks above which if a student scores he would be promoted. Then, since only
20% of the students have to be promoted the probability of a student getting promotion
should be 20/100=0.2
Therefore,
143
And so, P[Z.z] = 0.2 where z=
1
47 a
That is, [area from z to ]=0.2
That is, [area from 0 to z]=0.3
From the table of areas, the value of z for which [area from 0 to z] = 0.3 is z=0.84
Therefore, z=0.84.
And so,
Thus, the marks limit for promotion is a = 55.4
Session 08
Advantages of Probability Sampling
The following are the basic advantages of probability sampling methods:
Probability sampling does not depend upon the existence of detailed information about the
universe for its effectiveness.
144
Probability sampling provides estimates which are essentially unbiased and have
measurable precision.
It is possible to evaluate the relative efficiency various sample designs only when
probability sampling is used
NON-PROBABILITY SAMPLING METHODS
Judgment Sampling
In this method of sampling the choice of sample items depends exclusively on the
judgment of the investigator. In other words, the investigator exercises his judgment in the
choice and includes those items in the sample, which he thinks are most typical of the
universe with regard to the characteristics under investigation. For example, if sample of
ten students is to be selected from a class of sixty for analysing the spending habits of
students, the investigator would select 10 students who, in his opinion, are representative of
the class.
Merits: Though the principles of sampling theory are not applicable to judgment sampling,
the method is sometimes used in solving many types of economic and business problems.
The use of judgment sampling is, justified under a variety of circumstances:
(i) When only a small number of sampling units are in the universe, simple random
selection may miss the more important elements, whereas judgment selection would
certainly include them in the sample.
(ii) When we want to study some unknown traits of a population, some of whose
characteristics are known, we may then stratify the population according to these known
properties and select sampling units from each stratum on the basis of judgment. This
method is used to obtain a more representative sample.
(iii) In solving everyday business problems and making public policy decisions, executives
and public officials are often pressed for time and cannot wait for probability sample
designs. Judgment sampling is then the only practical method to arrive at solutions to their
urgent problems.
Limitations Judgment sampling method is however associated with the allowing
limitations:
(i) This method is not scientific because the population units to be sampled may be affected
by the personal prejudice or bias of the investigator. Thus, judgment sampling involves the
risk that the investigator may establish foregone conclusions by including those items in
the sample which conform to his preconceived notions. For example, if an investigator
holds the view that the wages of workers in a certain establishment are very low, and if he
adopts the judgment sampling method, he may include only those workers in the sample
whose wages are low and thereby establish his point of view which may be far from the
truth. Since an element of subjectiveness is possible, this method cannot be recommended
for general use.
145
(ii) There is no objective way of evaluating the reliability of sample results. The success of
this method depends upon the excellence in judgment. If the individual making decisions is
knowledgeable about the population and has good judgment, then the resulting sample may
be representative, otherwise the inferences based on the sample may be erroneous. It may
be noted that even if a judgment sample is reasonably representative, there is no objective
method for determining the size or likelihood of sampling error. This is a big defect of the
method.
Quota Sampling
Quota sampling is a type of judgment sampling and is perhaps the most commonly used
sampling technique in non-probability category. In a quota sample, quotas are set up
according to some specified characteristics such as so many in each of several income
groups, so, many in each age, so many with certain political or religious affiliations, and so
on. Each interviewer is then told to interview a certain number of persons which constitute
his quota. Within the quota, the selection of sample items depends on personal judgment.
For example, in a radio listening survey, the interviewers may be told to interview 500
people living in a certain area and that out of every 100 persons interviewed 60 are to be
housewives, 25 farmers and 15 children under the age of 15. Within these quotas the
interviewer is free to select the people to be interviewed. The cost per person interviewed
may be relatively small for a quota sample but there are numerous opportunities for
bias which may invalidate the results. For example, interviewers may miss farmers
working in the fields or talk with those housewives who are at home. If a person refuses to
respond, the interviewer simply selects someone else. Because of the risk of personal
prejudice and bias entering the process of selection, the quota sampling is not widely used
in practical work.
Quota sampling and stratified random sampling are similar in as much as in both methods
the universe is divided into parts and the total sample is allocated among the parts.
However, the two procedures diverge radically. In stratified random sampling the sample
with each stratum is chosen at random. In quota sampling, the sampling within each cell is
not done at random, the field representatives are given wide latitude in the selection of
respondents to meet their quotas.
Quota sampling is often used in public opinion studies. It occasionally provides satisfactory
results if the interviewers are carefully trained and if they follow their instructions closely.
It is often found that since the choice of respondents within a cell is left to the field
representatives, the more accessible and articulate people within a cell will usually be the
ones who are interviewed. Slight negligence on the part of interviewers may lead to
interviewing ineligible respondents. Even with alert and conscientious field representatives
it is often difficult to determine such control category as age, income, educational
qualifications, etc.

Convenience Sampling
A convenience sample is obtained by selecting 'convenient' population units. The method
of convenience sampling is also called the chunk A chunk refers to that fraction of the
population being investigated which is selected neither by probability nor by judgment but
by convenience. A sample obtained from readily available lists such as automobile
146
registrations; telephone directories, etc., is a convenience sample and not a random sample
even if the sample is drawn at random from the lists. If a person is to submit a project
report on labour-management relations in textile industry and he takes a textile mill close to
his office and interviews some people over there, he is following the convenience sampling
method. Convenience samples are prone to bias by their very nature-selecting population
elements which are convenient to choose almost always make them special or different
form the best of the elements in the population in some way.
Hence the result obtained by following convenience sampling method can hardly be
representative of the populationthey are generally biased and unsatisfactory.
However, convenience sampling is often used for making pilot studies. Questions may
be tested and preliminary information may be obtained by the chunk before the
final sampling design is decided upon.
PROBABILITY SAMPLING METHODS
Simple or Unrestricted Random Sampling
Simple random sampling refers to that sampling technique in which each and every unit of
the population has an equal opportunity of being selected in the sample. In simple random
sampling with items get selected in the sample is just a matter of chancepersonal bias of
the investigator does not influence the selection. It should be noted that the word 'random'
does not mean "haphazard' or "hit-or-miss'it rather means that the selection process is
such that chance only determines which items shall be included in the sample. As pointed
out by Chou, when a sample of size n is drawn from a population with N elements, the
sample is a 'simple random sample' if any of the following is true. And, if any Of the
following is true, so are the other two:
All n items of the sample are selected independently of one another and all N items in
the population have the same chance of being included in the sample. By independence
of selection we mean that he selection of a particular item in one draw has no influence on
the probabilities of selection in any other draw.
At each selection, all remaining items in the population have the same chance of being
drawn. If sampling is made with replacement, ie., when each unit drawn from the
population is returned prior to drawing the next unit, each item has a probability of 1/N of
being drawn at each selection. If sampling is without replacement, i.e., when each unit
drawn from the population is not returned prior to drawing the next unit, the probability of
selection of each item remaining in the population at the first draw is 1/N, at the second
draw is 1/(N-1), at the third draw is l/(N-2), and so on. It should be noted that sampling
with replacement has very limited and special use in statisticswe are mostly concerned
with sampling without replacement.
All the possible samples of a given size n are equally likely to be selected. To ensure
randomness of selection one may adopt either the Lottery Method or consult table of
random numbers.
147
Lottery Method: This is a very popular method of taking a random sample. Under this
method, all items of the universe are numbered or named on separate slips of paper of
identical size and shape. These slips are then folded and mixed up in a container or drum.
A blindfold selection is then made of the number of slips required to constitute the desired
sample size. The selection of items thus depends entirely on chance. The method would be
quite clear with the help of an example. If we want to take a sample of 10 persons out of a
population of 100, the procedure is to write the names of the 100 persons on separate slips
of paper, fold these slips-mix them thoroughly and then make a blindfold selection of 10
slips.
The above method is very popular in lottery draws where a decision about prizes is to be
made. However, while adopting lottery method it B absolutely essential to see that the slips
are of identical size, shape any colour, otherwise there is a lot of possibility of personal
prejudice an bias affecting the results.
Table of Random Numbers: The lottery method discussed above become quite
cumbersome as the size of population increases. An alternative method of random selection
is that of using the table of random number
The random numbers are generally obtained by some mechanism which, when repeated a
large number of times, ensures approximately equal frequencies for the numbers
from 0 to 9 and also proper frequencies for various combinations of number(such
as 00,01,.999, etc) that could be expected in a random sequence of the digits0 to 9.
Several standard tables of random numbers are available, among which the following may
be specially mentioned, as they have been tested extensively for randomness:
* Tippett's (1927) random number tables consisting of 41,600 random digits grouped into
10,400 sets of four-digit random numbers; .
* Fisher and Yates (1938) table of random numbers with 15,000 random digits
arranged into 1.500 sets of ten-digit random numbers;
* Kendall and Babington Smith (1939) table of random numbers consisting of
1,00,000 random digits grouped into 25,000 sets of four-digit random numbers; .
* Rand Corporation (1955) table of random numbers consisting of 1,00,000
random digits grouped into 20,000 sets of five-digit random numbers; and .
* C.R. Rao, Mitra and Mathai (1966) table of random numbers.
Tippett's table of random numbers is most popularly used in practice. We give below the
first forty sets from Tippett's table as an illustration of the general appearance of random
numbers:
2952 6641 3992 9792 7969 5911 3170 5624
4167 9524 1545 1396 7203 5356 1300 2693
2670 7483 3408 2762 3563 1089 6913 7991
0560 5246 1112 6107 6008 8125 4233 8776
2754 9143 1405 9025 7002 6111 8816 6446
It is important that the starting point in the table of random numbers be selected
in some random fashion so that every unit has an equal chance of being selected.
One may question, and quite rightly, as to how it is ensured that these digits
are random. It may be pointed out that the digits in the table were chosen haphazardly
148
but the real guarantee of their randomness lies in practical tests. Tippett's numbers
have been subjected to numerous tests and used in many investigations and their
randomness has been well established for all practical purposes. An example to
illustrate how Tippett's table of random numbers may be used is given below.
Suppose we have to select 20 items out of 6,000. The procedure is to number all
the items from 1 to 6,000. A page from Tippett's table may then be consulted and the
first twenty numbers up to 6,000 noted down. Items bearing those numbers will be
included in the sample. Making use of the Portion of the table given above the
required numbers are:
The items which bear the above numbers constitute the sample.
Universe size less than 1,000. If the size of universe is less than 1,000 the procedure will be
different, as Tippett's numbers are available only in four figures. Thus, for example, if it is
desired to take a sample of 10 items out of 400, all items from 1 to 400 should be numbered
as 0001 to 0400. We may now select 10 numbers from the table which are up to 0400.
Universe size less than 100. If the size of universe is less than 100, the table is used as
follows: Suppose ten numbers from out of 0 to 80 are required. We start anywhere in the
table and write down the numbers in pairs. The table can be read horizontally, vertically,
diagonally or in any methodical way. Starting with the first and reading horizontally first
(see table given above) we obtain 29, 52, 66, 41, 39, 92, 97, 92, 79, 69, 59, 11, 31, 70, 56,
24 70, 56, 24, 41, 67, and so on. Ignoring the numbers greater than 80, we obtain for our
purpose ten random numbers, namely, 29, 52, 66, .41, 39, 79, 69, 59, 11 and 31.
Fishers and Yate's tables consist of 15,000 numbers. These have been arranged in two
digits in 300 blocks, each block consisting of 5 rows and 5 columns. Kendall and Smith
also constructed random numbers (10,000 in all) by using a randomising machine.
However, this method of random selection cannot be followed in case of articles like ghee,
oil, petrol, wheat, etc.
Merits Simple random sampling method has the following advantages:
Since the selection of items in the sample depends entirely on chance there is no
possibility of personal bias affecting the results.
As compared to judgment sampling a random sample represents the universe in a better
way. As the size of the sample increases, it becomes increasingly representative of the
population.
The analyst can easily assess the accuracy of this estimate because sampling errors follow
the principles of chance. The theory of random sampling is further developed than that of
any other type of sampling which enables the analyst to provide the most reliable
information at the least cost.
2952 3992 5911 3170 5614 4 1 6 7
1545 1396 5356 1300 26>3 2 3 7 0
3408 2762 3563 1089 0550 5 2 4 6
1112 4233
149
Limitations This method is however associated with following limitations:
The use of simple random sampling necessitates a completely catalogued
universe from which to draw the sample. But it is often difficult for the investigator to
have up-to-date lists of all the items of the population to be sampled. This restricts the use
of this method in economic and business data where very often we have to employ
restricted random sampling designs.
The size of the sample required to ensure statistical reliability is usually larger
under random sampling than stratified sampling.
From the point of view of field survey it has been claimed that cases selected by
random sampling tend to be too widely dispersed Geographically and that
the time and cost of collecting data become.
Random sampling may produce the most non-random looking results. For example,
thirteen cards from a well-shuffled pack of playing cards may consist of one suit. But the
probability of this type of occurrence is very, very low.
Restricted Random Sampling
1. Stratified Sampling Stratified random sampling or simply stratified sampling is one of
the random methods which, by using the available information concerning the population,
attempts to design a more efficient sample than obtained by the simple random procedure.
While applying stratified random sampling technique, the procedure followed is given
below:
(a) The universe to be sampled is sub-divided (or stratified) into groups which are
mutually exclusive and include all items in the universe.
(b) A simple random sample is then chosen independently from each group.
This sampling procedure differs from simple random sampling in that in the latter the
sample items are chosen at random from the entire universe. In stratified random sampling
the sampling is designed so that a designated number of items is chosen from each stratum.
In simple random sampling the distribution of the sample among strata is left entirely to
chance.
How to Select Stratified Random Sample?
Some of the issues involved in setting up a stratified random sample are :
(i) Base of Stratification What characteristic should be used to sub divide the universe into
different strata? As a general rule, strata are created on the basis of a variable known to be
correlated with the variable of interest and for which information on each universe element
is known. Strata should be constructed in a way which will minimize differences among
sampling units within strata, and maximize difference among strata.
For example, if we are interested in studying the consumption pattern f the people of
Delhi, the city of Delhi may be divided into various parts (such as zones or wards) and
from each part a sample may be taken at random. Before deciding on stratification we must
have knowledge of the traits of the population. Such knowledge may be based upon expert
Judgment, past data, preliminary observations from pilot studies, etc.
150
The purpose of stratification is to increase the efficiency of sampling by dividing a
heterogeneous universe in such a way that Q there is as great a homogeneity as possible
within each stratum, and (ii) a marked difference is possible between the strata.
(ii) Number of Strata. How many strata should be constructed? The Practical
considerations limit the number of strata that is feasible, costs of adding more strata may
soon outrun benefits. As a generalization more than six strata may be undesirable.
(iii) Sample size within Strata How many observations should be taken from each stratum?
When deciding this question we can use either a proportional or a disproportional
allocation. In proportional allocation, one samples each stratum in proportion to its relative
weight. In disproportional allocation this is not the case. It may be pointed out that
proportional allocation approach is simple and if all one knows about each stratum is the
number of items in that stratum, it is generally also the preferred procedure. In
disproportional sampling, the different strata are sampled at different rates. As a general
rule when variability among observations within a stratum is high, one samples that stratum
at a higher rate than for strata with less internal variation.
Proportional and Disproportional Stratified Sample
In a proportional stratified sampling plan, the number of items drawn from each strata is
proportional to the size of the strata. For example, if the population is divided into five
groups, their respective sizes being 10, 15, 20, 30 and 25 per cent of the population and a
sample of 5,000 is drawn, the desired proportional sample may be obtained in the following
manner:
From stratum one 5,000 (0.10)= 500 items
From stratum two 5,000(0.15) = 750 items
From stratum three 5,000 (0.20) = 1,000 items
From stratum four 5,000 (0.30) = 1,500 items
From stratum five 5,000(0.25) = 1,250 items
Total = 5,000 items
Proportional stratification yields a sample that represents the universe with respect to the
proportion in each stratum in the population. This procedure is satisfactory if there is no
great difference in dispersion from stratum to stratum. But it is certainly not the most
efficient procedure, especially when there is considerable variation in different strata. This
indicates that in order to obtain maximum efficiency in stratification,, we should assign
greater representation to a stratum with a large dispersion and smaller representation to one
with small variation.
In disproportional stratified sampling an equal number of cases is taken from each stratum
regardless of how the stratum is represented in the universe. Thus, in the above example, an
equal number of items (1,000) from each stratum may be drawn. In practice disproportional
sampling is common when sampling forms a highly variable universe,, wherein the
variation of the measurements differs greatly from stratum to stratum.
Merits Stratified sampling methods have the following advantages:
More representative. Since the population is first divided into various strata and
then a sample is drawn from each stratum there is a little possibility of any essential group
151
of the population being completely excluded. A more representative sample is
thus secured. C.J. Grohmann has rightly pointed out that this type of
sampling balances the uncertainty of random sampling against the bias of
deliberate selection.
Greater accuracy. Stratified sampling ensures greater accuracy. The accuracy is
maximum if each stratum is so formed that it consists of uniform or homogeneous items.
Greater geographical concentration. As compared with random sample, stratified samples
can be more concentrated geographically, i.e., the units from the different strata may be
selected in such a way that all of them are localised in one geographical area. This would
greatly reduce the time and expenses of interviewing.
Limitations The limitations of this method are:
Utmost care must be exercised in dividing the population into various strata. Each
stratum must contain, as far as possible, homogeneous items as otherwise the results
may not be reliable. If proper stratification of the population is not done, the sample may
have the effect of bias.
The items from each stratum should be selected at random. But this may be difficult
to achieve in the absence of skilled sampling supervisors and a random selection within
each stratum may not be ensured.
Because of the likelihood that a stratified sample will be more widely distributed
geographically than a simple random sample cost per observation may be quite
high.
2. Systematic Sampling: A systematic sample is formed by selecting one unit at random
and then selecting additional units at evenly spaced intervals until the sample has been
formed. This method is popularly used in those cases where a complete list of the
population from which sample is to be drawn is available. The list may be prepared in
alphabetical, geographical, numerical or some other order. The items are serially numbered.
The first item is selected at random generally by following the Lottery method. Subsequent
items are selected by taking every kth item from the list where 'k refers to the sampling
interval or sampling ratio, i.e., the ratio of population size to the size of the sample.
Symbolically:

n
N
k
where k = Sampling interval, N = Universe size, and n = Sample size.
While calculating k, it is possible that we get a fractional value. In such a case we should
use approximation procedure, le., if the fraction is less than 0.5 it should be omitted and if
it is more than 0.5 it should be taken as 1. If it is exactly 0.5 it should be omitted, if the
number is even and should be taken as 1, if the number is odd. This is based on the
principle that the number after approximation should preferably be even. For example, if
the number of students is respectively 1,020, 1,150 and 1,100 and we want to take a sample
of 200, k shall be:
(i) 5 1 . 5
200
1020
or k
152
(ii) 6 75 . 5
200
1150
or k
(iii) 6 5 . 5
200
1100
or k
Merits : The systematic sampling design is simple and convenient to adopt. The
time and work involved in sampling by this method are "relatively less. The
results obtained are also found to be generally satisfactory provided care is
taken to see that there are no periodic features associated with the
sampling interval. If populations are sufficiently large, systematic sampling
can often be expected to yield -suits similar to those obtained by proportional
stratified sampling.
Limitations: The main limitation of the method is that it become less representative if we
are dealing with populations having hidden periodicities. Also if the population is order
in a systematic way with respect to the characteristics the investigator is interested in , then
it is possible that only certain types of item will be included in the population, or at least
more of certain types than others. For instance, in a study of worker wages the list may be
such that every tenth worker on the list gets wages above Rs. 750per month.
3. Multi-stage Sampling or Cluster Sampling: Under this method the random selection is
made of primary, intermediate and final ( or the ultimate) units from a given population or
stratum. There are several stages in which the sampling process is carried out. At first, he
first stage units are sampled by some suitable method, such as sample random sampling.
Then, a sample of second stage units is selected from each of the selected first stage units,
again by some suitable method which may be the same as or different from the method
employed for the first stage units. Further stages may be added as required.
Merits Multi-stage sampling introduces flexibility in the sampling method which is
lacking in the other methods. It enables existing divisions and sub-divisions of the
population to be used as units at and permits the field work to be concentrated and yet
large area to be covered. Another advantage of the method is that subdivision into second
stage units (i.e., the construction of the second stage frame) need be carried out for
only those first stage units which are included in the sample. It is, therefore,
particularly valuable in surveys of underdeveloped areas where no frame is generally
sufficiently detailed and accurate for subdivision of the material into reasonably small
sampling units.
Limitations: However, a multi-stage sample is in general less than a sample containing the
same number of final stage units which have been selected by some single stage process.
We have discussed above the various random procedures in independent
designs. In practice we often combine two or more of these methods into a single design.
SAMPLING DISTRIBUTION AND STANDARD ERROR
153
Suppose a sample of size n is drawn from a population and the sample mean
x
is
calculated. From the population, many such sample of the same size can be drawn. For
each of these samples,
x
can be calculated. And so, there can be many values of
x

Suppose these different values of
x
are tabulated in the form of a frequency distribution,
the resulting distribution is called Sampling distribution of
x
The standard deviation of
this sampling distribution is called Standard Error (S.E)
The distribution of values of a statistic for different samples of the same size is called
sampling distribution of the statistic.
Standard Error (S.E.) of a statistic is the standard deviation of the sampling
distribution of the statistic.
Sampling distributions of other statistics such as sample variance, sample median, etc., can
also be written down. In each of these case, the corresponding standard deviation would be
the standard error (SE).
Consider a population whose means is and standard deviation . Let a random sample of
size n be drawn from this population. Then, the sampling distribution of
x
has mean
and standard error
n

. That is, E(
x
) = and S.E.(
x
) =
n

Let a random sample of size n


1
be drawn from population whose mean is
1
and

1
standard deviation. And also, let a random sample of size n
2
be drawn from another
population whose mean is
2
and standard deviation is
2
. Let
x 1
be the mean of the first
sample and
x 2
be the mean of the second sample . Then,
2
2
2
1
2
1
2 1
2 1
2 1 )) .(( . ) ( ) (
n n
x x E andS x x E

+
STATISTICAL HYPOTHESIS
A statistical hypothesis is an assertion regarding the statistical distribution of the
population. It is a statement regarding the parameters of the population
Statistical hypothesis is denoted by H
Examples:
1. H: The population has mean = 25
2. H: The population is normally distributed with mean =25 and standard deviation

In a test procedure, to start with, a hypothesis is made. The validity of this hypothesis is
tested. If the hypothesis is found to be true, it is accepted. On the other hand, if it is found
to be untrue, it is rejected
The hypothesis which is being tested for possible rejection is called null hypothesis.
The null hypothesis is denoted by H
0
. Hypothesis which is accepted when the null
hypothesis is rejected is called alternative hypothesis The alternative hypothesis is denoted
by H
1
.
154
CRITICAL REGION
From a population many samples of the same size n can be drawn. Let S be the set of all
such sample of size n that can be drawn from the population. Then, S is called sample
space. While testing a null hypothesis, among the samples which belong to S, some
samples lead to the acceptance of the null hypothesis, whereas, some others lead to the
rejection of the null hypothesis. The set of all those samples belonging to the sample
space which lead to the rejection of the null hypothesis is called critical region. The
critical region ids denoted by . The critical region is also rejection region. The set of
samples which lead to the acceptance of the null hypothesis is the acceptance region. It is
(S- ).
In fact, to decide whether the sample in hand belongs to , the criterian |Z| > k is adopted.
And so, in effect, the critical region is defined by |Z| >ks.
ERRORS OF THE FIRST AND THE SECOND KIND
Type I and type II errors)
While testing a null hypothesis against an alternative hypothesis, one f the following four
situations arise
Actual fact Decision based on the
sample
Error
1 H
0
is true accept H
0
correct decision -
2 H
0
is true reject H
0
wrong decision Type I
3 H
0
is not true accept H
0
wrong decision Type II
4 H
0
is not true reject H
0
correct decision -
Here, in situations (2) and (3), wrong decisions are arrived at. These wrong decisions are
Error of the first kind (Type I error) and Error of the second kind (Type II error)
respectively. Thus,
(i) Error of the first kind (Type I error) is taking a wrong decision to reject the null
hypothesis when it is actually true.

(ii) Error of the second kind (Type II error) is taking a wrong decision to accept the
null hypothesis when it is actually not true.
The probability of occurrence of the first kind of error is denoted by . It is called level of
significance. Thus, the level of significance is the probability of Type I error. It is the
155
probability of rejection of the null hypothesis when it is actually true. Usually the level
of significance is fixed at 0.05 or 0.01. In other words, the level is fixed at 5% or 1%.
The probability of occurrence of the second kind of error is denoted by .
The value (1- ) is called power of the test. Power of a test is the probability of rejecting
H
0
when it is not true.
While testing, the level of significance is decided in advance. Then, the critical value k is
determined in such a way that the power (1- ) is maximum.
Thus, the critical value k is based on the level of significance. For tests which are based on
normal distribution, if =0.05, the critical value is k = 1.96. If =0.01, the critical value is k
= 2.58.
Note: In fact, a decision to accepts H
0
, is based only on the given data. And so, rather
than making an assertive statement H
0
is accepted, we would make a statement H0 is
not rejected. However, at the level of this book, we will not bother about the subtle
difference between these statements
TWO-TAILED AND ONE-TAILED TESTS (Two-sided and one-sided testes)
While testing H0, if the critical region is considered at one tail of the sampling
distribution of the test statistic, the test is one-tailed test
One the other hand, if the critical region is considered at both
the tails of the sampling distribution of the test statistic,
the test is two-tailed test.
156
Session 09
TEST FOR MEAN
Suppose the mean of a population is known. We want to test whether the mean is a given
value
0
. The null hypothesis is H
0
: =
0
. The alternative hypothesis is H
1
:
0
.
For a large random sample of size n from the population, under H
0,
the distribution of

n
x
Z

is N(0,1)

The test statistic is
For the sample, if the calculated value |Z| cal>k, H
0
is rejected. On the other hand, if |Z|cal
k, H
0
is accepted.
For the level of significance =0.05, the critical value is k = 1.96. However, for =0.01,
the critical value is k = 2.58.
Note: Here, if is not known, the test statistic is
n
x
Z

| |
0

where s is the sample


standard deviation

TEST FOR EQUALITY OF MEANS
The null hypothesis is H
0
:
1
=
2
(the means of the two populations are equal). The
alternative hypothesis is H
1
:
1

2
. Under H
0,
let be the common means and let
1
and
2
be the standard deviations of the two populations.
Let a random sample of size n
1
be drawn from the first population. Let the sample be
1 x
.
Also, let a random sample of size n
2
be drawn from the second population. Let the mean of
this sample be
2 x
Then,
2
2
2
1
2
1
2 1
n n
x x
Z

+
+

is N(0,1)
And so the test statistic is
2
2
2
1
2
1
2 1 | |
| |
n n
x x
Z

+
+


For the samples, if |Z| cal > k, H0 is rejected.
On the other hand, if |Z| cal k, H0 is accepted. For the level of significance =0.05, the
critical value is k = 1.96. However, for =0.01, the critical value is k = 2.58. Here, if
1
157
n
x
Z
/
0

and
2
are not known, the test statistic is
2
2
2
1
2
1
2 1 | |
| |
n
s
n
s
x x
Z
+
+

. Where s
1
and s
2
are the sample
standard deviations
TEST FOR PROPORTION
Suppose the proportion of an attribution in a population is not known, we want to test
whether the proportion is a given value P
0
. The null hypothesis is H
0
: P = P
0
. The
alternative hypothesis is H
1
: P P
0
.
In a large random sample of size n from the population, let x units posses the attribute.
Then, the sample proportion is p=x/n
Ans so,
n
Q p
P p
Z
0 0
0

is N(0,1).
Therefore, the test statistic is
n
Q p
P p
Z
0 0
0

For the level of significance =0.05, the critical value is k = 1.96. However, for =0.01,
the critical value is k = 2.58.
TEST FOR EQUALITY PROPORTION
Suppose there are two populations with unknown proportions, and we wish to test whether
the proportions( of certain attributes) in the two populations are equal. The null hypothesis
is H
0
: P
1
=P
2
(the proportions are equal). The alternative hypothesis is H
1
: P
1
P
2
Under H
0
, let P be the common proportion. Let a large random sample of size n1 be drawn
from the first population. Among these n1 units, let x1 units possess the attribute, so that
the sample proportion is p
1
=x
1
/n. Also let a large random sample of size n2 be drawn from
the second population. Among the units, let x
2
units posses the attribute, so that sample
proportion is p
2
= x
2
/n
2
.
158
For the samples, if |Z|
cal
> k, H
0
is rejected.
On the other hand, if |Z|
cal
k, H
0
is accepted.
The test statistic is
Generally, the common proportion P will not be known. And so, it is estimated from the
samples.
This estimates is
And also,

PROBLEMS
1. A random sample of 200 tins of vanaspathi has mean weight 4.97 kgs and standard
deviation 0.2kgs. Test at 1% level of significance, that the tins have 5 kgs. Vanaspathi
Solution:
H
0
: =5kg
H
1
: 5kg
Under H
0
200 / 02
5 97 . 4
Z
=2.12
Z
tab
=2.58 at 1% l.o.s
Since Z
cal
< Z
tab,
we accept H
0
at 1% l.o.s i.e the tins have 5 kgs of vanaspathi at 1%
l.o.s
2. A random sample of 100 rods drawn from a lot of rods has mean length 32.7cms. And
standard deviation 1.3cms. Can it be concluded that the lot has mean 32 cms?
Solution:
H
0
: =32
H
1
: 32
Under H
0
38 . 5
100 / 3 . 1
32 7 . 32

Z
Z
tab
=1.96 at 5% l.o.s
Since Z
cal
> Z
tab,
we reject H
0
at 5% l.o.s i.e the lot does not have mean 32 cms

3. The mean and standard deviation of heights of 100 randomly selected boys are 163 cms
and 3 cms, respectively. The mean and standard deviation of heights of 80 randomly
selected girls are 161 cms and 2 cms, respectively. Can it be concluded at 1% level of
significance that boys and girls are equally tall?
Solution:
159
For the samples, if |Z|cal > k, H0 is rejected.
On the other hand, if |Z|cal k, H0 is accepted.
H
0
:
1
=
2
H
1
:
1

2
Under H
0
35 . 5
80
2
100
3
161 163
2 2

+

Z
Z
tab
=2.58
Since Z
cal
> Z
tab,
we reject H
0
at 1% l.o.s i.e boys & girls do not have equal height at 1%
l.o.s.
4. The standard deviation of length of fibre manufactured by process A is 0.5cms, and
standard deviation of length of fibre manufactured by process B is 0.6cms. A sample of 40
randomly selected fibres from process A has mean length 16.7 cms. A sample of 60
randomly selected fibres from process B has mean length 16.4 cms. Test whether process A
and process B differ with regard to length of fibre manufactured by them.
Solution:
H
0
:
1
=
2
H
1
:
1

2
Under H
0
71 . 2
60
) 6 . 0 (
40
) 5 . 0 (
4 . 16 7 . 16
2 2

+

Z
Z
tab
=1.96.
Since Z
cal
> Z
tab,
we reject H
0
at 5% l.o.s i.e there is a significant difference in the mean
length of fibre manufactured by process A & process B.
5. Pass experience shows that among borewells dug by a firm, 78% are successful. The
firm digs 65 borewells in a district. Among them 58 were successful. Can we conclude that
these figures agree with the past experience? [Test both at 5% and 1% levels]
Solution:
H
0
: P= 0.78
H
1
: P0.78
Under H
0
219 . 0
65
22 . 0 78 . 0
78 . 0
65
58

Z
Z
tab
=1.96 at 5% l.o.s
Z
tab
=2.58 at 1% l.o.s

160
Since Z
cal
< Z
tab,
we accept H
0
at both the levels. i.e figure agree with past experience at 5%
and 1% l.o.s.
6. In a random selection of 85 workers of a factory, 18 were unmarried. Can we conclude
that 20% workers of the factory are unmarried?
Solution:
H
0
: P= 0.20
H
1
: P 0.20
Under H
0
271 . 0
55
8 . 0 2 . 0
20 . 0
55
18

Z
Z
tab
=1.96 at 5% l.o.s
Since Z
cal
< Z
tab
at 5%, l.o.s, we accept H
0
i.e. 20% of the workers in the factory are
unmarried.

7. Among 80 electric bulbs manufactured by Process A, three were defective. Among 130
electric bulbs manufactured by process B, two were defective. Test whether the proportion
of defectives in the two process differ.
Solution:
H
0
: P
1
= P
2
H
1
: P
1
P
2
Under H
0
506 . 2
130
1
80
1
210
201
210
9
130
2
80
7

,
_

+
,
_

,
_

Z
Z
tab
=1.96
Since Z
cal
> Z
tab
at 5%, l.o.s, we rejected H
0
i.e. 5% l.o.s i.e proportion of defective bulbs
differ in process A & process B.

8. The proportion of substandard crackers among 400 crackers manufactured by firm A is
0.12. The proportion among 500 crackers manufactured by firm B is 0.08. Test at 1% level
of significance that the proportion is the same among the products of the two firms
Solution:
H
0
: P
1
= P
2
H
1
: P
1
P
2
161
Under H
0
( ) ( )
988 . 1
500
1
400
1
9 . 0 1 . 0
08 . 0 12 . 0

,
_

Z
Z
tab
=2.58 at 1% l.o.s
Since Z
cal
<Z
tab,
we accepts H
0
at 1% l.o.s. i.e proportion is same among the products of the
two firms.
CHI- SQUARE TESTS
Let Z
1
, Z
2
,.Z
n
be n independently distributed standard normal variates. Then, the
distribution of
is called Chi-square distribution with n degree of freedom (d.f)
Here,
2
has n independent variable components. Therefore, its degree of freedom is n. The
degree of freedom of
2
is the number of independent variable components that it has. The
degree of freedom will be less if there are constraints on these variable components.

APPLICATIONS OF CHI-SQUARE DISTRIBUTION
The Chi-square distribution has many uses in the field of testing of hypotheses. Some of
them are
1. To test whether a population has given variance
2. To test goodness of fit of a theoretical distribution to a observed distribution.
3. To test independence of attributes in a contingency table
TEST FOR VARIANCE
Suppose the variance of a normal population is not known. We want to test whether the
population has given variance. The null hypothesis is
2
0
2
0
: H (Population variance is
2
0
S)
The alternative hypothesis is
2
0
2
0
: H .
To conduct the test, n random observation x
1
, x
2
,.x
n
are drawn from the population. Then,
under H
0
.
162
2
0
2
2
0
2
2
0
2
) 1 (
) (

s n
x x
x x

,
_


is a Chi-square variate with (n-1) degree of
freedom. {Here, the degree of freedom is one less (than n) because there is a constraint

. x n x }
Here,
1
) (
2
2

n
x x
s is the sample varience*
If for the sample, the calculated value
2
cal
lies between k1 and k2, the null hypothesis
is accepted. On the other hand, if
1
2
k
cal
or if
2
2
k
cal
the null hypothesis is
rejected.
For the different level of significance (=0.05 and =0.01) the critical value is k
1
and k
2
are
obtained from the table Chi- Square critical values.
Here, k
1
and k
2
are different for different degree of freedom. Also, the test is two-
tailed(two-sided)
163
Session 10
Problems : The standard deviation of heights of plants is know to be 2 cms. Eight
randomly selected plants have heights 172, 156, 154, 163, 170,169,172 and 164 cms. Test
whether the sample standard deviation differs significantly from population standard
deviation

Solution:
Here, n-8 and
0
= 2 cms.
H
0
: The sample standard deviation does not differ significantly from population standard
deviation
H
1
: The sample standard deviation differ significantly from population standard deviation
The test statistic is ---
2
0
2
2
) 1 (

s n

x
(x-
x
)=(x-165) (x-
x
)
2
172 7 49
156 -9 81
154 -11 121
163 -2 4
170 5 25
169 4 16
172 7 49
164 -1 1
1320 0 346
5 . 86
2
346 ) 1 (
346 ) ( ) 1 (
43 . 49
1 8
346
1
) (
165
8
1320
2 2
0
2
2
2
2
2
2

s n
x x s n
n
x x
s
n
x
x
cal
The degree of freedom is (n-1) = (8-1) = 7
The level of significance is =5%
The critical value are k
1
= 1.69 and k
2
= 16.01
164
Since
2
cal
=86.5>16.01, H
0
is rejected.
Conclusion: The sample standard deviation differ significantly from population deviation.
Problems: A Milk filling machine fills sachets with milk. The contention is that standard
deviation of quantity of milk filled is 3ml. To test this, 24 sachets are randomly selected
and their contents noted. If the standard deviation of these observations is 3.9ml. What is
your conclusion?
Solution:
Here, n=24,
0
=3ml. And s = 3.9ml
H0: Standard deviation is 3ml.
H: Standard deviation differs from 3ml
The test statistic is
2
0
2
2
) 1 (

s n
cal

Here,
87 . 38
3
) 9 . 3 ( 23 ) 1 (
2
2
2
0
2
2

s n
cal
The degree of freedom is (n-1) = (24-1) = 23
The level of significance is =5%
The critical value are k
1
= 11.69 and k
2
= 38.08
Since
2
cal
=38.87.>38.08, H
0
is rejected.
Conclusion: The standard deviation differ from 3ml.
CHI-SQUARE TEST OF GOODNESS OF FIT
Suppose there is an observed (empirical) frequency distribution with frequencies, O
1
,O
2
,
O
n
. According to certain theoretical assumptions, let a theoretical frequency distribution be
fitted to the observed distribution. Let the theoretical frequencies be E
1
,E
2
,E
n.
Suppose
we intend to test the null hypothesis-----
H
0
: The theoretical frequency distribution is a good fit to the observed frequency
distribution. H1 : The theoretical frequency distribution is not a good fit to the
observed frequency distribution.
To test H
2
against H
1
, Karl Pearsons Chi-square test of goodness of fit is applied.
Here, the test statistic is ---
Under H
0
this is a Chi-square variate with (n-c) d.f
Here, n is the number of terms in
2
and c is the number of constraints.
This test is one-tailed.
If
0
2
, H k
cal
> is rejected. And if
0
2
, H k
cal
is accepted.
For different degree of freedom and =0.05 and =0.01, the critical values are
obtained from the Chi-square table.
165
( )
i
i i
E
E O
2
2


The Chi-square test of goodness of fit is applicable subjects to the following
conditions.
1.The observation should be independent (random)
2.The total frequency N should be large.
3.The theoretical frequencies E
i
should be 5 or more. If any E
i
is less than 5, it should be
pooled with the adjacent frequency.
4.If any parameter is estimated from the observed distribution, corresponding to every such
estimation one degree of freedom should be lessened.
Problems
To an observed frequency distribution, binomial distribution is fitted after estimating p
from the observed data. The observed and theoretical frequencies are given below.
Test whether binomial distribution is a good fit.
x
i
0 1 2 3 4 5 6 7 Total
O
i
3 3 17 31 28 11 1 2 96
E
i
1 7 19 27 24 13 4 1 96
Test whether binomial distribution is a good fit.
Solution :
H
0
: Binomial distribution is a good fit.
H
1
: Binomial distribution is not a good fit
x
i
O
i
E
i
(O
i
- E
i)
2
(O
i
- E
i)
2
/ E
i
0
1
2
3
4
5
6
7
3
3 6
17
31
28
11
1
2 3
1
7 8
19
27
24
13
4
1 5
4
4
16
16
4
4
0.5000
0.2105
0.5926
0.6667
0.3077
0.2000
Total 96 96 2.4775
The frequencies are pooled in such a way that none of the theoretical is less than 5.
However observed frequencies may be less than 5.
The test statistic is--------
166
A/B



B
Ultimately, the number of items in the
2
is n = 6. Since p is estimated, the degree of
freedom is (n-c) = (6-2) = 4.
The level of significance is =5%
The critical value is k = 9.49
Since
2
cal
=2.4775<9.49, H
0
is accepted..
Conclusion: Binomial distribution is a good fit.
Problems:
The following table gives the observed and theoretical distributions concerning a
survey. To find theoretical frequencies if mean has been estimated, test whether it is a
good fit.
Solutions:
H
0
: Theoretical distributions is a good fit to the observed distributions
H
1
: Theoretical distribution is a not a good fit to the observed distributions
Class Observed Theoretical
0-2 13 16
2-4 27 25
4-6 58 42
6-8 34 38
8-10 16 23
10-12 12 16
C.I O
i
E
i


i
i i
E
E O
2
) (
0-2 13 16 0.5625
2-4 27 25 0.16
4-6 58 42 6.0952
6-8 34 38 0.4211
8-10 16 23 2.1304
10-12 12 16 1.000
10.3692
167
( )
i
i i
E
E O
2
2


A/B



B
2
cal
=10.3692
2
tab
=11.070 at 5% l.o.s for 5 d 5
Since
2
cal
<
2
tab
, we accept H
0
at 5% l.o.s. i.e. Theoretical distributions is a good fit to the
observed distributions
CHI-SQUARE TEST FOR INDEPENDENCE OF ATTRIBUTES
N random observations are drawn from the population. These observations are classified
with respect to the two attributes and they are written down in the form of a 2*2
contingency table as follows
H
0
: Attributes A and B are independent.
H
1
: Attributes A and B are not independent.
The Chi-square test statistic is ------
Under H
0
this is a Chi-square variate with 1 d.f
This test is one-tailed.
If
0
2
, H k
cal
> is rejected. And if
0
2
, H k
cal
is accepted.
For =0.05 the critical value is k = 3.84
For =0.01 the critical value is k = 6.63
The Chi-square test for independence of attributes is applicable subject to the
following conditions.
1. The observations should be independent (random)
168
N=a+b+c+d b+d a+c Total
c+d d c
A
2
a+b b a
A
1
Total
B
2
B
1
A/B



B
) )( )( )( (
) (
2
2
d b c a d c b a
bc ab N
+ + + +


2. The total frequency N should be large
3. Each of the frequencies a, b ,c, and d should be 5 or more.
Problems:
46 rabbits are divided into two groups one group consisting of 23 rabbits is called
experimental group, and the other group consisting of 23 rabbits is called control group.
The experimental group is inocculated against a disease and the control group is not
inocculated. Afterwards, all the rabbits of both the groups are exposed to the disease. In the
control group, 13 contracted the disease. In the experimental group 10 contracted the
disease. In the experimental group 8 contracted the disease. Test whether inocculation
and contract of disease are independent.
Solutions:
The Chi-square test
H
0
: Inocculation and contract of disease are independent
H
1
: Inocculation and contract of disease are not independent
The test statistic is
The given data is tabulated as follows.
Control Experimental
Group
Total
Contracted 13 8 21
Not Contracted 10 15 25
Total 23 23 46
Here 19 . 2
23 23 25 21
) 10 8 15 13 ( 46
2
2

cal

The degree of freedom is 1.


The level of significance is =5%
The critical value is k = 3.84
Since
2
cal
=2.19<3.84, H
0
is accepted.
Conclusion: Inocculation and contract of disease are independent.
Test for equality of proportions:

,
_

2 1
2 1
1 1
| |
| |
n n
PQ
p p
Z
169
) )( )( )( (
) (
2
2
d b c a d c b a
bc ab N
+ + + +


Here
and
48 . 1
23
1
23
1
5435 . 0 4565 . 0
5652 . 0 3478 . 0 |
| |
4565 . 0
23 23
12 8
5652 . 0
23
13
3478 . 0
23
8
2
1

,
_

+
+



cal
Z
P
p
p
Since this value is less than k = 1.96. H
0
is accepted.
Conclusion: Inocculation and contract of disease are independent.
Problems:
A Milk producers union wishes to test whether the preference pattern of consumers for its
products is dependent on income levels. A random sample of 500 individuals gives the
following data
Income
Product preferred
Product A Product B Product C Total
Low 170 30 80 280
Medium 50 25 60 135
High 20 10 55 85
Can you conclude that the preference patterns are independent of income levels?
Solutions:
H
0
: Preference pattern of consumers & income level are independent
H
1
: Preference pattern of consumers & income level are not independent
Income
Product preferred
Product
A
Product
B
Product
C
Total
Low 170
(134)
30
(36)
80
(110)
280
Medium 50
(65)
25
(18)
60
(52)
135
High 20
(41)
10
(11)
55
(33)
85
Total 240 65 195 500
170
78 . 51
2
2

,
_

i
i i
E
E o

488 . 9
2

tab
for 4 df at 5% l.o.s.
Since ,
2 2
tab cal
> reject H
0
at 5% l.o.s i.e. preference pattern of consumers & income level
are independent.
Problems:
Among 79 students, 58 were hard working. Among these hard working students, 52 passed
in the examination. Whereas, among the non-hard working students, only 5 passed. Apply
chi-square test at 1% level of significance to test whether hard work and pass are
independent.
Solutions:
H
0
: Hardworking & result are independent
H
1
: Hardworking & result are not Independent

52
(42)
05
(15)
57
06
(16)
16
(06)
22
58 21 79
96 . 31
2
2

,
_

i
i i
E
E o

635 . 6
2

tab
for 41df at 1% l.o.s.
Since ,
2 2
tab cal
> we reject H
0
at 5% l.o.s i.e. Hard working & results are not independent
THE F-TEST OR THE VARIANCE RATIO TEST
The F-test is named in honour of the great statistician R.A. fisher. The object of the F-test
is to find out whether the two independent estimates of population variance differ
significantly, or whether the two samples may be regarded as drawn from the normal
populations having the same variance. For carrying out the test of significance, we
calculate the ratio F.
F is defined as:
171
,
2
2
2
1
S
S
F
where
( )
2
1
1 1
2
1
1

n
X X
S
and
( )
1
2
2
2 2
2
2

n
X X
S
pass
Fail
It should be noted that S
1
2
is always the larger estimate of variance, i.e. S
1
2
> S
2
2

1
= degree of freedom for sample having large variance

2
= degree of freedom for samzple having smaller variance.
Assumptions in F-test. The F test is based on the following assumptions:

1. Normality i.e., the values in each group are normally
distributed
2.Homogeneity, i.e., the variance within each group should be equal for all groups(
1
2
=

2
2
=.=
c
2
) This assumption is needed in order to combine or pool the variances within
the groups into a single within groups source of variation.
3. Independence of error. It states that the error (variation of each value around its own
group mean) should be independent for each value.
The following few examples would illustrate the application of F-test:
1.Two random samples were drawn from two normal populations and their values are:
A: 66 67 75 76 82 84 88 90 92

B: 64 66 74 78 82 85 87 92 93 95 97

Test whether the two populations have the same variance at the 5% level of significance
(F=3.36) at 5% level for
1
=10 and
2
=8.
Solutions: Let us take the hypothesis that the two populations have the same variance.
Applying F-test
2
2
2
1
S
S
F
A
1
X
1
1 1
) (
x
X X
2
1
x
B
2
X
2
2 2
) (
x
X X
2
2
x
172
Large estimate of variance
Smaller estimate of variance
F =
1
1 1
n 1
2 2
n
and
66
67
75
76
82
84
88
90
92
-14
-13
-5
-4
+2
+4
+8
+10
+12
196
168
25
16
4
16
64
100
144
64
66
74
78
82
85
87
92
93
95
97
-19
-17
-9
-5
-1
+2
+4
+9
+10
+12
+14
361
289
81
25
1
4
16
81
100
144
196

1
X =720
1
x =0
2
1
x =734
2
X =913
2
x =0
2
2
x =1298
415 . 1
75 . 91
8 . 129
8 . 129
1 11
8 . 129
1
75 . 91
1 9
734
1
83
11
913
; 80
9
720
2
2
2
1
2
2
2 2
2
1
2
1 2
1
2
2
2
1
1
1


S
S
F
n
x
S
n
x
S
n
X
X
n
X
X
for
1
=10 and
2
=8. F
0.05
=3.36
The calculated value of F is less than the table value. The hypothesis is accepted. Hence it
may be calculated that two populations have the same variance.
173
Session 11
2. In a sample of 8 observations, the sum of squared deviations of items from the mean was
84.4. In another sample of 10 observations, the value was found to be 102.6
test whether the difference is significant at 5% level. You are given that at
5% level, critical value of F for
1
=7 and
2
=9 degree of freedom is 3.29 and
for
1
=8 and
2
=10 degree of freedom, its value is 3.07
Solutions: Let us take hypothesis that the difference in the variance of the two samples is
not significant. We are given
8
1
n ,
2
1 1
) ( X X =84.4
10
2
n ,
2
2 2
) ( X X =102.6
2
2
2
1
S
S
F
06 . 1
4 . 11
06 . 12
4 . 11
9
6 . 102
1
) (
06 . 12
7
4 . 84
1
) (
2
2
2 2 2
2
2
2
2 2 2
1

F
n
X X
S
n
X X
S
for
1
=7 and
2
=9. F
0.05
=3.29
The calculated value of F is less than the table value. Hence we accept the hypothesis and
concluded that the difference in the variance of two samples is not significant at 5% level.
ANALYSIS OF VARIANCE
The analysis of variance frequently referred to by contraction ANOVA is a statistical
technique specially designed to test whether the means of more than two quantitative
populations are equal.
Problems:
The three samples below have been obtained from normal populations with equal variance.
Test the hypothesis that the sample means are equal:

8 7 12
10 5 9
7 10 13
174
14 9 12
11 9 14
The table value of F at 5% level of significance for
1
=2 and
2
=12 is 3.88
Solutions:
1
X
2
X
3
X
8
10
7
14
11
7
5
10
9
9
12
9
13
13
14
Total 50

X
10
40
8
60
12
10
3
12 8 10

+ +
X
VARIANCE BEWEEN SAMPLES
( )
2
1
X X
2
2

,
_

X X
2
3

,
_

X X
0
0
0
0
0
4
4
4
4
4
4
4
4
4
4
0 20 20
Sum of squares between samples = 0+20+20=40
VARIANCE WITHIN SAMPLES
1
X
( )
2
1 1
X X
2
X
( )
2
2 2
X X
3
X
( )
2
3 3
X X
8
10
7
14
11
4
0
9
16
1
7
5
10
9
9
1
9
4
1
1
12
9
13
13
14
0
9
1
0
4
30 16 14
Sum of squares between samples = 30+16+14=60
175
ANOVA TABLE
Source of variation Sum of squares v Mean square
Between 40 2 20
Within 60 12 5
Total 100 14
4
5
20
F
The calculated value of F is greater than the table value. The hypothesis is rejected.
Hence there is significant difference in the sample means.
ANALYSIS OF VARIANCE IN TWO-WAY CLASSIFICATION
MODEL
In a one-factor analysis of variance explained above the treatments different
levels of a single factor which is controlled in the There are, however, many
situations in which the response variable of interest may be affected by more than one
factor. For example, sales of Maxfactor Cosmetics, in addition to being affected by the
point-of-sale display, might also be affected by the price charged, the size and/or location
of the store or the number of competitive products sold by the store, Similarly
petrol mileage may be affected by the type of car driven, the way it is driven, road
conditions and other factors in addition to the brand of petrol used.
When it is believed that two independent factors might have an effect on the response
variable of interest, it is possible to design the test so that an analysis of variance can be
used to test for the effects of the two factors simultaneously. Such a test is called a two-
factor analysis of variance. With two- factor analysis of variance, we can test two sets of
hypothesis with the same data at the same time.
In a two-way classifications the data are classified according to two different criteria or
factors. The procedure for analysis of variance is somewhat different than the one followed
while dealing with problems of one-way classification. In a two-way classification the
analysis of variance table takes the following form.
Source of
Variation
Sum of squares Degree of
freedom
Mean Sum of Squares Ratio of F
Between samples
Between Rows
Residual or error
SSC
SSR
SSE
(c-1)
(r-1)
(c-1)(r-1)
MSC-SSC/)/(c-1)
MSR = SSR/(r-1)
MSE=SSE/(r-1)(c-1)
MSE/MSE
MSR/MSE
Total SST n-1
SSC = Sum of square between columns
SSR = Sum of squares between rows
SSE= Sum of squares due to error
SST= Total sum of squares
The sum squares for the source 'Residual' is obtained by subtracting from the total sum of
squares the sum of squares between columns and rows, i.e.,
SSE = SST-[SSC+SSR]
176
The total number of degrees of freedom = n - 1 or cr - 1
where c refers to number of columns, and
r refers to number of rows,
Number of degrees of freedom between columns
=(c-1)
Number of degrees of freedom between rows
= (r-1)
Number of degrees of freedom for residual
=(c- l)(r- 1)
The total sum of squares, sum of squares for between columns and sum of squares for
between rows are obtained in the same way as before.
Residual or error sum of square = Total sum of square Sum of squares between columns
Sum of squares between rows.
5The F values are calculates as follows:
F (
1. 2
) = MSC/MSE
Where
1
= (c-1) and
2
= (c-1)(r-1)
F (
1. 2
) = MSR/MSE
Where
1
= (r-1) and
2
= (c-1)(r-1)
It should be carefully noted that
1
may not be same in both cases- in one case
1
= (c-1) and
another case
1
(r-1).
The calculated values of F are compared with the table values. If calculated value of F is
greater than the table value at pre-assigned level of significance the null hypothesis is
rejected, otherwise accepted.
It would be clear from above that in problems involving two-way classification. residual
is the measuring rod for testing significance. It represents the magnitude of variation due to
forces called change. The following examples would illustrate the procedure.
Problems:
A tea company appoints four salesmen A,B,C and D and observes their sales in three
seasons-summer, winter and monsoon. The figures (in lakhs) are given in the following
table:
(i) Do the salesmen significantly differ in performance?
(ii) Is there significant difference between the seasons?
177
360 96 81 93 90
Salesmens
Totals
112 29 29 28 26 Monsoon
120 32 31 29 28 Winter
128 35 21 36 36 Summer
Total D C B A
Seasons Salesmen
Seasons
Solutions:
The above data are classified according to criteria (i) salesman, and (ii) seasons in order to
simply calculations we code the data by subtracting 30 from each figure. The data in the
code from are given below:

Correction Factor =
( )
0
12
0
2 2

N
T
(number of items or N is 12)
Sum of squares between salesmen
This will be obtained by squaring up the salesmens totals, diving each total by the number
of items included in it, adding these figures and then subtracting the correction factor from
them.
Thus, sum of squares between salesmen:

( ) ( )
3 ) 1 4 (
42 0 12 27 3 0
3
) 6 (
3
9
3
3
3
) 0 (
2 2 2 2 2

+ + + +
+

+ +

N
T
Sum of squares between seasons
This will be obtained by dividing the squares of the season totals by the numbers of items
that make up each total, adding all such figures and subtracting therefrom the correction
factor, thus,
sum of squares between salesmen:

( ) ( )
2 ) 1 4 (
32 0 16 0 16
4
8
4
0
4
) 8 (
2 2 2 2

+ +

+ +

N
T
Total sum of squares
This will be obtained by adding the squares of all items in the table and subtracting the
correction factor therefrom, thus:
178
Grand total
T=0
6 -9 3 0
-8 -1 -1 -2 -4 Monsoon
0 +2 +1 -1 -2 Winter
+8 +5 -9 +6 +6 Summer
Total D C B A
Seasons Salesmen
Seasons
Total sum of squares =
11 ) 1 12 (
210 0 210
) 1 ( ) 2 ( ) 5 ( ) 1 ( ) 1 ( ) 9 (
) 2 ( ) 1 ( ) 6 ( ) 4 ( ) 2 ( ) 6 (
2
2 2 2 2 2 2
2 2 2 2 2 2


+ + + + + +
+ + + + +

N
T
The above information will be presented in the following table of Analysis of Variance:
Source of
Variation
Sum of squares Degree of
freedom
Mean Sum of Squares
Between columns
(salesmen)
Between
Rows(seasons)
Residual
42
32
136
3
2
6
14
16
22.67
210 11
Let us take the hypothesis that there is no difference between the sales of salesman and of
seasons or .In other words, the three independent estimates of variance are the estimates of
variance of a common population.
Now first compare the salesmen variance estimate with the residual variance estimate; thus
619 . 1
14
67 . 22
F
The table value of F for
1
= 3 and
2
= 6 at 5% level of significance is 4.76.
The calculated value is less than the table value and we conclude that the sales of different
salesmen do not differ significantly.
Now let us compare the season variance estimate with the residual variance estimate: thus
417 . 1
16
67 . 22
F
The critical value of F for
1
= 2 and
2
= 6 at 5% level of significance is 5.14.
The calculated value is less than this and hence there is no significant difference in the
seasons as far as the sales are concerned.
Thus, the test shows that the salesmen and the seasons are alike so far as the sales are
concerned.
Problems:
The following data represent the number of units of production per day tumed out by 5
different workers using 4 different types of machines:
179
39 49 42 38 5
33 46 38 43 4
32 44 36 34 3
43 52 40 46 2
36 47 38 44 1
D C B A
Workers
Machine type
A. Test whether the mean productivity is the same for the different machine types.
B. Test whether the 5 men differ with respect to mean productivity

Solutions:
Let us take the hypothesis that (a) the mean productivity is the same for four different
machines, and (b) the 5 men do not differ with respect to mean productivity. To simply
calculations let us divide each value by 40. The coded data is given below
Correction Factor = 20
20
400
2

N
T
Sum of squares between machines

( ) ( )

+ +

+
2 2 2 2
5
) 17 (
5
38
5
6
5
) 5 (
Correction Factor
3 ) 1 4 ( ) 1 (
8 . 338 20 8 . 358
20 ) 8 . 57 8 . 288 2 . 7 5 (


+ + +
c
Sum of squares between workers

( ) ( ) ( ) ( )
4 ) 1 5 ( ) 1 (
5 . 161 20 5 . 181
20 ) 16 0 49 25 . 110 25 . 6 (
20
4
64
4
0
4
196
4
441
4
25
4
8
4
0
4
14
4
21
4
) 4 (
2 2 2 2 2 2


+ + + +
+ + + +
+ +

+ +
r
N
T

Total sum of squares


180
Total
Machine type
Worker
D C B A
+5
+21
-14
0
+8
-4
+3
-8
-7
-1
+7
+12
+4
+6
+9
-2
0
-4
-2
+2
+4
+6
-6
+3
-2
1
2
3
4
5
T=20 -17 +38 -6 +5 Total
Total sum of squares =
574 20 594
20 ] 1 49 64 9 16 81 36
16 144 49 4 4 16 4 4 9 36 36 16 [
] ) 1 ( ) 7 ( ) 8 ( ) 3 ( ) 4 ( ) 9 ( ) 6 (
) 4 ( ) 12 ( ) 7 ( ) 2 ( ) 2 ( ) 4 ( ) 0 (
) 2 ( ) 2 ( ) 3 ( ) 6 ( ) 6 ( ) 4 [(
2
2 2 2 2 2 2 2
2 2 2 2 2 2 2
2 2 2 2 2 2

+ + + + + + +
+ + + + + + + + + + +
+ + + + + + +
+ + + + + + +
+ + + + +
N
T
Residual or Remainder = Total sum of squares (Sum of squares between machines Sum
of squares between workers)
= 574-33.8-161.5 = 73.7
Degree of freedom for remainder = 19-3-4=12
(c-1) (r-1)= (3*4) = 12
Source of Variation S.S d.f M.S Variance
Ratio or F
Between Machine types
Between Workers
Remainder or Residual
338.8
161.5
73.7
3
4
12
112.933
40.375
6.142
112.933/6.142
= 18.387
40.375/6.142
= 6.574
574 19
(a) For
3
=12, F
0.05
=3.49
Since the calculated value (18.4) is greater than the table value, we conclude that the
mean productivity is not same for the four different types of machines
(b) For
4
=12, F
0.05
=3.26
The calculated value (6.58) is greater than the table value, hence the worker differ with
respect to mean productivity.
Application of the t-distribution
The following are some of the examples to illustrate the way in which the Student
distribution is generally use to test the significance of the various results obtained from
small samples.
1. To test the Significance of the Mean of a Random Sample. In determining whether
the mean of a sample drawn from a normal population deviates significantly from a stated
value (the hypothetical value of the populations mean), when variance of the population is
unknown we calculate the statistic:
where
X
= the mean of the sample
= the actual or hypothetical mean of the population
181
( )
S
n X
t

n=the sample size


S= the standard deviation of the sample
Problems
The manufactures of a certain make of electric bulbs claims that his bulbs have a mean life
of 25 months with a standard deviation of 5 months. A random sample of 6 such bulbs gave
the following values.
Life in months: 24, 26, 30, 20, 20, 18
Can you regard the producers claim to be valid at 1% level of significance? (Give that the
table values of the appropriate test statistics at the said level are 4.032, 3.707 and 3.499 for
5.6 and 7 degree of freedom respectively)
Solutions: Let us take the hypothesis that there is no significant difference in the mean life
of bulbs in the sample and that of the population. Applying t-test:
CALCULATION OF
X
and S
X
x
X X ) (
24
26
30
20
20
18
+1
+3
+7
-3
-3
-5
1
9
49
9
9
25
X =138

2
x =102
182
( )
1
2


n
X X
S
( )
S
n X
t

2
x
517 . 4 4 . 20
5
102
1
23
6
138
2

n
x
S
n
X
X
084 . 1
517 . 4
449 . 2 2
6
517 . 4
| 25 23 |

=n-1=6-1=5. For =5, t


0.01
=4.032.
The calculated value if t is less than table value. The hypothesis is accepted. Hence,
the producers claim is not valid at 1 level of significance.
Problems
A random sample of size 16 has 53 as mean. The sum of the squares of the deviations taken
from mean is 135. Can this sample be regarded as taken from the population having 56 as
mean? Obtain 95% and 99% confidence limits of the mean of the population. (for =15, t
0.05
= 2.13 for =15, t
0.01
= 2.95)
Solutions: Let us take the hypothesis that there is no significant difference between the
sample mean and hypothetical population mean. Applying t test:
( )
4
3
4 3
16
3
| 56 53 |
3
15
135
1
135 ) ( , 16 , 56 , 53
2

t
n
X X
S
X X N X
n
S
X
t

=16-1=15. . For =16, t


0.05
= 2.13
The calculated value of t is more than the table value. The hypothesis is rejected. Hence,
the sample has not come from a population having 56 as mean.
95% confidence limits of the population mean

6 . 54 4 . 51 6 . 1 53
13 . 2
16
3
53
05 . 0
to
t
n
S
X
t
t
t
99% confidence limits of the population means
183

212 . 55 788 . 50 212 . 2 53
95 . 2
4
3
53
95 . 2
16
3
53
01 . 0
to
t
n
S
X
t
t
t
t
2. Testing Difference Between Means of Two Samples (Independent Samples) Given two
independent random samples of size n1 and n2 with means
1 X
and
2 X
and standard
deviations S
1
and S
2
we may be interested in testing the hypothesis that the samples come
from the same normal population. To carry out the test, we calculate the statistic as
follows:

Where
1 X
= mean of the first sample

2 X
= mean of the second sample
n
1
= number of observations in the first sample
n
2
= number of observations in the second sample
S = combined standard deviation
The value of S is calculated by the following formula:
Two types of drugs were used on 5 and 7 patients for reducing their weight.
Drug A was imported and drug B indigenous. The decrease in the weight after using the
drugs for six months was as follows:
Drug A : 10 12 13 11 14
Drug B : 8 9 12 14 15 10 9
If the bias correction due to small is ignored, pooled estimate of the standard deviation can
be obtained by:
Is there a significant difference in the efficacy of the two drugs? If not, which drug should
you buy? (For =10, t
0.05
=2.223)
184
2 1
2 1
2 1
n n
n n
S
X X
t
+

( ) ( )
2
2 1
2 2
2
1
1
2 +
+


n n
X X X X
S
2 1
2
2 2
2
1 1
n n
S n S n
S
+
+

Solution: Let us take the hypothesis that there is no significant difference in the efficacy of
the two drugs. Applying t-test
2 1
2 1 2 1
n n
n n
S
X X
t
+

1
X
( )
1 1
X X
( )
2
1 1
X X 2
X
( )
2 2
X X
( )
2
2 2
X X
10
12
13
11
14
-2
0
+1
-1
+2
4
0
1
1
4
8
9
12
14
15
10
9
-3
-2
+1
+3
+4
-1
-2
9
4
1
9
16
1
4

1
X =60

( )
2
1 1
X X
=10

2
X =60

( )
2
2 2
X X
=44
However, it is advisable to take account of bias.
( ) ( )
324 . 2
10
54
2 7 5
44 10
2
; 11
7
77
; 12
5
60
2 1
2
1 1
2
1 1
2
2
2
1
1
1

+
+

+
+




n n
X X X X
S
n
X
X
n
X
X
735 . 0
324 . 2
708 . 1
7 5
7 5
324 . 2
11 12
2 1
2 1 2 1

+

+

n n
n n
S
X X
t
=n
1
+n
2
2 = 5+7-2 = 10
=10, t
0.05
= 2.228.
For the calculated value of t is less than the table value, the hypothesis is accepted. Hence,
there is no significance in the efficacy of two drugs. Since drug B is indigenous and there is
no difference in the efficacy of impoted and indigenous drug, we should buy indigenous
drug, i.e., B.
2 1
2
2 2
2
1 1
n n
S n S n
S
+
+

Problems:
185
For a random sample of 10 persons, fed on die A, the increased weight in pounds in a
certain period were:
10, 6, 16, 17, 13, 12, 8, 14, 15, 9
For another random sample of 12 persons, fed on diet B, increase in the same period were:
7, 13, 22, 15, 12, 14, 18, 8, 21, 23, 10, 17
Test whether the diets A and B differ significantly as regards her effect on increase in
weight. Given the following:
Degree of freedom 19 20 21 22 23
Value of t at 5% level 2.09 2.09 2.08 2.07 2.07
Solutions: Let us take the null hypothesis that A and B do not differ significantly weight
regard to their effect on increase in weight. Applying t-test
2 1
2 1 2 1
n n
n n
S
X X
t
+

( ) ( )
2
2 1
2
1 1
2
1 1
+
+


n n
X X X X
S
Calculating the requires values:
Persons fed on diet A Persons fed on diet B
Increases in
weight
1
X
Deviations from
mean 12
( )
1 1
X X
( )
2
1 1
X X
Increases
in weight
2
X
Deviations
from mean
15
( )
2 2
X X
( )
2
2 2
X X
10
6
16
17
13
12
8
14
15
9
-2
-6
+4
+5
+1
0
-4
+2
+3
-3
4
36
16
25
1
0
16
4
9
9
7
13
22
15
12
14
18
8
21
23
10
17
-8
-2
+7
0
-3
-1
+3
-7
+6
+8
-5
+2
64
4
49
0
9
1
9
49
36
64
25
4

1
X =120 ( )
1 1
X X =0
( )
2
1 1
X X
=
120

2
X =180 ( )
2 2
X X
=0

( )
2
2 2
X X
=
44
Mean increase in weight of 10 persons fed on diet A
; 12
10
120
1
1
1

n
X
X pounds
186
Mean increase in weight of 12 persons fed on diet A
; 15
12
180
2
2
2

n
X
X pounds
( ) ( )
66 . 4
20
434
2 12 10
314 120
2
2 1
2
1 1
2
1 1

+
+

+
+


n n
X X X X
S
1 X
=12,
2 X
=15, n
1
= 12, n
2
= 12, S = 4.66. Substituting the values in the above formula
51 . 1 34 . 2
66 . 4
3
12 10
12 10
66 . 4
15 12

+

t
=n
1
+n
2
2 = 10+12-2 = 20.
For =20, the table value of t at 5 percent level is 2.09. The calculated value is less than the
table value and hence the experiment provides no evidence against the hypothesis. We,
therefore, conclude that diets A and B do not differ significantly as regards their effect on
increase in weight is concerned.
3. Testing Difference between Means of Two samples( Dependent Samples or
Matched Paired Observations)
n
S
d
t

0
or
S
n d
t
Where
d
= the mean of the differences
S = the standard deviation of the differences
The value of S is calculated as follows:
( )
1
2


n
d d
S
or
1
) (
2

n
d d
It should be noted that t is based on n-1 degree of freedom.
Problems
To verify whether a course in accounting improved performed, a similar test was given to
12 participants both before and after the course, The original marks recorded in
alphabetical order of the participants were44, 40, 61, 52, 32, 44, 70, 41,47,72,53, and 72.
After the course, the marks were in the same order, 53, 38, 69, 57, 46, 39, 73, 48,73,74,60
and 78. Was the course useful?
Solutions: Let us take the hypothesis that there is no difference in the marks obtained
before and after the course, i.e. the course has not been useful
Applying t-test(difference formula):
S
n d
t
187
Participants Before
(1
st
Test)
After
(
2nd
Test)
(2
nd
1
st
Test)
d
d
2
A
B
C
D
E
F
G
H
I
J
K
L
44
40
61
552
32
44
70
41
67
72
53
72
58
38
69
57
46
39
73
48
73
74
60
78
+9
-2
+8
+5
+14
-5
+3
+7
+6
+2
+7
+6
81
4
64
25
196
25
9
49
36
4
49
36
d=60 d
2
=578
443 . 3
03 . 5
464 . 3 5
03 . 5
12
03 . 5
11
278
1 12
) 5 ( 12 578
1
) (
5
12
60
2
2
2

t
t
n
d n d
S
n
d
d
=n-1=12 1 = 11;
For =11, t
0.05
= 2.201
The calculated value of t is greater than the table value. The hypothesis is rejected. Hence
the course has been useful.
Problems:
A drug is given to 10 patients and the increments in their blood pressure were recorded to
be 3, 6, -2, 4, -3, 4, 6, 0, 0, 2. Is it reasonable to believe that the drug has no effect on
change of blood pressure? (5% value of t for 9 d.f.=2.26)
Solution. Let us take the hypothesis that the drug has no effect on charge of blood pressure.
Applying the difference test:
S
n d
t
d d
2
3
6
-2
4
-3
4
6
0
0
2
9
36
4
16
9
16
36
0
0
4
d=0 d
2
=130
188
2
162 . 3
162 . 3 2
162 . 3
12 2
162 . 3
1 10
) 2 ( 10 130
1
) (
2
10
20
2
2
2

t
n
d n d
S
n
d
d
=n-1=10 1 = 9; For =6, t
0.05
= 2.26.
The calculated value of t is less than the table value. The hypothesis is accepted. Hence it is
reasonable to believe that the drug has no effect on change of blood pressure.
189
Summary of Hypothesis Testing
Carl Lee
A. Procedure for Testing a Hypothesis mean -Large Sample Cases
1. The following table summarizes the procedure.
Step 1: Setup the alternative (research hypothesis, Ha, and set H0: = 0)
Hypothesis Left-sided Two-sided Right-sided
Null vs. Alternative H0: = 0, Ha: < 0 H0: = 0, Ha: 0 H0: = 0, Ha: > 0
Step 2: Setup the appropriate rule and test statistic

Rule: use z-statistic
Step 3: Compute the observed z-value using observed sample mean and s.d., s from the collected sample
data:
Reject H0 (in favor of Ha) zobs < -z zobs < -z or zobs > z zobs > z
Step 4: Make a concluding statement based on the question asked.

B. Note, in all hypothesis testing situations, the direction of the alternative hypothesis determines the
direction of the rejection region. For example, in the large-sample test of hypothesis about a population
mean, for Ha: > 0, the rejection region is z > z ; for Ha: < 0, the rejection region is z < -z ; for
Ha: 0, the rejection region is z < -z /2 or z > z /2. Note also, that, whenever Ha is two-tailed, is
divided into 2 equal parts to determine the rejection region.
C. p-value and -value:
The -value is the predetermined level of significance. Typical -value is 1%, 2%, 5% or 10%. This value
sets up the rejection region. The reason that -value is usually small is because , in real applications, we
should not reject null hypothesis unless we have a strong evidence that the sample mean is far away from
the null hypothesis. Setting -value as small as 5% indicates that if the sample mean really falls into the 5%
region, then, it must be very rare compared with the null, and hence, we should strong evidence to reject the
null hypothesis.
190
On the contrast, p-value is the observed level of significance. One can compare the p-value with the -
value to make the decision in hypothesis test problems. This is a common approach when computer
software is available. It is not easy to be computed by hand. However, all statistical software gives the p-
value for the intended test.
D. The difference between p-value and -value. Consider a right-sided test situation,
-value = P(Z > z

). This is the probability of Z higher than the critical value z

.
p-value = P(Z > zobs) This is the probability of Z higher than the observed zobs. For example, consider = .
05. Then z

= 1.645, and suppose we obtain zobs = 1.02


Based on the rule using z

and zobs, we see, since zobs = 1.02 < z

= 1.645, we do not reject H0. If we


compute the
p-value, P(Z > zobs) = P(Z > z1.02) = 0.5 - .3461 = .1539. Then, in stead of comparing the two z-values
(zobs vs. z ), we can compare the corresponding "rejection region" probabilities (p and ).
E. The rule based on the p-value will be:
If the p-value < , then, reject H0, otherwise do not reject H0.
Based on this rule, p-value = .1539 > = .05. We do not reject H0.
Note: the final decision using either (zobs, z ) or (p-value, ) are the same.
F. Computation of the p-value and the corresponding rule-large sample case
Left-sided test: p-value = P(Z < zobs)
Two-sided test: p-value = 2P(Z > |zobs|)
Right-sided test: p-value = P(Z > zobs)
G. The rule for using the p-value, regardless of the type of test is
If p-value < , the reject H0
If p-value , then do not reject H0
H. Procedure for Testing a Hypothesis mean-Small Sample Cases
1. The following table summarizes the procedure.
191
Step 1: Setup the alternative (research hypothesis, Ha, and set H0: = 0)
Hypothesis Left-sided Two-sided Right-sided
Null vs. Alternative H0: = 0, Ha: < 0 H0: = 0, Ha: 0 H0: = 0, Ha: > 0
Step 2: Setup the appropriate rule and test statistic





Rule: use t-statistic
Step 3: Compute the observed t-value using observed sample mean and s.d., s from the collected sample
data:
Reject H0 (in favor of Ha)
if
tobs < -t( , ) tobs < -t( /2, ) or tobs >
t( /2, )
tobs > t( , )
Or use p-value to make a decision

p-value = P(t < tobs) p-value = 2P(t > |tobs|) p-value = P(t > tobs
Decision based on p-value
If p-value < then reject H0. If p-value then do not reject H0.
Step 4: Make a concluding statement based on the question asked.
2. The difference between small sample and large sample hypothesis testing problems is the choice of test
statistics. A test statistic is a standardization of the sample statistic. For large sample cases, the
standardized is . For small sample cases, the standardized is even though
the computation is the same.
I. The concept that a test-statistic is the standardized sample statistic is true for most of statistical
hypothesis testing problems, including all test statistics that are/will be covered in this course and many
others that are not covered I this course. The reasons behind using the standardized sample statistics as test
statistics is that standardized measures can be compared without worrying about the units, and probability
distributions for the standardized sample statistics are either known or easier to derive.

192
J. Procedure for Testing a Hypothesis on Proportion.
1. Procedure for testing a hypothesis on population proportion is similar to the large sample test for mean,
except that we are interested in p (population proportion) not in (population ).
Step 1: Setup the alternative (research hypothesis, Ha, and set H0: p = p0)
Hypothesis Left-sided Two-sided Right-sided
Null vs. Alternative H0: p = p0, Ha: p < p0 H0: p = p0, Ha: p p0
H0: p = p0, Ha: p > p0
Step 2: Setup the appropriate rule and test statistic


Rule: use z-statistic
Step 3: Compute the observed z-value using the sample proportion :
Reject H0 (in favor of Ha) zobs < -z zobs < -z or zobs > z zobs > z
Step 4: Make a concluding statement based on the question asked.
2. In step 3 of the procedure, Z is the standardized variable for through
where
193

S-ar putea să vă placă și