Sunteți pe pagina 1din 227

Statistical Amity Directorate

of
Distance
&
Methods
Online Education

Business statistics is the science of good decision making


in the face of uncertainty and is used in many disciplines
such as financial analysis, econometrics, auditing,
production and operations including services
improvement, and marketing research.

Bachelor of
Arts in
Economics
Semester -II

Preface
The importance of Business Statistics, as a field of study and practice, is being
increasingly realized in schools, colleges, universities, commercial and industrial
organizations both in India and abroad.
It is a technical and practical subject and learning of it means familiarizing oneself with
many new terms and concepts. As the Students Study Material is intended to serve the
beginners in the field, I have given it the quality of simplicity. This Study Material is
intended to serve as a Study Material for students of BBA course of Amity University.
This Study Material of Business Statistics, is student oriented and written in teach
yourself style.
The primary objective of this study material is to facilitate clear understanding of the
subject of Business Statistics. This Material contains a wide range of theoretical and
practical questions varying in content, length and complexity. Most of the illustrations
and exercise problems have been taken from the various university examinations. This
material contains a sufficiently large number of illustrations to assist better grasp and
understanding of the subject. The reader will find perfect accuracy with regard to
formulae and answers of the exercise questions. For the convenience of the students I
have also included multiple questions and case study in this Study Material for better
understanding of the subject.
I hope that this Material will prove useful to both students and teachers. The contents of
this Study Material are divided into eight chapters covering various aspects of the
syllabus of BBA and other related courses. At the end of this Material three assignments
have been provided which are related with the subject matter.
I have taken considerable amount of help from various literatures, journals and medias. I
express my gratitude to all those personalities who have devoted their life to knowledge
specially Statistics, from whom I could learn and on the basis of those learnings now, I
am trying to deliver my knowledge to others through this material.
It is by Gods loving grace that he brought me in to this world and blessed me with loving
and caring parents, my respected father Mr. Manohar Lal Arora and my loving mother
Mrs. Kamla Arora, who have supported me in this Study Material.
I am thankful to my beloved wife Mrs. Deepti Arora, without whose constant
encouragement, advice and material sacrifice, this achievement would have been a far of
dream.

BUSINESS STATISTICS
Course Contents:
Module I: Introduction to Statistics
Definitions, Functions of Statistics, Statistics and Computers, Limitation of Statistics,
Application of Statistics.
Module II: Data Collection and Analysis
Methods of Data Collection, Primary And Secondary Data, Measures of DispersionRange, Quartile Deviation, Mean Deviation, Standard Deviation, Coefficient of
Variation.(Absolute & Relative Measure of Dispersion), Skewness-Karl-Pearsons
Coefficient of Skewness, Bowleys Coefficient of Skewness, Kurtosis.
Module III: Correlation Analysis And Regression Analysis
Introduction-Importance of Correlation, Types of Correlation, Scatter Diagram Method,
Karl Pearsons coefficient of Correlation (Grouped and Ungrouped). Spearmans
Coefficient of Rank, Correlation, Rank Correlation for Tied Ranks, Regression AnalysisConcepts of Regression, Difference b/w Correlation and Regression, Regression Lines.
Module IV: Time Series Analysis
Meaning and Significance, Components of Time Series, Trend Measurement, Moving
Average Method, Least Square Method (Fitting of Straight Line Only).
Module V: Probability And Probability Distribution
Introduction, Terminology used in Probability, Definitions of Probability, Mathematical,
Statistical and Axiomatic Approach to Probability, Probability Rules-Addition Rule,
Multiplication Rule of Probability, Conditional Probability- Bayes Theoram, Problems on
Bayes Theoram. Discrete Probability Distributions-Binomial Probability Distribution,
Poission Probability Distribution, Properties, Applications, Continuous Probability
Distributions-Normal Probability, distribution, Properties of the Normal Curve,
Applications, Relation b/w distributions.
Module VI: Sampling Design
Introduction: Some Fundamental Definitions, Census and Sample Survey, Steps in
Sampling Design, Criteria for Selecting a Sampling Procedure, Characteristics of a Good
Sample Design, Different Types of a Sample Design.
Module VII: Testing Of Hypothesis
What is a Hypothesis? Basics Concepts concerning a Hypothesis. Procedure for
Hypothesis Testing. Tests of Hypothesis. Parametric Test: Z-Test, T-Test.

Module VIII: Linear Programming


Introduction of Linear Programming, Formulation of a Linear Programming, Problem,
Graphical Solution Method.
Text & References:
Text:
Business Statistics, S B Gupta and M P Gupta
References:
Statistical Methods, Gupta S.P.
Business Statistics, Sancheti Kapoor
Business Statistics (An Applied Orientation), P K Viswanathan
Business Statistics, Dr.JS Chandan, Prof. Jagjit Singh & K.K.Khanna
Statistics for Business and Economics, Anderson Sweeney Williams
Quantitative Techniques, C R Kothari
Business Statistics, B M Aggarwal
Programmed Statistics, B L Aggarwal

Index
S. No.

Chapter No. Subject

1.
2.
3.
4.
5.
6.
7.
8.

1
2
3
4
5
6
7
8

Introduction to Statistics
Primary and secondary data
Measures of Dispersion
Measures of Skewness
Correlation Analysis
Regression Analysis
Time Series Analysis
Probability

CHAPTER ONE
INTRODUCTION TO STATISTICS
1.1 Introduction
In the modern world of computers and information technology, the
importance of statistics is very well recogonised by all the disciplines.
Statistics has originated as a science of statehood and found applications
slowly and steadily in Agriculture, Economics, Commerce, Biology,
Medicine, Industry, planning, education and so on. As on date there is no
other human walk of life, where statistics cannot be applied.
Statistics is a discipline which is concerned with:

designing experiments and other data collection,

summarizing information to aid understanding,

drawing conclusions from data, and

estimating the present or predicting the future.


Today, statistics has become an important tool in the work of many

academic disciplines such as medicine, psychology, education, sociology,


engineering and physics, just to name a few. Statistics is also important in
many aspects of society such as business, industry and government. Because
of the increasing use of statistics in so many areas of our lives, it has become
very desirable to understand and practise statistical thinking. This is
important even if you do not use statistical methods directly.

Examples of Statistics: Unemployment rate, consumer price index, rate


of violent crimes, infant mortality rates, poverty rate of a country, batting
average of a baseball player, on base percentages of a baseball player, salary
rates, standardized test results.

1.2 Meaning of Statistics


The word 'Statistics' is derived from the Latin word 'Statis' which
means a "political state." Clearly, statistics is closely linked with the
administrative affairs of a state such as facts and figures regarding defense
force, population, housing, food, financial resources etc. What is true about a
government is also true about industrial administration units, and even ones
personal life.
The word statistics has several meanings. In the first place, it is a
plural noun which describes a collection of numerical data such as
employment statistics, accident statistics, population statistics, birth and
death, income and expenditure, of exports and imports etc. It is in this sense
that the word 'statistics' is used by a layman or a newspaper.
Secondly the word statistics as a singular noun, is used to describe a
branch of applied mathematics, whose purpose is to provide methods of
dealing with a collections of data and extracting information from them in
compact form by tabulating, summarizing and analyzing the numerical data
or a set of observations.
The various methods used are termed as statistical methods and the
person using them is known as a statistician. A statistician is concerned with

the analysis and interpretation of the data and drawing valid worthwhile
conclusions from the same.
It is in the second sense that we are writing this guide on statistics.
Lastly the word statistics is used in a specialized sense. It describes various
numerical items which are produced by using statistics ( in the second sense
) to statistics ( in the first sense ). Averages, standard deviation etc. are all
statistics in this specialized third sense.

1.3 Origin and Growth of Statistics:


The word Statistics and Statistical are all derived from the Latin
word Status, means a political state. The theory of statistics as a distinct
branch of scientific method is of comparatively recent growth. Research
particularly into the mathematical theory of statistics is rapidly proceeding
and fresh discoveries are being made all over the world.

1.4 Definitions :
Statistics is defined differently by different authors over a period of
time. In the olden days statistics was confined to only state affairs but in
modern days it embraces almost every sphere of human activity. Therefore a
number of old definitions, which was confined to narrow field of enquiry
were replaced by more definitions, which are much more comprehensive and
exhaustive. Secondly, statistics has been defined in two different ways
Statistical data and statistical methods. The following are some of the
definitions of statistics as numerical data.

1. Statistics are the classified facts representing the conditions of people in a


state. In particular they are the facts, which can be stated in numbers or in
tables of numbers or in any tabular or classified arrangement.
2. Statistics are measurements, enumerations or estimates of natural
phenomenon usually systematically arranged, analysed and presented as to
exhibit important interrelationships among them.

1.4.1 Definition by Florence Nightingale


the most important science in the whole world:
for upon it depends the practical application of every other
science and every art: the one science essential to all political
and social administration, all education, all organization based on
experience, for it only gives results of our experience.

1.4.2 Definitions by A.L. Bowley:


Statistics are numerical statement of facts in any department of
enquiry placed in relation to each other. - A.L. Bowley Statistics may be
called the science of counting in one of the departments due to Bowley,
obviously this is an incomplete definition as it takes into account only the
aspect of collection and ignores other aspects such as analysis, presentation
and interpretation.
Bowley gives another definition for statistics, which states statistics may be
rightly called the scheme of averages . This definition is also incomplete, as
averages play an important role in understanding and comparing data and
statistics provide more measures.

1.4.3 Definition by Croxton and Cowden:


Statistics may be defined as the science of collection, presentation
analysis and interpretation of numerical data from the logical analysis. It is
clear that the definition of statistics by Croxton and Cowden is the most
scientific and realistic one.
According to this definition there are four stages:
1. Collection of Data: It is the first step and this is the foundation upon
which the entire data set. Careful planning is essential before collecting the
data. There are different methods of collection of data such as census,
sampling, primary, secondary, etc., and the investigator should make use of
correct method.
2. Presentation of data: The mass data collected should be presented in a
suitable, concise form for further analysis. The collected data may be
presented in the form of tabular or diagrammatic or graphic form.
3. Analysis of data: The data presented should be carefully analysed for
making inference from the presented data such as measures of central
tendencies, dispersion, correlation, regression etc.,
4. Interpretation of data: The final step is drawing conclusion from the
data collected. A valid conclusion must be drawn on the basis of analysis. A
high degree of skill and experience is necessary for the interpretation.

1.4.4 Definition by Horace Secrist:


Statistics may be defined as the aggregate of facts affected to a
marked extent by multiplicity of causes, numerically expressed, enumerated
or estimated according to a reasonable standard of accuracy, collected in a
systematic manner, for a predetermined purpose and placed in relation to

each other. The above definition seems to be the most comprehensive and
exhaustive.

1.4.5 Definition by Professor Secrit : The word statistics in the first


sense is defined by Professor Secrit as follows:"By statistics we mean aggregate of facts affected to a marked extent
by multiplicity of causes, numerically expressed, enumerated or estimated
according to reasonable standard of accuracy, collected in a systematic
manner for a predetermined purpose and placed in relation to each other."
This definition gives all the characteristics of statistics which are :
Aggregate of facts, Affected by multiplicity of causes, Numerically
expressed, Estimated according to reasonable standards of accuracy,
Collected in a systematic manner, Collected for a predetermined purpose,
Placed in relation to each other.

1.4.6 Definition by Croxton and Cowden : The word 'statistics' in


the second sense is defined by Croxton and Cowden as follows:"The collection, presentation, analysis and interpretation of the numerical
data."
This definition clearly points out four stages in a statistical investigation,
namely:
1)

Collection

3) Analysis of data

of

data

2)

Presentation

of

4) Interpretation of data

In addition to this, one more stage i.e. organization of data is suggested.

data

1.5 Characteristics of Statistics:


1.5.1 Statistics are aggregate of facts : A single fact is not called
statistics. To become statistics, there must be more than one fact. However
the data may relate to production, sales, employment, birth, death etc.

1.5.2 Statistics are numerically expressed : Only those statements


which can be expressed numerically are statistics. It does not deal with
qualitative statements like students of MBA are intelligent. On the other
hand if say that sales of Escorts Ltd. is Rs. 354 crores. These are statistical
facts stated numerically.

1.5.3 Statistics are effected to a marked extent by multiplicity


of causes : Statistical data are affected to a great extent by various causes.
For instance, the production of wheat depends upon the quality of seed,
rainfall, quality of soil, fertilizer used, method of cultivation etc.

1.5.4 Statistics are collected in a systematic order : Statistical data


are collected in a systematic manner. Means the investigator has to chalk out
a plan keeping in view the objective of data collection, determine the
statistical unit, technique of data collection and so on.

1.5.5 Statistics must be collected for a predetermined purpose :


The objective of data collection must be predetermined and well established.
A mere statement of purpose is insufficient.

1.5.6 Statistics should be placed in relation to each other : The


Statistical data must be comparable. It is possible only when the data are
homogeneous.

1.6 Functions of Statistics:


There are many functions of statistics. Let us consider the following
five important functions.

1.6.1 Condensation:
Generally speaking by the word to condense , we mean to reduce or
to lessen. Condensation is mainly applied at embracing the understanding of
a huge mass of data by providing only few observations. If in a particular
class in Chennai School, only marks4

1.6.2 Comparison:
Classification and tabulation are the two methods that are used to
condense the data. They help us to compare data collected from different
sources. Grand totals, measures of central tendency measures of dispersion,
graphs and diagrams, coefficient of correlation etc provide ample scope for
comparison.
If we have one group of data, we can compare within itself. If the rice
production (in Tonnes) in Tanjore district is known, then we can compare
one region with another region within the district. Or if the rice production
(in Tonnes) of two different districts within Tamilnadu is known, then also a
comparative study can be made. As statistics is an aggregate of facts and
figures, comparison is always possible and in fact comparison helps us to
understand the data in a better way.

1.6.3 Forecasting:
By the word forecasting, we mean to predict or to estimate before
hand. Given the data of the last ten years connected to rainfall of a particular
district in Tamilnadu, it is possible to predict or forecast the rainfall for the

near future. In business also forecasting plays a dominant role in connection


with production, sales, profits etc. The analysis of time series and regression
analysis plays an important role in forecasting.

1.6.4 Estimation:
One of the main objectives of statistics is drawn inference about a
population from the analysis for the sample drawn from that population. The
four major branches of statistical inference are
1. Estimation theory
2. Tests of Hypothesis
3. Non Parametric tests
4. Sequential analysis
In estimation theory, we estimate the unknown value of the population
parameter based on the sample observations. Suppose we are given a sample
of heights of hundred students in a school, based upon the heights of these
100 students, it is possible to estimate the average height of all students in
that school.

1.6.5 Tests of Hypothesis:


A statistical hypothesis is some statement about the probability
distribution, characterising a population on the basis of the information
available from the sample observations. In the formulation and testing of
hypothesis, statistical methods are extremely useful. Whether crop yield has
increased because of the use of new fertilizer or whether the new medicine is
effective in eliminating a particular disease are some examples of statements
of hypothesis and these are tested by proper statistical tools.

1.7 Scope of Statistics:


Statistics is not a mere device for collecting numerical data, but as a
means of developing sound techniques for their handling, analysing and
drawing valid inferences from them. Statistics is applied in every sphere of
human activity social as well as physical like Biology, Commerce,
Education, Planning, Business Management, Information Technology, etc. It
is almost impossible to find a single department of human activity where
statistics cannot be applied. We now discuss briefly the applications of
statistics in other disciplines.

1.7.1 Statistics and Industry:


Statistics is widely used in many industries. In industries, control
charts are widely used to maintain a certain quality level. In production
engineering, to find whether the product is conforming to specifications or
not, statistical tools, namely inspection plans, control charts, etc., are of
extreme importance. In inspection plans we have to resort to some kind of
sampling a very important aspect of Statistics.

1.7.2 Statistics and Commerce:


Statistics are lifeblood of successful commerce. Any businessman
cannot afford to either by under stocking or having overstock of his goods.
In the beginning he estimates the demand for his goods and then takes steps
to adjust with his output or purchases. Thus statistics is indispensable in
business and commerce.
As so many multinational companies have invaded into our Indian economy,
the size and volume of business is increasing. On one side the stiff
competition is increasing whereas on the other side the tastes are changing
and new fashions are emerging. In this in an examination are given, no

purpose will be served. Instead if we are given the average mark in that
particular examination, definitely it serves the better purpose. Similarly the
range of marks is also another measure of the data. Thus, Statistical
measures help to reduce the complexity of the data and consequently to
understand any huge mass of data. connection, market survey plays an
important role to exhibit the present conditions and to forecast the likely
changes in future.

1.7.3 Statistics and Agriculture:


Analysis of variance (ANOVA) is one of the statistical tools
developed by Professor R.A. Fisher, plays a prominent role in agriculture
experiments. In tests of significance based on small samples, it can be shown
that statistics is adequate to test the significant difference between two
sample means. In analysis of variance, we are concerned with the testing of
equality of several
population means.
For an example, five fertilizers are applied to five plots each of wheat
and the yield of wheat on each of the plots are given. In such a situation, we
are interested in finding out whether the effect of these fertilisers on the
yield is significantly different or not. In other words, whether the samples
are drawn from the same normal population or not. The answer to this
problem is provided by the technique of ANOVA and it is used to test the
homogeneity of several population means.

1.7.4 Statistics and Economics:


Statistical methods are useful in measuring numerical changes in
complex groups and interpreting collective phenomenon. Nowadays the uses

of statistics are abundantly made in any economic study. Both in economic


theory and practice, statistical methods play an important role.
Alfred Marshall said, Statistics are the straw only which I like every
other economist have to make the bricks. It may also be noted that
statistical data and techniques of statistical tools are immensely useful in
solving many economic problems such as wages, prices, production,
distribution of income and wealth and so on. Statistical tools like Index
numbers, time series Analysis, Estimation theory, Testing Statistical
Hypothesis are extensively used in economics.

1.7.5 Statistics and Education:


Statistics is widely used in education. Research has become a
common feature in all branches of activities. Statistics is necessary for the
formulation of policies to start new course, consideration of facilities
available for new courses etc. There are many people engaged in research
work to test the past knowledge and evolve new knowledge. These are
possible only through statistics.

1.7.6 Statistics and Planning:


Statistics is indispensable in planning. In the modern world, which can
be termed as the world of planning, almost all the organisations in the
government are seeking the help of planning for efficient working, for the
formulation of policy decisions and execution of the same.
In order to achieve the above goals, the statistical data relating to
production, consumption, demand, supply, prices, investments, income
expenditure etc and various advanced statistical techniques for processing,
analysing and interpreting such complex data are of importance. In India

statistics play an important role in planning, commissioning both at the


central and state government levels.

1.7.7 Statistics and Medicine:


In Medical sciences, statistical tools are widely used. In order to test
the efficiency of a new drug or medicine, t - test is used or to compare the
efficiency of two drugs or two medicines, t test for the two samples is used.
More and more applications of statistics are at present used in clinical
investigation.

1.7.8 Statistics and Modern applications:


Recent developments in the fields of computer technology and
information technology have enabled statistics to integrate their models and
thus make statistics a part of decision making procedures of many
organisations. There are so many software packages available for solving
design of experiments, forecasting simulation problems etc.
SYSTAT, a software package offers mere scientific and technical graphing
options than any other desktop statistics package. SYSTAT supports all
types of scientific and technical research in various diversified fields as
follows
1. Archeology: Evolution of skull dimensions
2. Epidemiology: Tuberculosis
3. Statistics: Theoretical distributions
4. Manufacturing: Quality improvement
5. Medical research: Clinical investigations.
6. Geology: Estimation of Uranium reserves from ground water.

1.8 Limitations of statistics:


Statistics with all its wide application in every sphere of human
activity has its own limitations. Some of them are given below.

1.8.1 Statistics is not suitable to the study of qualitative


phenomenon: Since statistics is basically a science and deals with a set of
numerical data, it is applicable to the study of only these subjects of enquiry,
which can be expressed in terms of quantitative measurements. As a matter
of fact, qualitative phenomenon like honesty, poverty, beauty, intelligence
etc, cannot be expressed numerically and any statistical analysis cannot be
directly applied on these qualitative phenomenons. Nevertheless, statistical
techniques may be applied indirectly by first reducing the qualitative
expressions to accurate quantitative terms. For example, the intelligence of a
group of students can be studied on the basis of their marks in a particular
examination.

1.8.2 Statistics does not study individuals: Statistics does not give
any specific importance to the individual items, in fact it deals with an
aggregate of objects. Individual items, when they are taken individually do
not constitute any statistical data and do not serve any purpose for any
statistical enquiry.

1.8.3 Statistical laws are not exact: It is well known that


mathematical and physical sciences are exact. But statistical laws are not
exact and statistical laws are only approximations. Statistical conclusions are
not universally true. They are true only on an average.

1.8.4 Statistics table may be misused: Statistics must be used only by


experts; otherwise, statistical methods are the most dangerous tools on the

hands of the inexpert. The use of statistical tools by the inexperienced and
untraced persons might lead to wrong conclusions. Statistics can be easily
misused by quoting wrong figures of data. As King says aptly statistics are
like clay of which one can make a God or Devil as one pleases .

1.8.5 Statistics is only, one of the methods of studying a


problem:
Statistical method do not provide complete solution of the problems
because problems are to be studied taking the background of the countries
culture, philosophy or religion into consideration. Thus the statistical study
should be supplemented by other evidences.

1.9 Distrust Of Statistics


It is often said by people that, "statistics can prove anything." There
are three types of lies - lies, demand lies and statistics - wicked in the order
of their naming. A Paris banker said, "Statistics is like a miniskirt, it covers
up essentials but gives you the ideas."
Thus by "distrust of statistics" we mean lack of confidence in statistical
statements and methods. The following reasons account for such views
about statistics.
Figures are convincing and, therefore people easily believe them.
They can be manipulated in such a manner as to establish foregone
conclusions.
The wrong representation of even correct figures can mislead a reader. For
example, John earned $ 4000 in 1990 - 1991 and Jem earned $ 5000.
Reading this one would form the opinion that Jem is decidedly a better

worker than John. However if we carefully examine the statement, we might


reach a different conclusion as Jems earning period is unknown to us. Thus
while working with statistics one should not only avoid outright falsehoods
but be alert to detect possible distortion of the truth.

1.10 Uses of Statistics :


1.10.1 To present the data in a concise and definite form :
Statistics helps in classifying and tabulating raw data for processing and
further tabulation for end users.

1.10.2 To make it easy to understand complex and large data :


This is done by presenting the data in the form of tables, graphs, diagrams
etc., or by condensing the data with the help of means, dispersion etc.

1.10.3 For comparison : Tables, measures of means and dispersion can


help in comparing different sets of data..

1.10.4 In forming policies : It helps in forming policies like a


production schedule, based on the relevant sales figures. It is used in
forecasting future demands.

1.10.5 Enlarging individual experiences : Complex problems can be


well understood by statistics, as the conclusions drawn by an individual are
more definite and precise than mere statements on facts.

1.10.6 In measuring the magnitude of a phenomenon: Statistics


has made it possible to count the population of a country, the industrial
growth, the agricultural growth, the educational level (of course in numbers).

1.11 Types of Statistics


As mentioned earlier, for a layman or people in general, statistics
means numbers - numerical facts, figures or information. The branch of
statistics wherein we record and analyze observations for all the individuals
of a group or population and draw inferences about the same is called
"Descriptive statistics" or "Deductive statistics". On the other hand, if we
choose a sample and by statistical treatment of this, draw inferences about
the population, then this branch of statistics is known as Statical Inference or
Inductive Statistics.
In our discussion, we are mainly concerned with two ways of representing
descriptive statistics : Numerical and Pictorial.
1. Numerical statistics are numbers. But some numbers are more
meaningful such as mean, standard deviation etc.
2. When the numerical data is presented in the form of pictures
(diagrams) and graphs, it is called the Pictorial statistics. This
statistics makes confusing and complex data or information, easy,
simple and straightforward, so that even the layman can understand it
without much difficulty.

1.12 Common Mistakes Committed In Interpretation of


Statistics
1.12.1 Bias:- Bias means prejudice or preference of the investigator, which
creeps in consciously and unconsciously in proving a particular point.

1.12.2 Generalization:- Some times on the basis of little data available


one could jump to a conclusion, which leads to erroneous results.

1.12.3 Wrong conclusion:- The characteristics of a group if attached to


an individual member of that group, may lead us to draw absurd
conclusions.

1.12.4 Incomplete classification:- If we fail to give a complete


classification, the influence of various factors may not be properly
understood.
1.12.5 There may be a wrong use of percentages.
1.12.6 Technical mistakes may also occur.
1.12.7 An inconsistency in definition can even exist.
1.12.8 Wrong causal inferences may sometimes be drawn.
1.12.9 There may also be a misuse of correlation.

Chapter One
Introduction to Statistics
End Chapter Quizzes
1) The statement, Statistics is both a science and an art, was given by
a- R. A. Fisher
c- L. R. Connor

b- Tippet
d- A. L. Bowley

2) The word statistics is used as


a- Singular
b- Plural
c- Singular and plural both
d- none of the above

3) Statistics provides tools and techniques for research workers, was stated by
a- John I. Griffin
b- W. I. King
c-A. M. Mood
d- A. L. Boddington

4) Out of various definitions given by the following workers, which definition is


considered to be most exact?
a- R. A. Fisher
b- A. L. Bowley
c- M. G. Kendall
d- Cecil H. Meyers

5) Who stated that there are three kinds of lies: lies, dammed lies and statistics.
a- Mark Twin
b- Disraeili
c- Darrell Huff
d- G. W. Snedecor

6) Which of the following represents data?


a- a single value
b- only two values in a set
c- a group of values in a set d- none of the above

7) Statistics deals with


a- qualitative information
c- both (a)and (b)

b- quantitative information
d- none of (a) and (b)

8) Relative error is always


a- positive
b- negative
c- positive and negative both
d- zero

9) The statement, Designing of an appropriate questionnaire itself wins half the battle,
was given by
a- A. R. Ilersic
b- W. I. King
c- H. Huge
d- H. Secrist

10) Who originally gave the formula for the estimation of errors of the type
a- L. R. Connor
b- W. I. King
c- A. L. Bowley
d- A. L. Boddington

CHAPTER TWO
PRIMARY AND SECONDARY DATA
2.1 Primary Data
The foundation of statistical investigation lies on data so utmost care must
be taken while collecting data. If the collected data are inaccurate and
inadequate, the whole analysis and interpretation will also become
misleading and unreliable. The method of collection of data depends upon
the nature, object and scope of statistical enquiry on the one hand and the
availability of time and money on the other hand.
Data, or facts, may be derived from several sources. Data can be
classified as primary data and secondary data. Primary data is data gathered
for the first time by the researcher. So if the investigator himself prefers to
collect the data for the purpose of purpose and enquiry and uses the data, it
is called collection of primary data. These data are original in nature.
According to Horace Secrist, primary data are meant that data
which are original, that is, those in which little or no grouping has been
made, for instance being recorded or itemized as encountered. They are
essentially raw material.

2.2 Sources of Primary Data


Primary data may be collected by using the following methods,
namely :

2.2.1 Direct personal investigations : Under this method the


investigator personally contacts the informants and collect the data. This

method of data collection is suitable where the field of enquiry is limited or


the nature of inquiry is confidential.

2.2.2 Indirect oral investigations : This method is generally used in


those cases where informants are reluctant to give information, so
information is gathered from those who possess information on the problem
under investigation. The informants are called witnesses. This method of
investigation is normally used by enquiry committees and commissions.

2.2.3 Information through correspondence : Under this method, the


investigator appoints local agents or correspondents indifferent parts of the
field of enquiry. They send information on specific issues on regular basis to
investigator. This method is generally adopted by various television news
channels, newspapers and periodicals on regular basis.

2.2.4 Mailed questionnaire method : Under this method, a


questionnaire is prepared by the investigator containing questions on the
problem under investigations. This questionnaires are mailed to various
informants who are requested to return by mail after answering the
questions. A covering letter is also enclosed requesting the informants to
reply before a specific date.

2.2.5 Schedule to be filled in by the enumerator : Under this


method, enumerators are appointed areawise. They contact the informants
and and information is filled up by them in the schedules. The enumerators
should be honest, painstaking and tactful as they have to deal with people of
different nature.

2.3 Secondary Data


Secondary data is data taken by the researcher from secondary sources,
internal or external. The researcher must thoroughly search secondary data
sources before commissioning any efforts for collecting primary data. Once
the primary data are collected and published, it becomes secondary data for
other investigators. Hence, the data obtained from published or unpublished
sources are known as secondary data. There are many advantages in
searching for and analyzing data before attempting the collection of primary
data. In some cases, the secondary data itself may be sufficient to solve the
problem. Usually the cost of gathering secondary data is much lower than
the cost of organizing primary data. Moreover, secondary data has several
supplementary uses. It also helps to plan the collection of primary data, in
case, it becomes necessary.
Blair has rightly defined, secondary data, as those already in
existence and which have been collected for some other purpose than the
answering of the question at hand.
Secondary data is of two kinds, internal and external. Secondary data
whether internal or external is data already collected by others, for
purposes other than the solution of the problem on hand. Business firms
always have as great deal of internal secondary data with them. Sales
statistics constitute the most important component of secondary data in
marketing and the researcher uses it extensively. All the output of the MIS
of the firm generally constitutes internal secondary data. This data is readily
available; the market researcher gets it without much effort, time and money.

2.4 The nature of secondary sources of information


Secondary data is data which has been collected by individuals or agencies
for purposes other than those of our particular research study. For example,
if a government department has conducted a survey of, say, family food
expenditures, then a food manufacturer might use this data in the
organisation's evaluations of the total potential market for a new product.
Similarly, statistics prepared by a ministry on agricultural production will
prove useful to a whole host of people and organisations, including those
marketing agricultural supplies.

No marketing research study should be undertaken without a prior search of


secondary sources (also termed desk research). There are several grounds for
making such a bold statement. Secondary data may be available which is
entirely appropriate and wholly adequate to draw conclusions and answer
the question or solve the problem. Sometimes primary data collection simply
is not necessary.
It is far cheaper to collect secondary data than to obtain primary
data. For the same level of research budget a thorough examination of
secondary sources can yield a great deal more information than can be had
through a primary data collection exercise.
The time involved in searching secondary sources is much less than
that needed to complete primary data collection.
Secondary sources of information can yield more accuratedata than
that obtained through primary research. This is not always true but where a
government or international agency has undertaken a large scale survey, or
even a census, this is likely to yield far more accurate results than custom

designed and executed surveys when these are based on relatively small
sample sizes.
It should not be forgotten that secondary data can play a substantial
role in the exploratory phase of the research when the task at hand is to
define the research problem and to generate hypotheses. The assembly and
analysis of secondary data almost invariably improves the researcher's
understanding of the marketing problem, the various lines of inquiry that
could or should be followed and the alternative courses of action which
might be pursued.
Secondary sources help define the population. Secondary data can
be extremely useful both in defining the population and in structuring the
sample to be taken. For instance, government statistics on a country's
agriculture will help decide how to stratify a sample and, once sample
estimates have been calculated, these can be used to project those estimates
to the population.

2.5 Sources of Secondary data


Secondary sources of data may be divided into two categories:
internal sources and external sources.

2.5.1 Internal sources of secondary data

Sales data : All organisations collect information in the course

of their everyday operations. Orders are received and delivered, costs are
recorded, sales personnel submit visit reports, invoices are sent out, returned
goods are recorded and so on. Much of this information is of potential use in
marketing research but a surprising amount of it is actually used.
Organisations frequently overlook this valuable resource by not beginning

their search of secondary sources with an internal audit of sales invoices,


orders, inquiries about products not stocked, returns from customers and
sales force customer calling sheets. For example, consider how much
information can be obtained from sales orders and invoices:
Sales by territory
Sales by customer type
Prices and discounts
Average size of order by customer, customer type, geographical area
Average sales by sales person and
Sales by pack size and pack type, etc.
This type of data is useful for identifying an organisation's most
profitable product and customers. It can also serve to track trends within the
enterprise's existing customer group.

Financial data: An organisation has a great deal of data within

its files on the cost of producing, storing, transporting and marketing each of
its products and product lines. Such data has many uses in marketing
research including allowing measurement of the efficiency of marketing
operations. It can also be used to estimate the costs attached to new products
under consideration, of particular utilisation (in production, storage and
transportation) at which an organisation's unit costs begin to fall.

Transport data: Companies that keep good records relating to

their transport operations are well placed to establish which are the most
profitable routes, and loads, as well as the most cost effective routing
patterns. Good data on transport operations enables the enterprise to perform
trade-off analysis and thereby establish whether it makes economic sense to
own or hire vehicles, or the point at which a balance of the two gives the
best financial outcome.

Storage data: The rate of stockturn, stockhandling costs,

assessing the efficiency of certain marketing operations and the efficiency of


the marketing system as a whole. More sophisticated accounting systems
assign costs to the cubic space occupied by individual products and the time
period over which the product occupies the space. These systems can be
further refined so that the profitability per unit, and rate of sale, are added. In
this way, the direct product profitability can be calculated.

2.5.2 External sources of secondary information


The marketing researcher who seriously seeks after useful secondary
data is more often surprised by its abundance than by its scarcity. Too often,
the researcher has secretly (sometimes subconsciously) concluded from the
outset that his/her topic of study is so unique or specialised that a research of
secondary sources is futile. Consequently, only a specified search is made
with no real expectation of sources. Cursory researches become a selffulfilling prophecy. Dillon et. al3 give the following advice:
"You should never begin a half-hearted search with the assumption
that what is being sought is so unique that no one else has ever bothered to
collect it and publish it. On the contrary, assume there are scrolling
secondary data that should help provide definition and scope for the primary
research effort."
The same authors support their advice by citing the large numbers of
organisations that provide marketing information including national and
local government agencies, quasi-government agencies, trade associations,
universities, research institutes, financial institutions, specialist suppliers of
secondary marketing data and professional marketing research enterprises.
Dillon et al further advise that searches of printed sources of secondary data
begin with referral texts such as directories, indexes, handbooks and guides.

These sorts of publications rarely provide the data in which the researcher is
interested but serve in helping him/her locate potentially useful data sources.
The main sources of external secondary sources are :
(1)

Government (federal, state and local)

(2)

Trade associations

(3)

Commercial services

(4)

National and international institutions.


Governm

ent statistics

These may include all or some of the


following:
Population
Social

surveys,

censuses
family

expenditure

surveys

Import/export

statistics

Production

statistics

Agricultural statistics.
Trade

Trade associations differ widely in the extent of

associations

their data collection and information dissemination


activities. However, it is worth checking with them to
determine what they do publish. At the very least one
would normally expect that they would produce a
trade directory and, perhaps, a yearbook.

Commerc
ial services

Published market research reports and other


publications are available from a wide range of
organisations which charge for their information.
Typically, marketing people are interested in media
statistics and consumer information which has been
obtained from large scale consumer or farmer panels.

The commercial organisation funds the collection of


the data, which is wide ranging in its content, and
hopes to make its money from selling this data to
interested parties.
National

Bank economic reviews, university research

and

reports, journals and articles are all useful sources to

international

contact. International agencies such as World Bank,

institutions

IMF, IFAD, UNDP, ITC, FAO and ILO produce a


plethora of secondary data which can prove extremely
useful to the marketing researcher.

2.5.3 Examples of Sources of External Secondary Data


Following are some of the examples of sources of external secondary
data :

The Internet is a great source of external secondary data. Many

published, statistics and figures are available on the internet either free or for
a fee.

The yellow pages of telephone directories/stand alone yellow

pages have become an established source of elementary business


information. Tata Press, which first launched a stand alone yellow pages
directory for Mumbai City, and GETIT yellow pages have been leading in
this field. Today, yellow pages publications are available for all cities and
major town a in the country. New Horizons, a joint venture between the
Living Media group of publications and Singapore Telecom has been
publishing stand alone directories for specific businesses. Business India

data base of the Business India publications had been publishing the Delhi
Pages directory.

The Thomas Register is the worlds most powerful industrial

buying guide. It ensures a fast, frictionless flow of information between


buyers and sellers of industrial goods and services. This purchasing tool is
now available in India. The Thomas Register of Indian manufacturers or
TRIM is Indias first dedicated manufacture-to-manufacture register. It
features 120,000 listing of 40,000 industrial manufacturers and industrial
service categories. It is available in print, CD forms and on the internet.

The source Directory brought out by Mumbai based Source

Publishers is another example. It covers contact information on advertising


agencies and related services and products, music companies, market
research agencies, marketing and sales promotion consultants, publication,
radio stations and cable and satellite station telemarketing services, among
others. It currently has editions for Metro cites.

The Industrial Product Finder (IPF): IPF details the many

application of the new products and tells what is available and from whom.
Most manufacturers of industrial products ensure that a description of their
product is published in IPF before they hit the market.

Phone data service: Agencies providing phone data services

have also come up in major cities in recent times Melior Communication for
example, offers a tele-data service. Basic data on a number of
subjects/products can be had through call to the agency. The service is
termed Tell me Business through phone service. Its main aim, like that of
yellow pages, is to bring buyers and sellers of products together. It also
provides some elementary databank support to researchers.

2.6 The problems of secondary sources


Whilst the benefits of secondary sources are considerable, their
shortcomings have to be acknowledged. There is a need to evaluate the
quality of both the source of the data and the data itself. The main problems
may be categorized as follows:
Definiti
ons

The researcher has to be careful, when making


use of secondary data, of the definitions used by those
responsible for its preparation. Suppose, for example,
researchers are interested in rural communities and their
average family size. If published statistics are consulted
then a check must be done on how terms such as
family size have been defined. They may refer only to
the nucleus family or include the extended family. Even
apparently simple terms such as farm size need careful
handling. Such figures may refer to any one of the
following: the land an individual owns, the land an
individual owns plus any additional land he/she rents,
the land an individual owns minus any land he/she rents
out, all of his land or only that part of it which he
actually cultivates. It should be noted that definitions
may change over time and where this is not rganizati
erroneous conclusions may be drawn. Geographical
areas may have their boundaries redefined, units of
measurement and grades may change and imported
goods can be reclassified from time to time for purposes
of levying customs and excise duties.

Measur
ement error

When a researcher conducts fieldwork she/he is


possibly able to estimate inaccuracies in measurement
through the standard deviation and standard error, but
these are sometimes not published in secondary sources.
The only solution is to try to speak to the individuals
involved in the collection of the data to obtain some
guidance on the level of accuracy of the data. The
problem is sometimes not so much error but
differences in levels of accuracy required by decision
makers. When the research has to do with large
investments in, say, food manufacturing, management
will want to set very tight margins of error in making
market demand estimates. In other cases, having a high
level of accuracy is not so critical. For instance, if a
food manufacturer is merely assessing the prospects for
one more flavour for a snack food already produced by
the company then there is no need for highly accurate
estimates in order to make the investment decision.

Source
bias

Researchers have to be aware of vested interests


when

they

consult

secondary

sources.

Those

responsible for their compilation may have reasons for


wishing to present a more optimistic or pessimistic set
of results for their rganization. It is not unknown, for
example, for officials responsible for estimating food
shortages to exaggerate figures before sending aid
requests to potential donors. Similarly, and with equal

frequency, commercial rganizations have been known


to inflate estimates of their market shares.
Reliabil
ity

The reliability of published statistics may vary


over time. It is not uncommon, for example, for the
systems of collecting data to have changed over time
but without any indication of this to the reader of
published statistics. Geographical or administrative
boundaries may be changed by government, or the basis
for stratifying a sample may have altered. Other aspects
of research methodology that affect the reliability of
secondary data is the sample size, response rate,
questionnaire design and modes of analysis.

Time
scale

Most censuses take place at 10 year intervals, so


data from this and other published sources may be outof-date at the time the researcher wants to make use of
the

statistics.

The time period during which secondary data was first


compiled may have a substantial effect upon the nature
of the data. For instance, the significant increase in the
price obtained for Ugandan coffee in the mid-90s could
be interpreted as evidence of the effectiveness of the
rehabilitation programme that set out to restore coffee
estates which had fallen into a state of disrepair.
However, more knowledgeable coffee market experts
would interpret the rise in Ugandan coffee prices in the
context of large scale destruction of the Brazilian coffee

crop, due to heavy frosts, in 1994, Brazil being the


largest coffee producer in the world.
Whenever possible, marketing researchers ought to use multiple
sources of secondary data. In this way, these different sources can be crosschecked as confirmation of one another. Where differences occur an
explanation for these must be found or the data should be set aside.

2.7 Difference between Primary & Secondary Data


The difference between primary data and secondary data can be
studied in following points, which are :
Primary research entails the use of immediate data in determining
the survival of the market. The popular ways to collect primary data consist
of surveys, interviews and focus groups, which shows that direct relationship
between potential customers and the companies. Whereas secondary
research is a means to reprocess and reuse collected information as an
indication for betterments of the service or product. Both primary and
secondary data are useful for businesses but both may differ from each other
in various aspects.
In secondary data, information relates to a past period. Hence, it
lacks aptness and therefore, it has unsatisfactory value. Primary data is more
accommodating as it shows latest information.
Secondary data is obtained from some other organization than the
one instantaneously interested with current research project. Secondary data
was collected and analyzed by the organization to convene the requirements
of various research objectives. Primary data is accumulated by the researcher
particularly to meet up the research objective of the subsisting project.

Secondary data though old may be the only possible source of the
desired data on the subjects, which cannot have primary data at all. For
example, survey reports or secret records already collected by a business
group can offer information that cannot be obtained from original sources.
Firm in which secondary data are accumulated and delivered may
not accommodate the exact needs and particular requirements of the current
research study. Many a time, alteration or modifications to the exact needs
of the investigator may not be sufficient. To that amount usefulness of
secondary data will be lost. Primary data is completely tailor-made and there
is no problem of adjustments.
Secondary data is available effortlessly, rapidly and inexpensively.
Primary data takes a lot of time and the unit cost of such data is relatively
high.

Chapter Two
Primary and Secondary Data
End Chapter Quizzes
1.

Statistical results are,

a- cent per correct

b- not absolutely correct

c- always incorrect

d- misleading

2.

Data taken for the publication, Agricultural Situation in India

will be considered as
a primary data

b- secondary data

c- primary and secondary data

3.

d- neither primary nor secondary

Mailed quesetionnaire methods of enquiry can be adopted if

respondents
a-

live in cities

b-

have high income

c-

are educated

d-

are known

4.Statistical data are collected for,


a-

collecting data without any purpose

b-

a given purpose

c-

any purpose

d-

none of the above

5. Method of complete enumeration is applicable for


a-

Knowing the production

b-

Knowing the population

c-

Knowing the quantum of export and im port

d-

All the above

6. A statistical population may consist of


a-

an infinite number of items

b-

an finite numberof items

c-

either of (a) and (b)

d-

none of (a) and (b)

7. Which of the following example does not constitute an infinite


population?
a-

Population consisting of odd numbers

b-

Population of weights of newly born babies

c-

Population of heights of 15-years -old children

d-

Population of head and tails in tossing a coin successively

8. Which of the following can be classified as hypothetical


population?
a-

All labourers of a factory

b-

Female population of a factory

c-

Population of real numbers between 0 and 100

d-

students of the world

9. A study based on complete enumeration is known as


a-

sample survey

b-

pilot survey

c-

census survey

d-

none of the above

10.Statistical results are


a-

absolutely correct

b-

not true

c-

true on average

d-

universally true

CHAPTER THREE
MEASURES OF DISPERSION
3.1 Meaning
There may be variations in the items of different distributions from
average despite the fact that they have value of mean. Hence, the measure of
central tendency alone are incapable of taking complete decisions about the
decisions. It has to be supplemented by some other measures.

3.2 Definitions :

Dispersion is the measure of the variation of the items.


---- A.L. Bowley
Dispersion is the measure of the extent to which the individual items
vary.
---- L.R. Connor
The arithmetic mean of the deviations of the values of the individual
items from the measure of a particular central tendency used. Thus the
dispersion is also known as the "average of the second degree." Prof.
Griffin and Dr. Bowley said the same about the dispersion.

3.3 Types of Dispersion :


Dispersion can be divided into following types :

3.3.1 Absolute Dispersion : It is measured in the same statistical


unit in which the original data exist, e.g., kg, rupee, years etc.

3.3.2 Relative Dispersion : Absolute dispersion fails to measure


the comparison between two series specially when the statistical unit is not
the same. Hence, absolute dispersion has to be converted into relative
measure of dispersion. Relative dispersion is measured in ratio form. It is
also called coefficient of dispersion.
The measures of central tendencies (i.e. means) indicate the general
magnitude of the data and locate only the center of a distribution of
measures. They do not establish the degree of variability or the spread out or
scatter of the individual items and their deviation from (or the difference
with) the means.
i) According to Nciswanger, "Two distributions of statistical data
may be symmetrical and have common means, medians and modes and
identical frequencies in the modal class. Yet with these points in common

they may differ widely in the scatter or in their values about the measures of
central tendencies."
ii) Simpson and Kafka said, "An average alone does not tell the full
story. It is hardly fully representative of a mass, unless we know the manner
in which the individual item. Scatter around it. A further description of a
series is necessary, if we are to gauge how representative the average is."
From this discussion we now focus our attention on the scatter or
variability which is known as dispersion. Let us take the following three
sets.
Students

G
roup X

roup Y
5

0
2

4
5

5
5

3
0

G
roup Z

mean

7
5

5
0

5
0

Thus, the three groups have same mean i.e. 50. In fact the median of
group X and Y are also equal. Now if one would say that the students from
the three groups are of equal capabilities, it is totally a wrong conclusion
then. Close examination reveals that in group X students have equal marks
as the mean, students from group Y are very close to the mean but in the
third group Z, the marks are widely scattered. It is thus clear that the
measures of the central tendency is alone not sufficient to describe the data.

3.4 Features of an ideal measure of dispersion

An ideal measure of dispersion must possess the following features :

Simple to understand

Easy to compute

Well defined measure

Based on all the items of data

Capable of algebraic treatment

Should not be affected by the extreme items.

3.5 Methods of measuring Dispersion


Dispersion can be calculated by using any of the following method :
3.5.1 Range
3.5.2 Quartile Deviation
3.5.3 Mean Deviation
3.5.4 Standard Deviation
3.5.5 Co-efficient of Variation

3.5.1 Range
In any statistical series, the difference between the largest and the
smallest values is called as the range.

Thus Range (R) = L - S


Coefficient of Range : The relative measure of the range. It is used in
the comparative study of the dispersion co-efficient of Range =
Example ( Individual series ) Find the range and the co-efficient of
the range of the following items :
110, 117, 129, 197, 190, 100, 100, 178, 255, 790.
Solution: R = L - S = 790 - 100 = 690

Solution: R = L - S = 100 - 10 = 90
Co-efficient of range =
Example ( Discrete Series ) Find the range and the co-efficient of the
range of the following items :
x

10

12

13

14

17

12

10

Solution
X

10

12

12

13

10

14

17

Range = L-S = 17- 8 = 9


Coefficient of Range = L-S/ L+S
= (17-8) / (17+8)
= 9/25
= 0.36
Continuous Series
Example (Continuous Series) Find the range and the co-efficient of
the range of the following items :
X(m

0-10

10-20

20-30

30-40

40-50

arks)
F(St

12

udents)

Solution
X(Marks)

F(Students)

0-10

10-20

20-30

12

30-40

40-50

Range = L-S
= 50-0
50
Coefficient of Range = (L-S) / (L+S)
Relative Range = (50-0) / (50+0)
= 50/50
=1

3.5.2 Quartile Deviations


If we concentrate on two extreme values ( as in the case of range ), we
dont get any idea about the scatter of the data within the range ( i.e. the two
extreme values ). If we discard these two values the limited range thus
available might be more informative. For this reason the concept of

interquartile range is developed. It is the range which includes middle 50%


of the distribution. Here 1/4 ( one quarter of the lower end and 1/4 ( one
quarter ) of the upper end of the observations are excluded.

Now the lower quartile ( Q1 ) is the 25th percentile and the upper
quartile ( Q3 ) is the 75th percentile. It is interesting to note that the 50th
percentile is the middle quartile ( Q2 ) which is in fact what you have studied
under the title Median ". Thus symbolically
If we divide ( Q3 - Q1 ) by 2 we get what is known as Semi-Iinter
quartile range.
Q.D. = (Q3-Q1)/2, where Q1 = First Quartile and Q3 = Third quartile
Relative or Coefficient of Q.D. :
To find the coefficient of Q. D., we divide the semi interquartile
range by the sum of semi interquartiles. Symbolically :
Coefficient of Q.D. = (Q3 Q1) / (Q3 + Q1)
Example ( Individual Series ) Find the quartile deviation and its coefficient from the following items :
X(marks)

10

12

15

11

12

Solution
S. No.

X(Marks)

Revised X (In
ascending order)

15

20

10

12

10

15

11

12

11

12

12

15

15

15

10

20

20

Q1 = ( N+1)/4th item
Where N = No. of items in the data
Q1 = (10+1)/4
= 11/4
= 2.75th item
and 2.75th item = 2nd item + ( 3rd 2nd item) 75/100
= 8 + (9-8)
= 8 + 0.75
= 8.75
Q3 = 3 (N+1)/4th item
= 3 ( 10+1)/4
= 33/4
= 8.25th item
and 8.25th item 8th = (9th 8th item) 25/100
= 15+(15-15)/4
= 15+ 0
= 15

Q.D. = (Q3 Q1) /2


= (15- 8.75)/ 2
= 3.125
and coefficient of Q.D. = (Q3 Q1) / (Q3+Q1)
= (15 8.75) / (15+8.75)
= 6.25/ 23.75
= 0.26
Example (Discrete Series) Find the range and the co-efficient of the
range of the following data :
Solution
Central size of

Frequency(f)

c.f.

items(x)
2

10

16

24

12

36

16

52

59

10

64

11

68

N = 68
Q1 = ( N+1) /4th item
= (68+1)/ 4th item
= (69)/4

= 17.25th item
17.25th item lies in c.f. 24 and against value of X = 6
Q1 = 6
Q3 = 3(N+1)/4th item
= 3(68+1)/4 th item
= (3*69)/4
= 51.75th item
51.75th item lies in c.f. 52 and against it value of X = 8
Q3 = 8
Q.D. = (Q3-Q1)/2
= (8-6)/2
=1
Coefficient of Q.D. = (Q3-Q1)/(Q3+Q1)
= (8-6)/(8+6)
= 2 / 14
= 0.143

3.5.3 Mean Deviation


Average deviations ( mean deviation ) is the average amount of
variations (scatter) of the items in a distribution from either the mean or the
median or the mode, ignoring the signs of these deviations by Clark and
Senkade.
Individual Series
Steps : (1) Find the mean or median or mode of the given series.

(2) Using and one of three, find the deviations ( differences ) of the
items of the series from them.
i.e. xi - x, xi - Me and xi - Mo.
Me = Median and Mo = Mode.
(3) Find the absolute values of these deviations i.e. ignore there
positive (+) and negative (-) signs.
i.e. | xi - x | , | xi - Me | and xi - Mo |.
(4) Find the sum of these absolute deviations.
i.e. | xi - x | + , | xi - Me | , and | xi - Mo | .

(5) Find the mean deviation using the following formula.

Note that :
(i) generally M. D. obtained from the median is the best for the
practical purpose.
(ii) co-efficient of M. D. =

Merits and Demerits of Mean Deviations


Merits
1.

It is a better technique of dispersion in relation to range and

quartile deviation.
2.

This method is based on all the items of the data.

3.

The mean deviation is less affected by the extreme items in

relation to standard deviations.

Demerits
1. This method lacks algebraic treatment as signs are ignored while
taking deviation from an average.
2. Mean deviation can not be considered as a scientific methods as it
ignores signs.

Example Calculate Mean deviation and its co-efficient for the


following salaries:
$ 1030, $ 500, $ 680, $ 1100, $ 1080, $ 1740. $ 1050, $ 1000, $ 2000,
$ 2250, $ 3500 and $ 1030.

Calculations :

i) Median (Me) = Size of


= Size of 11th item.
Therefore, Median ( Me) = 8
ii) M. D. =

Example ( Continuous series ) Calculate the mean deviation and the


coefficient of mean deviation from the following data using the mean.
Difference in ages between boys and girls of a class.
Diff. in years

No.of students

0-5

449

5 10

705

10 15

507

15 20

281

20 25

109

25 30

52

30 35

16

35 40

Calculation:
1) X

2) M. D.

efficient of M. D.3) Co-

3.5.4 Standard Deviation (S. D.)


It is the square root of the arithmetic mean of the square deviations of
various values from their arithmetic mean. it is denoted by s.d.

Thus, s.d. ( x ) =

where n = fi

Merits : (1) It is rigidly defined and based on all observations.


(2) It is amenable to further algebraic treatment.
(3) It is not affected by sampling fluctuations.
(4) It is less erratic.
Demerits : (1) It is difficult to understand and calculate.
(2) It gives greater weight to extreme values.

Note that variance V(x) =

and s. d. ( x ) =
Then V ( x ) =

and

3.5.5 Co-efficient Of Variation ( C. V. )


To compare the variations (dispersion) of two different series, relative
measures of standard deviation must be calculated. This is known as coefficient of variation or the co-efficient of s. d. Its formula is C. V. =

Thus it is defined as the ratio s. d. to its mean.


Remark: It is given as a percentage and is used to compare the
consistency or variability of two more series. The higher the C. V. , the
higher the variability and lower the C. V., the higher is the consistency of the
data.
Example Calculate the standard deviation and its co-efficient from
the following data.
A

10

12

16

25

30

14

11

13

11

Solution :
No.

xi

(xi - x)

( xi - x )2

10

-5

25

12

-3

16

+1

-7

49

25

+10

100

30

+15

225

14

-1

11

-5

16

13

-2

11

-4

16

n= 10

xi = 150

Calculations :
i)

ii)
iii)
Example Calculate s.d. of the marks of 100 students.

|xi -x |2 =
446

fi xi

fi xi2

10

10

20

60

180

4-6

35

175

875

6-8

30

210

1470

8-10

45

405

fi xi =
500

fi xi2 =
2940

Marks

No. of
students

Midvalues

(fi)

(xi)

0-2

10

2-4

n = 100

Solution
1)

2)

Chapter Three
Measures of Dispersion
End Chapter Quizzes
1. Which of the following is not a measure of dispersion?
a-

mean deviation

b-

quartile deviation

c-

standard deviation

d-

average deviation from mean

2. Which of the following is a unit less measure of dispersion?


a-

standard deviation

b-

mean deviation

c-

coefficient of variation

d-

range

3. Which one of the given measures of dispersion is considered best?


a-standard deviation
b- range
c- variance
d- coefficient of variation
4. For comparison of two different series, the best measure of dispersion
is
e-

range

f-

mean deviation

g-

standard deviation

h-

none of the above

5. Out of all measures of dispersion, the easiest one to calculate is


a- standard deviation
b- range
c- variance
d- quartile deviation
6. Mean deviation is minimum when deviations are taken from
a.

mean

b.

median

c.

mode

d.

zero

7. Sum of squares of the deviations is when deviations are taken from


a.

mean

b.

meadian

c.

mode

d.

zero

8. Which measure of dispersion is least affected by extreme values ?


a.

range

b.

mean deviation

c.

standard deviation

d.

quartile deviation

9.

The average of the sum of squares of the deviations about mean

is called
a.

variance

b.

absolute deviation

c.

standard deviation

d.

mean deviation

10.

Quartile deviation is equal to

a.

interquartile range

b.

double interquartile range

c.

half of the interquartile range

d.

none of the above

CHAPTER FOUR
MEASURES OF SKEWNESS
4.1 Skewness
The voluminous raw data cannot be easily understood, Hence, we
calculate the measures of central tendencies and obtain a representative
figure. From the measures of variability, we can know that whether most of
the items of the data are close to our away from these central tendencies. But
these statical means and measures of variation are not enough to draw
sufficient inferences about the data. Another aspect of the data is to know its
symmetry. in the chapter "Graphic display" we have seen that a frequency
may be symmetrical about mode or may not be. This symmetry is well
studied by the knowledge of the "skewness." Still one more aspect of the
curve that we need to know is its flatness or otherwise its top. This is
understood by what is known as " Kurtosis."

4.2 Definitions : Different authorities have defined skewness in different


manners. Some of the definitions are as under :
According to Croxton and Cowden, When a series is not
symmetrical, it is said to be asymmetrical or skewed.

It may happen that two distributions have the same mean and standard
deviations.

For

example,

see

the

following

diagram.

Although the two distributions have the same means and standard
deviations they are not identical. Where do they differ ?
They differ in symmetry. The left-hand side distribution is
symmetrical one where as the distribution on the right-hand is asymmetrical
or skewed. For a symmetrical distribution, the values, of equal distances on
either side of the mode, have equal frequencies. Thus, the mode, median and
mean - all coincide. Its curve rises slowly, reaches a maximum ( peak ) and
falls equally slowly (Fig. 1). But for a skewed distribution, the mean, mode
and median do not coincide. Skewness is positive or negative as per the
positions of the mean and median on the right or the left of the mode.
A positively skewed distribution ( Fig.2 ) curve rises rapidly, reaches
the maximum and falls slowly. In other words, the tail as well as median on
the right-hand side. A negatively skewed distribution curve (Fig.3) rises
slowly reaches its maximum and falls rapidly. In other words, the tail as well
as the median are on the left-hand side.

Size

Frequency

Size

Frequency

Size

Frequency

12

13

14

12

15

10

10

14

12

13

12

4.3 Difference between Skewness and Dispersion


Dispersion refers to spreadness or variations of items in a series while
skewness refers to the direction of variation in a series. Thus, we measure
the lack of symmetry in the distribution. Skewness may be both positive as
well as negative depending upon the fact whether the value of mode is on
the right or on the left side of the distribution.

4.4 Tests of Skewness


1. The values of mean, median and mode do not coincide. The more
the difference between them, the more is the skewness.
2. Quartiles are not equidistant from the median. i.e. ( Q3 -Me ) ( Me
- Q1 ).

3 The sum of positive deviations from the median is not equal to the
sum of the negative deviations.
4. Frequencies are not equally distributed at points of equal deviation
from the mode.
5. When the data is plotted on a graph they do not give the normal
bell-shaped form.

4.5 Methods of measurement of Skewness


1. First measure of skewness
Measure of skewness

It is given by Karl Pearson


Co-efficient of skewness

Skp = Mean - Mode

J=

i.e. Skp = - Mo
Pearson has suggested the use of this formula if it is not possible to
determine the mode (Mo) of any distribution,
( Mean - Mode ) = 3 ( mean - median )
Skp = 3 ( - Mo ) Thus J =
Note : i) Although the co-efficient of skewness is always within 1,
but Karl Pearsons co-efficient lies within 3.
ii) If J = 0, then there is no skewness
iii) If J is positive, the skewness is also positive.
iv) If J is negative, the skewness is also negative.
Unless and until no indication is given, you must use only Karl
Pearsons formula.

Example Find Karl Pearsons coefficient of skewness from the


following data:
Marks above

No.of students

150

10

140

20

100

30

80

40

80

50

70

60

30

70

14

80

Note: You will always find the different values of J when calculated by Karl
Pearsons and Bowleys formula. But the value of J by Bowleys formula
always lies with 1.

Example The following table gives the frequency distribution of 291


workers of a factory according to their average monthly income in 1945- 55.
Income group
($)

No.of workers

Below 50

50-70

16

70-90

39

90-110

58

110-130

60

130-150

46

150-170

22

170-190

15

190-210

15

210-230

230 & above

10

Solution:
Income
group

c.f.

Below 50

50 70

16

17

70 90

39

56

90 - 110

58

114

110 - 130

60

174

130 - 150

46

220

150 - 170

22

242

170 - 190

15

257

190 - 210

15

252

210 - 230

281

230 &
above

10

291

n = f = 291

Calculations :
1) Median = Size of
= Size of

item

item

= Size of 146th item which lies in (100-130) class interval.


Me =

=
=

Chapter Four
Measures of Skewness
End Chapter Quizzes
1. For a positive skewed distribution, which of the following
inequally is
a-

median > mode

b-

mode > mean

c-

mean > median

d-

mean > mode

2. For a negatively skewed distribution, the correct inequality is


a-

mode < median

b-

mean < median

c-

mean < mode

d-

none of the above

3. In case of a positive skewed distribution, the relation between mean,


mead, median, and mode that hold is
a-

median >mean >mode

b-

mean > median > mode

c-

mean = median = mode

d-

none of the above

4. For a positive skewed frequency curve, the inequality that holds is


a-

Q1 +Q3 >2Q2

b-

Q1 + Q2 > 2Q3

c-

Q1 + Q3 > Q2

d-

Q3 Q1 > Q2

5. If a moderately skewed distribution has mean 30 and mode 36, the


median of the distribution is
a-

10

b-

35

c-

20

d-

zero

6. First and third quartile of a frequency distribution are 30 and 75. Also
its coefficient of skewness is 0.6. The median of the frequency
distribution is
a- 40
b- 39
c- 41
d- 38
7. For negatively skewed distribution, the correct relation between mean,
median and mode is
a-

mean = median = mode

b-

median < mean < mode

c-

mean < median < mode

d-

mode < mean < median

8. In the case of positive skewed distribution, the extreme values lies in


the
a-

left tail

b-

right tail

c-

middle

d-

any where

9. The extreme values in a negatively skewed distribution lie in the


a-

middle

b-

right tail

c-

left tail

d-

whole curve

10. Which of the following statements is true for a measures of deviation


is
a-

mean deviation does not follow algebraic rule

b-

range is a crudest measure

c-

coefficient of variation is a relative measure

d-

all the above statements

CHAPTER FIVE
CORRELATION
5.1 Introduction
So far we have considered only univariate distributions. By the
averages, dispersion and skewness of distribution, we get a complete idea
about the structure of the distribution. Many a time, we come across
problems which involve two or more variables. If we carefully study the
figures of rain fall and production of paddy, figures of accidents and motor
cars in a city, of demand and supply of a commodity, of sales and profit, we
may find that there is some relationship between the two variables. On the
other hand, if we compare the figures of rainfall in America and the
production of cars in Japan, we may find that there is no relationship
between the two variables. If there is any relation between two variables i.e.
when one variable changes the other also changes in the same or in the
opposite direction, we say that the two variables are correlated.
W. J. King : If it is proved that in a large number of instances two
variables, tend always to fluctuate in the same or in the opposite direction
then it is established that a relationship exists between the variables. This is
called a "Correlation."
The correlation is one of the most common and most useful statistics.
A correlation is a single number that describes the degree of relationship
between two variables. Let's work through an example to show you how this
statistic is computed.
Correlation is a statistical technique that can show whether and how
strongly pairs of variables are related. For example, height and weight are
related; taller people tend to be heavier than shorter people. The relationship

isn't perfect. People of the same height vary in weight, and you can easily
think of two people you know where the shorter one is heavier than the taller
one. Nonetheless, the average weight of people 5'5'' is less than the average
weight of people 5'6'', and their average weight is less than that of people
5'7'', etc. Correlation can tell you just how much of the variation in peoples'
weights is related to their heights.
Although this correlation is fairly obvious your data may contain
unsuspected correlations. You may also suspect there are correlations, but
don't know which are the strongest. An intelligent correlation analysis can
lead to a greater understanding of your data.
It means the study of existence, magnitude and direction of the
relation between two or more variables. in technology and in statistics.
Correlation is very important. The famous astronomist Bravais, Prof. Sir
Fancis Galton, Karl Pearson (who used this concept in Biology and in
Genetics). Prof. Neiswanger and so many others have contributed to this
great subject.

5.2 Definitions :
An analysis of the covariation of two or more variables is usually
called correlation.
A. M. Tuttle
Correlation analysis attempts to determine the degree of relationship
between variables.
Ya Lun Chou
The effect of correlation is to reduce the range of uncertainty of
ones prediction.
Tippett

5.3 Coefficient of Correlation


The main result of a correlation is called the correlation coefficient
(or "r"). It ranges from -1.0 to +1.0. The closer r is to +1 or -1, the more
closely the two variables are related.
If r is close to 0, it means there is no relationship between the
variables. If r is positive, it means that as one variable gets larger the other
gets larger. If r is negative it means that as one gets larger, the other gets
smaller (often called an "inverse" correlation).
While correlation coefficients are normally reported as r = (a value
between -1 and +1), squaring them makes then easier to understand. The
square of the coefficient (or r square) is equal to the percent of the variation
in one variable that is related to the variation in the other. After squaring r,
ignore the decimal point. An r of .5 means 25% of the variation is related (.5
squared =.25). An r value of .7 means 49% of the variance is related (.7
squared = .49).
A correlation report can also show a second result of each test statistical significance. In this case, the significance level will tell you how
likely it is that the correlations reported may be due to chance in the form of
random sampling error. If you are working with small sample sizes, choose a
report format that includes the significance level. This format also reports
the sample size.
A key thing to remember when working with correlations is never to
assume a correlation means that a change in one variable causes a change in
another. Sales of personal computers and athletic shoes have both risen
strongly in the last several years and there is a high correlation between

them, but you cannot assume that buying computers causes people to buy
athletic shoes (or vice versa).
The second caveat is that the Pearson correlation technique works best
with linear relationships: as one variable gets larger, the other gets larger (or
smaller) in direct proportion. It does not work well with curvilinear
relationships (in which the relationship does not follow a straight line). An
example of a curvilinear relationship is age and health care. They are
related, but the relationship doesn't follow a straight line. Young children
and older people both tend to use much more health care than teenagers or
young adults. Multiple regression (also included in the Statistics Module)
can be used to examine curvilinear relationships, but it is beyond the scope
of this article.
Correlation Example
Let's assume that we want to look at the relationship between two
variables, height (in inches) and self esteem. Perhaps we have a hypothesis
that how tall you are effects your self esteem (incidentally, I don't think we
have to worry about the direction of causality here -- it's not likely that self
esteem causes your height!). Let's say we collect some information on
twenty individuals (all male -- we know that the average height differs for
males and females so, to keep this example simple we'll just use males).
Height is measured in inches. Self esteem is measured based on the average
of 10 1-to-5 rating items (where higher scores mean higher self esteem).
Here's the data for the 20 cases (don't take this too seriously -- I made this
data up to illustrate what a correlation is):

Person
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20

Height
68
71
62
75
58
60
67
68
71
69
68
67
63
62
60
63
65
67
63
61

Self Esteem
4.1
4.6
3.8
4.4
3.2
3.1
3.8
4.1
4.3
3.7
3.5
3.2
3.7
3.3
3.4
4.0
4.1
3.8
3.4
3.6

Now, let's take a quick look at the histogram for each variable:

And, here are the descriptive statistics:


Variable Mean
Height
65.4

StDev
4.40574

Self
Esteem

0.426090 0.181553 75.1

3.755

Variance Sum
19.4105 1308

Minimum Maximum Range


58
75
17
3.1

4.6

Finally, we'll look at the simple bivariate (i.e., two-variable) plot:

1.5

You should immediately see in the bivariate plot that the relationship
between the variables is a positive one (if you can't see that, review the
section on types of relationships) because if you were to fit a single straight
line through the dots it would have a positive slope or move up from left to
right. Since the correlation is nothing more than a quantitative estimate of
the relationship, we would expect a positive correlation.
What does a "positive relationship" mean in this context? It means
that, in general, higher scores on one variable tend to be paired with higher
scores on the other and that lower scores on one variable tend to be paired
with lower scores on the other. You should confirm visually that this is
generally true in the plot above.

5.4 Types of Correlation


5.4.1 Positive and negative correlation
5.4.2 Linear and non-linear correlation
A) If two variables change in the same direction (i.e. if one increases
the other also increases, or if one decreases, the other also decreases), then
this is called a positive correlation. For example : Advertising and sales.
B) If two variables change in the opposite direction ( i.e. if one
increases, the other decreases and vice versa), then the correlation is called a
negative correlation. For example : T.V. registrations and cinema
attendance.
1.

The nature of the graph gives us the idea of the linear type of

correlation between two variables. If the graph is in a straight line, the


correlation is called a "linear correlation" and if the graph is not in a straight
line, the correlation is non-linear or curvi-linear.

For example, if variable x changes by a constant quantity, say 20 then


y also changes by a constant quantity, say 4. The ratio between the two
always remains the same (1/5 in this case). In case of a curvi-linear
correlation this ratio does not remain constant.

5.5 Degrees of Correlation


Through the coefficient of correlation, we can measure the degree or
extent of the correlation between two variables. On the basis of the
coefficient of correlation we can also determine whether the correlation is
positive or negative and also its degree or extent.

5.5.1 Perfect correlation: If two variables changes in the same


direction and in the same proportion, the correlation between the two is
perfect positive. According to Karl Pearson the coefficient of correlation in
this case is +1. On the other hand if the variables change in the opposite
direction and in the same proportion, the correlation is perfect negative. its
coefficient of correlation is -1. In practice we rarely come across these types
of correlations.

5.5.2 Absence of correlation: If two series of two variables


exhibit no relations between them or change in variable does not lead to a
change in the other variable, then we can firmly say that there is no
correlation or absurd correlation between the two variables. In such a case
the coefficient of correlation is 0.

5.5.3 Limited degrees of correlation: If two variables are not


perfectly correlated or is there a perfect absence of correlation, then we term
the correlation as Limited correlation. It may be positive, negative or zero
but lies with the limits 1.

High degree, moderate degree or low degree are the three categories
of this kind of correlation. The following table reveals the effect ( or degree )
of coefficient or correlation.
Degrees

Positive

Negative

Absence of correlation

Zero

Perfect correlation

+1

-1

High degree

+ 0.75 to +
1

- 0.75 to 1

Moderate degree

+ 0.25 to +
0.75

- 0.25 to 0.75

Low degree

0 to 0.25

0 to - 0.25

5.6 Techniques in Determining Correlation


There are several different correlation techniques. The Survey
System's optional Statistics Module includes the most common type, called
the Pearson or product-moment correlation. The module also includes a
variation on this type called partial correlation. The latter is useful when you
want to look at the relationship between two variables while removing the
effect of one or two other variables.
Like all statistical techniques, correlation is only appropriate for
certain kinds of data. Correlation works for quantifiable data in which
numbers are meaningful, usually quantities of some sort. It cannot be used
for purely categorical data, such as gender, brands purchased, or favorite
color.
Following are the techniques for determining the correlation :-

5.6.1 Rating Scales


Rating scales are a controversial middle case. The numbers in rating
scales have meaning, but that meaning isn't very precise. They are not like
quantities. With a quantity (such as dollars), the difference between 1 and 2
is exactly the same as between 2 and 3. With a rating scale, that isn't really
the case. You can be sure that your respondents think a rating of 2 is
between a rating of 1 and a rating of 3, but you cannot be sure they think it is
exactly halfway between. This is especially true if you labeled the midpoints of your scale (you cannot assume "good" is exactly half way between
"excellent" and "fair").
Most statisticians say you cannot use correlations with rating scales,
because the mathematics of the technique assume the differences between
numbers are exactly equal. Nevertheless, many survey researchers do use
correlations with rating scales, because the results usually reflect the real
world. Our own position is that you can use correlations with rating scales,
but you should do so with care. When working with quantities, correlations
provide precise measurements. When working with rating scales,
correlations provide general indications.

Calculating the Correlation


Now we're ready to compute the correlation value. The formula for
the correlation is:

We use the symbol r to stand for the correlation. Through the magic
of mathematics it turns out that r will always be between -1.0 and +1.0. if the
correlation is negative, we have a negative relationship; if it's positive, the
relationship is positive. You don't need to know how we came up with this
formula unless you want to be a statistician. But you probably will need to
know how the formula relates to real data -- how you can use the formula to
compute the correlation. Let's look at the data we need for the formula.
Here's the original data with the other necessary columns:
Heig

Person

ht (x)

Self
Esteem (y)

x*y

x*x

y*y

68

4.1

278.8

4624

16.81

71

4.6

326.6

5041

21.16

62

3.8

235.6

3844

14.44

75

4.4

330

5625

19.36

58

3.2

185.6

3364

10.24

60

3.1

186

3600

9.61

67

3.8

254.6

4489

14.44

68

4.1

278.8

4624

16.81

71

4.3

305.3

5041

18.49

10

69

3.7

255.3

4761

13.69

11

68

3.5

238

4624

12.25

12

67

3.2

214.4

4489

10.24

13

63

3.7

233.1

3969

13.69

14

62

3.3

204.6

3844

10.89

15

60

3.4

204

3600

11.56

16

63

252

3969

16

17

65

4.1

266.5

4225

16.81

18

67

3.8

254.6

4489

14.44

19

63

3.4

214.2

3969

11.56

20

61

3.6

219.6

3721

12.96

1308

75.1

4937.

8591

285.4

Sum
=

The first three columns are the same as in the table above. The next
three columns are simple computations based on the height and self esteem
data. The bottom row consists of the sum of each column. This is all the
information we need to compute the correlation. Here are the values from
the bottom row of the table (where N is 20 people) as they are related to the
symbols in the formula:

Now, when we plug these values into the formula given above, we get
the following (I show it here tediously, one step at a time):

So, the correlation for our twenty cases is .73, which is a fairly strong
positive relationship. I guess there is a relationship between height and self
esteem, at least in this made up data!

5.7 Methods of Determining Correlation


We shall consider the following most commonly used methods.(1)
Scatter Plot (2) Kar Pearsons coefficient of correlation (3) Spearmans
Rank-correlation coefficient.

5.7.1 Scatter Plot (Scatter diagram or dot diagram): In this


method the values of the two variables are plotted on a graph paper. One is
taken along the horizontal ( (x-axis) and the other along the vertical (y-axis).
By plotting the data, we get points (dots) on the graph which are generally
scattered and hence the name Scatter Plot.
The manner in which these points are scattered, suggest the
degree and the direction of correlation. The degree of correlation is
denoted by r and its direction is given by the signs positive and
negative.

i) If all points lie on a rising straight line the correlation is perfectly


positive and r = +1 (see fig.1 )
ii) If all points lie on a falling straight line the correlation is
perfectly negative and r = -1 (see fig.2)
iii) If the points lie in narrow strip, rising upwards, the
correlation is high degree of positive (see fig.3)
iv) If the points lie in a narrow strip, falling downwards, the
correlation is high degree of negative (see fig.4)
v) If the points are spread widely over a broad strip, rising
upwards, the correlation is low degree positive (see fig.5)

vi) If the points are spread widely over a broad strip, falling
downward, the correlation is low degree negative (see fig.6)
vii) If the points are spread (scattered) without any specific pattern,
the correlation is absent. i.e. r = 0. (see fig.7)
Though this method is simple and is a rough idea about the existence
and the degree of correlation, it is not reliable. As it is not a mathematical
method, it cannot measure the degree of correlation.

5.7.2 Karl Pearsons coefficient of correlation: It gives the


numerical expression for the measure of correlation. it is noted by r . The
value of r gives the magnitude of correlation and sign denotes its
direction. It is defined as
r=

where
N = Number of pairs of observation
Note : r is also known as product-moment coefficient of correlation.

OR r =

OR r =
Now covariance of x and y is defined as

Example Calculate the coefficient of correlation between the heights


of father and his son for the following data.
Height
of
father
(cm):

Height
of son
(cm):

165

166

167

168

167

169

170

172

167

168

165

172

168

172

169

171

Solution: n = 8 ( pairs of observations )


Height of

Height of

father

son

xi

yi

165

x
=
xix

y=
yi-y

xy

x2

y2

167

-3

-2

166

168

-2

-1

167

165

-1

-4

16

167

168

-1

-1

168

172

169

172

170

169

172

171

xi=1344

yi=1352

2
0

16

xy=24

x2=36

y2=44

Calculation:

Now,

Since r is positive and 0.6. This shows that the correlation is positive
and moderate (i.e. direct and reasonably good).
Example From the following data compute the coefficient of
correlation between x and y.

Example If covariance between x and y is 12.3 and the variance of x


and y are 16.4 and 13.8 respectively. Find the coefficient of correlation
between them.
Solution: Given - Covariance = cov. ( x, y ) = 12.3
Variance of x ( x2 )= 16.4
Variance of y (y2 ) = 13.8
Now,

5.7.3 Spearmans Rank Correlation Coefficient


This method is based on the ranks of the items rather than on their
actual values. The advantage of this method over the others in that it can be
used even when the actual values of items are unknown. For example if you
want to know the correlation between honesty and wisdom of the boys of
your class, you can use this method by giving ranks to the boys. It can also
be used to find the degree of agreements between the judgements of two
examiners or two judges. The formula is :

R=
where R = Rank correlation coefficient
D = Difference between the ranks of two items
N = The number of observations.
Note: -1 R 1.
i)

When R = +1 Perfect positive correlation or complete


agreement in the same direction

ii)

When R = -1 Perfect negative correlation or complete


agreement in the opposite direction.

iii) When R = 0 No Correlation.

Computation:
i.Give ranks to the values of items. Generally the item with the highest
value is ranked 1 and then the others are given ranks 2, 3, 4, .... according to
their values in the decreasing order.

ii.Find the difference D = R1 - R2


where R1 = Rank of x and R2 = Rank of y
Note that D = 0 (always)
iii.Calculate D2 and then find D2
iv.Apply the formula.
Note :
In some cases, there is a tie between two or more items. in such a case
each items have ranks 4th and 5th respectively then they are given

4.5th rank. If three items are of equal rank say 4th then they are given
= 5th rank each. If m be the number of items of equal ranks, the
is added to S D2. If there are more than one of such cases

factor

then this factor added as many times as the number of such cases, then

Example: Calculate Rank Correlation from the following data.


Student
No.:

10

Rank
in
Maths :

10

Rank
in
Stats:

10

Solution :
Student
No.

Rank
in
Maths
(R1)

Rank
in
Stats
(R2)

R1 - R2
D

(R1 - R2 )2
D2

-2

-2

-3

-5

25

10

10

-1

10

36

SD=0

S D2 = 96

N = 10

Calculation of R :

Example Calculate R of 6 students from the following data.


Marks
in Stats
:

40

42

45

35

36

39

Marks
in
English
:

46

43

44

39

40

43

Solution:
Marks
in
Stats

R1

Marks
in
English

R2

R1 - R2

(R1 -R2)2
=D2

40

46

42

43

3.5

-1.5

2.25

45

44

-1

35

39

36

40

39

43

3.5

0.5

0.25

SD=0

S D2 = 7.50

N=6

Here m = 2 since in series of marks in English of items of values 43


repeated twice.

Example The value of Spearmans rank correlation coefficient for a


certain number of pairs of observations was found to be 2/3. The sum of the
squares of difference between the corresponding rnks was 55. Find the
number of pairs.
Solution: We have

Example A panel of two judges A and B graded dramatic


performance

by

independently

awarding

marks

as

follows:

Solution:

The equation of the line of regression of y on x

Inserting x = 38, we get


y - 33 = 0.74 ( 38 - 33 )
y - 33 = 0.74 5
y - 33 = 3.7
y = 3.7 + 33
y = 36.7 = 37 ( approximately )

Therefore, the Judge B would have given 37 marks to 8th


performance.

Chapter Five
Correlation Analysis
End Chapter Quizzes
1.
abcd-

The idea of product moment correlation was given by


R. A. Fisher
Sir Francis Galton
Karl Pearson
Spearman

2.
abcd-

Correlation coefficient was invented in the year


1910
1890
1908
none of the above

3.
abcd-

The unit of correlation coefficient is


kg/ cc
per cent
non-existing
none of the above

4.
abcd-

The correlation between two variables is of order


2
1
0
none of the above

5.
abcd-

Coefficient of co-current deviation depends on


the signs of the deviations
the magnitude of deviation
both (a) and (b)
none of (a) and (b)

6.

If each group consists of one observation only, the value of correlation

a-

ratio is

bcd7.
association is
abcd-

0
between 1 and 0
between 1and 1
From a given (2*c) contingency table, the appropriate measure of
correlation ratio
biserial correlation
intracless correlation
tetrachoric correlation

8.
abcd-

Another name of autocorrelation is


biserial correlation
serial correlation
Spearmans correlation
none of the above

9.

If the correlation coefficient between two variables is positive, it means

abcd-

far apart
coincident
near to each other
none of the above

10.
abcd-

The correlation between the two variables is unity, there is


perfect correlation
perfect positive correlation
perfect negative correlation
no correlation

that

CHAPTER SIX
REGRESSION ANALYSIS
6.1 Meaning
In statistics, regression analysis is a collective name for techniques for the modeling and
analysis of numerical data consisting of values of a dependent variable (also called
response variable or measurement) and of one or more independent variables (also known
as explanatory variables or predictors). The dependent variable in the regression
equation is modeled as a function of the independent variables, corresponding
parameters ("constants"), and an error term.
So Regression analysis is any statistical method where the mean of one or more
random variables is predicted based on other measured random variables. There are two
types of regression analysis, chosen according to whether the data approximate a straight
line, when linear regression is used, or not, when non-linear regression is used.
Regression can be used for prediction (including forecasting of time-series data),
inference, hypothesis testing, and modeling of causal relationships. These uses of
regression rely heavily on the underlying assumptions being satisfied. Regression
analysis has been criticized as being misused for these purposes in many cases where the
appropriate assumptions cannot be verified to hold one factor contributing to the misuse
of regression is that it can take considerably more skill to critique a model than to fit a
model.

6.2 Definitions :
Regression is the measure of the average relationship between two or more variables
and terms of the original units of the data.
Morris M. Blair

One of the most frequently used techniques in economics and business research,
to find a relation between two or more variables that are related casually, is regression
analysis.
Taro Yamane
It is often more important to find out what the relation actually is, in order to
estimate or predict one variable and the statistical technique appropriate to such a case is
called regression analysis.
Wallis and Roberts

6.3 Regression Line


A regression line is a line drawn through a scatterplot of two variables. The line is chosen
so that it comes as close to the points as possible. Regression analysis, on the other hand,
is more than curve fitting. It involves fitting a model with both deterministic and
stochastic components. The deterministic component is called the predictor and the
stochastic component is called the error term.
The simplest form of a regression model contains a dependent variable, also called the
"Y-variable" and a single independent variable, also called the "X-variable".

6.4 Regression Equations and Regression Coefficient


Regression equations or estimating equations are algebraic expression of regression lines.
As there are two regression lines, so there are two regression equation, i.e. regression
equation of X on Y and regression equation of Y on X.
The regression equation of X on Y is :
X = a + bY
Here X is a dependent variable and Y is independent variable. a is X intercept
and b is the slope of line and it represents change in variable X when there is a unit
change in variable Y.
X = aN + bY
(i)
2
(ii)
XY = aY + bY
If we solve these two equations, we can compute the values of a and b constants.
Similarly, regression equation of Y on X is :
Y = a + bX
And if we solve the following two equations, we can find the values of constants a and b.
Y = aN + bX

(i)

XY = aX + bX2

(ii)

Illustration : Students of a class have obtained marks as given below in


paper I and paper II of statistics:
Paper I 45
55
56
58
60
65
68
70
75
80
85
PaperII 56
50
48
60
62
64
65
70
74
82
90
Find the mean, coefficient of correlation, regression coefficient.

6.5 Difference between Correlation and Regression Analysis


Both Correlation and Regression Analysis are two important statistical
tools to study the relationship between variables. The difference between the
two can be analysed as under :

Correlation
1. Correlation measures the
relationship between the two
variables which vary in the same or
opposite direction.
2. Here both X and Y variables are
random variables.

Regression Analysis
1. Regression means going back or
act of return. It is a mathematical
measure which shows the average
relationship between the two
variables.
2. Here X is a random variable and Y
is a fixed variable. However, both

3. There can be non sense or


spurious correlation between two
variables.
4. The coefficient of correlation is a
relative measure and it ranges in 1.

variables may be random variables.


3. There is no such non sense
regression equation.
4. Regression coefficient is an
absolute measure. If we know the
value of independent variable, we
can estimate the value of dependent
variable.

Chapter Six
Regression Analysis
End Chapter Quizzes
1.
abcd-

The term regression was introduced by


R. A. Fisher
Sir Francis Galton
Karl Pearson
none of the above

2.
abcd-

If X and Y are two variates, there can be most


one regression line
two regression lines
three regression lines
an infinite number of regression lines

3.
abcd-

In regression line of Y on X, the variable X is known as


independent variable
regressor
explanatory variable
all the above

4.
abcd-

Regression equation is also named as


prediction equation
estimating equation
line of average relationship
all the above

5.
abcd-

Scatter diagram of the variate values (X, Y) gives the idea about
functional relationship
regression model
distribution of errors
none of the above

6.
abc-

If p=0, the lines of regression are


coincident
parallel
perpendicular to each other

d-

none of the above

7.
abcd-

Regression coefficient is independent of


origin
scale
both origin and scale
neither origin nor scale

8.
abcd-

Regression analysis can be used for


reducing the length of confidence interval
for prediction of dependent variate value
to know the true effect of certain treatments
all the above

9.
abcd-

Probable error is used for


measuring the error in r
testing the significance of r
both (a) and (b)
neither (a) nor (b)

10.
abcd-

If p = 0, the angle between the two lines of regression is


0 degree
90 degree
60 degree
30 degree

CHAPTER SEVEN
TIME SERIES ANALYSIS
7.1 Meaning
In statistics, signal processing, and many other fields, a time series is a sequence of data
points, measured typically at successive times, spaced at (often uniform) time intervals.
Time series analysis comprises methods that attempt to understand such time series,
often either to understand the underlying context of the data points (where did they come
from? what generated them?), or to make forecasts (predictions). Time series forecasting
is the use of a model to forecast future events based on known past events: to forecast
future data points before they are measured. A standard example in econometrics is the
opening price of a share of stock based on its past performance.
The term time series analysis is used to distinguish a problem, firstly from more
ordinary data analysis problems (where there is no natural ordering of the context of
individual observations), and secondly from spatial data analysis where there is a context
that observations (often) relate to geographical locations. There are additional
possibilities in the form of space-time models (often called spatial-temporal analysis). A
time series model will generally reflect the fact that observations close together in time
will be more closely related than observations further apart. In addition, time series
models will often make use of the natural one-way ordering of time so that values in a
series for a given time will be expressed as deriving in some way from past values, rather
than from future values (see time reversibility.)
So a time series is a sequence of observations which are ordered in time (or
space). If observations are made on some phenomenon throughout time, it is most
sensible to display the data in the order in which they arose, particularly since successive
observations will probably be dependent. Time series are best displayed in a scatter plot.
The series value X is plotted on the vertical axis and time t on the horizontal axis. Time is
called the independent variable (in this case however, something over which you have
little control). There are two kinds of time series data:
1.
Continuous, where we have an observation at every instant of time, e.g. lie
detectors, electrocardiograms. We denote this using observation X at time t, X(t).
2.
Discrete, where we have an observation at (usually regularly) spaced
intervals. We denote this as Xt.

7.2 Definitions
A set of data depending on the time is called a time series.
------- Kenny and Keeping
A time series consists of data arranged chronologically.
------- Croxton and Cowden
A time series may be defined as a sequence or repeated measurements of a variable
made periodically through time.
------- C.H.Mayers

7.3 Applications of time series: The application of time series models is two fold :

Obtain an understanding of the underlying forces and structure that


produced the observed data

Fit a model and proceed to forecasting, monitoring or even feedback and


feed forward control.
Time Series Analysis is used for many applications. Few of them are as follows:

Economic Forecasting

Sales Forecasting

Budgetary Analysis

Stock Market Analysis

Yield Projections

Process and Quality Control

Inventory Studies

Workload Projections

Utility Studies

Census Analysis

7.4 Uses or importance of Time-series


Analysis of time series is useful in every walk of life like business, economics, science,
state, sociology, research work etc. However, following are its main objectives :
7.4.1 Study of past behaviour: Analysis of time series studies the past behaviour of data
and indicates the changes that have taken place in the past.
7.4.2 Prediction for future: On the basis of analysis of time series, future predictions
can be made easily. For instance, we can predict future sales and necessary alterations
can be done in the production policy.
7.4.3 Facilitate comparisions : We can make comparison of various time series to know
the death rate, birth rate, yield per acre etc.

7.4.4 Evaluation of actual data: On the basis of deviation analysis of actual data and
estimated data obtained from analysis of time series, we can come to know about the
causes of this change.
7.4.5 Prediction of trade cycle: We can know about the factors of cyclical variations
like boom, depression, recession and recovery which are very important to business
community.
7.4.6 Universal utility: The analysis of time series is not only useful to business
community and economists but it is equally to agriculturist, government, researchers,
political and social institutions, scientists etc.

7.5 Difference between seasonal and cyclical variations


Following are the main differences between the two:
7.5.1 Time period: The duration of seasonal variations is always one year while year
while duration of cyclical variation is more than one year and it varies from three to eight
years.
7.5.2 Regularity: We find regularity in the components of seasonal variation while there
is no regularity in the components of cyclical variations and even the length of
components of cyclical variations, viz., boom, disinflation, depression and recovery is not
equal.
7.5.3 Causes of variations: Seasonal variation takes place due to change in seasons,
customs, habits, fashion etc. While cyclical variation takes place due to change in the
economic activity.
7.5.4 Measurement: Both the variations can be measured, however, their technique
differ. The seasonal variation can be measured more precisely as its variation is of regular
in nature.
7.5.5 Effect of variation: Seasonal variation affect different people in a different manner
while the effect of cyclical variation is the same on the whole economy.

7.6 Components of time series


Following are the components of time series :

7.6.1 Trend Component


We want to increase our understanding of a time series by picking out its main features.
One of these main features is the trend component. Descriptive techniques may be
extended to forecast (predict) future values.
Trend is a long term movement in a time series. It is the underlying direction (an upward
or downward tendency) and rate of change in a time series, when allowance has been
made for the other components.
A simple way of detecting trend in seasonal data is to take averages over a certain period.
If these averages change with time we can say that there is evidence of a trend in the
series. There are also more formal tests to enable detection of trend in time series.
It can be helpful to model trend using straight lines, polynomials etc.

7.6.2 Cyclical Component


We want to increase our understanding of a time series by picking out its main
features. One of these main features is the cyclical component. Descriptive techniques
may be extended to forecast (predict) future values.
In weekly or monthly data, the cyclical component describes any regular
fluctuations. It is a non-seasonal component which varies in a recognisable cycle.

7.6.3 Seasonal Component


We want to increase our understanding of a time series by picking out its main
features. One of these main features is the seasonal component. Descriptive techniques
may be extended to forecast (predict) future values.
In weekly or monthly data, the seasonal component, often referred to as
seasonality, is the component of variation in a time series which is dependent on the time
of year. It describes any regular fluctuations with a period of less than one year. For
example, the costs of various types of fruits and vegetables, unemployment figures and
average daily rainfall, all show marked seasonal variation. We are interested in
comparing the seasonal effects within the years, from year to year; removing seasonal
effects so that the time series is easier to cope with; and, also interested in adjusting a
series for seasonal effects using various models.

7.6.4 Irregular Component


We want to increase our understanding of a time series by picking out its main
features. One of these main features is the irregular component (or 'noise'). Descriptive
techniques may be extended to forecast (predict) future values.
The irregular component is that left over when the other components of the series
(trend, seasonal and cyclical) have been accounted for.

7.7 Methods of measuring secular trend or trend


Broadly speaking there are four methods of measuring trend, they are as follows :
7.7.1 Free hand curve method: This is the easiest and simplest method of computing
secular trend. In this method, time is plotted on X- axis and the other variable is plotted
on Y- axis. A free hand curve is then drawn so as to pass from the center of original
fluctuations.
Merits:
-It is the easiest and simplest method of knowing to trend values.
-The trend line is drawn without using scale, so it may be a straight line or a
smooth curve line.
-The method is free from any mathematical formulas.
Demerits:
-The straight line trends (Yt) drawn on graph will differ from person to person in
the absence of any mathematical formula.
-If the statistician is biased, the free hand curve will also be biased.

7.7.2 Semi average method: It is a better technique to comparison to free hand


curve method. Under this method variable (Y) is divided into two equal parts and average
of each part is computed separately.
Merits:
-This method is simple and easy to understand in relation to moving average and
least square method.
-The trend line (Yt) in this method is a fixed straight line unlike the free hand
curve method where trend line depend upon the personal judgement of the
statistician.
Demerits:
-The method is based on the assumption of linear trend whether it exists or not.
-The method is affected by the limitation of the arithmetic means.
-This method is not suitable for removing trend from the original data.

7.7.3 Moving Average method: This method is a better technique of knowing trend in
relation to semi average method. The trend values are obtained with a fair degree of
accuracy by eliminating cyclical fluctuations. In this method we calculate average on the
basis of moving technique. This period of moving average is determined on the basis of
length of cyclical fluctuations which varies from 3 to 11 years.
Merits:
-This technique is easier in relation to method of least square.
-This technique is effective if the trend of series is irregular.
Demerits:
-In this method we can not obtain the trend values for all the years as we leave the
first and last year value of data while computing three years moving average and
so on.
-The basic purpose of trend value is to predict the trend of future. In this method
we can not extend the trend line on both direction, so this method cannot be used
for prediction purposes.

7.7.4 Method of least square: This is the best method of measuring secular trend. It
is the mathematical as well as analytical tool. This method can be fitted to economic and
business time series to make future predictions.
The trend line may be linear or non linear.
Merits :
-The method of least square does not suffer from subjectivity or personal
judgement as it is a mathematical method.
-We can compute the trend value of all the given years by this method.
Demerits:
-The method is based on mathematical technique, so it is not easily
understandable to a non mathematical person.
-If we add or delete some observations in the data, the value of constants a and
b will change and new trend line will follow.

7.8 Measurement of seasonal variations


The short term variations with in a year in a time series are referred to as seasonal
variations. These variations are periodic in nature, viz., weekly, monthly or quarterly
changes. These variations may take place due to change in seasons like summer, winter,

rainy, autumn etc. Thus, seasonal variations refer to annual repetitive pattern in economic
and business activity.
Following measures are used to measure the seasonal variations:

7.8.1 Method of simple averages:


This method involves the following steps :
-The given time series is arranged by years, months or quarters.
-Totals of each month for the given years are obtained.
-The average of each month is then obtained by dividing the totals of months by
no. of years.
-Total of average month is obtained and divided by the no. of months in a year.
-Considering the average of monthly average as base, seasonal index is computed
for each month by applying the following formula:
Seasonal index = monthly average for the month/ Average of monthly
average*100
7.8.2 Ratio to trend method:
This method is based on multiplicative model of time series. It assumes that
seasonal variation for a given period is a constant fraction of the trend value. The steps
for computation of this method are:
-First of all trend values are calculated by applying the method of least square on
the yearly average.
-Trend values for each quarter is obtained based on trend values so obtained.
-Now divide the original quarterly data by the trend value of corresponding
quarter and multiply the quotient by hundred. These values are free from trend.
-To free the data from cyclical and irregular variations, quarterly data are
averaged.
7.8.3 Link relative method: This is one of the most difficult method of obtaining
seasonal variations. Steps involved in this method are:
1.
Link relatives are calculated from the given quarterly data by applying
formula:
Current Quarter/ Previous quarter*100
2. Average of link relatives are obtained for each quarter.
2.
Chain relatives are then calculated by using the formula:
Chain index = (Current quarterL.R.*Previous quarter chain index)/100
3.
I quarter chain index is calculated bases on IV quarter.

4.
Chain relatives are adjusted for each quarters by subtracting (Quarterly
effect * 1, quarterly effect * 2, quarterly effect * 3). quarterly effect from II, III, IV
quarter.
5.
Seasonal index is finally computed . since the total of quarterly index
should be 400, while the real total will be much more, so seasonal index is computed as
Seasonal index = (Chain index of quarter * 400) / Actual total of chain index of
four quarters.
7.9 Practical Problems:
Illustration: Find 3- years moving average from the following data :
Year
Sales(in lakh Rs.) Year
1990
3
1995
1991
8
1996
1992
10
1997
1993
9
1998
1994
12
1999

Sale (in lakh Rs.)


15
13
18
17
20

Link relative method :


This is one of the most difficult method of obtaining seasonal variations. Steps involved
in this method are :
Link relatives are calculated from the given quarterly data by applying formula:
Current Quarter/ Previous quarter*100
Average of link relatives are obtained for each quarter.
Seasonal index is finally computed . since the total of quarterly index should be 400,
while the real total will be much more, so seasonal index is computed as
Seasonal index = (Chain index of quarter * 400) / Actual total of chain index of four
quarters.
Illustration : Compute seasonal variations by using Link Relative Method from the
following data:
Year
I Quarter
II Quarter
III Quarter
IV Quarter
I
45
54
72
60
II
48
56
63
56
III
49
63
70
65
IV
52
65
75
72

(iv) Total of correct chain relatives = 100+ 120.08+140.86+124.74


= 485. 68
(v) Seasonal Index

Chapter Seven
Time Series Analysis
End Chapter Quizzes
1.
abcd-

A time series is a set of data recorded


periodically
at time or space intervals
at successive points of time
all the above

2.
abcd-

The time series analysis helps


to compare the two or more series
to know the behaviour of business
to make predictions
all the above

3.
abcd-

A time series is unable to adjust the influences like


customs and policy changes
seasonal changes
long-term influences
none of the above

4.
abcd-

A time series consists of


two components
three components
four components
five components

5.
abcd-

The forecasts on the basis of a time series are


cent per cent true
true to a great extent
never true
none of the above

6.
terms as
abc-

The components of the time series attached to long-term variations is


cyclic variation
secular trend
irregular variation

d-

all the above

7.
abcd8.
abcd-

Secular trend is indicative of long-term variation towards


increase only
decrease only
either increase or decrease
none of the above
Linear trend of a time series indicates towards
constant rate of change
constant rate of growth
change is geometric progression
all the above

9.
abcd-

Seasonal variation means the variations occurring with in


a number if years
parts of year
parts of month
none of the above

10.
abcd-

Cyclic variations in a time series are caused by


lockouts in a factory
war in a country
floods in the states
none of the above

CHAPTER EIGHT
PROBABILITY
8.1 Introduction
The theory of probability was developed towards the end of the 18th century and its
history suggests that it developed with the study of games and chance, such as rolling a
dice, drawing a card, flipping a coin etc. Apart from these, uncertainty prevailed in every
sphere of life. For instance, one often predicts: "It will probably rain tonight." "It is quite
likely that there will be a good yield of cereals this year" and so on. This indicates that, in
laymans terminology the word probability thus connotes that there is an uncertainty
about the happening of events. To put probability on a better footing we define it. But
before doing so, we have to explain a few terms."

8.2 Concepts of probability calculation


Following are the fundamental concepts of probability calculation:

8.2.1 Trial
A procedure or an experiment to collect any statistical data such as rolling a dice or
flipping a coin is called a trial.

8.2.2 Random Trial or Random Experiment


When the outcome of any experiment can not be predicted precisely then the experiment
is called a random trial or random experiment. In other words, if a random experiment is
repeated under identical conditions, the outcome will vary at random as it is impossible to
predict about the performance of the experiment. For example, if we toss a honest coin or
roll an unbiased dice, we may not get the same results as our expectations.

8.2.3 Sample space


The totality of all the outcomes or results of a random experiment is denoted by Greek
alphabet or English alphabets and is called the sample space. Each outcome or element
of this sample space is known as a sample print.

8.2.4 Event
Any subset of a sample space is called an event. A sample space S serves as the universal
set for all questions related to an experiment 'S' and an event A w.r.t it is a set of all
possible outcomes favorable to the even t A
For example,
A random experiment :- flipping a coin twice
Sample space :- or S = {(HH), (HT), (TH), (TT)}
The question : "both the flipps show same face"
Therefore, the event A : { (HH), (TT) }

8.2.5 Equally Likely Events


All possible results of a random experiment are called equally likely outcomes and we
have no reason to expect any one rather than the other. For example, as the result of
drawing a card from a well shuffled pack, any card may appear in draw, so that the 52
cards become 52 different events which are equally likely.

8.2.6 Mutually Exclusive Events


Events are called mutually exclusive or disjoint or incompatible if the occurrence of one
of them precludes the occurrence of all the others. For example in tossing a coin, there
are two mutually exclusive events viz turning up a head and turning up of a tail. Since
both these events cannot happen simultaneously. But note that events are compatible if it
is possible for them to happen simultaneously. For instance in rolling of two dice, the
cases of the face marked 5 appearing on one dice and face 5 appearing on the other, are
compatible.

8.2.7 Exhaustive Events


Events are exhaustive when they include all the possibilities associated with the same
trial. In throwing a coin, the turning up of head and of a tail are exhaustive events
assuming of course that the coin cannot rest on its edge.

8.2.8 Independent Events


Two events are said to be independent if the occurrence of any event does not affect the
occurrence of the other event. For example in tossing of a coin, the events corresponding
to the two successive tosses of it are independent. The flip of one penny does not affect in
any way the flip of a nickel.

8.2.9 Dependent Events


If the occurrence or non-occurrence of any event affects the happening of the other, then
the events are said to be dependent events. For example, in drawing a card from a pack of
cards, let the event A be the occurrence of a king in the 1st draw and B be the occurrence
of a king in the 1st draw and B be the occurrence of a king in the second draw. If the card
drawn at the first trial is not replaced then events A and B are independent events.
Note
(1) If an event contains a single simple point i.e. it is a singleton set, then this event is
called an elementary or a simple event.
(2) An event corresponding to the empty set is an "impossible event."
(3) An event corresponding to the entire sample space is called a certain event.

8.2.10 Complementary Events


Let S be the sample space for an experiment and A be an event in S. Then A is a subset of
S. Hence , the complement of A in S is also an event in S which contains the outcomes
which are not favorable to the occurrence of A i.e. if A occurs, then the outcome of the
experiment belongs to A, but if A does not occur, then the outcomes of the experiment
belongs to
A = S.
It is obvious that A and are mutually exclusive. A = and
If S contains n equally likely, mutually exclusive and exhaustive points and A
contains m out of these n points then contains (n - m) sample points.

8.3 Definitions
We shall now consider two definitions of probability :

8.3.1 Mathematical or a priori or classical.


8.3.2 Statistical or empirical.
8.3.1 Mathematical (or A Priori or Classic) Definition
If there are n exhaustive, mutually exclusive and equally likely cases and m of them are
favorable to an event A, the probability of A happening is defined as the ratio m/n
Expressed as a formula :-

This definition is due to Laplace. Thus probability is a concept which measures


numerically the degree of certainty or uncertainty of the occurrence of an event.
For example, the probability of randomly drawing taking from a well-shuffled deck of
cards is 4/52. Since 4 is the number of favorable outcomes (i.e. 4 kings of diamond,
spade, club and heart) and 52 is the number of total outcomes (the number of cards in a
deck).
If A is any event of sample space having probability P, then clearly, P is a positive
number (expressed as a fraction or usually as a decimal) not greater than unity. 0 P 1
i.e. 0 (no chance or for impossible event) to a high of 1 (certainty). Since the number of
cases not favorable to A are (n - m), the probability q that event A will not happen is,
q = or q = 1 - m/n or q = 1 - p.
Now note that the probability q is nothing but the probability of the
complementary event A i.e.
Thus p ( ) = 1 - p or p ( ) = 1 - p ( )
so that p (A) + p ( ) = 1 i.e. p + q = 1

Relative Frequency Definition


The classical definition of probability has a disadvantage i.e. the words equally likely
are vague. In fact, since these words seem to be synonymous with "equally probable".
This definition is circular as it is defining (in terms) of itself. Therefore, the estimated or
empirical probability of an event is taken as the relative frequency of the occurrence of
the event when the number of observations is very large.

8.3.2 Van Mises Statistical (or Empirical) Definition


If trials are to be repeated a great number of times under essentially the same condition
then the limit of the ratio of the number of times that an event happens to the total
number of trials, as the number of trials increases indefinitely is called the probability of
the happening of the event.
It is assumed that the limit exists and finite uniquely. Symbolically p (A) = p =
provided it is finite and unique.
The two definitions are apparently different but both of them can be reconciled
the same sense.
Example Find the probability of getting heads in tossing a coin.
Solution : Experiment : Tossing a coin
Sample space : S = { H, T} n (S) = 2
Event A : getting heads
A = { H} n (A) = 1

Therefore, p (A) =
or 0.5
Example Find the probability of getting 3 or 5 in throwing a die.
Solution : Experiment : Throwing a dice
Sample space : S = {1, 2, 3, 4, 5, 6 } n (S) = 2
Event A : getting 3 or 6
A = {3, 6} n (A) = 2
Therefore, p (A) =
Example Two dice are rolled. Find the probability that the score on the second
die is greater than the score on the first die.
Solution : Experiment : Two dice are rolled
Sample space : S = {(1, 1), (1, 2), (1, 3), (1, 4), (1, 5), (1, 6) (2, 1), (2, 2), (2, 3),
(2,
4),
(2,
6)}...
(6, 1), (6, 2) (, 3), (6, 4), (6, 5), (6, 6) }
n (S) = 6 6 = 36 Event A : The score on the second die > the score on the 1st
die.
i.e. A = { (1, 2), (1, 3), (1, 4), (1, 5), (1, 6) (2, 3), (2, 4), (2, 5), (2, 6) (3, 4), (3, 5),
(3, 6) (4, 5), (4, 6) (5, 6)}
n (A) = 15
Therefore, p (A) =
Example A coin is tossed three times. Find the probability of getting at least one
head.
Solution : Experiment : A coin is tossed three times.
Sample space : S = {(H H H), (H H T), (HTH), (HTT), (THT), (TTH), (THH),
(TTT) }
n (S) = 8
Event A : getting at least one head
so that A : getting no head at all
= { (TTT) n ( ) = 1
P( )=
Therefore, P (A) = 1 - P ( A ) =

Example A ball is drawn at random from a box containing 6 red balls, 4 white
balls and 5 blue balls. Determine the probability that the ball drawn is (i) red (ii) white
(iii) blue (iv) not red (v) red or white.
Solution : Let R, W and B denote the events of drawing a red ball, a white ball
and a blue ball respectively.

(i)

Note : The two events R and W are disjoint events.


Example What is the chance that a leap year selected at random will contain 53 Sundays
?
Solution : A leap year has 52 weeks and 2 more days.
The two days can be :
Monday - Tuesday
Tuesday - Wednesday
Wednesday - Thursday
Thursday - Friday
Friday - Saturday
Saturday - Sunday and
Sunday - Monday.
There are 7 outcomes and 2 are favorable to the 53rd Sunday.

Now for 53 Sundays in a leap year, P(A)

2 / 7 = 0.29 (Approximately)
Example If four ladies and six gentlemen sit for a photograph in a row at random,
what is the probability that no two ladies will sit together ?

Solution :
Now if no two ladies are
to be together, the ladies have 7 positions, 2 at ends and 5 between the gentlemen
Arrangement L, G1, L, G2, L, G3, L, G4, L, G5, L, G6, L

Example In a class there are 13 students. 5 of them are boys and the rest are girls.
Find the probability that two students selected at random wil be both girls.
Solution : Two students out of 13 can be selected in
of 8 can be selected in

ways and two girls out

ways.

Therefore, required probability =


Example A box contains 5 white balls, 4 black balls and 3 red balls. Three balls
are drawn randomly. What is the probability that they will be (i) white (ii) black (iii) red ?
Solution : Let W, B and R denote the events of drawing three white, three black
and
three red balls respectively.

8.4 The Law of Probability


So far we have discussed probabilities of single events. In many situations we come
across two or more events occurring together. If event A and event B are two events
and either A or B or both occurs, is denoted by A B or (A + B) and the event that
both A and B occurs is denoted by A B or AB. We term these situations as
compound
event
or
the
joint
occurrence
of
events.
We may need probability that A or B will happen.
It is denoted by P (A B) or P (A + B). Also we may need the probability that
A and B (both) will happen simultaneously. It is denoted by P (A B) or P (AB).
Consider a situation, you are asked to choose any 3 or any diamond or both from
a well shuffled pack of 52 cards. Now you are interested in the probability of this
situation.
Now see the following diagram.
It is denoted by P (A B) or P (A + B). Also we may need the probability that A
and B (both) will happen simultaneously. It is denoted by P (A B) or P (AB).
Consider a situation, you are asked to choose any 3 or any diamond or both from
a well shuffled pack of 52 cards. Now you are interested in the probability of this
situation.
Now see the following diagram.

Now count the dots in the area


diamond or both. They are 16.

which fulfills the condition any 3 or any

Thus the required probability


In the language of set theory, the set any 3 or any diamond or both is the union of the sets
any 3 which contains 4 cards and any diamond which contains 15 cards. The number
of cards in their union is equal to the sum of these numbers minus the number of cards in
the space where they overlap. Any points in this space, called the intersection of the two
sets, is counted here twice (double counting), once in each set. Dividing by 52 we get the
required probability.
Thus P (any 3 or any diamond or

both)
In general, if the letters A and B stands for any two events, then

Clearly, the outcomes of both A and B are non-mutually exclusive.

Example Two dice are rolled. Find the probability that the score is an even number or
multiple of 3.
Solution : Two dice are rolled.
Sample space = {(1, 1), (1, 2), ............, (6, 6)}
n(S) = 6 6 = 36
Event E : The score is an even number or multiple of 3.
Note here score means the sum of the numbers on both the dice when they land. For
example (1, 1) has score 1 + 1 = 2.
It is clear that the least score is 2 and the highest score (6, 6) 6 + 6 = 12
i.e. score 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12
Let Event A : Score is an even numbers
A = {(1, 1), (1, 3), (1, 5), (2,2), (2, 4), (2, 6), (3, 1), (3, 3) (3, 5), (4, 2), (4, 4), (4, 6), (5,
1), (5, 3), (5, 5), (6, 2), (6, 4), (6, 6) }
Therefore n (A) = 18
Let Event B: The score is the multiple of 3
i.e. 3, 6, 9, 12
B = {(1, 2), (1, 5), (2, 4), (2, 1) (3, 6) (3, 3) (4,2), (4, 5), (5, 1), (5,4), (6, 3), (6, 6) }
n (B) = 12
Let Event A B:The score is an even number and multiple of 3 or (i.e. common to both
A and B) AB
AB = {(2, 4), (4, 2), (33,3), (4,2), (5, 1), (6,6)}

n (AB) = 6

Multiplication Law of Probability


If there are two independent events; the respective probability of which are known, then
the probability that both will happen is the product of the probabilities of their happening
respectively P (AB) = P (A) P (B)
To compute the probability of two or even more independent event all occurring (joint
occurrence) extent the above law to required number.
For example, first flip a penny, then the nickle and finally flip the dime.
On landing, probability of heads is
probability of heads is

for a nickle

probability of heads is

for a dime

for a penny

Thus the probability of landing three heads will be


three events are independent)

or 0.125. (Note that all

Example Three machines I, II and III manufacture respectively 0.4, 0.5 and 0.1 of the
total production. The percentage of defective items produced by I, II and III is 2, 4 and 1
percent respectively for an item randomly chosen, what is the probability it is defective?
Solution:

Example In shuffling a pack of cards, 4 are accidentally dropped one after another. Find
the chance that the missing cards should be one from each suit.
Solution: Probability of 4 missing cards from different suits are as follows:
Let H, D, C and S denote heart, diamond, club and spade cards respectively

Example A problem in statistics is given to three students A, B and C whose chances in


solving it are
be solved ?

respectively. What is the probability that the problem will

Solution : The probability that A can solve the problem = 1/2


The probability that B cannot solve the problem = 1 - 1/2 = 1/2
Similarly the probabilities that B and C cannot solve problem are
respectively.

Conditional Probability
In many situations you get more information than simply the total outcomes and
favorable outcomes you already have and, hence you are in position to make yourself
more informed to make judgements regarding the probabilities of such situations. For
example, suppose a card is drawn at random from a deck of 52 cards. Let B denotes the
event the card is a diamond and A denotes the event the card is red. We may then
consider the following probabilities.

Since there are 26 red cards of which 13 are diamonds, the probability that the card is
diamond is

. In other words the probability of event B knowing that A has occurred

is .
The probability of B under the condition that A has occurred is known as condition
. It should be observed that
probability and it is denoted by P (B/A) . Thus P (B/A) =
the probability of the event B is increased due to the additional information that the event
A has occurred.
Conditional probability found using the formula P (B/A) =

Justification :- P (A/B) =
Similarly P(A/B) =
In both the cases if A and B are independent events then P (A/B) = P (A) and P(B/A) =
P(B)
Therefore P(A) =
or P(B) =

P (AB) = P (A) . P (B)

P (AB) = P(A) . P (B)

8.5 Importance of Probability


The theory of probability has its origin in the seventeenth century to develop the
quantitative measure of probability concerning problems related to the theory of die in
gambling.
Later, the theory was used on problems pertaining to chance by mathematicians. The
problems are related to tossing of a coin, possibility of getting a card of specific suit,
possibility of getting balls of specific colour from a bag of balls. Now a days the law of
probability, is used to solve the economic and business problems. It is also used to solve
the problems of our day to day life even.
The utility of probability can be known by its various uses.
Following are the areas where probability theory has been used :
1. The fundamental laws of statistics like Law of Statistical Regularity and Law of Inertia
of large numbers are based on the theory of probability.

2. The various test of significance like Z test, F test, Chi suare test, are derived from
the theory of probability.
3. This theory gives solution to the problems relating to the game of chance.
4. The decision theories are based on the fundamental laws of probability.
5. The theory is generally used in economic and business decision making. The theory is
very useful in the situations where risk and uncertainty prevails.
6. The subjective probability is widely used in those situations where actual measurement
of probability is not feasible. It has, thus, added new dimension to the theory of
probability. These probability can be revised at a later stage on the basis of experience.

8.6 Practical Problems:


Illustration: A single letter is selected at random from the word PROBABILITY. What
is the probability that it is a vowel?
Sollution :
Total number of letters in the word, PROBABILIT5Y = n = 11
Number of favourable cases = m = 4 ( vowels are o, a, i, i )
We know that,
P(A)=

Illustration:
Find the probability of having at least one son in a family if there are two children in a
family on an average.
Solution:
Two children in a family may be either :
(1) Both sons
or (2) Son and daughter
or (3) Daughter and son
or (4) Both daughters
Thus, total number of equally likely cases = n = 4
At least one son implies that a family may have one son or two sons.
Thus, favourable number of cases = m = 3 (i.e., option, nos 1,2,3,)
P(A) =

Illustration: Find the chance of getting an ace in a draw from a pack of 52 cards.
Solution:
Total number of cards = n = 52
Number of favourable cases = m = 4 (number of aces)
P(A)
Illustration: Suppose an ideal die is tossed twice. What is the probability of getting a sum
of 10 in the two tosses?
Solution:
A die can be tossed first time in = 6 ways
Adie can be tossed second time in = 6 ways
A die can be tossed twice in = 6 6 = 36 ways (as per rule of counting)
Number of ways in which we can through two die to get a sum of 10 are = m = 3 ways
(i.e., dot number 4+6+5and 6+4)
P(A)

Classical Probability
Classical Definition of Probability
The classical definition of probability is the proportion of times that an event will occur,
assuming that all outcomes in a sample space are equally likely to occur. The probability
of an event is determined by counting the number of outcomes in the sample space that
satisfy the event and dividing by the total number of outcomes in the sample space. The
probability of an event A is
P(A) = NA/N
Where NA is the number of outcomes that satisfy the condition of event A and N is the
total number of outcomes in the sample space. The important idea here is that one can
develop a probability from fundamental reasoning about the process.
Example:
In a pack of cards, we have N=52 equally likely outcomes. Now have to determine the
probability that the card is King, Queen and card is not a King.
Solution:
Probability of being King = 4/52
= 1/13

Probability of being Queen = 4/52


=1/13
Probability that card is not a King = (52-4)/52
= 48/52
= 12/13

Probability Rules
Complement Rule
Let A be an event and its complement. Then the complement rule is:
=1-P(A)
The Addition Rule of Probabilities
Let A and B be two events. The probability of their union is
P(A U B ) = P ( A ) + P( B ) - P( A B )
Conditional Probability
Let A and B be two events. The conditional probability of event A, given that event B has
occurred, is denoted by the symbol P( A|B ) and is found to be:
P(A/B) = P(AB)/P(B)
The Multiplication Rule of Probabilities
Let A and B be two events. The probability of their intersection can be derived from
conditional probability as
P( A B) = P( A|B ) P( B )
Statistical Independence
Let A and B be two events. These events are said to be statistically independent if and
only if
P( A / B ) = P( A ) P( B )
From the multiplication rule it also follows that
P( A|B ) = P( A ) ( if P( B ) > 0 )
More generally, the events E1, E2, ., EK are mutually statistically independent if and
only if
P( E1 E2 .. EK ) = P( E1 ) P( E2 )..P( EK )

Probability Distribution
Probability distribution is related to frequency distributions. Probability distribution is
like theoretical frequency distribution. A theoretical frequency distribution is a
probability distribution that describes how outcomes are expected to vary. Because these
distributions deal with expectations, they are useful models in making inferences and
decisions under conditions of uncertainty.

To understand probability distributions, it is important to understand variables. Random


Variable, and some notation. A variable is a symbol (A, B, x, y, etc.) that can take on any
of a specified set of values.

When the value of a variable is the outcome of a statistical experiment, that


variable is a random variable.

Generally, statisticians use a capital letter to represent a random variable and a lower-case
letter, to represent one of its values. For example,

X represents the random variable X.


P(X) represents the probability of X.
P(X = x) refers to the probability that the random variable X is equal to a
particular value, denoted by x. As an example, P(X = 1) refers to the probability
that the random variable X is equal to 1.

The relationship between random variables and probability distributions can be easily
understood by example. Suppose you flip a coin two times. This simple statistical
experiment can have four possible outcomes: HH, HT, TH, and TT. Now, let the variable
X represent the number of Heads that result from this experiment. The variable X can
take on the values 0, 1, or 2. In this example, X is a random variable; because its value is
determined by the outcome of a statistical experiment.
A probability distribution is a table or an equation that links each outcome of a
statistical experiment with its probability of occurrence. Consider the coin flip
experiment described above. The table below, which associates each outcome
with its probability, is an example of a probability distribution.
Number of Heads
0
1
2

Probability
0.25
0.50
0.25

The above table represents the probability distribution of the random variable X.
Cumulative Probability Distributions
A cumulative probability refers to the probability that the value of a random variable
falls within a specified range.
Let us return to the coin flip experiment. If we flip a coin two times, we might ask: What
is the probability that the coin flips would result in one or fewer heads? The answer
would be a cumulative probability. It would be the probability that the coin flip

experiment results in zero heads plus the probability that the experiment results in one
head.
P(X < 1) = P(X = 0) + P(X = 1) = 0.25 + 0.50 = 0.75
Like a probability distribution, a cumulative probability distribution can be represented
by a table or an equation. In the table below, the cumulative probability refers to the
probability than the random variable X is less than or equal to x.
Number of heads: x
0
1
2

Probability: P(X = x)
0.25
0.50
0.25

Cumulative Probability: P(X < x)


0.25
0.75
1.00

Example:
Suppose a die is tossed. What is the probability that the die will land on 6 ?
Solution: When a die is tossed, there are 6 possible outcomes represented by: S = { 1, 2,
3, 4, 5, 6 }. Each possible outcome is a random variable (X), and each outcome is equally
likely to occur. Thus, we have a uniform distribution. Therefore, the P(X = 6) = 1/6.
Example:
2
Suppose we repeat the dice tossing experiment described in Example 1. This time, we ask
what is the probability that the die will land on a number that is smaller than 5 ?
Solution: When a die is tossed, there are 6 possible outcomes represented by: S = { 1, 2,
3, 4, 5, 6 }. Each possible outcome is equally likely to occur. Thus, we have a uniform
distribution.
This problem involves a cumulative probability. The probability that the die will land on
a number smaller than 5 is equal to:
P( X < 5 ) = P(X = 1) + P(X = 2) + P(X = 3) + P(X = 4) = 1/6 + 1/6 + 1/6 + 1/6 = 2/3
Discrete and Continuous Probability Distributions
If a variable can take on any value between two specified values, it is called a continuous
variable; otherwise, it is called a discrete variable.
Some examples will clarify the difference between discrete and continuous variables.

Suppose the fire department mandates that all fire fighters must weigh between
150 and 250 pounds. The weight of a fire fighter would be an example of a
continuous variable; since a fire fighter's weight could take on any value between
150 and 250 pounds.

Suppose we flip a coin and count the number of heads. The number of heads
could be any integer value between 0 and plus infinity. However, it could not be
any number between 0 and plus infinity. We could not, for example, get 2.5
heads. Therefore, the number of heads must be a discrete variable.

Just like variables, probability distributions can be classified as discrete or continuous.


Discrete Probability Distributions
If a random variable is a discrete variable, its probability distribution is called a discrete
probability distribution.
An example will make this clear. Suppose you flip a coin two times. This simple
statistical experiment can have four possible outcomes: HH, HT, TH, and TT. Now, let
the random variable X represent the number of Heads that result from this experiment.
The random variable X can only take on the values 0, 1, or 2, so it is a discrete random
variable.
The probability distribution for this statistical experiment appears below.
Number of heads
0
1
2

Probability
0.25
0.50
0.25

The above table represents a discrete probability distribution because it relates each value
of a discrete random variable with its probability of occurrence. In subsequent lessons,
we will cover the following discrete probability distributions.

Binomial probability distribution


Hypergeometric probability distribution
Multinomial probability distribution
Poisson probability distribution

Note: With a discrete probability distribution, each possible value of the discrete random
variable can be associated with a non-zero probability. Thus, a discrete probability
distribution can always be presented in tabular form.
Continuous Probability Distributions
If a random variable is a continuous variable, its probability distribution is called a
continuous probability distribution.
A continuous probability distribution differs from a discrete probability distribution in
several ways.

The probability that a continuous random variable will assume a particular value
is zero.
As a result, a continuous probability distribution cannot be expressed in tabular
form.
Instead, an equation or formula is used to describe a continuous probability
distribution.

Most often, the equation used to describe a continuous probability distribution is called a
probability density function. Sometimes, it is referred to as a density function, a PDF,
or a pdf. For a continuous probability distribution, the density function has the following
properties:

Since the continuous random variable is defined over a continuous range of values
(called the domain of the variable), the graph of the density function will also be
continuous over that range.
The area bounded by the curve of the density function and the x-axis is equal to 1,
when computed over the domain of the variable.
The probability that a random variable assumes a value between a and b is equal
to the area under the density function bounded by a and b.

For example, consider the probability density function shown in the graph below.
Suppose we wanted to know the probability that the random variable X was less than or
equal to a. The probability that X is less than or equal to a is equal to the area under the
curve bounded by a and minus infinity as indicated by the shaded area.

Note: The shaded area in the graph represents the probability that the random variable X
is less than or equal to a. This is a cumulative probability. However, the probability that
X is exactly equal to a would be zero. A continuous random variable can take on an
infinite number of values. The probability that it will equal a specific value (such as a) is
always zero.
Later we will discuss following distribution in that chapter:

Normal probability distribution


t distribution
Chi-square distribution
F distribution

Binomial Distribution
To understand binomial distributions and binomial probability, it helps to understand
binomial experiments and some associated notation; so we cover those topics first.
Binomial Experiment
A binomial experiment (also known as a Bernoulli trial) is a statistical experiment that
has the following properties:

The experiment consists of n repeated trials.


Each trial can result in just two possible outcomes. We call one of these outcomes
a success and the other, a failure.
The probability of success, denoted by P, is the same on every trial.
The trials are independent; that is, the outcome on one trial does not affect the
outcome on other trials.

Consider the following statistical experiment. You flip a coin 2 times and count the
number of times the coin lands on heads. This is a binomial experiment because:

The experiment consists of repeated trials. We flip a coin 2 times.


Each trial can result in just two possible outcomes - heads or tails.
The probability of success is constant - 0.5 on every trial.
The trials are independent; that is, getting heads on one trial does not affect
whether we get heads on other trials.

Notation
The following notation is helpful, when we talk about binomial probability.

x: The number of successes that result from the binomial experiment.


n: The number of trials in the binomial experiment.
P: The probability of success on an individual trial.
Q: The probability of failure on an individual trial. (This is equal to 1 - P.)
b(x; n, P): Binomial probability - the probability that an n-trial binomial
experiment results in exactly x successes, when the probability of success on an
individual trial is P.
nCr: The number of combinations of n things, taken r at a time.

Binomial Distribution
A binomial random variable is the number of successes x in n repeated trials of a
binomial experiment. The probability distribution of a binomial random variable is called
a binomial distribution (also known as a Bernoulli distribution).

Suppose we flip a coin two times and count the number of heads (successes). The
binomial random variable is the number of heads, which can take on values of 0, 1, or 2.
The binomial distribution is presented below.
Number of heads
0
1
2

Probability
0.25
0.50
0.25

The binomial distribution has the following properties:

The mean of the distribution (x) is equal to n * P .


The variance (2x) is n * P * ( 1 - P ).
The standard deviation (x) is sqrt[ n * P * ( 1 - P ) ].

Binomial Probability
The binomial probability refers to the probability that a binomial experiment results in
exactly x successes. For example, in the above table, we see that the binomial probability
of getting exactly one head in two coin flips is 0.50.
Given x, n, and P, we can compute the binomial probability based on the following
formula:
Binomial FormulaSuppose a binomial experiment consists of n trials and results in x successes. If the
probability of success on an individual trial is P, then the binomial probability is:
B(x; n,P) = nCx * Px * (1-P)n-x
Example

Suppose a die is tossed 5 times. What is the probability of getting exactly 2 fours?
Solution: This is a binomial experiment in which the number of trials is equal to 5, the
number of successes is equal to 2, and the probability of success on a single trial is 1/6 or
about 0.167. Therefore, the binomial probability is:
b(2; 5, 0.167) = 5C2 * (0.167)2 * (0.833)3
b(2; 5, 0.167) = 0.161
Cumulative Binomial Probability
A cumulative binomial probability refers to the probability that the binomial random
variable falls within a specified range (e.g., is greater than or equal to a stated lower limit
and less than or equal to a stated upper limit).

For example, we might be interested in the cumulative binomial probability of obtaining


45 or fewer heads in 100 tosses of a coin (see Example 1 below). This would be the sum
of all these individual binomial probabilities.
b(x < 45; 100, 0.5) = b(x = 0; 100, 0.5) + b(x = 1; 100, 0.5) + ... + b(x = 44; 100, 0.5) +
b(x = 45; 100, 0.5)
Example
The probability that a student is accepted to a prestigeous college is 0.3. If 5 students
from the same school apply, what is the probability that at most 2 are accepted?
Solution: To solve this problem, we compute 3 individual probabilities, using the
binomial formula. The sum of all these probabilities is the answer we seek. Thus,
b(x < 2; 5, 0.3) = b(x = 0; 5, 0.3) + b(x = 1; 5, 0.3) + b(x = 2; 5, 0.3)
b(x
<
2;
5,
0.3)
=
0.1681
+
0.3601
+
0.3087
b(x < 2; 5, 0.3) = 0.8369
Example
What is the probability that the World Series will last 4 games? 5 games? 6 games? 7
games? Assume that the teams are evenly matched.
Solution:
This is a very tricky application of the binomial distribution. If you can follow the logic
of this solution, you have a good understanding of the material covered in the tutorial, to
this point.
In the world series, there are two baseball teams. The series ends when the winning team
wins 4 games. Therefore, we define a success as a win by the team that ultimately
becomes the world series champion.
For the purpose of this analysis, we assume that the teams are evenly matched. Therefore,
the probability that a particular team wins a particular game is 0.5.
Let's look first at the simplest case. What is the probability that the series lasts only 4
games. This can occur if one team wins the first 4 games. The probability of the National
League team winning 4 games in a row is:
b(4; 4, 0.5) = 4C4 * (0.5)4 * (0.5)0 = 0.0625
Similarly, when we compute the probability of the American League team winning 4
games in a row, we find that it is also 0.0625. Therefore, probability that the series ends
in four games would be 0.0625 + 0.0625 = 0.125; since the series would end if either the
American or National League team won 4 games in a row.

Now let's tackle the question of finding probability that the world series ends in 5 games.
The trick in finding this solution is to recognize that the series can only end in 5 games, if
one team has won 3 out of the first 4 games. So let's first find the probability that the
American League team wins exactly 3 of the first 4 games.
b(3; 4, 0.5) = 4C3 * (0.5)3 * (0.5)1 = 0.25
Okay, here comes some more tricky stuff, so listen up. Given that the American League
team has won 3 of the first 4 games, the American League team has a 50/50 chance of
winning the fifth game to end the series. Therefore, the probability of the American
League team winning the series in 5 games is 0.25 * 0.50 = 0.125. Since the National
League team could also win the series in 5 games, the probability that the series ends in 5
games would be 0.125 + 0.125 = 0.25.
The rest of the problem would be solved in the same way. You should find that the
probability of the series ending in 6 games is 0.3125; and the probability of the series
ending in 7 games is also 0.3125.
Normal Distribution
The normal distribution refers to a family of continuous probability distributions
described by the normal equation.
The Normal Equation
The normal distribution is defined by the following equation:
Normal equation
The value of the random variable Y is:
Y= [1/ * sqrt(2)] * e-(x-)2/22
Where X is a normal random variable, is the mean, is the standard deviation, is
approximately 3.14159, and e is approximately 2.71828.
The random variable X in the normal equation is called the normal random variable.
The normal equation is the probability density function for the normal distribution.
The Normal Curve
The graph of the normal distribution depends on two factors - the mean and the standard
deviation. The mean of the distribution determines the location of the center of the graph,
and the standard deviation determines the height and width of the graph. When the
standard deviation is large, the curve is short and wide; when the standard deviation is

small, the curve is tall and narrow. All normal distributions look like a symmetric, bellshaped curve, as shown below.

The curve on the left is shorter and wider than the curve on the right, because the curve
on the left has a bigger standard deviation.
Probability and the Normal Curve
The normal distribution is a continuous probability distribution. This has several
implications for probability.

The total area under the normal curve is equal to 1.


The probability that a normal random variable X equals any particular value is 0.
The probability that X is greater than a equals the area under the normal curve
bounded by a and plus infinity (as indicated by the non-shaded area in the figure
below).
The probability that X is less than a equals the area under the normal curve
bounded by a and minus infinity (as indicated by the shaded area in the figure
below).

Additionally, every normal curve (regardless of its mean or standard deviation) conforms
to the following "rule".

About 68% of the area under the curve falls within 1 standard deviation of the
mean.
About 95% of the area under the curve falls within 2 standard deviations of the
mean.
About 99.7% of the area under the curve falls within 3 standard deviations of the
mean.

Collectively, these points are known as the empirical rule or the 68-95-99.7 rule.
Clearly, given a normal distribution, most outcomes will be within 3 standard deviations
of the mean.
Example:
An average light bulb manufactured by the Acme Corporation lasts 300 days with a
standard deviation of 50 days. Assuming that bulb life is normally distributed, what is the
probability that an Acme light bulb will last at most 365 days?
Solution: Given a mean score of 300 days and a standard deviation of 50 days, we want
to find the cumulative probability that bulb life is less than or equal to 365 days. Thus, we
know the following:

The value of the normal random variable is 365 days.


The mean is equal to 300 days.
The standard deviation is equal to 50 days.

We enter these values into the Normal Distribution Calculator and compute the
cumulative probability. The answer is: P( X < 365) = 0.90. Hence, there is a 90% chance
that a light bulb will burn out within 365 days.
Example:
Suppose scores on an IQ test are normally distributed. If the test has a mean of 100 and a
standard deviation of 10, what is the probability that a person who takes the test will
score between 90 and 110?
Solution: Here, we want to know the probability that the test score falls between 90 and
110. The "trick" to solving this problem is to realize the following:
P( 90 < X < 110 ) = P( X < 110 ) - P( X < 90 )
We use the Normal Distribution Calculator to compute both probabilities on the right side
of the above equation.

To compute P( X < 110 ), we enter the following inputs into the calculator: The
value of the normal random variable is 110, the mean is 100, and the standard
deviation is 10. We find that P( X < 110 ) is 0.84.
To compute P( X < 90 ), we enter the following inputs into the calculator: The
value of the normal random variable is 90, the mean is 100, and the standard
deviation is 10. We find that P( X < 90 ) is 0.16.

We use these findings to compute our final answer as follows:


P( 90 < X < 110
P(
90
<
X
P( 90 < X < 110 ) = 0.68

=
<

P( X
110

<

110
)

)
=

P(
0.84

<
-

90 )
0.16

Thus, about 68% of the test scores will fall between 90 and 110.
Standard Normal Distribution
The standard normal distribution is a special case of the normal distribution. It is the
distribution that occurs when a normal random variable has a mean of zero and a standard
deviation of one.
The normal random variable of a standard normal distribution is called a standard score
or a z-score. Every normal random variable X can be transformed into a z score via the
following equation:
z = (X - ) /
where X is a normal random variable, is the mean mean of X, and is the standard
deviation of X.
Standard Normal Distribution Table
A standard normal distribution table shows a cumulative probability associated with a
particular z-score. Table rows show the whole number and tenths place of the z-score.
Table columns show the hundredths place. The cumulative probability (often from minus
infinity to the z-score) appears in the cell of the table.
For example, a section of the standard normal table is reproduced below. To find the
cumulative probability of a z-score equal to -1.31, cross-reference the row of the table
containing -1.3 with the column containing 0.01. The table shows that the probability that
a standard normal random variable will be less than -1.31 is 0.0951; that is, P(Z < -1.31)
= 0.0951.
z
3.
0
...
1.
4
1.
3
1.
2
...
3.

0.00
0.001
3

0.01
0.001
3

0.02
0.001
3

0.03
0.001
2

0.04
0.001
2

0.05
0.001
1

0.06
0.001
1

0.07
0.001
1

0.08
0.001
0

0.09
0.001
0

...
0.080
8

...
0.079
3

...
0.077
8

...
0.076
4

...
0.074
9

...
0.073
5

...
0.072
2

...
0.070
8

...
0.069
4

...
0.068
1

0.096
8

0.095
1

0.093
4

0.091
8

0.090
1

0.088
5

0.086
9

0.085
3

0.083
8

0.082
3

0.115
1

0.113
1

0.111
2

0.109
3

0.107
5

0.105
6

0.103
8

0.102
0

0.100
3

0.098
5

...
0.998

...
0.998

...
0.998

...
0.998

...
0.998

...
0.998

...
0.998

...
0.998

...
0.999

...
0.999

Of course, you may not be interested in the probability that a standard normal random
variable falls between minus infinity and a given value. You may want to know the
probability that it lies between a given value and plus infinity. Or you may want to know
the probability that a standard normal random variable lies between two given values.
These probabilities are easy to compute from a normal distribution table. Here's how.

Find P(Z > a). The probability that a standard normal random variable (z) is
greater than a given value (a) is easy to find. The table shows the P(Z < a). The
P(Z
>
a)
=
1
P(Z
<
a).
Suppose, for example, that we want to know the probability that a z-score will be
greater than 3.00. From the table (see above), we find that P(Z < 3.00) = 0.9987.
Therefore, P(Z > 3.00) = 1 - P(Z < 3.00) = 1 - 0.9987 = 0.0013.

Find P(a < Z < b). The probability that a standard normal random variables lies
between two values is also easy to find. The P(a < Z < b) = P(Z < b) - P(Z < a).
For example, suppose we want to know the probability that a z-score will be
greater than -1.40 and less than -1.20. From the table (see above), we find that
P(Z < -1.20) = 0.1151; and P(Z < -1.40) = 0.0808. Therefore, P(-1.40 < Z < -1.20)
= P(Z < -1.20) - P(Z < -1.40) = 0.1151 - 0.0808 = 0.0343.

In school or on the Advanced Placement Statistics Exam, you may be called upon to use
or interpret standard normal distribution tables. Standard normal tables are commonly
found in appendices of most statistics texts.
The Normal Distribution as a Model for Measurements
Often, phenomena in the real world follow a normal (or near-normal) distribution. This
allows researchers to use the normal distribution as a model for assessing probabilities
associated with real-world phenomena. Typically, the analysis involves two steps.

Transform raw data. Usually, the raw data are not in the form of z-scores. They
need to be transformed into z-scores, using the transformation equation presented
earlier: z = (X - ) / .

Find probability. Once the data have been transformed into z-scores, you can use
standard normal distribution tables, online calculators (e.g., Stat Trek's free
normal distribution calculator), or handheld graphing calculators to find
probabilities associated with the z-scores.

Example: Mr. X earned a score of 940 on a national achievement test. The mean test
score was 850 with a standard deviation of 100. What proportion of students had a higher
score than Mr. X? (Assume that test scores are normally distributed.)
(A) 0.10
(B) 0.18
(C) 0.50
(D) 0.82
(E) 0.90
Solution:
The correct answer is B. As part of the solution to this problem, we assume that test
scores are normally distributed. In this way, we use the normal distribution as a model for
measurement. Given an assumption of normality, the solution involves three steps.

First, we transform Mr. X's test score into a z-score, using the z-score
transformation equation.
z = (X - ) / = (940 - 850) / 100 = 0.90

Then from the standard normal distribution table, we find the cumulative
probability associated with the z-score. In this case, we find P(Z < 0.90) = 0.8159.

Therefore, the P (Z > 0.90) = 1 P (Z < 0.90) = 1 - 0.8159 = 0.1841.

Thus, we estimate that 18.41 percent of the students tested had a higher score than Mr. X.

Chapter Eight
Probability
End Chapter Quizzes
1.
abcd-

The outcome of tossing a coin is a


simple event
mutually exclusive event
complementary event
compound event

2.
abcd-

Classical probability is measured in terms of


an absolute value
a ratio
absolute value and ratio both
none of the above

3.
abcd-

Probability is expressed as
ratio
proportion
percentage
all the above

4.
abcd-

Classical probability is also known as


Laplaces probability
mathematical probability
a priori probability
all the above

5.
abcd-

Each outcome of a random experiment is called


Primary event
Compound event
Derived event
All the above

6.
abcd-

The definition of statistical probability was originally given by


De Moivre
Laplace
Von-Mises
Pascal

7.
abcd-

The definition of priori probability was originally given by


De Moivre
Laplace
Von-Mises
Feller

8.
abcd-

Probability by classical approach has


no lecuna
only one lecuna
only two lecunae
many lecunae

9.
abcd-

An event consisting of those elements which are not in A is called


primary event
derived event
simple event
complementary event

10.

The probability of the intersection of two mutually exclusive events is

abcd-

infinity
zero
one
none of the above

always

Chapter9
Sampling Design
Introduction
In this lesson, we shall describe the basic thing, how to collect data. We shall also discuss
a variety of methods of selecting the sample called Sampling Designs, which can be used
to generate our sample data sets.
Apopulation is commonly understood to be a natural, geographical, or political collection
of people, animals, plants, or objects. Some statisticians use the word in the more
restricted sense of the set of measurements of some attribute of such a collection; thus
they might speak of the population of heights of male college students. Or they might
use the word to designate a set of categories of some attribute of a collection, for
example, the population of religious affiliations of U.S. government employees.
In statistical discussions, we often refer to the physical collection of interest as well as to
the collection of measurements or categories derived from the physical collection. In
order to clarify which type of collection is being discussed, in this book we use the term
population as it is used by the research scientist: The population is the physical
collection. The derived set of measurements or categories is called the set of values of the
variable of interest. Thus, in the first example above, we speak of the set of all values of
the variable height for the population of male college students.
After we have defined the population and the appropriate variable, we usually find it
impractical, if not impossible, to observe all the values of the variable. For example, all
the values of the variable miles per gallon in city driving for this years model of a certain
type of car could not be obtained since some of the cars probably are yet to be produced.
Even if they did exist, the task of obtaining a measurement from each car is not feasible.
In another example, the values of the variable condition of all packaged bandages (sterile
or contaminated) produced on a particular day by a certain firm could be obtained, but
this is not desirable since the bandages would be made useless in the process of testing.
Instead, we consider a sample (a portion of the population), obtain measurements or
observations from this sample (the sample data), and then use statistics to make an
inference about the entire set of values. To carry out this inference, the sample must be
random.

Need for Sampling


Sampling is used in practice for a variety of reasons such as:
1. Sampling can save time and money. A sample study is usually less expensive than
a census study and produces results at a relatively faster speed.
2. Sampling may enable more accurate measurements for a sample study is
generally conducted by trained and experienced investigators.
3. Sampling remains the only way when population contains infinitely many
members.
4. Sampling remains the only choice when a test involves the destruction of the
items under study.
5. Sampling usually enables to estimate the sampling errors and, thus, assists in
obtaining information concerning some characteristic of the population.

Concept of Population and Sample


Statisticians commonly separate the statistical techniques into two broad categoriesDescriptive and Inferential.
The Descriptive Statistics deals with collecting, summarizing and simplifying the
complicated data. It also helps in understanding the data and report making.
The Inferential Statistics deals with methods used for drawing inferences about the
totality of observations on the basis of knowledge gained
Population is roughly defined as collection of all elements taken into consideration and
about which conclusion have to be drawn. For example: If the study is been conducted to
determine average salary of the workers of a factory, then the population will consists
workers in the factory. Similarly, if we investigate about fertility of land in a region, then
the population will consists of all lands under cultivation. Thus population refers to all
items under investigation.
Sample can be defined as collection of some elements of population. In other words, a
part of totality on which information is generally collected and analyzed for the purpose
of understanding any aspect of the population. The part of population taken into
consideration is called Sampling Unit. For example: A doctor examines a few drops of
blood to draw conclusions about the nature of disease or blood constitution of the whole
body.
If the sampling unit comprises of all units of all elements of population may be viewed as
Elementary Sampling Unit

For example: In textile industry, the workers of a department whose wages may be a
sample and all the workers of the company will be considered as population.
The total number of units in the population is known as population size.
The total number of units in the sample is known as sample size.
Any characteristic of population is called parameter and that of sample is called statistic.

Sampling Frame
To select a random sample of sampling units, we need a list of all sampling units
contained in the population. Such a list is called a Sampling Frame

Census and Sample Survey


It is possible to examine every person of the population if we want to calculate average
wage of a person working in a factory, then all the elements of population will be called
as primary sampling unit. Also we call this a complete enumeration or CENSUS.
The census method is not very popularly used in practice. Since the effort, money & time
required for carrying out complete enumeration will generally be extremely large and in
many cases, it involves huge cost.
The standard deviation of sampling distribution is called standard error, larger the sample
size lower will be the standard error. We have also studied various sources of sampling
and non-sampling error along with principles of sampling.
For the process of statistical inference to be valid we must ensure that we take a
representative sample of our population. Whatever method of sample selection we use, it
is vital that the method is described. How do we know if the characteristics of a sample
we take match the characteristics of the population we are sampling? The short answer is
we dont. We can, however, take steps that make it as likely as possible that the sample
will be representative of the population. Two simple and effective methods of doing this
are making sure that the sample size is large and making sure it is randomly selected. A
large sample size is more likely to be representative of a population than a small one.
Think of extreme cases. If we want to know the average height of the population and we
select just one person and measure their height it is unlikely to be close the population
average. If we took 1,000,000 people, measured their heights and took the average, this
figure would be likely to be close to the population average.

Types of Sampling
The type of enquiry you want to have and the nature of data that you want to collect
fundamentally determines the technique or method of selecting a sample.
The procedure of selecting a sample may be broadly classified under the following three
heads:
Non-Probability Sampling Methods
Probability Sampling
Mixed Sampling
Now let us discuss these in detail. We will start with the non-probability sampling then
we will move on to probability sampling.
Non-Probability Sampling Methods: The common feature in non probability sampling
methods is that subjective judgments are used to determine the population that are
contained in the sample .We classify non-probability sampling into four groups:
1. Convenience Sampling
2. Judgement Sampling
3. Quota Sampling
4. Snowball sampling
Convenience Sampling
This types of sampling is used primarily for reasons of convenience.
It is used for exploratory research and speedy situations.
It is often used for new product formulations or to provide gross-sensory
evaluations by using employees, students, peers, etc.
Convenience sampling is extensively used in marketing studies
This would be clear from the following examples:
1. Suppose a marketing research study aims at estimating the proportion of Pan (Beetle
leaf) shops in Delhi, which store a particular drink Maaza. It is decided to take a sample
of size 150. What the investigator does is to visit 150 Pan shops near his place of office
as it is very convenient to him and observe whether a Pan shop stores Maaza or not. This
is definitely not a representative sample, as most Pan shops in Delhi had no chance of
being selected. It is only those Pan shops which were near the office of the investigator
has a chance of being selected

2. A ball pen manufacturing company is interested in knowing the opinions about the ball
pen (like smooth flow of ink, resistance to breakage of the cover etc.) it is presently
manufacturing with a view to modify it to suit customers
need. The job is given to a marketing researcher who visits a college near his place of
residence and asks a few students (a convenient sample) their opinion about the ball
pen in question.
Judgement Sampling
It is that sample in which the selection criteria are based upon the researchers
personal judgment that the members of the sample are representative of the
population under study.
It is used for most test markets and many product tests conducted in shopping
malls. If personal biases are avoided, then the relevant experience and the
acquaintance of the investigator with the population may help to choose a
relatively representative sample from the population. It is not possible to make an
estimate of sampling error as we cannot determine how precise our sample
estimates are.
Judgement sampling is used in a number of cases, some of which are:
1. Suppose we have a panel of experts to decide about the launching of a new product in
the next year. If for some reason or the other, a member drops out, from the panel, the
chairman of the panel may suggest the name of another person whom he thinks has the
same expertise and experience to be a member of the said panel. This new member was
chosen deliberately - a case of Judgment sampling.
2. The method could be used in a study involving the performance of salesmen. The
salesmen could be grouped into top-grade and low-grade performer according to certain
specified qualities. Having done so, the sales manager may indicate who in his opinion,
would fall into which category. Needless to mention this is a biased method. However in
the absence of any objective data, one might have to resort to this type of sampling.
Quota Sampling
This is a very commonly used sampling method in marketing research studies. Here the
sample is selected on the basis of certain basic parameters such as age, sex, income and
occupation that describe the nature a population so as to make it representative of the
population. The Investigators or field workers are instructed to choose a sample that
conforms to these parameters. The field workers are assigned quotas of the number of
units satisfying the required characteristics on which data should be collected. However,
before collecting data on these units, the investigators are supposed to verify that the
units qualify these characteristics. Suppose we are conducting a survey to study the

buying behavior of a product and it is believed that the buying behavior is greatly
influenced by the income level of the consumers. We assume that it is possible to divide
our population into three income strata such as high-income group, middle-income group
and low-income group. Further it is known that 20% of the population is in high income
group, 35% in the middle-income group and 45% in the low-income group. Suppose it is
decided to select a sample of size 200 from the population. Therefore, samples of size 40,
70 and90 should come from high income, middle income and low income groups
respectively. Now the various field workers are assigned quotas to select the sample from
each group in such a way that a total sample of 200 is selected in the same proportion as
mentioned above.
Snowball Sampling
The sampling in which the selection of additional respondents (after the first small
group of respondents is selected) is based upon referrals from the initial set of
respondents.
It is used to sample low incidence or rare populations
It is done for the efficiency of finding the additional, hard-to-find members of the
sample.
Advantages of Non-probability Sampling
It is much cheaper to probability sampling.
It is acceptable when the level of accuracy of the research results is not of utmost
importance.
Less research time is required than probability samples.
It often produces samples quite similar to the population of interest when conducted
properly.
Disadvantages of Non-probability Sampling
You cannot calculate Sampling error. Thus, the minimum required sample size cannot
be calculated which suggests that you (researcher) may sample too few or too many
members of the population of interest.
You do not know the degree to which the sample is representative of the population
from which it was drawn.
The research results cannot be projected (generalized) to the total population of interest
with any degree of confidence.
Probability Sampling Methods
Probability sampling is the scientific method of selecting samples according to some laws
of chance in which each unit in the population has some definite pre-assigned probability
of being selected in the sample. The different types of probability sampling are:

1. Where each unit has an equal chance of being selected.


2. Sampling units have different probabilities of being selected
3. Probability of selection of a unit is proportional to the sample size.
Simple Random Sampling
It is the technique of drawing a sample in such a way that each unit of the population has
an equal and independent chance of being included in the sample.
In this method an equal probability of selection is assigned to each unit of population at
the first draw. It also implies an equal probability of selecting in the subsequent draws.
Thus in simple random sample from a population of size N, the probability of drawing
any unit in the first draw is 1/N.The probability of drawing a second unit in the second
draw is (1/N)-1.
The probability of selecting a specified unit of population at any given draw is equal to
the probability of its being selected at the first draw.
Selection of a Simple Random Sample:
As we all know Simple Random Sample refers to that method of selecting a sample in
which each and every unit of population is given independent and equal chance to be
included in the sample. But, Random Sample does not depend only upon selection of
units, but also on the size and nature of the population. One procedure may be good and
simple for a small sample but it may not be good for the large population.
Generally, the method of selecting a sample must be independent of the properties of
sampled population. Proper precautions should be taken to ensure that your selected
sample is random. Although human bias is inherent in any sampling scheme administered
by human beings. Random selection is best for two reasons - it eliminates bias and
statistical theory is based on the idea of random sampling. We can select a simple random
sample through use of tables of random numbers, computerized random number
generator or lottery method. Thus, the three methods of drawing simple random sample
are mechanical method and using tables of random numbers and sealed envelopes (lottery
system) etc.
Lottery Method
This is the simplest method of selecting a random sample. We will illustrate it by means
of example for better understanding. Suppose, we want to select r candidates out of
n. We assign the numbers from 1 to n i.e. to each and every candidate we assign only
one exclusive number. These numbers are then written on n slips which are made as

homogeneous as possible in shape, size, colour, etc. These slips are then put in a bag and
thoroughly shuffled and then r slips are drawn one by one. The r candidates
corresponding to numbers on the slips drawn will constitute a random sample.
This method of selecting a simple random sample is independent of the properties of
population. Generally in place of slips you can use cards also. We make one card
corresponding to one unit of population by writing on it the number assigned to that
particular unit of population. The pack of cards is a miniature of population for sampling
purposes. The cards are shuffled a number of times and then a card is drawn at random
from them. This is one of the most reliable methods of selecting a random sample.
Merits and Limitations of Simple Random Sampling
Merits
1. Since sample units are selected at random providing equal chance to each and every
unit of population to be selected, the element of subjectivity or personal bias is
completely eliminated. Therefore, we can say that simple random sample is more
representative of population than purposive or judgement sampling.
2. You can ascertain the efficiency of the estimates of the parameters by considering the
sampling distribution of the statistic (estimates)
For example: One measure of calculating precision is sample size. Sample mean becomes
an unbiased mean of population mean or a more efficient estimate of population mean as
sample size increases.
Limitations
1. The selection of simple random sample requires an up-to-date frame of population
from which samples are to be drawn. Although it is impossible to have knowledge about
each and every unit of population if population happens to be very large. This restricts the
use of simple random sample.
2. A simple random sample may result in the selection of the sampling units, which are
widely spread geographically and in such a case the administrative cost of collecting the
data may be high in terms of time and money.
3. For a given precision, simple random sample usually requires larger sample size as
compared to stratified random sampling which we will be studying next.
The limitations of simple random sample will be clear from the example.
Therefore, some of the randomly allocated samples prove very non-random. This type of
problem can be eliminated by use of Stratified Random Sampling, in which the
population is divided into different strata. Now, we will move into details of stratified
random sampling.

Stratified Random Sampling


We have understood that in simple random sampling, the variance of the sample estimate
of the population is a. inversely proportional to the sample size, and
b. directly proportional to the variability of the sampling units in the population.
We also know that the precision is defined as reciprocal of its sampling variance.
Therefore as sample size increases precision increases. Apart from increasing the sample
size or sampling fraction n/N, the only way of increasing the precision of sample mean is
to devise a sampling technique which will effectively reduce variance, the population
heterogeneity. One such technique is Stratified Sampling.
Stratification Means Division into Layers
Past data or some other information related to the character under study may be used to
divide the population into various groups such that
i. units within each group are as homogeneous as possible and
ii. the group means are as widely different as possible.
Thus, if we have a population consisting of N sampling units, it is divided into k
relatively homogeneous mutually disjoint (non overlapping) sub-groups, termed as strata,
of sizes N1, N2,,.., Nk , such that N = Ni for i =1 to k .
Now you draw a simple random sample of size ni (i=1, 2, 3,... k) from each stratum. This
type of technique of drawing a sample is called stratified random sampling and the
sample is called stratified random sample.
There are two points which you have to keep in mind while drawing a stratified random
sample.
Proper classification of the population into various strata, and
A suitable sample size from each stratum.
Both these points are important to be considered because if your stratification is faulty, it
cannot be compensated by taking large samples.
Advantages of Stratified Random Sampling
1. More Representative
In non-stratified random sample some strata may be over represented, others may be
under-represented while some may be excluded altogether. Stratified sampling ensures
any desired representation in the sample of the various strata in the population. It overrules the possibility of any essential group of the population being completely excluded
in the sample. Stratified sampling thus provides a more representative cross section of the
population and is frequently regarded as the most efficient system of sampling.

2. Greater Accuracy
Stratified sampling provides estimates with increased precision. Moreover, stratified
sampling enables us to obtain the results of known precision for each stratum.
3. Administrative Convenience
As compared with simple random sample, the stratified random samples are more
concentrated geographically. Accordingly, the time and money involved in collecting the
data and interviewing the individuals may be considerably reduced and the supervision of
the field work could be allocated with greater ease and convenience.
Systematic Random Sampling
If you have the complete and up-to-date list of sampling units is available you can also
employ a common technique of selection of sample, which is known as systematic
sampling.
In systematic sampling you select the first unit at random, the rest being automatically
selected according to some predetermined pattern involving regular spacing of units.
Now let us assume that the population size is N. We number all the sampling units from 1
to N in some order and a sample of size n is drawn in such a way that
N = nk i.e. k = N/n , where k, usually called the sampling interval, is an integer. In
systematic random sampling we draw a number randomly, let us suppose that the number
drawn is i and selecting the unit corresponding to this number and every kth unit
subsequently. Thus the systematic sample of size n will consist of the units
i, i+k, i+2k, - - - - - - - - - - - - , i+ (n-1)k.
The random number i is called the random start and its value determines the whole
sample.
Merits and Demerits of Systematic Random Sampling
Merits
I. .Systematic sampling is operationally more convenient than simple random sampling or
stratified random sampling. It saves your time and work involved.
II. This sampling is more efficient to simple random sample, provided the frame (the list
from which you have drawn the sample units ) is arranged wholly at random
Demerits
I. The main disadvantage of systematic sampling is that systematic sampling is that
systematic samples are not in general random samples since the requirement in merit two
is rarely fulfilled.
II. If N is not a multiple of n, then the actual sample size is different from that required,
and sample mean is not an unbiased estimate of the population mean.

Cluster Sampling
In this type of sampling you divide the total population, depending upon the problem
under study, into some recognizable sub-divisions which are termed as clusters and a
simple random sample of n blocks is drawn. The individuals whom you have selected
from the blocks constitute the sample.
Notes
Clusters should be as small as possible consistent with the cost and limitations of the
survey.
The number of sampling units in each cluster should be approximately same.
Thus cluster sampling is not to be recommended if we have sampling areas in the cities
where there are private residential houses, business and industrial complexes, apartment
buildings, etc., with widely varying number of persons or households.
Multistage Sampling
One better way of selecting a sample is to resort to sub-sampling within the clusters,
instead of enumerating all the sampling units in the selected cluster. This technique is
called two-stage sampling, clusters being termed as primary units and the units within the
clusters being termed as secondary units. This technique can be generalized to multistage
sampling. We regard population as a number of primary units each of which is further
composed of secondary stage units and so on, till we ultimately reach a stage where
desired sampling units are obtained. In multi-stage sampling each stage reduces the
sample size.
Merits and Limitations
Merits:
i. Multistage sampling is more flexible as compared to other methods .It is simple to
carry out and results in administrative convenience by permitting the field work to be
concentrated and yet covering large area.
ii. It saves a lot of operational cost as we need the second stage frame only for those units
which are selected in the first stage sample.
iii. It is generally less efficient than a suitable single- stage sampling of the same
size.This brings an end on todays discussion on sampling techniques.
Thus in the nutshell we can say that Non probabilistic sampling such as Convenience
sampling, Judgement Sampling and Quota sampling are sometimes used although
representative ness of such a sample cannot be ensured. Whereas a probabilistic sampling
to each unit of the population to be included in the sample and in this sense it is a
representative sample of the population.

Points to Ponder
Sampling is based on two premises. One is that there is enough similarity among the
elements in a population that a few of these elements will adequately represent the
characteristic of the total population.
The second premises is that while some elements in a sample underestimate the
population value, others overestimate the value.
The results of these tendencies are that a sample mean is generally a good estimate of
population mean.
A good sample has both accuracy & precision. An accurate sample is one which there is
little or no bias or systematic variance. A sample with adequate precision is one that has a
sampling error that is within acceptable limits.
A variety of sampling technique is available, of which probability sampling is based on
random selection a controlled procedure that ensures that each population element is
given a known nonzero chance of selection.
In contrast non-probability selection is not random. When each sample element is drawn
individually from the population at large, it is unrestricted sampling.

Sample size and its determination


In sample analysis the most ticklish question is: What should be the size of the sample or
how large pr small should be n? If the sample size (n) is too small, it may not serve to
achieve the objectives and if it is too large, we may incur huge cost and waste resources.
As a general rule, one can say that the sample must be of an optimum size i.e., it should
neither be excessively large nor too small. Technically the sample size should be large
enough to give a confidence interval of desired width and as such the size of the sample
must be chosen by some logical process before sample is taken from the universe. Size of
the sample should be determined by researcher keeping in view the following points:
1. Nature of Universe: Universe may be either homogenous or heterogeneous in
nature. If the items of the universe are homogenous, a small sample can serve the
purpose. But if the sample is heterogeneous, a large sample would be required.
Technically, this can be termed as the dispersion factor.
2. Number of classes proposed: If many class groups (groups and sub groups)
are to be formed, a large sample would be required because a small sample might
not be able to give a reasonable number of items in each class groups.
3. Nature of Study: If items are to be intensively and continuously studied, the
sample should be small. For general survey the size of the sample should be large,
but a small sample is considered appropriate in technical surveys.
4. Type of Sampling: Sampling technique plays an important part in determining the
size of the sample. A small random sample is apt to be much superior to a larger
but badly selected sample.

5. Standard of accuracy and acceptable confidence level: If the standard of accuracy


or the level of precision is to be kept high, we shall require relatively larger
sample. For doubling the accuracy for a fixed significance level, the sample size
has to be increased fourfold.
6. Availability of finance: In practice, size of the sample depends upon the amount
of money available for the study purposes. This factor should be kept in view
while determining the size of the sample for large samples result in increasing the
cost of sampling estimates.
Other considerations: Nature of units, size of the population, size of questionnaire,
availability of trained investigators, the conditions under which the sample is being
conducted, the time available for completion of the study are a few other considerations
to which a researcher must pay attention while selecting the size of the sample

Sampling Distribution
The process of generalizing the sample results of the population is referred to as
statistical inference. Here, we shall use certain sample statistics (such as the sample
mean, the sample proportion, etc.) in order to estimate and draw inferences about the true
population parameters. For example, in order to be able to use the sample mean to
estimate the population mean, we should examine every possible sample (and its mean)
that could have occurred in the process of selecting one sample of a certain size. If this
selection of all possible samples actually were to be done, the distribution of the results
would be referred to as a sampling distribution. Although, in practice, only one such
sample is actually selected, the concept of sampling distributions must be examined so
that probability theory and its distribution can be used in making inferences about the
population parameter values.
Sampling theory has made it possible to deal effectively with these problems. However,
before we discuss in detail about them from the standpoint of sampling theory, it is
necessary to understand the central limit theorem and the following three probability
distributions, their characteristics and relations:
(1) The population (universe) distribution,
(2) The sample distribution, and
(3) The sampling distribution.
Central Limit Theorem: The Central Limit Theorem, first introduced by De Moivre
during the early eighteenth century, happens to be the most important theorem in
statistics. According to this theorem, if we select a large number of simple random
samples, say, from any population distribution and determine the mean of each sample,

the distribution of these sample means will tend to be described by the normal probability
distribution with a mean and variance 2/n. This is true even if the population
distribution itself is not normal. Or, in other words, we say that the sampling distribution
of sample means approaches to a normal distribution, irrespective of the distribution of
population from where sample is taken and approximation to the normal distribution
becomes increasingly close with increase in sample size. Symbolically, the theorem can
be explained as follows:
When given n independent random variables X1, X2,X3Xn, which have the same
distribution (no matter what the distribution), then:
X= X1+X2+X3+.Xn
is a normal variate. The mean and variance 2 of X are
= 1+ 2+ 3++ n = n i
2 = 21+ 22+ 23 ++ 2n= n 2i
where i and 2i are the mean of Xi.
The utility of this theorem is that it requires virtually no conditions on distribution
patterns of the individual random variable being summed. As a result, it furnishes a
practical method of computing approximate probability values associated with sums of
arbitrarily distributed independent random variables. This theorem helps to explain why a
vast number of phenomena show approximately a normal distribution. Lets consider a
case when the population is skewed, skewness of the sampling distribution of means is
inversely proportional to the square root of the sample size. Consider the case when n=16
that means is inversely proportional to the square root of the sample size. Consider the
case when n=16 that means the sampling distribution of means will exhibit only onefourth as much skewness as the population has. Consider the case when n=100, skewness
becomes one-tenth as much, ie., as the sample size increases, the skewness will decrease.
As a practical consequence, the normal curve will serve as a satisfactory model when
samples are small and population is close to a normal distribution, or when samples are
large and population is markedly skewed. Because of its theoretical and practical
significance, this theorem is considered as most remarkable theoretical formulation of all
probability laws.
The Population (Universe) Distribution
When we talk of population distribution, we assume that we have investigated the
population and have full knowledge of its mean and standard deviation. For example, a

company might have manufactured 1, 00,000 tyres of cars in the year 2004. Suppose, it
contacts all those who had bought these tyres and gathers information about the life of
these tyres. On the basis of the information obtained, the mean of the population which is
also called true mean symbolized by and its standard deviation symbolized by can be
worked out. These Greek letters and are used for these measures to emphasise their
difference from corresponding measure taken from a sample. It may be noted such
measures characterizing a population care called population parameters.
The shape of the distribution of the life of tyres may be as follows:

Distribution of the Life of Tyres


It is clear from above that, though, the distribution shows slight skewness, it does not
depart radically from a normal distribution. However, this should not lead one to the
conclusion that for sampling theory to apply, it is necessary that the distribution must be
normally distributed.
The Sample Distribution
When we talk of a sample distribution, we take a sample from the population. A sample
distribution may take any shape. The mean and standard deviation of the sample
distribution are symbolized by x and s respectively. A measure characterizing a sample
such as x or s is called a sample statistic. It may be noted that several sample distributions
are possible from a given population.
Suppose, in the above illustration, the manufacturer takes a s sample of 500 tyres. He
contacts the buyers and enquiries about he life of tyres. The shape of the distribution of
these tyres may be as follows:

Sample distribution of 500 tyres


The mean values of these tyres can be expected to differ somewhat from one sample to
another. The sample means constitute the raw material out of which a sampling
distribution is constructed.
The Sampling Distribution
Sampling distributions constitute the theoretical basis of statistical inference and are of
considerable importance in business decision making. If we take numerous different
samples of equal size from the same population, the probability distribution of all the
possible values of a given statistic from all the distinct possible samples of equal size is
called a sampling distribution. It is interesting to note that sampling distributions closely
approximate a normal distribution. It can be see that the mean of a sampling distribution
of sample means is the same as the mean of the population distribution from which the
sample are taken.
The mean of the sampling distribution is designated by the same symbol as the mean of
the population, namely . However, the standard deviation of the sampling distribution of
means given a special name, standard error of mean, and is symbolized by x. The
subscript indicates that in this case, we are dealing with a sampling distribution of means.
The greatest importance of sampling distributions is the assistance that they give us
in revealing the patterns of sampling errors and their magnitude in terms of standard
error. In sampling with replacement, we can observe a good deal of fluctuations in the
sample mean as compared to fluctuations in the actual population. The fact that the
sample means are less variable than the population data follows logically from an
understanding of the averaging process. A particular sample mean averages together all
the values in the sample. A population (universe) may consist of individual outcomes that
can take on a wide range of values from extremely small to extremely large. However, if
an extreme value falls into the sample, although it will have an effect on the mean, the
effect will be reduced since it is being averaged in with all the other values in the sample.
Moreover, as the sample size increases, the effect of a single extreme value gets even
smaller, since it is being averaged with more observations. This is a single extreme value
gets even smaller, since it is being averaged with more observations. This phenomenon is
expressed statistically in the value of the standard deviation of the sample mean. This is
the measure of variability of the mean from sample to sample and is referred to as the
standard deviation of the sampling distribution of sample mean or the standard error of
the mean denoted as x and is calculated by
x = / n
This formula holds only when population is infinite or sample are from finite population
with replacement.

It may be noted that in deducing a sampling distribution, we must first make an


assumption about the appropriate parameter. In as much as any value can be assumed for
a parameter. In as much as any value can be assumed for a parameter, depending upon
our knowledge or a guess of the population, there is no theoretical limit to the number of
sampling distribution of the same sample size that can be taken from the population.
There is a sampling distribution for each assumed value of a parameter. Also, given the
assumed value of a parameter, there is a different sampling distribution of statistics for
each specific sample size. Further, under the same assumptions about a population and
the same sample size, the distribution of one statistic differs from that of another statistic.
For example, the pattern of the distribution of X (x bar) will differ from that of s2, even
though both measures are computed from the same sample.
Relationship between Population, Sample and Sampling Distribution
It will be interesting to note that the mean of the sampling distribution is the same as the
mean of the population. It is possible that many sample means may differ from the
population mean. However, the sample information can be used as an estimate of
population values.It has also been established that the observed standard deviation of a
sample is close to the standard deviation of the population values.
In fact, the standard deviation of the sample is usually so good an approximation that it
can safely be used as an estimate of the corresponding population measure. In order to
use s of the sample to estimate of the population, we make a slight adjustment which
has been found to contribute to greater accuracy of the estimate. The adjustment consists
of using (n-1) instead of n in the formula for the standard deviation of a sample, i.e., we
use
s ={ (x-x)/n-1}
The adjustment decreases the denominator and, therefore, gives a larger result. Thus, the
estimated standard deviation of the population is slightly larger than the observed
standard deviation of the sample
A probability distribution of all the possible means of the samples is a distribution of
sample means or sampling distribution of the mean.
Population mean and sample average
The population mean is to be described as the populations center or location. If the
population were totally accessible, its mean would be computed by the formula
=y/N
in which (the lower-case Greek letter mu) is the symbol for the population mean, y is
the sum of all of the values of the variable of interest for the whole population, and N is

the number of elements in the population. We rarely have an opportunity to use this
formula since most of the populations we study are not totally accessible; they either are
too large, perhaps even infinite, or would be destroyed in the process of measurement.
Population variance and sample variance
The population variance is a measure of the spread of the population. Suppose we want to
choose between two investment plans and are told that both have mean earnings of 10%
per annum; we might conclude that they were equally good. However, suppose we learn
that plan A has a variance twice as large as plan B. This gives us additional information
on which to base a choice. A population variance can be computed from ungrouped data
or from data that are grouped into a frequency or relative frequency distribution if the
population is of the accessible variety. For ungrouped data, a population variance is
defined to be

Chapter-10
Hypothesis Testing
Introduction
A hypothesis is an assumption about the population parameter to be tested based on
sample information. The statistical testing of hypothesis is the most important technique
in statistical inference. Hypothesis tests are widely used in business and industry for
making decisions. It is here that probability and sampling theory plays an everincreasing role in constructing the criteria on which business decisions are made. Very
often in practice we are called upon to make decisions about population on the basis of
sample information. For example, we may wish to decide on the basis of sample data
whether a new medicine is really effective in curing a disease, whether one training
procedure is better than another, etc. Such decisions are called statistical decisions. In
other words, a hypothesis is the assumption that we make about the population
parameter. This can be any assumption about a population parameter not necessarily
based on statistical data. For example it can also be based on the gut feel of a manager.
Managerial hypotheses are based on intuition; the market place decides whether the
managers intuitions were in fact correct.
In fact managers propose and test hypotheses all the time. For example:
1. If a manager says if we drop the price of this car model by Rs15000, well increase
sales by 25000 units is a hypothesis. To test it in reality we have to wait to the end of the
year to and count sales.
2. A manager estimates that sales per territory will grow on average by 30% in the next
quarter is also an assumption or hypotheses. How would the manager go about testing
this assumption? Suppose he has 70 territories under him.
One option for him is to audit the results of all 70 territories and determine whether the
average growth is greater than or less than 30%. This is a time consuming and expensive
procedure.
Another way is to take a sample of territories and audit sales results for them.
Once we have our sales growth figure, it is likely that it will differ somewhat from our
assumed rate. For example we may get a sample rate of 27%. The manager is then faced
with the problem of determining whether his assumption or hypothesized rate of growth
of sales is correct or the sample rate of growth is more representative.
To test the validity of our assumption about the population we collect sample data and
determine the sample value of the statistic. We then determine whether the sample data
supports our hypotheses assumption regarding the average sales growth.

What is Hypothesis?
In attempting to reach decisions, it is useful to make assumptions or guesses about the
populations involved. Such assumptions, which may or may not be true, are called
statistical hypothesis and in general are statements about the probability distributions of
the population. The hypothesis is made about the value of some parameter, but the only
facts available to estimate the true parameter are those provided by a sample. If the
sample statistic differs from the hypothesis made about the population parameter, a
decision must be made as to whether or not this difference is significant. If it is, the
hypothesis is rejected. If not, it must be accepted. Hence, the term "tests of hypothesis".
Now, if be the parameter of the population and is the estimate of in the random
sample drawn from the population, then the difference between and should be
small. In fact, there will be some difference between and because is based on
sample observations and is different for different samples. Such a difference is known as
difference due to sampling fluctuations. If the difference between and is large, then
the probability that it is exclusively due to sampling fluctuations is small. Difference
which is caused because of sampling fluctuations is called insignificant difference and
the difference due to some other reasons is known as significant difference. A
significant difference arises due to the fact that either the sampling procedure is not
purely random or sample is not from the given population.

Procedure for Hypotheses Testing


The general procedure followed in testing hypothesis comprises the following
steps:
Set up a hypothesis. The first step in hypothesis testing is to establish the hypothesis
to be tested. Since statistical hypothesis are usually assumptions about the value of some
unknown parameter, the hypothesis specifies a numerical value or range of values for
the parameter. The conventional approach to hypothesis testing is not to construct single
hypothesis about the population parameter, but rather to set up two different hypothesis.
These hypothesis are normally referred to as (i) null hypothesis denoted by Ho and (ii)
alternative hypothesis denoted by H1.
The null hypothesis asserts that there is no true difference in the sample statistic and
population parameter under consideration (hence the word "null" which means invalid,
void or amounting to nothing and that the difference found is accidental arising out of
fluctuations of sampling. A hypothesis which states that there is no difference between
assumed and actual value of the parameter is the null hypothesis and the hypothesis that
is different from the null hypothesis is the alternative hypothesis. If the sample
information leads us to reject Ho then we will accept the alternative hypothesis H1 Thus,

the two hypothesis are constructed so that if one is true, the other is false and vice versa.
The rejection of the null hypothesis indicates that the differences have statistical
significance and the acceptance of the null hypothesis indicates that the differences are
due to chance. As against the null hypothesis, the alternative hypothesis specifies those
values that the researcher believes to hold true. The alternative hypothesis may embrace
the whole range of values rather than single point.
Set up a suitable significance level. Having set up a hypothesis, the next step is to select
a suitable level of significance. The confidence with which an experimenter rejects or
retains null hypothesis depends on the significance level adopted. The level of
significance, usually denoted by "", is generally specified before any samples are
drawn, so that results obtained will not influence our choice. Though any level of
significance can be adopted, in practice, we either take 5 per cent or 1 per cent level of
significance. When we take 5 per cent level of significance then there are about 5
chances out of 100 that we would reject the null hypothesis when it should be accepted,
i.e., we are about 95% confident that we have made the right decision. When we test a
hypothesis at a 1 per cent level of significance, there is only one chance out of 100 that
we would reject the null hypothesis when it should be accepted, i.e., we, are about 99%
confident that we have made the right decision. When the null hypothesis is rejected at
= 0.5, the test result is said to be "significant". When the null hypothesis is rejected at
= 0.01, the test result is said to be "highly significant".
Determination of a suitable test statistic. The third step is to determine a suitable test
statistic and its distribution. Many of the test statistics that we shall encounter will be of
the following form:

Determine the critical region. It is important to specify, before the sample is taken,
which values of the test statistic will lead to a rejection of Ho and which lead to
acceptance of Ho. The former is called the critical region. The value of , the level of
significance, indicates the importance that one attaches to the consequences associated
with incorrectly rejecting Ho. It can be shown that when the level of significance is ,
the optimal critical region for a two-sided test consists of that /2 per cent of the area in
the right-hand tail of the distribution plus that /2 percent in the left hand tail. Thus,
establishing a critical region is similar to determining a 100(I - )% confidence interval.
In general, one uses a level of significance of = 0.05, indicating that one willing to

accept a 5 per cent chance of being wrong to reject Ho.


Doing computations. The fifth step in testing hypothesis is the performance of various
computations from a random sample of size n, necessary for the test statistic obtained in
step 7.3.3. Then, we need to see whether sample result falls in the critical region or in
the acceptance regions.
Making decisions. Finally, we may draw statistical conclusions and the management
may take decisions. A statistical decision or conclusion comprises either accepting the
null hypothesis or rejecting it. The decision will depend on whether the computed value
of the test criterion falls in the region of rejection or the region of acceptance. If the
hypothesis is being tested at 5 per cent level of significance and the observed set of
results has a probability less than 5 per cent, we reject the null hypothesis and the
difference between the sample statistic and the hypothetical population parameter is
considered to be significant. On the other hand, if the testing statistic falls in the region
of non-rejection, the null hypothesis is accepted and the difference between the sample
statistic and the hypothetical population parameter is not regarded as significant, i. e., it
can be explained by chance variations.

Type I and Type II Errors


When a statistical hypothesis is tested, there are four possible results:
(I) the hypothesis is true but our test rejects it.
(2) The hypothesis is false but our test accepts it.
(3) The hypothesis is true and our test accepts it.
(4) The hypothesis is false and our test rejects it.
Obviously, the first two possibilities lead to errors. If we reject a hypothesis when it
should be accepted (possibility No.1), we say that a Type I error has been made. On the
other hand, if we accept a hypothesis when it should be rejected (possibility No.2), we
say that a Type II error has been made. In either case a wrong decision or error in
judgment has occurred.

The probability of committing a type I error is designated as "." and is called the level of
significance. Therefore,
= P r [Type I error]
= Pr [Rejecting Ho| Ho is true]
must be the complement of
(I - ) = Pr [Accepting Ho| Ho is true].
This probability (I - ) corresponds to the concept of 100(1- ) % confidence interval.
Our efforts would obviously be to have a small probability of making a type I error.
Hence the objective is to construct the test to minimise .
Similarly, the probability of committing a type II error is designated by . Thus
= P r [Type II error]
= Pr [Accepting HoI Ho is false]
and

(1 - ) = Pr [Rejecting HolHo is false].

This probability (1 - ) is known as the power of a statistical test.


The following table gives the probabilities associated with each of the four cells
shown in the previous table:

The decision is :
Accept Ho

Reject Ho
Sum

The null hypothesis is


True
False
(1 -)
Confidence level

(1- )

Power of the test


1.00
1.00

Note that the probability of each decision outcome is a conditional probability and
the elements in the same column sum to 1.0, since the events with which they are
associated are complement. However, and are not independent of each other, nor are
they independent of the sample size n. When n is fixed, if is lowered then normally
rises and vice versa. If n is increased, it is possible for both and to decrease. Since,
increasing the sample size involves money and time, therefore, one should decide how

much additional money and time, he is willing to spare on increasing the sample size in
order to reduce the size of and .
In order for any tests of hypothesis or rules of decisions to be good, they must be
designed so as to minimise errors of decision. However, this is not a simple matter, since
for a given sample size, an attempt to decrease one type of error is accompanied in
general by an increase in other type of error. The probability of making type I error is
fixed in advance by the choice of level of significance employed in the test. We can
make the type I error as small as we please, by lowering the level of significance. But by
doing so, we increase the chance of accepting a false hypothesis, i. e., of making a type
II error. It follows that it is impossible to minimise both errors simultaneously. In the
long run, errors of type I are perhaps more likely to prove serious in research
programmes in social sciences than are errors of type II. In practice, one type of error
may be more serious than the other and so a compromise should be reached in favour of
limitations of the more serious error. The only way to reduce both types of error is to
increase the sample size that may or may not be possible.
One-Tailed and Two-Tailed Tests
Basically, there are three kinds of problems of tests of hypothesis. They include:
(i) two-tailed tests, (ii) right-tailed test, and (iii) left-tailed test.
Two-tailed test is that where the hypothesis about the population mean is rejected for
value of falling into either tail of the sampling distribution. When the hypothesis about
population mean is rejected only for value of falling into one of the tails of the sampling
distribution, then it is known as one-tailed test. If, it is right tail then it is called righttailed test or one-sided alternative to the right and if it is on the left tail, then, it is onesided alternative to the left and called left-tailed test.
For example, Ho: = 100 tested against H1: > 100 or < 100 is one-tailed test since HI
specifies that lies on particular side of 100. The same null hypothesis tested against H1:
100 is a two-tailed test since can be on either side of 100. The following diagrams
would make it clearer:
***** DIAGRAMATIC REPRESENTATION ( after the table)
The following table gives critical values of Z for both one-tailed and two-tailed tests at
various levels of significance. Critical values of Z for other levels of significance are
found by use of the table of normal curve areas :
Level of Significance
Critical value of z for onetailed tests

0.10
-1.28
or 1.28

0.05
0.01
-1.645 -2.33
or 1.645 or 2.33

0.005
-2.58
or 2.58

0.0002
-2.88
or 2.88

Critical value of z for twotailed tests

- 1.645
and
1.645

- 1. 96 - 2.58
and I. and
96
2.58

-2.81

-3.08

and 2.81 and 3.08

Z TEST: Tests of Hypothesis Concerning Large Samples


Though, it is difficult to draw a clear-cut line of demarcation between large and small
samples, it is generally agreed that if the size of sample exceeds 30, it should be
regarded as a large sample. The tests of significance used for large samples are different
from the ones used for small samples for the reason that the assumptions we make in
case of large samples do not hold for small samples. Tests of hypothesis involving large
samples are based on the following assumptions:
(1) The sampling distribution of a sample statistic is approximately normal.
(2) Values given by the samples are sufficiently close to the population value and
can be used in its place for the standard error of the estimate.
Thus, we have seen that the normal distribution plays a vital role in tests of
hypothesis based on large samples (central limit theorem).
Suppose is an unbiased estimate of , the population parameter. On the basis of
, taken from sample observations, it is to test the hypothesis whether the sample is
drawn from a population whose parameter value is , i. e., we have to test the
hypothesis
Ho: =
If the Sampling distribution of is normal, then

Let us test the hypothesis at 100 % level of significance. From tables of area under the
standard normal curve corresponding to given , we can find an ordinate z such that
Pr[IzI>z]=
P r [ - z Z z ] = 1
If = .01, then z = 2.58 and if = 0.05, then z = 1.96, and so on.
If the difference between and is more than z times, the standard error of
, the difference is regarded significant and Ho is rejected at 100 % level of
significance and if the difference between and is less than or equal to z times the
standard error of , the difference is insignificant and Ho is accepted at 100 % level
of significance.

Testing Hypothesis about the Difference between Two Means:


a For the hypothesis testing concerning the population parameter by considering the
two-tailed test.
Ho: = o
Since the best unbiased estimator of is the simple mean x` (x bar), therefore, we shall
focus our attention on the sampling distribution of x`(x bar). From Central Limit theorem,
we know that
x` (x bar) ~ N(,x)
z = (x`- )/x
where

x = / N = s/ N

( if s is unknown for large samples)

If the calculated value of z<-z /2 or z>z /2 , the null hypothesis is rejected.


If the hypothesis involves a right-tailed test. For example,
Ho: o and H1: > o
For the calculated values z> z , the null hypothesis is rejected.
If the hypothesis involves a left-tailed test. For example,
Ho: o and H1: < o
For the calculated values z<-z , the null hypothesis is rejected.

Testing Hypothesis about the Difference between two Means


The test statistic for testing the difference between two population means, when the populations
are normally distributed, is based on the general form of the standard normal statistic as
given below:
Z= (- ) /
where = 1-2
Therefore the z statistic is given by

The null hypothesis is Ho: 1-2 =0


Then, the z statistic is reduced to

At 5% level of significance, the critical value of z for two-tailed test = 1.96. If the
computed value of z is greater than +1.96 or less than 1.96, then reject Ho, otherwise
accept Ho.
S12 and S22 can be used if the values of 12 and 22 are unknown.
Illustration2: You are working as a purchase manager for a company. Two
manufacturers of electric bulbs have supplied the following information to you

Mean Life(in hours)


Standard Deviation
hours)
Sample Size

(in

Company A
1300

Company B
1288

82

93

100

100

Which brand of bulb are you going to purchase if you desire to take a risk of 5%?

Theory for Small Samples


If the original population is normally distributed, all sampling distributions of the mean
shall be normally distributed regardless of the sample size (central limit theorem). If
the original population is normally distributed and the standard deviation of the

population is unknown (and therefore, has to be estimated from a sample), the sampling
distribution of the mean derived from large samples will also be normally distributed,
but if the sample size is small (say 30, or less) then the sample statistic will follow a tdistribution.
The Student's t-distribution obtained by W.S. Gosset was published under the pen
name of "Student" in the year 1908. It is reported that Gosset was a statistician for a
brewery, and that the management did not want him to publish his scholarly theoretical
work under his real name and bring shame to his employer. Consequently, he selected
the pen name of Student.
The study of statistical inference with the small samples is called small sampling
theory or exact sampling theory. We shall discuss in detail the "t" and "F' distributions.
These two distributions are defined in terms of number of degrees of freedom. It is
appropriate at this stage to clarify this concept.
Degrees of freedom: The number of degrees of freedom can be interpreted as the
number of useful items of information generated by a sample of given size with respect
to the estimation of a given population parameter. Thus, a sample of size 1 generates
one piece of useful information if one is estimating the population mean, but none, if
one is estimating the population variance. In order to know about the variance, one need
at least a sample of size n 2. The number of degrees of freedom, in general, is the total
number of observations minus the number of independent constraints imposed on the
observations.
Suppose the expression X = X1 + X2 + X3 has four terms. We can arbitrarily
assign values to any three of these four values (for example, 15 = X1 + 2 + 8) but the
value of the fourth is automatically determined (for example, X1 = 5).
In this example, there are 3 degrees of freedom. If n is the number of observations
and k is the number of independent constants (the number of constants that have to be
estimated from the original data) then n - k is the number of degrees of freedom.
If we consider sample of size n drawn from a normal (or approximately normal)
population with mean and if for each sample we compute t, using the sample mean x`
and sample standard deviation s, the distribution for t can be obtained. The probability
density function of the t-distribution is given by

(1) The t-distribution ranges from - to just as does a normal distribution.


(2) The t-distribution like the standard normal distribution is bell-shaped and symmetrical
around mean zero.
(3) The shapes of the t-distribution changes as the number of degrees of freedom changes.
Therefore, for different degrees of freedom, the t-distribution has a family of t-distributions.
Hence, the degrees of freedom v is a parameter of the t distribution.
(4) The variance of the t-distribution is always greater than one and is defined only when v
and is given as:

(5) The t-distribution is more of platykurtic (less peaked at the centre and higher in tails)
than the normal distribution.
(6) The t-distribution has a greater dispersion than the standard normal distribution. As n
gets larger, the t-distribution approaches the normal form. When n is as large as 30, the
difference is very small. Relation between the t-distribution and standard normal
Properties of t-Distribution

Chi-Square Distributions
The chi-square is a continuous probability distribution. Although this theoretical
probability distribution is usually not a direct model of a population distribution, it has
many uses when we are trying to answer questions about populations. For example, the
chi-square distribution can be used to decide whether or not a set of data fits a specified
theoretical probability model a goodness-of-fit test.

Goodness-of-fit tests
Goodness-of-Fit Test with a Specified Parameter
Example: Each day a salesperson calls on 5 prospective customers and she records
whether or not the visit results in a sale. For a period of 100 days her record is as follows:
Number of sales: 0 1 2 3 4 5
Frequency:
15 21 40 14 6 4
A marketing researcher feels that a call results in a sale about 35% of the time, so he
wants to see if this sampling of the salespersons efforts fits a theoretical binomial
distribution for 5 trials with 0.35 probability of success, b( y; 5, 0.35). This binomial
distribution has the following probabilities and leads to the following expected values for
100 days of records:

Since the last category has an expected value of less than 1, he combines the last two
categories to perform the goodness-of-fit test.

In this goodness-of-fit test the hypotheses are:


H0: This sample is from b(y; 5, 0:35)
Ha: This sample is not from b(y; 5, 0:35)
The degrees of freedom are v k 2 1 5 2 1 4. The critical value is
The null hypothesis is rejected if this value is exceeded. Thus the marketing researcher
rejects the null hypothesis. The sales do not follow the pattern of this binomial
distribution.

t-Tests
If random samples of size less than 30 are taken from a normal distribution and the
samples used to estimate the
Variance, then the statistic
is not normally distributed. The probabilities in the tails of this distribution are greater
than for the standard normal distribution

Comparison of the standard normal distribution and a t distribution


t distributions are
1. unimodal;
2. asymptotic to the horizontal axis;
3. symmetrical about zero, E(t);
4. dependent on n, the degrees of freedom (for the statistic under discussion, v n 2 1);
5. more variable than the standard normal distribution, V(t) = v/(v - 2) for n > 2;
6. approximately standard normal if v is large.

Example:
Using a t Distribution to Test a Hypothesis about
The sports physiologist would like to test H0:= 17 against Ha:
marathon runners. In a random sample of 8 female runners, he finds

for female

Since n = 8, the degrees of freedom are v =7, and at = 0.05 the null hypothesis will be
rejected if |t|= t0.025,7 2.365. The test statistic is

Thus he rejects the null hypothesis and concludes that for women the distance until stress
is more than 17 miles.
It is possible to make inference about another type of mean, the mean of the difference
between two matched groups. For example, the mean difference between pretest scores
and post-test scores for a certain course or the mean difference in reaction time when the
same subjects have received a certain drug or have not received the drug might be
desired. In such situations, the experimenter will have two sets of sample data (in the
examples just given, pretest/post-test or received/did not receive); however, both sets are
obtained from the same subjects. Sometimes the matching is done in other ways, but the
object is always to remove extraneous variability from the experiment. For example,
identical twins might be used to control for genetically caused variability or two types of
seeds are planted in identical plots of soil under identical conditions to control for the
effect of environment on plant growth. If the experimenter is dealing with two matched
groups, the two sets of sample data contain corresponding members thus he has,
essentially, one set consisting of pairs of data. Inference about the mean difference
between these two dependent groups can be made by working with the differences within
the pairs and using a t distribution with n - 1 degrees of freedom in which n is the number
of pairs.
Example: Matched-Pair t Test
Two types of calculators are compared to determine if there is a difference in the time
required to perform a certain common statistical calculation. Twelve students chosen at
random are given drills with both calculators so that they are familiar with the operation
of each type. Then the time they take to complete the calculation on each device is
measured in seconds (which calculator they are to use first is determined by some random
procedure to control for any additional learning during the first calculation). The data are
as follows:

The null hypothesis is H0: = 0 and Ha:


in which d is the population mean for
the difference in time on the two devices. Thus

The test statistic is

and since t > 2.201, the test is


significant and the two calculators differ in the time necessary to perform the calculation.
Looking at the data, since yd is positive, the experimenter concludes that the calculation
is faster on machine B.
In the above example, the experimenter was interested in whether there is a difference in
time required on the two calculators; thus d= 0 was tested. The population mean
specified in the null hypothesis need not be zero; it could be some other specified
amount. For example, in an experiment about the reaction time the experimenter might
hypothesize that after taking a certain drug reaction times are slower by 2 seconds; then
H0: d = 2 would be tested, with
. The alternative hypothesis may be
one-tailed or two-tailed, as appropriate for the experimental question.
Using a matched-pair design is a way to control extraneous variability. If the study of the
two calculators involved a random sample of 12 students who used calculator A and
another random sample of 12 students who used calculator B, additional variability
would be introduced because the two groups are made up of different people. Even if
they were to use the same calculator, the means of the two groups would probably be
different. If the differences among people are large, they interfere with our ability to

detect any difference due to the calculators. If possible, a design involving two dependent
samples that can be analyzed by a matched-pair t test is preferable to two independent
samples.

F-Tests
Inference about two variances
There are situations, of course, in which the variances of the two populations under
consideration are different. The variability in the weights of elephants is certainly
different from the variability in the weights of mice, and in many experiments, even
though we do not have these extremes; the treatments may affect the variances as well as
the means.
The null hypothesis H0:12=22 is tested by using a statistic that is in the form of a ratio
rather than a difference; the statistic is s12=s22. Intuitively, if the variances are equal, this
ratio should be approximately equal to 1, so values that differ greatly from 1 indicate
inequality.
It has been found that the statistic s12/s22 from two normal populations with equal
variances follows a theoretical distribution known as anF distribution. The density
functions for F distributions are known, and we can get some understanding of their
nature by listing some of their properties. Let us call a random variable that follows an F
distribution F; then the following properties exist:
1. F > 0.
2. The density function of F is not symmetrical.
3. F depends on an ordered pair of degrees of freedom v1 and v2; that is, there is a
different F distribution for each ordered pair v1, v2. (v1 corresponds to the degrees of
freedom of the numerator of s12 /s22 and v2 corresponds to the denominator.)
4. If a is the area under the density curve to the right of the value Fa,v1,v2 , then
Fa,v1,v2 = 1/F1-a,v2,v1
5. The F distribution is related to the t distribution:
Fa,1,v2 = (ta/2,v2 )2
Table A.12 in the Appendix gives upper critical values for F if a 0.050, 0.025, 0.010,
0.005, 0.001. Lower-tail values can be found using property 4 above.
Example Testing for the Equality of Two Variances
Both rats and mice carry ectoparasites that can transmit disease organisms to humans. To
determine which of the two rodents presents the greater health hazard in a certain area, a
public health officer traps (presumably at random) both and counts the number of
ectoparasites each carries. The data are presented first in side-by-side stem-and-leaf plots
and then as side-by-side box-and-whisker plots:

He wants to test for the equality of means with a group comparison t test. He assumes
that these discrete counts are approximately normally distributed, but because he is
studying animals of different species, sizes, and body surface areas, he has some doubts
about the equality of the variances in the two populations, and the box plots seem to
support that concern. Thus he first must test

with the test statistic F =s12/s22 =43:4/13:0 = 3:34. Since n1 = 31 and n2 = 9, the degrees
of freedom for the numerator are v1 = n1 - 1 = 30 and for the denominator v2 = n2 - 1 =
8.
From table
F0:05,30,8 = 3:079 and F0:05,8,30 = 2:266
thus the region of rejection at a = 0.10 is F >= F0:05, 30,8 = 3:079 and F <= F0:95,30,8 =
1/ F0:05,8,30= 1/ 2:266= 0:441
Since the computed F equals 3.34, the null hypothesis is rejected, and the public health
officer concludes that the variances are unequal. Since one of the sample sizes is small,
he may not perform the usual t test for two independent samples.
One-tailed tests of hypotheses involving the F distribution can also be performed, if
desired, by putting the entire probability of a Type I error in the appropriate tail. Central
confidence intervals on 12= 22 are found as follows:

Although the public health officer cannot perform the usual t test for two independent
samples because of the unequal variances and the small sample size, there are
approximation methods available. One such test is called the BehrensFisher, or the t 0
test for two independent samples and using adjusted degrees of freedom.

Chapter -10
Linear Programming
The organizations today are working in a highly competitive and dynamic external
environment. Not only the number of decisions required to be made has increased
tremendously but the time period within which these have to be made has also shortened
considerably. Decisions can no longer be taken on the basis of personal experience or
gut feeling. This has resulted in a need for the application of appropriate scientific
methods in decision making.
The name operations research (O.R.) is taken directly from the context in which it was
developed and applied. Subsequently, it came to be known by several other names such
as:1. Management science,
2. Decision science.
3. Quantitative methods and
4. Operational analysis.
After the war, the success of the military provided a much needed boost to the discipline
for. The industry during that period was struggling to cope up with the increase in
complexity. There were complex decision problems. Solutions to which were neither
apparent nor forthcoming.
The successful implementation of operations research technique during the war was
probably the most important event, the industry was waiting for. This paved the way for
the application of OR to the business & industry. As .the business requirements changed,
newer and better operations research techniques evolved.
Another factor which has significantly contributed to the development of OR during the
last few decades is the development of high speed computers capable of performing a
large number of operations in a very short time period. Since 1960s, there has been a
rapid increase in the areas in which operations research has found acceptability. A part
from the industry and the business, OR also finds applicability in areas such as:1. Regional planning.
2. Telecommunications,
3. Crime investigation,
5. Public transportation and
6. Medical sciences.
Operations research has now become one of the most importable tools in decisionmaking and is currently being taught under various management and business programs.
Due to the fast pace at which it has developed & gained widespread acceptance,
professional society devoted to the cause of operations research and its allied activities
have been founded world-wide. e.g. Institute of Management Sciences founded in 1953
seeks to integrate scientific knowledge with the management of an industrial house - by
the development of quantitative methodology to the functional aspects of management.

Critical Path Method (CPM) and Project Evaluation and Review Technique (PERT) were
developed in 1958. These are extensively used in scheduling and monitoring complex
and lengthy projects having a greater time and cost over-run. PERT is now considered as
the ultimate management technique and finds applicability in such diverse areas as: 1. Construction projects,
2. Ship-building projects,
3. Transportation projects and
4. Military projects.
A large number of business and industrial houses adopted the methodology of operations
research techniques by early 1970s.
The first use of OR techniques in India, was in the year 1949 at Hyderabad, where at the
Regional Research Institute, an independent operations research unit was set-up. To
identify evaluate and solve the problems related to:1. Planning,
2. Purchases and
3. Proper maintenance of stores

Definition of Operations Research and Its Main Features or Characteristics


The phrase operations-research is taken from military-operations where it was first used.
It lays down emphasis on the various activities within an organization with a view to
control and co-ordinate them.
Operational Research seeks the optimal solution to a problem & not merely one which
gives better solutions than the one presently in use. A decision taken may not be
acceptable to each & every department with-in an organization but it should be the
optimal decision for greater part of the area organization. To arrive at such decisions the
decision maker must follow up the effects & interactions of a particular decision.
Although a large number of definitions of operations research have been given from time
to time, yet is it almost impossible to provide a single definition of OR which has a
uniform acceptability.
However, we provide some of the important definitions of OR:
(1). Operational Research is the application of the methods of science to complex
problems arising in the direction & management of large number of
a. men,
b. machines,
c. materials and
d. money,
In industry, business, government & defense the distinctive approach is to develop a
scientific model of the system, incorporating measurement of the factors such as chance
& risk, with which to predict & compare the outcomes of alterative decision, strategies &
contra is the purpose is to help the management, determine its policy and actions
scientifically.

Besides being too lengthy, this definition has also been criticized since it focuses on
complex problems & large systems, giving the impression that it is a highly sophisticated
& technical approach which is suitable on by for very large and complex organizations.
OR is an experimental & applied science devoted to observing understanding &
predicting the behavior of purposeful man and machine in systems; arid operations
research workers are actively engaged in applying this- knowledge to practical problems
in business, government and society.

Models in Operations Research


Contrary to the popular belief, operations research is not a highly advanced and
complicated discipline, totally independent of any other branch of menagerie geared
towards helping the managers in making effective decisions, On the country, it borrows
quite heavily from other fields especially the mathematics & its allied disciplines. OR
should be viewed as a mix of all the problem solving techniques available to the
management out of which an organisation has to settle for the one which best suits its
long term requirements. Following are some of the models extensively used in OR in
solving various types of problems:
1. Allocation Models
The resources- whether natural or man-made or financial are, scarce while the pressure
on these resources is many-fold due to inversely competitive needs of the organization.
Allocation models are concerned solely with the problem of optimal allocation of
precious and scarce resources for optimizing the given objective function subject to the
limiting factors prevailing at that point of time or the constraints within which a firm has
to most effectively operate. All such problems related to the allocation aspects are
collectively known as the mathematical programming problems.
Linear programming is very widely used in OR is just an example of the mathematical
programming problem.

Linear vs Non-linear Programming Model


The nature of the objective function along with its constraints separates a linear
programming problem from a non-linear one. Organisational problems, where both the
objective and the constraints functions are capable of being represented as a linear
function are solved with .the help of linear programming. Even the assignment &
transportation problems can be viewed essentially as a linear programming problem
though of a special type requiring procedures devised specially for them. On the other
hand, if the decision -parameters could be restricted to either integer value or zero-one
values in a linear programming problem, these are known as integer programming and
zero one programming respectively. Linear goal programming models are concerned with
those types of special problems which have conflicting, multiples objective functions in
relation to their linear constraints.

2. Simulation Models
It is very much similar to the managements trial and error approach in decision-making.
To simulate is to duplicate the features of the problem in a working model, which is then
solved using well known OR techniques. The results hence obtained are tested for
sensitivity analysis, after which these are applied to the original problem. By simulating
the characteristics and features of the organisational problems not a model, the various
decisions can be evaluated and-the risks inherent in actually implementing them is
drastically cut-down or eliminated. Simulation models are normally used for those kinds
of problems or situations which cannot be studied or understood by any other technique.
3. Inventory Models
The inventory models are primarily concerned with the optimal stock or inventory
policies of the organisation. Inventory problems deal with the determination of optimum
levels of different inventory items and ordering policies, optimizing a pre-specified
standard of effectiveness. It is concerned with the factors such as:
1. demand per unit time,
2. cost of placing orders,
3. costs incurred while keeping the goods in inventory,
4. stock-out costs and
5. costs of lost sales etc.
If a customer demands a certain quantity of a product, which is not available, then it
results in a lost sale. On the other hand, excess inventories mean blocked working capital
which is the life blood of modern business. Similarly, in the case of raw materials,
shol1age of even a very small item may cause bottlenecks in the production and the entire
assemb1y line may came to a halt. Inventory models are also useful in dealing with
quantity discounts and multiple products. These models can be of two types
1. deterministic and
2. probabilistic and are used in calculating various important decision variables such as :1. re-order quantity,
2. lead-time,
3. economic order quantity and
4. the pessimistic,. optimistic & the most likely level of stock keeping.
4. Network Models
Networking models are extensively used in planning, scheduling and controlling complex
projects which can be represented in the form of a net-work of various activities & subactivities.
Two of the most important and commonly used networking models are
1. Critical Path Method (CPM) and
2. Programme Evaluation & Review Technique (PERT).
PERT is the better known and more extensively applied of the two and it involves,
finding the time requirements of a given project, & the allocation of scarce resources to
complete the project as scheduled i.e.; within the planned stipulated time and with
minimum cost.

5. Sequencing Models
Sequencing models deal with the selection of the most appropriate or the optimal
sequence in which a series of jobs can be performed on different machines so as to
maximize the operational efficiency of the system. e.g. consider a job shop, where Jobs
are required to be processed on Y machines. Different jobs require different amounts of
time on different machines and each job must be processed on all the machines. In what
order should the jobs be processed so as to minimize the total processing time of all the
jobs. There are several variations of the same problem which can be evaluated by
sequencing models - with the different kinds of optimization criterion. Hence, sequencing
is primarily concerned with those problems in which the efficiency of operations depends
solely upon the sequence of performing a series of jobs.
6. Competitive Problems Models
The competitive problems deal with making decisions under conflict caused by opposing
interests or under competition.
Many problems related to business such as bidding for the same contract, competitions
for the market share; negotiating with labour unions and other associations etc. involve
intense competition. Games theory is the OR technique which is used in such situations,
where only one of the two or more players can win.
However, the competitive model has yet to find the widespread industrial and business
acceptability. Its biggest raw back is that it is too idealistic in outlook and fails to take
into consideration the actual reality and other related factors, within which an
organisation has to operate.
7. Queuing or Waiting Line Models
Any problem that involves waiting before the required service could be provided is
termed as a queuing or waiting line problem. These models seek to ascertain the various
important characteristics of queuing systems such as:1. average time spent in line by a customer,
2. average length of queue etc.
The waiting line models find very wide applicability across virtually every organisation
and in our daily life. Examples of queuing or waiting-line models are:1. waiting for service in a bank,
2. waiting list in schools,
3. waiting for purchases etc.
These models aim at minimizing the cost of providing service. Most of the realistic
waiting line problems are extremely complex and often simulation is used to analyze
such situations.
8. Replacement Models
These models are concerned with determining the optimal time required to replace
equipment or machinery that deteriorates or fails. Hence it seeks to formulate the optimal
replacement policy of an organization.
For example, when to replace the old machine with a newer one in the factory or at what
interval should an old car is replaced with a newer one? In all such cases there exists an
economic tradeoff between the increasing and the decreasing cost functions.

9. Routing and Trans-Shipment Models


These category of problems involve finding the optimal route from the starting or
invitation point (i.e., the origin) to the final or termination point (i.e., the destination),
where a finite number of possible routes are available.
For example:a. traveling salesman problems,
b. finding the shortest path and
c. transport dispatching problems,
could be solved by the routing and transshipment models..
10. Search Models
The main objectives of search models is:a. to search,
b. ascertain and
c. retrieve the relevant information required by a decision maker.
For example:a. auditing text-books for errors,
b. storage & retrieval of data in computers and
c. exploration for the natural resources are the problems falling within the domain of the
search models.
11. Markonian Models
These are the advanced operational research models and are applicable in highly
specialized cases or situation. e.g.:a. where the states of a system is capable of being accurately defined by precise
numerical value or
b. where the system is in a state of flux i.e.; it moves from one state to another which can
be computed using the theorems of probability. An example of the application or
markonian model is the brand-switching problems considered under the marketing
research.
The various steps involved in the operations research methodology can be depicted by the
help of flow-diagram as under:
Step I
Minute examination of the organisational environment & the problem area,
Step II
Acknowledge, analyse & rationally formulate the problem Construct a model based on
the above
Step III
Solve the model on the basis of the relevant data
Step IV
Testing the accuracy; authenticity & the reasonableness of the solution

Step VI
Implementation stage/Establishing control mechanisms then analyse the Main Reasons
For Wide-spread Applicability of OR

Introduction to linear programming


Linear Programming is that branch of mathematical programming which is designed to
solve optimization problems where all the constraints as well as the objectives are
expressed as linear function. Linear Programming is a technique for making decisions
under certainty i.e.; when all the courses of options available to an organization are
known & the objective of the firm along with its constraints are quantified. That course of
action is chosen out of all possible alternatives, which yields the optimal results. Linear
Programming can also be used as a verification and checking mechanism to ascertain the
accuracy and the reliability of the decisions which are taken solely on the basis of
managers experience-without the aid of a mathematical model. Some of the definitions
of Linear Programming are as follows:
Linear Programming is a method of planning and operation involved in the construction
of a model of a real-life situation having the following elements: a) Variables which denote the available choices and
b) the related mathematical expressions, which relate the variables to the
controlling conditions, reflect clearly the criteria to be employed for measuring the
benefits flowing out of each course of action and providing an accurate measurement of
the organizations objective. The method maybe so devised as to ensure the selection of
the best alternative out of a large number of alternative available to the organization
Kohler
Linear Programming is the analysis of problems in which a linear function of a number
of variables is to be optimized (maximized or minimized) when whose variables are
subject to a number of constraints in the mathematical near inequalities.
From the above definitions, it is clear that:
I.Linear Programming is an optimization technique, where the underlying
objective is either to maximize the profits or to minimize the Cost function.
II. It deals with the problem of allocation of finite limited resources amongst
different competiting activities in the most optimal manner.
III. It generates solutions based on the feature and characteristics of the actual
problem or situation. Hence the scope of linear programming is very wide as it
finds application in such diverse fields as marketing, production, finance &
personnel etc.
IV. Linear Programming has be-en highly successful in solving the following
types of problems:
Product-mix problems
Investment planning problems

Blending strategy formulations and


Marketing & Distribution management.
V. Even though Linear Programming has wide & diverse applications, yet all LP
problems have the following properties in common:
The objective is always the same (i.e. profit rnaximisation or cost minimization).
Presence of constraints, which limit the extent to which the objective can be
pursued/ achieved.
Availability of alternatives i.e.; different courses of action to choose from, and
The objectives and constraints can be expressed in the form of linear relation.
Regardless of the size or complexity, all LP problems take the same form i.e.
allocating scarce resources among various compete ting alternatives. Irrespective
of the manner in which one defines Linear Programming, a problem must have
certain basic characteristics before this technique can be utilized to find the
optimal values.
The characteristics or the basic assumptions of linear programming are as follows:
1. Decision or Activity Variables & Their Inter Relationship.
The decision or activity variables refer to any activity, which are in competition with
other variables for limited resources. Examples of such activity variables are: services,
projects, products etc. These variables are most often inter-related in terms of utilization
of the scarce resources and need simultaneous solutions. It is important to ensure that the
relationship between these variables be linear.
2. Finite Objective Functions.
A Linear Programming problem requires a clearly defined, unambiguous objective
function, which is to be optimized. It should be capable of being expressed as a liner
function of the decision variables. The single-objective optimization is one of the most
important pre-requisites of linear programming. Examples of such objectives can be:
cost-minimization, sales, profits or revenue maximization & the idle-time minimization
etc.
3. Limited Factors/Constraints.
These are the different kinds of limitations on the available resources e.g. important
resources like availability of machines, number of man hours available, production
capacity and number of available markets or consumers for finished goods are often
limited even for a big organization. Hence, it is rightly said that each and every
organization function within overall constraints both internal and external. These limiting
factors must be capable of being expressed as linear equations or in equations in terms of
decision variables.
4. Presence of Different Alternatives.
Different courses of action or alternatives should be available to a decision maker, who is
required to make the decision, which is the most effective or the optimal. For example,
many grades of raw material may be available, the same raw material can be purchased

from different supplier, the finished goods can be sold to various markets, production can
be done with the help of different machines.
5. Non-Negative Restrictions.
Since the negative value of (any) physical quantity has no meaning, therefore all the
variables must assume non-negative values. If some of the variables are unrestricted in
sign, the help of certain mathematical tools can enforce the non- negativity restriction
without altering the original information contained in the problem.
6. Linearity Criterion.
The relationship among the various decision variables must be directly proportional i.e.;
both the objective and the constraint must be expressed in terms of linear equations or
inequalities.
For example if one of the factor inputs (resources like material, labour, plant capacity
etc.) increases, then it should result in a proportionate manner in the final output. These
linear equations and inequations can graphically be presented as a straight line.
7. Additively.
It is assumed that the total profitability and the total amount of each resource utilized
would be exactly equal to the sum of the respective individual amounts. Thus the function
or the activities must be additive - and the interaction among the activities of the
resources does not exist.
8. Mutually Exclusive Criterion.
All decision parameters and the variables are assumed to be mutually exclusive In other
words, the occurrence of anyone variable rules out the simultaneous occurrence of other
such variables.
9. Divisibility.
Variables may be assigned fractional values. i.e.; they need not necessarily always be in
whole numbers. If a fraction of a product cannot be produced, an integer-programming
problem exists. Thus, the continuous values of the decision variables and resources must
be permissible in obtaining an optimal solution.
10. Certainty
It is assumed that conditions of certainty exist i.e.; all the relevant parameters or
coefficients in the Linear Programming model are ful1y and completely known and that
they dont change during the period. However, such an assumption may not hold good at
all times.
11. Finiteness.
Linear Programming assumes the presence of a finite number of activities and constraints
without which it is not possible to obtain the best or the optimal solution. What are the
advantages and limitation of Now it is time to examine the advantages as well as the
limitations of Linear Programming.

Advantages of Linear Programming approach


1. Scientific Approach to Problem Solving.
Linear Programming is the application of scientific approach to problem solving. Hence it
results in a better and true picture of the problems-which can then be minutely analysed
and solutions ascertained.
2. Evaluation of All Possible Alternatives.
Most of the problems faced by the present organisations are highly complicated - which
cannot be solved by the traditional approach to decision making. The technique of Linear
Programming ensures thatll possible solutions are generated - out of which the optimal
solution can be selected.
3. Helps in Re-Evaluation.
Linear Programming can also be used in. re-evaluation of a basic plan for changing
conditions. Should the conditions change while the plan is carried out only partially,
these conditions can be accurately determined with the help of Linear Programming so as
to adjust the remainder of the plan for best results.
4. Quality of Decision.
Linear Programming provides practical and better quality of decisions that reflect very
precisely the limitations of the system i.e.; the various restrictions under which the system
must operate for the solution to be optimal. If it becomes necessary to deviate from the
optimal path, Linear Programming can quite easily evaluate the associated costs or
penalty.
5. Focus on Grey-Areas.
Highlighting of grey areas or bottlenecks in the production process is the most significant
merit of Linear Programming.
During the periods of bottlenecks, imbalances occur in the production department. Some
of the machines remain idle for long periods of time, while the other machines are unable
toffee the demand even at the peak performance level.
6. Flexibility.
Linear Programming is an adaptive & flexible mathematical technique and hence can be
utilized in analyzing a variety of multi-dimensional problems quite successfully.
7. Creation of Information Base.
By evaluating the various possible alternatives in the light of the prevailing constraints,
Linear Programming models provide an important database from which the allocation of
precious resources can be don rationally and judiciously.
8. Maximum optimal Utilization of Factors of Production.
Linear Programming helps in optimal utilization of various existing factors of production
such as installed capacity, labour and raw materials etc.

Limitations of Linear Programming


Although Linear Programming is a highly successful fool having wide applications in
business and trade for solving optimization problems, yet it has certain demerits or
defects. Some of the important-limitations in the application of Linear Programming are
as follows:
I. Linear Relationship.
Linear Programming models can be successfully applied only in those situations where a
given problem can clearly be represented in the form of linear relationship between
different decision variables. Hence it is based on the implicit assumption that the
objective as well as all the constraints or the limiting factors can be stated in term of
linear expressions - which may not always hold well in real life situations. In practical
business problems, many objective function & constraints cannot be expressed linearly.
Most of the business problems can be expressed quite easily in the form of a quadratic
equation (having a power 2) rather than in the terms of linear equation. Linear
Programming fails to operate and provide optimal solutions in all such cases.
e.g. A problem capable of being expressed in the form of: ax2 +bx+c = 0 where a # 0
cannot be solved with the help of Linear Programming techniques.
1. Constant Value of objective & Constraint Equations.
Before a Linear Programming technique could be applied to a given situation, the values
or the coefficients of the objective function as well as the constraint equations must be
completely known. Further, Linear Programming assumes these values to be constant
over a period of time. In other words, if the values were to change during the period of
study, the technique of LP would lose its effectiveness and may fail to provide optimal
solutions to the problem.
However, in real life practical situations often it is not possible to determine the
coefficients of objective function and the constraints equations with absolute certainty.
These variables in fact may, lie on a probability distribution curve and hence at best, only
the Iikelil1ood of their occurrence can be predicted.
2. No Scope for Fractional Value Solutions.
There is absolutely no certainty that the solution to a LP problem can always be
quantified as an integer quite often, Linear Programming may give fractional-varied
answers, which are rounded off to the next integer. Hence, the solution would not be the
optimal one. For example, in finding out the number of men and machines required to
perform a particular job, a fractional solution would be meaningless.
3. High Degree of Complexity.
Many large-scale real life practical problems cannot be solved by employing Linear
Programming techniques even with the help of a computer due to highly complex and
Lengthy calculations. Assumptions and approximations are required to be made so that
the, given problem can be broken down into several smaller problems and, then solve
separately. Hence, the validity of the final result, in all such cases, may be doubtful:

4. Multiplicity of Goals.
The long-term objectives of an organisation are not confined to a single goal. An
organisation ,at any point of time in its operations has a multiplicity of goals or the goals
hierarchy all of which must be attained on a priority wise basis for its long term growth.
Some of the common goals can be Profit maximization or cost minimization, retaining
market share, maintaining leadership position and providing quality service to the
consumers. In cases where the management has conflicting, multiple goals, Linear
Programming model fails to provide an optimal solution. The reason being that under
Linear Programming techniques, there is only one goal which can be expressed in the
objective function. Hence in such circumstances, the situation or the given problem has to
be solved by the help of a different mathematical programming technique called the
Goal Programming.

5. Flexibility.
Once a problem has been properly quantified in terms of objective function and the
constraint equations and the tools of Linear Programming are applied to it, it becomes
very difficult to incorporate any changes in the system arising on account of any change
in the decision parameter. Hence, it lacks the desired operational flexibility.
The basic model of Linear Programming:
Linear Programming is a mathematical technique for generating & selecting the optimal
or the best solution for a given objective function. Technically, Linear Programming may
be formally defined as a method of optimizing (i.e.; maximizing or minimizing) a linear
function for a number of constraints stated in the form of linear in equations.
Mathematically the problem of Linear Programming may be stated as that of the
optimization of linear objective function of the following form:
Z = C1x1 + C2x2 ++ Cixi+.. Cnxn
Subject to the Linear constrains of the form:
a11x1 + a12x2 + a13x3++a1ixi++ainxn >= or <= b1
ajix1 + a22x2 + a23x3 +.+a2ixi + ....+a2nxn >= or <= b2
These are called the non-negative constraints. From the above, it is linear that a LP
problem has:
(I) Linear objective function which is to be maximized or minimized.
(ii) Various linear constraints which are simply the algebraic statement of the limits of the
resources or inputs at the disposal.
(iii) Non-negatively constraints.
Linear Programming is one of the few mathematical tools that can be used to provide
solution to a wide variety of large, complex managerial problems.

Graphical method of solution in linear programming


Once the Linear programming model has been formulated on the basis of the given
objective & the associated constraint functions, the next step is to solve the problem &
obtain the best possible or the Optimal solution various mathematical & analytical
techniques can be employed for solving the Linear programming model.
The graphic solution procedure is one of the method of solving two variable Linear
programming problems. It consists of the following steps:
Step I : Defining the problem. Formulate the problem mathematically. Express it in terms
of several mathematical constraints & an objective function. The objective function
relates to the optimization aspect is, maximisation or minimisation Criterion.
Step II: Plot the constraints Graphically. Each inequality in the constraint equation has to
be treated as an equation. An arbitrary value is assigned to one variable & the value of the
other variable is obtained by solving the equation. In the similar manner, a different
arbitrary value is again assigned to the variable & the corresponding value of other
variable is easily obtained.
These 2 sets of values are now plotted on a graph and connected by a straight line. The
same procedure has to be repeated for all the constraints. Hence, the total straight lines
would be equal to the total no of equations, each straight line representing one constraint
equation.
Step III: Locate the solution space. Solution space or the feasible region is the graphical
area which satisfies all the constraints at the same time. Such a solution point (x, y)
always occurs at the comer. points of the feasible. The feasible region is determined as
follows:
(a) For greater than & greater than or equal to constraints i.e.; the feasible region
or the solution space is the area that lies above the constraint lines.
(b) For Less Then & Less than or equal to constraint ie.
The feasible region or the solution space is the area that lies below the constraint
lines.
Step IV. Selecting the graphic technique. Select the appropriate graphic technique to be
used for generating the solution. Two techniques viz; Corner Point Method and Iso-profit
(or Isocost) method may be used we give below both there techniques, however, it is
easier to generate solution by using the corner point method.
(a) Corner Point Method
(i) Since the solution point (x. y) always occurs at the corner point of the feasible
or solution space, identify each of the extreme points or corner points of the
feasible region by the method of simultaneous equations.

(ii) By putting the value of the corner points co-ordinates into the objective
function, calculate the profit (or the cost) at each of the corner points.
(iii) In a maximisation problem, the optimal solution occurs at that corner point
which gives the highest profit.
(iv) In a minimisation problem, the optimal solution occurs at that corner point
which gives the lowest profit.
(b) Iso-Profit (or Iso-Cost) method. The term Iso-profit sign if is that any combination
of points produces the same profit as any other combination on the same line. The various
steps involved in this method are given below.
(i) Selecting a specific figure of profit or cost, an iso-profit or iso-cost line is
drawn up so that it lies within the shaded area.
(ii) This line is moved parallel to itself and farther or closer with respect to the
origin till that point after which any further movement would lead to this line
falling totally out of the feasible region.
(iii) The optimal solution lies at the point on the feasible region which is touched
by the highest possible isoprofit or the lowest possible iso-cost line.
(iv)The co-ordinates of the optimal point (x. Y) are calculated with the help of
simultaneous equations and the optimal profit or cost is as curtained.

Example: A retired person wants to invest up to an amount of Rs. 30,000 in the fixed
income securities. His broker recommends investing in two bonds bond A yielding 7%
per annum and bond B yielding 10% per annum. After some consideration he decides to
invest at the most Rs. 12,000 in bond B and at least Rs. 6,000 in bond A. he also wants
that the amount invested in bond A must be at least equal to the amount invested in bond
B. what should the broker recommend if the investor wants to maximize his return on
investment? Solve graphically.

Solution. Designating the decision variables x1 and x2 as the amount invested in bond
A and bond B respectively. Then the appropriate mathematical formulation of the given
problem as LP model is:
Maximize (total return) Z = 0.07 x1 + 0.10x2
Subject to the constraints
X1 + x2 30,000
X1 6,000

X2 12,000
X1-x2 0
X1 0, x2 0
Plotting the constraints graphically as shown. The shaded portion represents the feasible
region and the corner points of the feasible region are A, B, C, D and E.

|The values of the objective function at the corner points are summarized in the following
table:

Corner
Point
A
B
C
D
E

Coordinates
(x1, x2)
(6,000;0)
(6,000; 6,000)
(12,000; 12,000)
(18,000; 12,000)
(30,000; 0)

Value of the objective


Z = 0.07 x1 + 0.10 x2
0.07(6,000) + 0.10(0)
0.07(6,000) + 0.10(6,000)
0.07(12,000) + 0.10(12,000)
0.07(18,000) + 0.10(12,000)
0.07(30,000) + 0.10(0)

function

Thus the maximum value of Z is Rs. 2,460 and it occurs when x1 = 18,000 and x2 =
12,000. Hence, the person should invest Rs. 18,000 in bond A and Rs. 12,000 in bond B.

Example: X Ltd wishes to purchase a maximum of 3600 units of a product two types
of product a. & are available in the market Product a occupies a space of 3 cubic Jeet &
cost Rs. 9 whereas occupies a space of 1 cubic feet & cost Rs. 13 per unit. The budgetary
constraints of the company do not allow to spend more than Rs. 39,000. The total
availability of space in the companys godown is 6000 cubic feet. Profit margin of both
the product a & is Rs. 3 & Rs. 4 respectively. Formulate as a linear programming model
and solve using graphical method. You are required to ascertain the best possible
combination of purchase of a & so that the total profits are
maximized.

Solution. Let x1 = no of units of product & x2 = no of units of product b. Then the


problem can be formulated as a P model as follows: Objective function,
Maximise Z =3 x1+ 4 x 2
Constraint equations,
x1 + x2<= 3600 (Maximum Units Constraints)
3x1 + x2 <= 6000 (Storage area constraints)
9x1 + 13 x2<= 39000 (Budgetary constraints)
x1 + x2 <= 0
Step I. Treating all the constraints as equality, the first constraint is x1+ x2=3600
Put x1 = 0 x2 = 3600: The point is (0, 3600)
Put x2 = 0 x1 = 3600: The point is (3600, 0)
Step II. Determine the set of the points which satisfy the constraint: x1 + x2 = 3600
This can easily be done by verifying whether the origin (0,0) satisfies the constraint. Here,
0 +0 =3600. Hence all the points below the line will satisfy the constraint.
Step III. The 2nd constraint is: 3 x1+ x2<=6000
Put x1 = 0 x2 = 6000: The point is (0, 6000)
Put x2 = 0 x1 = 200: The point is (200, 0)
Now draw its graph.
Step IV. Like its in the above step II, determine the set of points which satisfy the
constraint.
0 + 0 < 6000, Hence, all the points below the line will satisfy the constraint.
Step V. The 3rd constraint is: Put & the point is (0, 3000)
Put Now draw its graph.
Step VI. Again the point (0,0) ie; the origin satisfies the constraint . Hence, all the points
below the line will satisfy the constraint.
Step VII. The intersection of the above graphic denotes the feasible region for the given
problem.

Step VIII. Finding Optimal Solution. Always keep in mind two things:
For constraint the feasible region will be the area which lie above the constraint lines and
for constraints, it will lie below the constraint lines.
This would be useful in identifying the feasible region.
According to a theorem on linear programming, an optimal solution to a problem (if it
exists) is found at a corner point of the solution space.
Step IX. At corner points (O, A, B, C), find the profit value from the objective function.
That point which maximize the profit is the optimal point

Step IX. At corner points (O, A, B, C), find the profit value from the
objective function. The point which maximizes the profit is the optimal
point.

Corner Point

W- ordinates

0
A
C

(0,0)
(0,3000)
(2000,0)

Objective
functions
Z=3x1+4x2
Z=0+0
Z=0+4*3000
Z=3*2000+0

Value

0
12000
6000

For point B, solve the equation 9 x1 + 12 x2 =39000 And 3x1 + 6x2 =6000 to find point B
(A+B, these two lines are intersecting)
ie, 3x1 + x2 =6000 (1)
9 x1 + x2 =39000 (2)
On solving we get x1 = 13000 and x2= 21000

At point (1300, 2100)


Z = 3x1 + 4x2 = 3*1300 + 4*2100
= 12,300 which is the maximum value.

Result. The optimal solution is:


No of units of product a = 1300
No of units of product b = 2100
Total profit, = 12300 which is the maximum

Simplex Method (Problems involving only up to 3 constraints & of


inequality <)
The Simplex Algorithm is a systematic & efficient algebraic procedure for finding
corner point solutions and taking them for optimality. The evaluation of corner points
always starts at the point of origin (initial basic feasible solution) which is one of the
corners of the feasible solution space. This solution is then tested for optimality, i.e. it
tests whether an improvement in the objective function is possible by moving to adjacent
corner point of the feasible solution space. This solution is then tested for optimality, i.e.
it tests whether an improvement in the objective function is possible by moving to
adjacent corner point of the feasible solution space. If the improvement is possible then at
a new corner point, the solution is again tested for optimality. This iterative search for a
better corner point is repeated until an optimal solution, if it exists, is determined.
Since the number of extreme points (i.e. corners or vertices) of the convex set of all
feasible solutions is finite, the method leads to the optimum extreme point (i.e. optimum
or optimal solution) in a finite number of steps or indicates that there exists an unbounded
solution.

SIMPLEX METHOD - STANDARD MAXIMISATION PROBLEM


Standard maximisation problem - a linear programming problem for which the objective
function is to be maximised and all the constraints are less-than-or-equal-to inequalities.
The Cannnon Hill Furniture Company produces chairs and tables. Each table takes four
hours of labour from the carpentry department and two hours of labour from the finishing
department. Each chair requires three hours of carpentry and one hour of finishing. During
the current week, 240 hours of carpentry time are available and 100 hours of finishing time.
Each table produced gives a profit of $70 and each chair a profit of $50. How many chairs
and tables should be made?
We first choose the variables - suppose x tables and y chairs are produced in the week.
The information can be summarized:

Tables
X

Chairs
Y

Constraints
Cannot
negative

4 hr/table

3hr/chair

Finishing

2 hr/table

1hr/chair

Maximum
240 hrs for
week
Maximum
100 hrs for
week

Profit

$70 per table

$ 50 per chair

Number
produced
week
Carpentry

per

be X 0, y 0
of 4x+3y 240
the
of 2x + y 100
the

The total profit ($) for the week is given by the objective function
P= 70x +50y
When the simplex method is used in the furniture problem, the objective function
is written in terms of four variables. If the problem has a solution, then the
solution occurs at one of the vertices of a region in four-dimensional space. We
start at one of the vertices and check the neighbouring vertices to see which ones
provide a better solution. We then move to one of the vertices that give a better
solution. The process is repeated until the target vertex is reached.
The first step of the simplex method requires that each inequality be converted into an
equation. Less-than-or-equal-to inequalities are converted to equations by including slack
variables. Suppose
s1
carpentry hours and s2 finishing hours remain unused in a
week. The constraints become:
4x + 3y + s1 = 240 or 4x + 3y + 1 s1 + 0s2= 240
2x+ y +s2=100 or 2x + y + 0 s1 + 1 s2 = 100
As unused result in zero profit, the slack variables can be included in the objective
function with zero coefficients.
P = 70 x1 + 50 x2 + 0 s1 + 0 s2
The problem can now be considered as solving a system of 3 linear equations involving
the 5 variables x , y , s , s , P 1 2 in such a way that P has the maximum value:
4x + 3y + 1 s1 + 0s2 + 0 P = 240
2x + y + 0 s1 + 1 s2 + 0 P = 100
-70x 50y + 0s1 + 0s2 + 1P = 0

The system of linear equations can be written in matrix form or as a 3x6 augmented
matrix.

In the simplex method, the augmented matrix is referred to as the tableau.


The initial tableau is:
Basic
Variables
S1
S2

S1

S2

4
2
-70

3
1
-50

1
0
0

0
1
0

0
0
1

240
100
0

The slack variables 1 s and 2 s form the initial solution mix. The initial solution assumes
that all available hours are unused i.e. the slack variables take the largest possible values.
Variables in the solution mix are called basic variables. Each basic variable has a column
consisting of all 0s except for a single 1. All variables not in the solution mix take the
value 0.
The simplex method uses a four step process (based on the Gauss Jordan method for
solving a system of linear equations) to go from one tableau or vertex to the next. In this
process, a basic variable in the solution mix is replaced by another variable previously
not in the solution mix. The value of the replaced variable is set to 0.
Step 1
Select the pivot column (determine which variable to enter into the solution mix). Choose
the column with the most negative element in the objective function row.

Basic
Variables
S1
S2

S1

S2

4
2
-70

3
1
-50

1
0
0

0
1
0

0
0
1

240
100
0

x should enter into the solution mix because each unit of x (a table) contributes a profit
of$70 compared with only $50 for each unit of y (a chair).
Step 2
Select the pivot row (determine which variable to replace in the solution mix). Divide the
last element in each row by the corresponding element in the pivot column. The pivot
row is the row with the smallest non-negative result.

Basic
Variables
S1
S2

S1

S2

4
2
-70

3
1
-50

1
0
0

0
1
0

0
0
1

240
100
0

S2 should be replaced by x in the solution mix. 60 tables can be made with 240 unused
carpentry hours but only 50 tables can be made with the 100 unused finishing hours.
Therefore we decide to make 50 tables.
Step 3
Calculate new values for the pivot row. Divide every number in the row by the pivot
number.
R2 / 2:
Basic
Variables
S1
S2

S1

S2

4
1
-70

-50

1
0
0

0
0
1

240
50
0

Step 4
Use row operations to make all numbers in the pivot column equal to 0 except for the
pivot number which remains as 1.
R1 4 x R2 and R3 + 70 x R2:
Basic
Variables
S1
x

S1

S2

0
1
0

-15

1
0
0

-2

35

0
0
1

40
50
3500

If 50 tables are made, then the unused carpentry hours are reduced by 200 hours (4
h/table multiplied by 50 tables); the value changes from 240 hours to 40 hours. Making

50 tables results in the profit being increased by $3500 ($70 per table multiplied by 50
tables); the value changes from $0 to $3500.
The new tableau represents the solution or vertex

The existence of 40 unused carpentry hours suggests that a more profitable solution can
be found. For each table removed from the solution, 4 carpentry hours and 3 finishing
hours are made available. If 2 unused carpentry hours are also taken from the 40
available, then 2 chairs can be made with the 6 carpentry hours and 3 finishing hours.
Therefore, if 1 table is replaced by 2 chairs, the marginal increase in profit is $30 (2 x $50
less $70).
Now repeat the steps until there are no negative numbers in the last row.
Step 1
Select the pivot column. y should enter into the solution mix.
Basic
Variables
S1
x

S1

S2

0
1
0

-15

1
0
0

-2

35

0
0
1

40
50
3500

Each unit of y (a chair) added to the solution contributes a marginal increase in profit of
$15.
Step 2
Select the pivot row. 1 s should be replaced by y in the solution mix.
Basic
Variables
S1
x

S1

S2

0
1
0

-15

1
0
0

-2

35

0
0
1

40
50
3500

40 chairs is the maximum number that can be made with the 40 unused carpentry hours.

Step 3
Calculate new values for the pivot row. As the pivot number is already 1, there is no need
to calculate new values for the pivot row.
Step 4
Use row operations to make all numbers in the pivot column equal to 0 except for the
pivot number.
R2 1/2 x R1 and R3 + 15 x R1:
Basic
Variables
y
x

S1

S2

0
1
0

1
0
0

1
-1/2
15

-2
3/2
5

0
0
1

40
30
4100

If 40 chairs are made, then the number of tables is reduced by 20 tables (1/2 table/chair
multiplied by 40 chairs); the value changes from 50 tables to 30 tables. The replacement
of 20 tables by 40 chairs results in the profit being increased by $600 ($15 per chair
multiplied by 40 chairs); the value changes from $3100 to $4100.
The new tableau represents the solution or vertex

As the last row contains no negative numbers, this solution gives the maximum value of
P. The maximum profit of $4100 occurs when 30 tables and 40 chairs are made. There
are no unused hours.

Example: X Ltd wishes to purchase a maximum of 3600 units of a product two types of
product a. & are available in the market Product a occupies a space of 3 cubic Jeet & cost
Rs. 9 whereas occupies a space of 1 cubic feet & cost Rs. 13 per unit. The budgetary
constraints of the company do not allow to spend more than Rs. 39,000. The total
availability of space in the companys godown is 6000 cubic feet. Profit margin of both
the product a & is Rs. 3 & Rs. 4 respectively.
Formulate as a linear programming model and solve using graphical method. You are
required to ascertain the best possible combination of purchase of a & so that the total
profits are maximized.
Solution: Let x1 = no of units of product &
x2 = no of units of product b
Then the problem can be formulated as a P model as follows:
Objective function,
Maximise Z =3 x1+ 4 x 2
Constraint equations,
x1 + x2<= 3600 (Maximum Units Constraints)
3x1 + x2 <= 6000 (Storage area constraints)
9x1 + 13 x2<= 39000 (Budgetary constraints)
x1 + x2 <= 0
Step I. Treating all the constraints as equality, the first constraint is
x1+ x2=3600

Step II. Determine the set of the points which satisfy the constraint:
x1 + x2 = 3600
This can easily be done by verifying whether the origin (0,0) satisfies the constraint.
Here,
0 +0 =3600. Hence all the points below the line will satisfy the constraint.
Step III. The 2nd constraint is: 3 x1+ x2<=6000

Now draw its graph.


Step IV. Like its in the above step II, determine the set of points which satisfy the
constraint.
Hence, all the points below the line will satisfy the constraint.

Step V. The 3rd constraint is:


Put & the point is (0, 3000)
Put
Now draw its graph.
Step VI. Again the point (0,0) ie; the origin satisfies the constraint . Hence, all the points
below the line will satisfy the constraint.
Step VII. The intersection of the above graphic denotes the feasible region for the given
problem.
Step VIII. Finding Optimal Solution. Always keep in mind two things:
(i) For constraint the feasible region will be the area which lie above the constraint lines
and for constraints, it will lie below the constraint lines.
This would be useful in identifying the feasible region.
(ii) According to a theorem on linear programming, an optimal solution to a problem (if it
exists) is found at a corner point of the solution space.
Step IX. At corner points (O, A, B, C), find the profit value from the objective function.
That point which maximize the profit is the optimal point.

Step IX. At corner points (O, A, B, C), find the profit value from the objective function.
The point which maximizes the profit is the optimal point.

For point B, solve the equation 9 x1 + 12 x2 =39000


And 3x1 + 6x2 =6000 to find point B
( A+ B, these two lines are intersecting )
ie, 3x1 + x2 =6000 (1)
9 x1 + x2 =39000 (2)
On solving we get x1 = 13000 and x2= 21000
At point (1300, 2100)
Z = 3x1 + 4x2
= 3*1300 + 4*2100
= 12,300 which is the maximum value.
Result. The optimal solution is:
No of units of product a = 1300
No of units of product b = 2100
Total profit, = 12300 which is the maximum

Answer key to End Chapter Quizzes:


Chapter One
(1) b , (2) c , (3) c , (4) a , (5) b , (6) c , (7) b , (8) c , (9) c , (10) d
Chapter Two
(1) b , (2) b , (3) c , (4) b , (5) d , (6) c , (7) c , (8) c , (9) c , (10) c
Chapter Three
(1) d , (2) c , (3) a , (4) d , (5) b , (6) b , (7) a , (8) d , (9) a , (10) c
Chapter four
(1) d , (2) c , (3) b , (4) a , (5) a , (6) b , (7) c , (8) b , (9) c , (10) d
Chapter Five
(1) c , (2) b , (3) c , (4) c , (5) a , (6) b , (7) b , (8) b , (9) c , (10) b
Chapter Six
(1) b , (2) b , (3) d , (4) d , (5) a , (6) c , (7) a , (8) d , (9) b , (10) b
Chapter Seven
(1) d , (2) d , (3) a , (4) c , (5) b , (6) b , (7) c , (8) a , (9) b , (10) d
Chapter Eight
(1) a , (2) b , (3) d , (4) d , (5) a , (6) c , (7) b , (8) d , (9) d , (10) b

BIBLIOGRAPHY
(I) Books :
1.
2.
3.
4.
5.
6.
7.
8.
9.

Gupta, S. P.
Sharma, N. L.
Gupta, K. L.
Gupta, S. P.
Kapoor & Sancheti
Kothari, C. R.
Agarwal, B. M.
Hooda, R. P.
Sharma, J. K.

:
:
:
:
:
:
:
:
:

Business Statistics
Statistics
Business Statistics
Statistical Methods
Business Statistics
Quantitative Techniques
Business Statistics
Introduction to Statistics
Business Statistics

II) Journals, Periodicals, Newspapers and Other useful Publications

1. Economic and Political Weekly.


2. India Today.
3. Business India.
4. Journal of Development Economics.

III) Reports and Other Materials


1. Journals in Statistics

2. Various reports on Economic and Statistical investigations.

Suggested Books
Business Statistics
1.
2.
3.
4.

Business Statistics
Business Statistics
Business Statistics
Business Statistics

Kapoor & Sancheti


B. M. Agarwal
S. P. Gupta
J. K. Sharma

S-ar putea să vă placă și