Sunteți pe pagina 1din 41

MEASURES OF CENTRAL TENDENCY

Formula
1. Mean for grouped data, using assumed mean with step deviation method

Mean = A +
∑ fd * c
n
Where –
A is the assumed mean
d is deviations from assumed mean divided by common interval

∑ fd is the summation of frequency * deviations

c is the class interval


n is total frequency

2. Median for grouped data,


n / 2 − cf
Median = l + *c
f
Where –
l is lower limit of the median class
n is total frequency
cf is the cumulative frequency before median class
c is class interval

3. Mode for Grouped Data


∆1 − ∆ 2
Mode = l + *C
∆1
Where –
l is lower limit of the modal class
f is frequency of modal class interval
∆1 is the frequency of pre-modal class – frequency of modal class
∆ 2 is the frequency of post-modal class – frequency of modal class
c is the class interval

4. Five number summary comprises-


a. Smallest observation
b. First quartile (lower quartile)
c. Median
d. Third quartile (upper quartile)
e. Largest observation

Statistics Page 1
OBJECTIVE QUESTIONS
Choose the best answer / Fill in the blanks / True or False –
1. If the classes are of the form 0 - 10, 10 – 20, etc they are called _______________ classes
2. If the classes are of the form 1 - 10, 11 - 20,etc they are called _________________ classes
3. If the classes are of the form 0 - 10, 10 – 20, etc an item of value 10 will be entered in –
a. Class 0 – 10
b. Class 10 – 20
c. Either of the above
d. None of the above
4. If the classes are of the form 0 - 10, 10 – 20, etc the class interval is ____________
5. If the classes are of the form 0 - 10, etc the mid point of class is ____________
6. Number of observations falling within a class is called - Class _____________
7. Ogive means –
a. Cumulative frequency curve
b. Frequency Cure
c. Mathematical Average
d. Arithmetic Mean
8. Data can be in ________________________ or _____________________form.
9. The measures of central tendency are ______________, ______________ & _________________
10. Mean, Median and Mode are known as –
a. Measures of Central Tendency
b. Measures of Dispersion
c. Measures of Middle Values
d. Measures of Mathematical Averages
11. If all the items in a distribution are of the same value, then-
a. Mean = Median = Mode
b. Mean > Median > Mode
c. Mean < Median < Mode
d. Mean + Median = Mode
12. The sum of deviations of all observations from the Arithmetic Mean is ____________
13. In a symmetrical distribution-
a. Mean = Median = Mode
b. Mean > Median > Mode
c. Mean < Median < Mode
d. Mean + Median = Mode

Statistics Page 2
14. Empirical formula about measures of central tendency given by Karl Pearson for an asymmetrical distribution is –
a. Mean – Mode = 3 (Mean – Median)
b. 2 Mode = (Mean + Median)
c. 2 Mean = (Mode + Median)
d. 2 Median = (Mode + Mean)
15. Quartiles are _____________________
16. Percentiles are ____________________
17. Deciles are _____________________
18. True or False
a. The following measures are affected when the highest value in a set of observations is altered
b. The following measures are affected when the lowest value in a set of observations is altered
c. The following measures are affected when the highest value and the lowest in a set of observations are altered
d. The following measures are affected when each value in a set of observations are increased or decreased by a
constant value
e. The following measures are affected when each value in a set of observations are multiplied or divided by a
constant value
Measure a b c d e
Mean
Median
Mode

Statistics Page 3
PROBLEMS
CALCULATE THE MEASURES OF CENTRAL TENDENCY AND THE FIVE NUMBER SUMMARY FOR THE
FOLLOWING DATA
1. Data pertaining to marks of students and ages of people is given below
a. Marks of students in a test is 48, 60, 59, 67, 66, 78
b. Ages of people in a group is 70, 72, 63, 56, 37, 82, 55, 85, 63

2. Cycle test marks of students are given below –


Class A 55 58 64 70 75 72
Class B 45 35 64 60 58

3. Data pertaining to workers and their wages is given below -


Wages (Rs) 35 45 55 65 75
No. of Workers 19 12 15 10 14

4. Monthly income of 100 families is given below-


Monthly No. of
Income (Rs) Families
Less than 10 5
Less than 20 12
Less than 30 26
Less than 40 44
Less than 50 64
Less than 60 78
Less than 70 87
Less than 80 94
Less than 90 100

5. Data pertaining to students and their marks is given below -


Marks 0–9 10 – 19 20 – 29 30 – 39 40 -49 50 - 59
No. of Students 1 3 19 10 15 2

Statistics Page 4
MEASURES OF DISPERSION

Measure Individual Data Discrete Data Grouped Data


Range L−S L−S L−S
Coefficient L−S L−S L−S
of Range L+S L+S L+S
L is mid-value of largest class
S is mid-value of smallest class
L is Maximum Value L is Maximum Value
Explanation No class should be open-ended. If it is
S is Minimum Value S is Minimum Value
an inclusive class, it should be
converted to exclusive classes
Quartile Q3 − Q1 Q3 − Q1 Q3 − Q1
Deviation 2 2 2
Coefficient Q3 − Q1 Q3 − Q1 Q3 − Q1
of Quartile
Deviation Q3 + Q1 Q3 + Q1 Q3 + Q1
n
( − cf )
Q1 = l + 4 *c
f
Where
l = lower limit of Q1 class
n = total no. of observations
cf = cumulative frequency till class
Q1 is the x value of
( N + 1)th preceding Q1 class
Q1 = item the
( N + 1)th c = class size
4 item
4 f = frequency of Q1 class
Explanation
3( N + 1)th
Q3 = item Q3 is the x value of 3n
4 3( N + 1)th (− cf )
the
4
item Q3 = l + 4 *c
f
Where
l = lower limit of Q3 class
n = total no. of observations
cf = cumulative frequency till class
preceding Q3 class
c = class size
f = frequency of Q3 class

Statistics Page 5
OBJECTIVE QUESTIONS
Choose the best answer / Fill in the blanks / True or False -
1. The measure of degree of scatter of the data from the central value is
a. Dispersion
b. Skewness
c. Average
d. Mean
2. ______________is the difference between the largest and the smallest value of the variable
3. Quartile deviation is otherwise called as –
a. Quartile Range
b. Inter quartile range
c. Intra quartile range
d. Semi inter quartile range
4. Mean deviation is otherwise called as –
a. Average deviation
b. Dispersion
c. Difference
d. Zero sum
5. The relative measure of standard deviation is called ___________________________________
6. Square of standard deviation is called _______________________________
7. Sum of squares of deviation is minimum when taken from ___________________
8. Sum of absolute deviation is minimum when taken from _____________
9. Inter quartile range is
a. Q3 – Q1
b. Q1 – Q2
c. Q2 – Q1
d. Q3 – Q2

Statistics Page 6
10. True or False
a. The following measures are affected when the highest value in a set of observations is altered
b. The following measures are affected when the lowest value in a set of observations is altered
c. The following measures are affected when the highest value and the lowest in a set of observations are altered
d. The following measures are affected when each value in a set of observations are increased or decreased by a
constant value
e. The following measures are affected when each value in a set of observations are multiplied or divided by a
constant value
Measure a b c d e
Range
Mean Deviation
Quartile Deviation
Standard Deviation
Variance

Statistics Page 7
PROBLEMS
CALCULATE THE MEASURES OF DISPERSION FOR THE FOLLOWING DATA
1. The following are the runs scored by two cricketers in 10 innings.
a. Find which batsman is a better player
b. Find out which batsman is more consistent (more reliable)
Batsman I 16 8 24 56 90 104 48 32 8 14
Batsman II 42 56 43 37 31 45 50 29 30 27

2. Heights of 60 students in a class are as below.


Height (in cms) 152.5 153 153.5 154 155 155.5 157.5 158 159.5
No. of Students 3 9 7 13 8 6 7 5 2

3. A factory produced two types of electric bulbs A and B. In a study about the life of bulbs, the following results were
obtained
c. Find which type of bulb is long lasting
d. Find out which type of bulb is more variable
Length of Life A (no. of B (no. of
(in hours) bulbs) bulbs)
60 – 80 10 8
80 – 100 22 60
100 – 120 52 24
120 – 140 20 16
140 - 160 16 12

Statistics Page 8
CORRELATION AND REGRESSION
1. Correlation measures the degree of relationship between two or more variables
a. The symbol for measuring correlation is ‘r’
b. ‘r’ lies between -1 and +1
c. Correlation is independent of origin and scale
d. Correlation is symmetric with respect to the variables
e. It is independent of units
f. Correlation means relationship and not causation
2. Understanding why association exists -
a. Dependency
b. Nature and strength of association
c. Causation
d. Coincidental relationship
e. Influence of other variables
3. Important types of correlation are –
a. Positive and negative correlation
b. Linear and non-linear correlation
c. Simple, partial and multiple correlation
• Lag and lead in correlation
a. Difference in periods for cause and effect relationship to be established is known as lag and lead
b. Advertisement and marketing expenses may lead to sales with a lag
c. Additional supply of materials today may lead to reduction in prices after some time
d. Effect of increase in income may lead to increase in expenditure and savings after a period
e. Boom in agricultural produce may lead to increase in industrial output after a gap of time
• Regression
a. Regression is a functional relationship between the value of 2 variables
b. With the help of regression lines we can predict most likely value of one variable given the other
c. If x and y are two variables, then y can be represented as equal to ax + b or x is equal to cy + d where a, b, c, and d
are constants. These are known as linear regression equations
d. Rate of change of one variable to unit change in other variable is called regression coefficient
e. The regression lines intersect at ( x , y ) where x and y are mean of x and y respectively
f. If r = 0, then the regression lines will be perpendicular to each other
g. If r = ± 1, then the regression lines will coincide
h. r is the geometric mean of the regression coefficients
i. Both the regression coefficients are either positive or negative
j. At least 1 regression coefficient must be numerically less than unity
k. Regression coefficients are independent of origin but not scale

Statistics Page 9
Formula-
1. Methods of Correlation
a. Karl Pearson’s Coefficient of Correlation
Assumed mean method

N ∑ dxdy − ∑ dx ∑ dy
r=
N ∑ dx 2 − (∑ dx) 2 N ∑ dy 2 − (∑ dy ) 2

Where dx is (all values of x – assumed mean of x) and dy is all values of y – assumed mean of y and N is the
number of observations
Direct method

N ∑ xy − ∑ x ∑ y
r=
N ∑ x 2 − ( ∑ x ) 2 N ∑ y 2 − (∑ y ) 2

Where x is all values of x and y is all values of y and N is the number of observations
(Note: Karl Pearson’s coefficient of correlation is also called product moment correlation)
b. Spearman’s Rank Correlation
WHEN RANKS ARE NOT GIVEN OR UNEQUAL RANKS GIVEN

6∑ d 2
R = 1−
n(n 2 − 1)
Where, d is difference of ranks of x and y variable and n is number of observations

WHEN RANKS ARE EQUAL

1
6(∑ d 2 +
3
(mi − mi ))
R = 1− 12
n(n 2 − 1)
Where, d is difference of ranks of x and y variable and n is number of observations and mi is number of times a rank
is repeated in the first or second variable
C. Two way Frequency Table

N ∑ fdxdy − ∑ fdx ∑ fdy


r=
N ∑ fdx 2 − (∑ fdx )2 N ∑ fdy 2 − (∑ fdy ) 2

Steps-
Take step-deviations of x and y from assumed mean and denote them dx and dy
Multiply dx and dy and the frequency of each cell and note the figure in upper right hand corner of each cell
Add all values of fdxdy and obtain ∑fdxdy

Statistics Page 10
Multiply frequencies of variable x by deviations of variable x and obtain ∑fdx
Take square of deviations from variable x and multiply by frequencies to obtain ∑fdx2
Multiply frequencies of variable x by deviations of variable y and obtain ∑fdy
Take square of deviations from variable y and multiply by frequencies to obtain ∑fdy2
Substitute the values in the formula to obtain r
d Concurrent Deviation Method

2C − n
R=±
n
Where C is number of concurrent deviations (where sign change from previous pair of x and y is same and n is
number of pairs observed)
4. Probable Error

(1 − r 2 )
PE = 0.6745
n
Where r is correlation and n is number of pairs observed

(1 − r 2 )
SE =
n
Where r is correlation and n is number of pairs observed
δ (Rho) is r ± PE
5. Calculation of Regression Equation
σx
a. (x − x) = r ( y − y)
σy
σx
Where x and y are means of x and y respectively and r is called the regression coefficient of x on y
σy
σy
b. ( y − y) = r (x − x)
σx
σy
Where x and y are means of x and y respectively and r is called the regression coefficient of x on y
σx
c. Fitting a straight line y on x –
Equation is Y = a + bX
∑ y = na + b ∑ x
∑ xy = a ∑ x + b ∑ x 2
Where if we solve for ‘a’ and equate the 2 equations, we will get the value of b as mentioned below
σy N ∑ dxdy − ∑ dx ∑ dy
r = by =
σx x
N ∑ dx 2 − (∑ dx )2
Where dx is (all values of x – assumed mean of x) and dy is all values of y – assumed mean of y and N is the
number of observations
d. Fitting a straight line x on y -

Statistics Page 11
σx N ∑ dxdy − ∑ dx ∑ dy
r = bxy =
σy N ∑ dy 2 − (∑ dy ) 2
Where dx is (all values of x – assumed mean of x) and dy is all values of y – assumed mean of y and N is the
number of observations
e. Fitting a parabolic curve or a second degree equation-
Equation is Y = a + bX + cX2

∑ y = na + b ∑ x + c ∑ x 2
∑ xy = a ∑ x + b ∑ x 2 + c ∑ x 3
∑ x 2 y = a ∑ x 2 + b ∑ x3 + c ∑ x 4
f. Multiple Regression Equations
For 3 variables, equation is X = a + bY + cZ

∑ x = na + b ∑ y + c ∑ z
∑ xy = a ∑ y + b ∑ y 2 + c ∑ yz
∑ xz = a ∑ z + b ∑ yz + c ∑ z 2
Similarly, it can be done for N variables.

Statistics Page 12
OBJECTIVE QUESTIONS

CHOOSE THE BEST ANSWER / FILL IN THE BLANKS / TRUE OR FALSE


1. An analysis of the relationship among two or more variables is called
a. Correlation
b. Skewness
c. Dispersion
d. Kurtosis
2. If x and y are independent, then correlation between them is _________
3. If the decrease in one variable influences the decrease in the other, it is called _______________ correlation
4. If the decrease in one variable influences the increase in the other, it is called _______________ correlation
5. If the ration between two sets of variables is same, then it is called _____________________ correlation
6. Curvilinear correlation is
a. Linear correlation
b. Non-linear correlation
c. Simple correlation
d. Special correlation
7. Perfect negative correlation is when r = _________
8. Perfect positive correlation is when r = _________
9. Completely no correlation is when r = ________
10. Change of scale in value of x or y series will-
a. Affect the value of ‘r’ very much
b. Not affect the value of ‘r’
c. Affect the value of ‘r’ slightly
d. Increase or decrease the value of r proportional to the change of scale
11. State the nature of correlation that exists between the following variables-
a. The amount of rainfall and the yield of crops
b. The color of an employee’s dress and the employee’s salary
c. Age of applicants for life insurance and the annual premium payable
d. Sale of raincoats and the sale of umbrellas
12. Correlation value lies between ____________and ________
13. Coefficient of determination is _________ and coefficient of non-determination is ____________
14. State true or false
a. Correlation coefficient is unaffected by shift in origin
b. Covariance between 2 variables is always positive
c. Rank correlation lies between 0 and 1
d. If one set of values are removed, then coefficient of correlation for the remaining pairs remains unchanged

Statistics Page 13
e. If correlation between 2 variables are 0, then the variables are independent

Statistics Page 14
15. Do the following items have positive, negative or zero correlation
a. Price and demand
b. Age and life expectancy
c. Age of husband and wife
d. Income and savings of a person

Statistics Page 15
PROBLEMS
CALCULATE CORRELATION FOR THE FOLLOWING DATA
1. Find the correlation and also regression equations between advertisement expenses and sales of a particular brand of ice-
cream Dippy-Dip
Month Jan Feb Mar Apr May Jun
Advt. Exp (Rs 000s) 20 25 28 32 36 34
Sales (Rs lakhs) 30 36 40 42 45 40

2. Find correlation and also regression equations between marks in statistics and accounting of a particular group of students
Roll No of student 101 102 103 104 105
Statistics marks 45 66 58 74 81
Accounting marks 79 56 61 48 40

3. Find correlation and regression equations between age of cars and annual maintenance cost
Age of cars 2 4 6 8 10
Annual maintenance cost 1600 1500 1800 1700 2100

4. Find rank correlation between marks in test and marks in interview of a group of candidates in a job selection procedure
Marks in Test 24 33 33 42 53 60 60 60 71 75
Marks in Interview 38 40 44 50 49 45 52 50 55 68

5. Find correlation between percentage score given by 2 judges


Y\X 60-70 70 – 80 80 – 90 90 – 100
50 – 60 4 2 2 -
60 – 70 3 5 3 -
70 -80 - 3 3 3
80 – 90 - 3 5 6
90 – 100 - - 5 3

X – Percentage score by judge A


Y - Percentage score by Judge B

6. Excel Pharma has launched a new preventive medicine for the treatment of Swine Flu. The data below is the effect on 100
patients who have taken the medicine against 100 patients who have not taken the medicine and being admitted to the
hospital with viral infection. 98% are free from Swine Flu in the first case vs. 21% who are infected with Swine Flu in the
second case. Excel Pharma is claiming a very high success rate on use of their medicine. Comment

Statistics Page 16
7. Following is the data pertaining to the sensex value and the gold price as on 1st of month from Jan to Sep 2010. What will be
the sensex value in Oct 2010, if the gold price will increase by 10% for diwali purchase season?

MONTH JAN 10 FEB 10 MAR 10 APR 10 MAY 10 JUN 10 JUL 10 AUG 10 SEP 10
24 Ct Gold Price/gm 1500 1550 1600 1620 1700 1750 1800 1850 1900
Sensex 14000 15000 1550 15500 16000 17000 17500 18000 18500

8. Find the multiple linear regression equation of X on Y and Z from the data given below-

X 2 4 6 8
Y 3 5 7 9
Z 4 6 8 10

9. (Please find below an article printed in the front page of Chennai Times)

Chennai

During our recent investigations, it was found that five Chennai cricket players, Sairam, Sandeep, Sankar, Sundar, and Suresh
are deeply involved with the betting syndicate. It has been confirmed by our sources that these players willfully
underperformed in the recently concluded ODI series against the Bangalore team. In the table below are the batting scores of
these five players along with the team score and the result of the matches in the recently concluded Friendship series.

Career
Player Batting 1st ODI 2nd ODI 3rd ODI 4TH ODI 5th ODI
Average
Sairam 28 41 19 12 33 30
Sandeep 26 17 19 17 71 10
Sankar 41 33 42 39 36 45
Sundar 85 89 112 58 90 67
Suresh 34 0 3 2 1 1
Team Chennai 224 272 212 171 265 178
Result 60% WON WON LOST LOST WON LOST

Further, it was predicted by the paper in a letter to the board that the players will under perform in their matches against
Mumbai also and the prediction factor was given to the Chennai Police much in advance before the actual matches were
played. The table contains scores calculated by the prediction factor vs. actual scores for the five Chennai players in the one
off ODI match against Mumbai

Player Predicted score Actual score


Sairam 36 35
Sandeep 74 73
Sankar 41 40
Sundar 87 90
Suresh 4 3

Please give your comments about these investigations and the truth in the allegations against the players.

Statistics Page 17
TIME SERIES
Time Series - It is arrangement of data according to time of occurrence in chronological order. Any series of measurement
that is variable over time is called Time series.

Utility of Time Series


• Analysis
Past behavior
Effect of Factors
Help predict future behavior
• Forecasting
Help make future plan of action
• Evaluation
Evaluation of current achievements
• Comparison
Scientific basis for making comparisons
Isolating effects of various components

Components of Time Series


• Long term
Secular Trend (T) - General Trend to increase or decrease over a period of time
Cyclic Variations (C) - Oscillatory movements with periods greater than 1 year. Usually may last 7-9 years
• Short Term
Seasonal Variation (S) - Movements due to forces which are usually rhythmic in nature and within a year
Irregular Variations ( I ) - No regular period of occurrence and accidental changes, purely random, unforeseen and
unpredictable

Mathematical Models
• Additive Model
Y=T+S+C+I
Components are independent to each other
Different components are expressed in original units and are residuals
S, C & I are expressed as deviations from T
• Multiplicative Model
Y=T*S*C*I
S, C & I are expressed as ratios or in percentages
Components may be dependent on each other
Mostly used in real life practice

Statistics Page 18
• Preliminary adjustments before Analyzing Time Series
o Time Variation - Adjusting for no. of days in a month
o Population Variation - Adjust for variables affected by population like per capita income
o Price Changes - Use real values rather than nominal values
o Comparability - Make data homogeneous and comparable
o Miscellaneous Changes

Measurement of Trend
Freehand or Graphic Method
• Simplest and Most Flexible Method
• First step to plot points on a paper
• Then, draw a freehand smooth curve through points
• Number of points above curve and below curve should be equal
• Total deviations should be zero
• Sum of square of deviations should be the minimum possible
Merits and Demerits of Graphic Method
Merits
• Simple and time saving
• No mathematical calculation required
• Very flexible
Demerits
• Highly subjective
• Hence, not suitable for forecasting and decision making <>

Method of Semi Averages


• Semi averages are the averages of two halves of a series
• Whole data is classified into two equal parts with respect to time
Merits and Demerits of method of Semi Averages
Merits
• Simple method
• Trend figures are objective
• Line can be extended to obtain future estimates
Demerits
• Assumption of linear trend
• Affected by extreme values and use of arithmetic mean
• Obtained and predicted values are not precise and reliable <>

Statistics Page 19
Method of Moving Averages
• Method helps to reduce fluctuations and obtain trend values with fair degree of accuracy
• Method consists of taking arithmetic mean of the values for a certain time span and placing at the centre of time
span
• In case of even years, the centered moving average has to be found
• In some cases, weights may be given to the moving averages called weighted moving average
Merits and Demerits of Method of Moving Averages
Merits
• Simple and Objective method
• Flexible to add additional data without affecting calculations
• If period of moving average coincides with period of cyclical fluctuations, then they are automatically eliminated
Demerits
• No trend values for some initial and end periods
• No functional relationship between value and time
• Difficulty in selecting period of moving average
• Bias in case the trend is non-linear<>

Method of Least squares


• As sum of deviations from mean is zero, sum of deviations from line of best fit is zero
• Hence, called as method of least squares or best fit
• Y = a + bX where ‘a’ and ‘b’ are constants
Merits and demerits of Method of least squares
Merits
• Trend line for entire period
• Functional relationship between time and value
• Objective method
Demerits
• Requires many calculations and is complicated
• Seasonal, cyclical or irregular variations are ignored
• If even a single data pair is added, a new equation has to be formed <>

Other Methods of obtaining trends


• Fitting a Second Degree Trend or a parabolic trend
Y = a + bX + cX2 where a, b, and c are constants
• Fitting an exponential trend
Y = a b X where a, and b are constants
• Exponential smoothing average

Statistics Page 20
Selection of type of trend
• If first differences are constant, use linear method
• If second differences are constant, use quadratic method
• If first differences of logarithm are constant, use exponential curve
• If first differences tend to decrease by a constant percentage, use modified exponential curve

Methods of measuring Seasonal Variations

Method of Simple Averages


• Arrange seasonal data across given periods
• Find average of data for same season
• Find average of averages
• Get percentage weights for various seasons
• It is simple to find but there is an assumption that there is almost no cyclical or irregular variation or of negligible
value

Ratio to Trend Method


• Arrange seasonal data across given periods
• Using a suitable method, find seasonal trend values for annual data and then seasonal data
• Get percentage for actual seasonal data by dividing actual data/ trend values
• Find Seasonal Index which is average of percentages
• If total of seasonal index more or less than 1200 or 400, adjustment correction factor = 1200 or 400/(Total SI)

Ratio to Moving Average method


• First take a centered moving average
• Get percentage for actual seasonal data by dividing actual data/ centered moving average
• Arrange percentage data seasonally and take average
• If total of seasonal index more or less than 1200 or 400, adjustment correction factor = 1200 or 400/(Total SI)

De-Seasonalisation of Data
• Elimination of seasonal variation is called as de-seasonalisation of data
• Either additive or multiplicative models are used
• Measurement of cyclical variations
Residual Method
• Eliminate Trends and Seasonal Variations from the original data using additive or multiplicative models
• Irregular variations are removed from this data by using the method of moving averages of appropriate period
• Cyclical variations are the only variations left and can be measured now
• Measurement of Irregular variations

Statistics Page 21
• Using additive or multiplicative models by removing trend, seasonal or cyclical variations
• They are found to be of small magnitude

Forecasting of Data

Qualitative Forecasting
• When historical data are not available

Quantitative Forecasting
• When historical data available
• Casual forecasting methods
• Time Series forecasting methods

Forecasting methods using time series


• Mean forecast
• Naive forecast
• Linear Trend Forecast
• Non-Linear Trend Forecast
• Forecasting with Exponential Smoothing

Statistics Page 22
Objective Questions
CHOOSE THE BEST ANSWER / FILL IN THE BLANKS / TRUE OR FALSE
1. With which form of time series would you associate the following-
a. A fire in the factory delaying production for three weeks
b. Need for increased wheat production due to rise in the population
c. Change in day temperature from winter to summer
d. Increase in employment during harvest time
e. Price hike in petroleum products due to Gulf war
2. Fill in the blanks
a. An overall rise or fall in a time series is called____________
b. A time series consists of data arranged in _________________ order
c. The additive model is expressed as Y = ________________________
d. The multiplicative model is expressed as Y = ________________________
e. The trend line obtained by the method of least squares is known as line of __________
f. The component of time series useful for long-term forecasting is _____________
g. For the annual data _______________________component of time series is missing
h. If growth rate is constant, the trend line is _____________
i. A polynomial of the form Y = a + bX + cX2 is called _______________________
j. Trend is the overall tendency of the time series data to _____________ or _______________ over a long period of time
k. Seasonal variations are variations with periods of _________________ and are mostly caused by _________________
3. Choose the correct answer
a. Trend refers to a long term tendency to
i. Increase only
ii. Decrease only
iii. Increase or Decrease
iv. None of the above
b. If trend is absent in a time series, seasonal indices are obtained by using
i. Method of simple averages
ii. Ratio to trend method
iii. Ratio to moving average method
iv. Method of least squares
c. The most widely used method of measuring seasonal variations is
i. Method of simple averages
ii. Ratio to trend method
iii. Ratio to moving average method
iv. Link relative method

Statistics Page 23
d. The method used in the study of cyclical variations is
i. Ratio to trend method
ii. Ratio to moving average method
iii. Link relative method
iv. Residual method

Statistics Page 24
PROBLEMS

Find trend lines for the following data by -


a. Semi Averages method
b. Moving Averages method
c. Weighted Moving Averages method
d. Least Squares method

1. Assume a 4 yearly cycle with equal weights


Year 1970 71 72 73 74 75 76 77 78 79 80 81 82 83
Value 53 79 76 66 69 94 105 87 79 104 97 92 101 105

2. Following is the data pertaining to the sensex value and the gold price as on 1st of month from Jan to Sep 2010. What will be
the sensex value in Oct 2010, if the gold price will increase by 10% for diwali purchase season?
Month Jan 10 Feb 10 Mar 10 Apr 10 May 10 Jun 10 Jul 10 Aug 10 Sep 10
24 Ct Gold Price/gm 1500 1550 1600 1620 1700 1750 1800 1850 1900
Sensex 14000 15000 1550 15500 16000 17000 17500 18000 18500

Find seasonal indices for the following data by -


a. Method of simple averages
b. Ratio to trend method
c. Ratio to moving average method
d. Link Relative method
3. Output of Coal in Million Tonnes

Year Q1 Q2 Q3 Q4
2005 73 67 66 68
2006 70 63 61 66
2007 73 68 68 72
2008 75 64 61 67
2009 65 60 56 63

4. Monthly data pertaining to rice production in lakhs of tonnes the period of Jan 2007 to Dec 2009

Month 2007 2008 2009


Jan 16 25 21
Feb 15 23 20
Mar 14 25 21
Apr 18 27 19
May 17 24 18
Jun 19 25 17
Jul 20 26 19
Aug 17 22 20
Sep 16 22 21
Oct 14 22 20
Nov 16 22 18
Dec 19 23 16

Statistics Page 25
5. Calculate the seasonal variations by ratio to trend method for the following data from 2005 to 2009

Year IQ II Q III Q IV Q
2005 30 40 36 34
2006 34 52 50 44
2007 40 58 54 48
2008 54 76 68 62
2009 80 92 86 82

6. Calculate the seasonal variations by ratio to moving average method for the following data from 2007 to 2009

Year IQ II Q III Q IV Q
2007 68 62 61 63
2008 65 58 66 61
2009 68 63 63 67

Statistics Page 26
PROBABILITY
Concepts

Probability is the mathematics of chance. A probability experiment is a chance process that leads to well defined outcomes or
results. An outcome of a probability experiment is the result of a single trial of a probability experiment. Each outcome of a
probability experiment occurs at random. Each outcome of the experiment is equally likely. A trial means tossing a coin once,
rolling a die or drawing a single card from the deck. The set of all outcomes of a probability experiment is called a sample space.
Sample space can be represented using tree diagrams and tables. Probability Experiment is a process of chance that leads to well
defined outcomes or results. An event is one or more outcomes of a sample space. An event with a single outcome is called
simple event and with two or more outcomes is called a compound event.

Rules –
1. The probability of any event will always be from 0 to 1
2. When an event cannot occur (impossible event), the probability will be 0
3. When an event is certain to occur, the probability is 1
4. The sum of the probabilities of all the outcomes in the sample space is 1
5. The probability that an event will not occur = (1 – probability that event will occur)

Sample space can be represented in two ways: tree diagrams and tables.
A tree diagram can be used to determine the outcome of a probability experiment. A tree diagram consists of branches
corresponding to the outcomes of two or more probability experiments that are done in sequence.
Sample spaces can also be represented using tables. For example, the outcomes when selecting a card from an ordinary deck can
be represented by a table. When two dice are rolled, 36 outcomes can be represented by using a table. Once a sample space is
found, probabilities can be computed for specific events

Addition Rules-
Many times in probability, it is necessary to find probability of two or more events occurring. In these cases, the addition rules are
used.
When the events are mutually exclusive, they have no outcome in common.
P (A or B) = P (A) + P (B)
When the two events are not mutually exclusive, they have some common outcomes.
P (A or B) = P (A) + P (B) – P (A and B)
The key word in these problems is “Or”, and it means add or union.

Multiplication Rules-
When two events occur in sequence, the probability that both events occur can be found by using multiplication rules.
When two events are independent, the probability that the first event occurs does not affect or change the probability of the
second event occurring.

Statistics Page 27
P (A and B) = P (A). P (B)
If the events are dependent, the probability of the second event occurring is changed after the second event occurs.

P (A and B) = P (A). P (B|A) where P (B|A) = .

P (B|A) is also known as conditional probability.

Conditional Probability –

The key word for multiplication rule is “and” and it means intersection. Conditional probability is used when additional
information is known about the probability of an event.

Odds and Expectations –


Odds are used to determine the payoffs in gambling games. Odds are computed from probabilities; however, probabilities can be
computed from odds if the true odds are known.

Odds in favor =

Odds against =

Expected Value-
Mathematical expectations can be thought of as a long term average. If the game is played many times, the average of the
outcomes or the payouts can be computed using mathematical expectation.
E(x) =
In order to determine the number of outcomes or events, the fundamental counting rule, the permutation rules, and the
combination rule can be used. The difference between a permutation and a combination is that for a permutation, the order or
arrangement of the objects is important. For example, order is important in phone numbers, identification tags, social security
numbers, license plates, dictionary etc. Order is not important when selecting objects from a group.

There are three types of probability:


Classical probability uses sample spaces. A sample space is the set of outcomes of a probability experiment. Classical
probability is defined as the number of ways (outcomes) the event can occur divided by the total number of outcomes in the
sample space.
Empirical probability uses frequency distributions, and it is defined as the frequency of an event divided by the total number of
frequencies
Subjective probability is made by a person’s knowledge of the situation and is basically an educated guess as to the chance of
the event occurring

Bayes’ theorem –

Statistics Page 28
Probability Distributions –

1. Uniform Distribution- A distribution is said to be uniform if the probability of the variable is equal for all values in the
given interval.
For example – If people come to a railway station in a uniform distribution and a train leaves every 5 minutes. What is the
probability that a person arriving at the station will have to wait for less than a minute?
The number of persons arriving is uniform and hence one in five persons arrive every minutes and hence probability = 0.2

2. Binomial Distribution –
• Each trial can only have two outcomes
• There are a fixed number of trials
• The outcome of each trial is independent of each other
• The probability for an outcome must be same for each trial
• where n is number of trials, r is number of successes, p is probability of success

3. Poisson Distribution –
• It is used when variable occurs over a period of time, over a period of area or volume

• P= where e is mathematical constant, λ is mean or expected value and x is number of successes where mean

and variance = np

Statistics Page 29
4. Normal Distribution –
• It is bell shaped and symmetric about the mean and continuous and asymptotic to the axis
• Area under the curve is 1
• The mean, median and mode are at the centre of the distribution

• In a standard normal distribution, mean is 0 and variance is 1. If

• The standard normal values are called z scores

Statistics Page 30
Problems
1. When a die is rolled, what is the probability of getting a number greater than 4?
2. Two dice are rolled. The probability that the sum of spots on the faces will be ‘8’ is?
3. When two coins are tossed, the probability of getting two tails is?
4. When a card is selected from a standard pack, the probability that it is a ‘9’ is?
5. When a card is selected from a standard pack, the probability that it is a diamond or a number card is?
6. In a survey of 180 people, 7s are over 60. If a person is selected at random, what is the probability that the person is over
60?
7. If a letter is selected at random from the word “PROBABILITY”, the probability that it is a vowel is?
8. In a box, there are 6 white marbles, 3 blue marbles and 1 red marble. If a marble is selected at random what is the
probability that it is not white?
9. In a sample of 10 pieces, 4 are defective. If 3 are selected at random and tested, what is the probability that they are not
defective?
10. How many different 3 digit codes can be made?
11. If 30% of commuters ride to work on a bus, find the probability that if 8 workers are selected at random, 3 will ride the
bus.
12. A survey found that 10% of older people have given up driving. If a sample of 1000 persons is taken, the standard
deviation of the sample will be?
13. A board of directors consists of 7 women and 5 men. If 4 directors are selected at random, the probability that exactly 2
directors are men is?
14. The probability that there will be a car accident in a particular road is 0.01. The number of accidents follows Poisson
distribution. If there are 500 cars on the road on a particular day, find the probability that there will be exactly 4
accidents?
15. About 5% of rabbits are brown in color. If the distribution is Poisson, find the probability that in 100 randomly selected
rabbits, 7 rabbits are brown in color?
16. In an exam (which is approximately normally distributed), the average marks were 200 and variance was 400. If a person
who took the exam was selected at random, find the probability that the person scores above 230.
17. The average height for adult kangaroos is 64 inches with a variance of 4 inches. Assume normal distribution. If a
kangaroo is selected at random, find the probability that its height is between 62 and 66.8 inches
18. Box 1 contains 2 red balls and 1 blue ball. Box 2 contains 1 red ball and 3 blue balls. Each of the two boxes is selected
and a ball is selected from the box at random. If the ball is red, find the probability it came from box 1?
19. Two manufacturers supply paper cups to a certain catering service. ‘A’ supplied 100 cups and 5 were damaged. ‘B’
supplied 50 cups and 3 were damaged. If a cup is damaged, find the probability that it came from ‘A’?
20. A street vendor, if the vendor is caught by city inspector, must pay a fine of Rs 50. Otherwise, the vendor can make Rs
100 at Main Road or Rs 75 at Cross Road. Construct a payoff table, determine the optimal strategy for both locations,
and find the value of the game.

Statistics Page 31
HYPOTHESIS TESTING

Procedure in Hypothesis Testing-

1. Formulate a Hypothesis
2. Set up a suitable significance level
3. Select test criterion
4. Compute the statistic
5. Make the decision

H0 Accepted H0 Rejected
H0 is True Correct decision Type I error (α)
H0 is False Type II error (β) Correct decision

Explanations-

• Parameter – Statistical measure based on all units of a population


• Statistic – Statistical measure based on all units of a sample
• Sampling distribution – Distribution of a statistic
• Standard error – Standard deviation of the sampling distribution of the statistic
• Confidence interval – An interval that is expected to include the true values of the parameter with the desired levels of
confidence
• Significance level (α) – It indicates the percentage of sample data outside certain limits. It is also the probability of
committing a type I error
• Acceptance region – Complementary region
• Critical Region – Rejection region
• One tail test – A hypothesis with two rejection regions.
o Right tail test - H0 =µ and H1 > µ or H0 ≤ µ and H1 > µ
o Left tail test - H0 =µ and H1 < µ or H0 ≥ µ and H1 < µ
• Two tail test – A hypothesis with one rejection region. H0 =µ and H1≠µ
• Null hypothesis (H0) – The hypothesis which is tested for possible rejection under the assumption that it is true. It is also
known as the hypothesis of no difference.
• Alternate hypothesis (H1) – A hypothesis which contradicts the null hypothesis. It decides whether the test has to be a
one tailed test or two tailed test
• Type I error – Rejecting a hypothesis when it is true. It is also known as rejecting a good lot or producer’s risk
• Type II error – Accepting a hypothesis when it is false. It is also known as accepting a bad lot or consumer’s risk

Statistics Page 32
Non-Parametric Tests

• K-S test for goodness of fit of one sample (Kolmogorov-Smirnov)

o Sum cumulative frequency of observed values


o Convert to percentage
o Find the expected values and convert to percentage
o Find the difference of observed and expected values
o The maximum difference value is called D value
o Degree of freedom is the number of observations
o Compare with table value of D at degrees of freedom

• U Test (Mann-Whitney Test for Equality of two means)


n1 (n1 + 1) n (n + 1)
U = n1n2 + − R1or = n1n2 + 2 2 − R2 Whichever is lesser
2 2
n1n2
µ=
2
n n (n + n2 + 1)
If σ = 1 2 1
2

12
U −µ
Z=
σ
Where Ri is sum of ranks of each group and ni = number of observations in each group

• H Test (Kruskal Wallis Rank Sum Test for Equality of several means)

2
12 R
H= Σ i − 3(n + 1) Where n = total number of observations, Ri = group sum of ranks
n(n + 1) ni

Statistics Page 33
PROBLEMS-

1. A company surveyed 100 respondents to know about the importance of computers in their life. The respondents
indicated as follows. Use Kolmogorov-Smirnov test (K-S test) to test the hypothesis that there is no difference in ratings
amongst the respondents

Total Respondents 100


Very Important 25
Somewhat Important 30
Neither Important nor Unimportant 10
Somewhat Unimportant 20
Very Unimportant 15

1. The following data indicates the lifetime (in hours) of samples of two kinds of light bulbs in continuous use. Use Mann-
Whitney U test to compare the life time of brands A and B light bulbs.

Brand A 603 625 641 622 585 593 660 600 633 580 615 648
Brand B 620 640 646 620 652 639 590 646 631 669 610 619

2. A company used three different methods of advertising its product in three cities It found out the increased sales in
identical retail outlets in three cities as follows. Use Kruskal-Wallis method (H test) to test the hypothesis that the
increase in sales using different methods in different cities is the same at 5% level of significance.

Chennai 70 58 60 45 55 62 89 72
Mumbai 65 57 48 55 75 68 45 52 63
Kolkata 53 59 71 70 63 60 58 75

Statistics Page 34
Chi-Square Test

• Chi square distribution for goodness of fit-


( Fo − Fe) 2
χ2 = ∑ Where Fo = Observed Frequency, Fe = Expected Frequency DF (degrees of freedom) = (k-1)
Fe
where k is number of classes

• Chi square distribution for independence of attributes-


( Fo − Fe) 2
χ2 = ∑
Fe
row total * column total
Fe =
grand total
Where Fo = Observed Frequency, Fe = Expected Frequency DF (degrees of freedom) = (r-1)(c-1) where r is number of
rows and c is number of columns

Statistics Page 35
PROBLEMS-

Test for goodness of fit-

1. The following table gives the average number of calls received by an operator on various days of the week in a call
centre. Find out whether the calls are uniformly distributed over the week.

Days Monday Tuesday Wednesday Thursday Friday


Number of calls 124 120 126 134 146

Test for independence of attributes-

2. The following information is obtained concerning 50 randomly selected students. Can it be inferred that availing of loans
is more common among boys?

Educational Loan Boys Girls Total


Taken 14 8 22
Not taken 16 12 28
Total 30 20 50

Statistics Page 36
• Z test for one sample mean-
x −µ σ
Z= Where is the standard error. If ‘ σ ’ is not given, we can use‘s’
σ n
n

• Z test for difference between means-


x1 − x 2 σ 12 σ 22
Z= Where + is standard error and H0 =µ1-µ2=0. If σ is not known, we can estimate σ by
σ 12 σ 22 n1 n2
+
n1 n2
2 2
n1 s1 + n2 s 2
the formula σ =
n1 + n2

• T Test for One sample mean-


x−µ x−µ
t= . Where standard deviation is given directly, use formula t =
s SD
n n −1
Degrees of freedom = n-1

• T test for difference between means-


2 2
x1 − x 2 n1 s1 + n2 s2
t= Where s = and n1+n2-2 = degrees of freedom
1 1 n1 + n2 − 2
s +
n1 n2

Statistics Page 37
ANOVA
1. The following table gives the retail prices of a certain commodity in some selected shops in four cities as below. Can we say
the prices of the commodities differ in the four cities?

City Prices

Chennai 11 7 10 8

Mumbai 7 9 11

Delhi 9 4 7 3 2

Kolkata 8 12 12 8

2. The sales of 4 salesmen - A, B, C & D of the Company Sellers in three seasons are given below. Can we conclude that
overall sales are dependent on seasons? Are the four salesmen equally effective?

Season/Salesman A B C D

Summer 6 4 8 6

Winter 7 6 6 9

Monsoon 8 5 10 9

Statistics Page 38
Statistics Page 39
DECISION THEORY
DECISION UNDER UNCERTAINTY

1. A retailer has space for up to 4 Kgs of tomato in his store. The cost per Kg is Rs 30 and the selling price per Kg is Rs
50.Any units not sold at the end of the day are wasted. He sells in Kgs only. Construct a payoff and opportunity loss table.

2. A newspaper vendor can stock up to 10 newspapers in his store. There is a guaranteed demand for 5 newspapers. Each
newspaper costs Rs 2 per unit and is sold for Rs 4. Unsold newspapers are disposed off for Rs 1 per unit. Construct a payoff
and opportunity loss table.

3. A food product company is contemplating the introduction of a new product to replace an existing product at a higher price
(S1), modifying the existing product at a moderately increased price (S2), and continuing the same product with new
packaging at a nominally increased price (S3). Sales may increase (E1), not change at all (E2) or decrease (E3) with respect
to these strategies. The marketing department has given profits for each of these strategies are given below-

E1 E2 E3

S1 700,000 300,000 150,000

S2 500,000 450,000 0

S3 300,000 300,000 300,000

What strategy should the company choose on the basis of - Maximin criterion, Maximax criterion, Minimax Regret
criterion, Laplace criterion and Hurwitz criterion (α=0.8)?

DECISION UNDER RISK

4. A milk producer needs to determine how many litres of milk are to be produced on a daily basis to meet demand. Milk is
sold in multiples of 5 litres only and there is an assured demand for 15 litres every day. Milk costs Rs 14 per litre and is sold
at Rs 20 per litre. Unsold milk is disposed off. Past records of 200 days show the following demand pattern

Milk (Litres) 15 20 25 30 35 40 45

No. of days 4 16 20 80 40 30 10

Construct a conditional profit table, Identify the best course of action for maximum expected profits and Calculate EVPI

Statistics Page 40
Statistics Page 41

S-ar putea să vă placă și