Sunteți pe pagina 1din 24

Course: Statistics

Unit 9
Regression Analysis
Table of Contents

9.1. Learning Objectives ........................................................................................................................... 3


9.2. Introduction ........................................................................................................................................ 3
9.3. Regression Analysis ............................................................................................................................ 4
9.4. Regression Lines ................................................................................................................................. 4
9.5. Regression Coefficient ....................................................................................................................... 5
9.6. Differences between Correlation Coefficient and Regression Coefficient .................................... 5
9.7. Examples ............................................................................................................................................. 8
9.8. Standard Error of Estimate ............................................................................................................ 13
9.9. Application in Finance ..................................................................................................................... 17
9.9.1. Correlation between Two Variables ......................................................................................................... 17
9.9.2. Beta () of a Stock/Share ......................................................................................................................... 17
9.10. Non-Linear Regression .................................................................................................................. 19
9.11. Logistic Regression......................................................................................................................... 21
9.12. Summary ......................................................................................................................................... 24

Page 2 of 24
A Case

Mr. Ajit is a G.M of a tyre manufacturing company. He is very happy that the sales of
tyres are increasing. However he was of the opinion that increase in sales is due to sales
force. His secretary, Ms. Anitha pointed out that the performance record sent by
Marketing Manager does not show any changes. Mr. Ajit was very curious. When he was
talking to his friends son, Mr. Suresh who holds a position in Motor Vehicle Registration
office he learnt that Registration of vehicles is increasing. Mr. Ajit immediately thinks of
his statistician, Mr. Satish. He consults him. Mr. Satish promises to come back with
solution to the problem.

(Cont. in topic ‘Differences between Correlation Coefficient and Regression Coefficient’)

9.1. Learning Objectives


By the end of this unit, you should be able to:

 Recognise the need of regression analysis


 Apply the regression equations to calculate correlation coefficient
 Calculate the regression equations for a correlation study
 Calculate the standard error of the estimate

9.2. Introduction
The word Regress means the tendency of the data to tend to the normal value.

Regression is defined as, “the measure of the average relationship between two or more
variables in terms of the original units of the data.”

Correlation analysis attempts to study the relationship between the two variables x and y. Regression
analysis attempts to predict the average x for a given y. In Regression it is attempted to quantify the
dependence of one variable on the other.

There are two variables x and y. y depends on x. The dependence is expressed in the form
of the following equation. In regression one of the variables is dependent and the others are
independent.
Y = a + bx

Page 3 of 24
9.3. Regression Analysis
Regression Analysis is used to:

 Estimate the values of the dependent variables from the values of the independent variables
 Get a measure of the error involved while using the regression line as a basis for estimation

Regression coefficient is used to calculate correlation coefficient; the square of correlation that prevails
between the given two variables. It provides a mathematical relationship between two or more variables. It is
based on cause and effect relationship.

9.4. Regression Lines


For a set of paired observations there exist two straight lines.

The line drawn such that sum of vertical deviation is zero and sum of their squares is minimum
is called Regression line of y on x. It is used to estimate y – values for given x – values.

The line drawn such that sum of horizontal deviation is zero and sum of their squares is
minimum is called Regression line of x on y. It is used to estimate x - values for given y -
values.

The smaller angle between these lines, higher is the correlation between the variables. If we fit a straight line
to scatter diagram data some of the points will lie above the straight line and some below the line. The
deviation of each point from the line is called Error.

The regression lines always intersect at x y . The regression lines have equation,

The regression equation of y on x / simple linear Regression model is given by Y  y  byx X  x  .


The regression equation of x on y / simple linear regression model is given by X  x  bxy Y  y  .

  dxdy   dx  dy    dxdy   dx  dy 


Where, byx  and bxy 
  dx 2   dx 
2
  dy 2   dy 
2

The regression equations found by the above conditions is said to fit by method of least squares. ‘byx’ and
‘bxy’ are called Regression Coefficients.

The regression model captures the systematic behaviour of data. The non-systematic behaviour of data
cannot be captured and are known as errors. The errors are due to random components that cannot be
predicted. Assuming that the random errors are “Normally distributed” we can construct confidence level
and interval for random errors.
Page 4 of 24
9.5. Regression Coefficient
Regression coefficient is used to calculate correlation coefficient; the square of correlation that prevails
between the given two variables. It provides a mathematical relationship between two or more variables. It is
based on cause and effect relationship.

 byx.bxy  r 2   byx.bxy  1

 byx.bxy  1
 If byx is negative, then bxy is also negative and r is negative.
y x
 They can also be expressed as byx  r  and byx  r 
x y
 It is an absolute measure.

9.6. Differences between Correlation Coefficient and Regression Coefficient

Table 9.1
Correlation Coefficient Regression Coefficient
rxy = ryx byx = bxy

-1< r <1 if byx can be greater than one, but bxy must
be less than one such that byx.byx<1

It has no units attached to it It has unit attached to it

There exist nonsense correlation There is no such nonsense regression

It is not based on cause and effect It is based on cause and effect relationship
relationship

It indirectly helps in estimation It is meant for estimation

Page 5 of 24
(Cont. from topic ‘A Case’)

Mr. Satish collects data on Number of Vehicles registered and number of tyres sold as
follows:
Table 9.2
Number of Vehicle 23 29 29 35 42 46 50 54 64 66 76 78
Registered in week
(X)
Number of Tyre’s sold 69 96 102 118 125 126 138 178 156 184 176 225
per week (Y)

He worked out the regression equation of sales on number of vehicles registered as follows:-

Table 9.3

  

X Y X 2
XY   
 
23 69 529 1587 82.432 180.4305
29 96 841 2784 95.7959 0.0416
29 102 841 2958 95.7959 38.4904
35 118 1225 4130 109.1594 78.1557
42 125 1764 5250 124.7502 0.0624
46 126 2116 5796 133.6592 58.6629
50 138 2500 6900 142.5682 20.8681
54 178 2916 9612 151.4772 703.4609
64 156 4096 9184 173.7497 315.0502
66 184 4356 12144 178.2042 33.5918
76 176 5776 13376 200.4766 599.1060
78 225 6084 17550 204.9311 402.7592
Total 592 1693 33044 92071 2430.68

712.472
byx   2.2272
319.889
592
  49.33
12
1693
  141.083
12

The regression equation is



  141.083  2.2272  49.33

   2.2272  31.2128
(Cont. in next page)

Page 6 of 24
(Cont. from previous page)

And he concludes that there is good relationship between the variables. His conclusion is
that increase is number of registration has increased the sales. He further supports it by
calculating correlation coefficient. The calculation through MS-Excel is shown at later
below. This information will help Mr. Ajit to plan his future production.

He worked out the regression equation of sales on number of vehicles registered as follows:

Table 9.4

  

Y   
 
17 16.6555 0.1187
17 17.1765 0.0311
18 17.6975 0.0915
18 18.2185 0.0477
19 18.7395 0.0678
19 19.5605 0.3142
19 20.0815 1.1696
20 20.6025 0.3630
21 21.1235 0.0153
22 21.6445 0.1264
Total 2.3453

2.3453
S YX 
10
 0.23453  0.484

Page 7 of 24
9.7. Examples

Example 9.1:

Find regression equation from the following data


Table 9.5
Age of Husband 18 19 20 21 22 23 24 25 26 27
Age of Wife 17 17 18 18 19 19 19 20 21 22

And hence calculate correlation coefficient.


Solution:
Table 9.6
Age of dx = x- dx2 Age of dy = y-19 dy2 dx dy
husband 22 wife (y)
(x)
18 -4 16 17 -2 4 8
19 -3 9 17 -2 4 6
20 -2 4 18 -1 1 2
21 -1 1 18 -1 1 1
22 0 0 19 0 0 0
23 1 1 19 0 0 0
24 2 4 19 0 0 0
25 3 9 20 1 1 3
26 4 16 21 2 4 8
27 5 25 22 3 9 15
Total 225 5 85 190 0 24 43

225 190
  22.5   19
10 10
Regression equation of Y on X is: Regression Equation of X on Y is:

Y  Y  byx X  X  10  43  (5)(0) 43
bxy    1.392
10  43  (5)(0) 430 10  24  (5) 2 24
byx    0.521
10  85  (5) 2 825  X  22.5  1.792(Y  19)
 Y  19  0.521( X  22.5)  X  1.792Y  11.548
 Y  0.521X  7.2775 r  0.5211.792  0.966

(Cont. in next page)

Page 8 of 24
(Cont. from previous page)

Using MS Excel - Procedure


Regression Analysis

Regression Statistics
Multiple R 0.966353136
R Square 0.933838384
Adjusted R Square 0.925568182
Standard Error 0.445516384
Observations 10

ANOVA
df SS MS F Significance F
Regression 1 22.41212121 22.41212121 112.9160305 5.38409E-06
Residual 8 1.587878788 0.198484848
Total 9 24

Standard
Coefficients Error t Stat P-value Lower 95% Upper 95%

Intercept 7.272727273 1.112575252 6.536840775 0.000180955 4.707124143 9.838330403


Age of
Husband 0.521212121 0.04904974 10.62619549 5.38409E-06 0.408103219 0.634321023

Residual Output

Predicted Age of
Observation Wife Residuals
1 16.65454545 0.345454545
2 17.17575758 -0.175757576
3 17.6969697 0.303030303
4 18.21818182 -0.218181818
5 18.73939394 0.260606061
6 19.26060606 -0.260606061
7 19.78181818 -0.781818182
8 20.3030303 -0.303030303
9 20.82424242 0.175757576
10 21.34545455 0.654545455

Page 9 of 24
Example 9.2:

In a correlation study we have the following data.

Table 9.7
Series X Series Y
Mean S.D 65 67
S.D 2.5 3.5
Correlation coefficient 0.8

Find the two regression equations.


Solution:
Regression equation of y and x is:
y
Y  Y  r. X  X 
x
 3.5 
Y  67  (0.8)    X  65
 2.5 
 Y  67  1.12( X  65)
 Y  1.12 X  5.8

Regression equation of x and y is:


x
X  X  r. Y  Y 
y
 2.5 
X  65  (0.8)    X  67 
 3.5 
 X  65  0.57( X  67)
 X  0.57Y  26.72
Example

A study of wheat prices at Mumbai and Kanpur yields the following data:

Mumbai Kanpur

Mean 7.50 8.10

Standard Deviation 0.326 0.207

Page 10 of 24
The correlation coefficient between the prices of Mumbai and Kanpur is 0.774. Estimate the price at Kanpur,
if the price at Mumbai is Rs.8.

Solution:

Given

X = 7.5 Y = 8.10 σx = 0.326 σy = 0.207 r = 0.774

The regression equation which we need to find is Y on X (where X Mumbai and Y Kanpur)

Y  Y  byx ( X  X ) …… eq. (1)

y
Where, b yx  r
x

Substituting the values in eq. (1) we get,

Y  8.10  0.774 
0.207
 X  7.50
0.326

 Y  8.10  0.4914 X  7.5


 Y  0.4914 X  4.4145

Estimation of price at Kanpur when the price at Mumbai is Rs. 8

Y  0.4914  8  4.4145
Y  8.1195

The price at Kanpur is Rs. 8.12, when the price at Mumbai is Rs. 8.

Page 11 of 24
Example

The following table shows the amount spent on advertising and the corresponding sales of the product from
10 companies:

Company Sales (Rs. in Advertising cost


lakh) (Rs. in lakh)
A 25 8
B 35 12
C 29 11
D 24 5
E 38 14
F 12 3
G 18 6
H 27 8
I 17 4
J 30 9

a. Plot a scatter gram showing the relationship between advertising cost and sales of the
product.

b. Estimate the equation of the regression line of sales on advertising costs.

c. Use the regression line to forecast sales if advertising costs were Rs. 10 lakh.

Solution:

a. A scatter gram showing the relationship between advertising cost and sales of the product.

40
30
Sales (Rs. in lakh)

20
10
0
0 5 10 15
Advertising cost (Rs. in lakh)

Page 12 of 24
b. The equation of the regression line of sales on advertising costs.

Y X X2 XY
25 8 64 200
35 12 144 420
29 11 121 319
24 5 25 120
38 14 196 532
12 3 9 36
18 6 36 108
27 8 64 216
17 4 16 68
30 9 81 270
Y = 225 X = 80 X2 = 756 XY = 2289

n xy   x  y  y bx
b= a=
n x   x 
2
2 n n

10  2289  255  80 255 80


b= a=  2.14655 
10  756  80 2 10 10

= 2.14655 = 25.5 - 17.1724

= 8.3276

 Y= 8.33 + 2.15x

c. Forecast of sales if advertising costs were Rs. 1000 lakh, we put X = 10 in the equation,

Y = 8.33 + 2.15 x 10
= 29.83

As the original data was given to the nearest integer (whole number), the forecast of sales
= 30 (or Rs. 30 lakh)

9.8. Standard Error of Estimate


The standard error of estimates helps to measure the accuracy of the estimated figures in regression analysis.
If the value of the standard error of estimate is small, it shows that the estimate provided by the regression
equation is better and closer. If standard error of estimate is zero, it shows that there is no variation about the
line and the correlation will be perfect.

Page 13 of 24
“The standard error of estimate uses to ascertain how good and representative the regression
line is as a description of the average relationship between two series.”

The standard error of regression of X values from Xc is:

    
2

Sx  y  ,

Sy  x  6  1  r 2 ,

Sx  y 
 2
 a   b 
, and

    
2
c
Sx  y 

Page 14 of 24
Example 9.3:

The following results were worked out from scores in Statistics and Mathematics in a
certain examination.

Table 9.8
Scores in Statistics (X) Scores in Mathematics (Y)
Mean 40 48
Standard Deviation 10 15

Karl Pearson’s correlation coefficient between x and y is = + 0.42. Find the regression lines
x on y and y on x. Use the regression lines to find the value of y when x = 50 and value of x
when y = 30.

Solution:
Given the following data:

X  40; Y  40;  x  10;  y  15; r  0.42

The regression line x on y is:


x
(X  X )  r (Y  Y ) ................... (1)
y
The regression line y on x is:
y
(Y  Y )  r ( X  X ) ................... (2)
x

Therefore substituting the values we get the respective equation as:

X  0.279 y  26.6.8 ................ (3) and


Y  0.63x  22.80 ................ (4)

Therefore;
When y=30; x=35.518 using equation (3)
When x=50; y=54.3 by using equation (4)

Page 15 of 24
Example 9.4:

From the following data obtain the two regression equations


Table 9.9
X 12 4 20 8 16
Y 18 22 10 16 14

Estimate Y for X = 15 and estimate X for Y = 20


Solution:
 = (12 + 4 + 20 + 8 + 16)/ 5 =12 = mean of X
 = (18 + 22 + 10 + 16 + 14) / 5 = 16 = mean of Y

Table 9.10
X Y X–  Y–  (X –  )2 (Y –  )2 (X –  ) (Y –  )
X - 12 Y - 16
12 8 0 2 0 4 0
4 22 -8 6 64 36 - 48
20 10 8 -6 64 36 - 48
8 16 -4 0 16 0 0
16 14 4 -2 16 4 -8
160 80 - 104

b yx 
         104  0.65 and b yx 
         104  1.3
         
2 2
160 80

Regression equation X on Y is given by:

     b    
1

X  12  1.3(Y  16)
Therefore, X  32.8  1.3Y
When Y = 20; X = 32.8 – 1.3 x 20 = 6.8

Regression equation Y on X is given by:

     b   
1

Y  16  0.65( X  12)
Therefore, Y  23.8  0.65 X
When X = 15; Y = 23.8 – 0.65 x 15 = 14.05

Page 16 of 24
9.9. Application in Finance
9.9.1. Correlation between Two Variables

The correlation between two variables can be studied for

 Time series data


 Cross-sectional data, that is, data about sales revenue and advertisement expenses during a year for a
number of companies

The results and conclusions for time series data is valid for one company only. But for cross sectional data it
is valid for a group of companies at industry level.

One can determine regression equation between advertisement expenses and sales revenue
for different sectors of industries say, manufacturing, IT, chemical, pharmaceutical etc.

We may take a particular company and study the correlation between prices of its stock in BSE and NSE.

9.9.2. Beta () of a Stock/Share

Beta measures which reflects the sensitiveness of a stock to movement in the stock market
index like NSE-Nifty or BSE-Sensex, as a whole. Always Beta value for market is taken as
one.

A stock with beta more than one say, 1.10, would rise 10% as much as the market index or would fall 10%
as compared to the index.

The volatility of stock is measured by its beta value. Beta represents the risk associated with the stock.

 An aggressive investor would opt for a stock with beta value more than one.
 A conservative investor would opt for the stock with beta value less than one.

Beta is measured through regression analysis. The percentage daily/weekly/monthly change in stock is taken
as dependent variable and the corresponding change in market index such as BSE or NSE is taken as
independent variable. Then the regression equation is fitted which is of the form Y=  + X.

Thus a stock’s “” measures the relationship between the stock’s rate of return (Y) and the average rate of
return for the market as a whole.

The coefficient of determination “r2” obtained in the study provides a measure of volatility explained in a
stock’s price by the market.

Page 17 of 24
Example 9.5:

The following data relates to the closing BSE sensex and stock price of RIL for 10 trading
days during a period. Find “” and interpret.

Table 9.11
Days BSE Stock price of RIL
1 12342 1150
2 12378 1163
3 12360 1148
4 12461 1150
5 12479 1147
6 12538 1169
7 12730 1192
8 12928 1213
9 12848 1216
10 12885 1208

Solution:
First we calculate the percentage changes in both BSE (X) and RIL(Y) as follows

BSE / RIL
indexfor 2 nd

day  indexfor 1st day  100
Indexfor1st day

Table 9.12
X Y
+0.2917 1.1304
-0.1454 -1.2898
0.8172 0.1742
0.1445 -0.2609
0.4728 1.9180
1.5313 1.9675
1.5554 1.7617
-0.6188 0.2473
0.2880 -0.6579

(Cont. in next page)

Page 18 of 24
(Cont. from previous page)

Using MS Excel - Procedure


Regression Analysis

Regression Statistics
Multiple R 0.657986268
R Square 0.432945929
Adjusted R
Square 0.351938204
Standard
Error 0.961822395
Observations 9

ANOVA
Significance
df SS MS F F
Regression 1 4.9442110 4.9442110 5.34450178 0.05404187
Residual 7 6.4757162 0.9251023
Total 8 11.419927

Standard
Coefficients Error t Stat P-value Lower 95% Upper 95%
Intercept 0.0291111 0.392985 0.0740769 0.9430215 -0.9001508 0.958373159
X 1.0903451 0.4716397 2.3118178 0.0540418 -0.0249055 2.205595895

9.10. Non-Linear Regression


Test of Hypothesis on regression coefficient by analysis will tell us whether there exists a linear relationship
or not suppose the relation is not linear, and then it can be always converted to linear relation by using
logarithm

Consider, the relation y = abx. This can be written as:

Logy  log a  log b


 Y  A  BX

When, Y = log Y, A = log a, and B = log b.

Page 19 of 24
Example 9.6:

Consider the following incentive scheme and the turnover expected

Table 9.13
Incentive increase in % of Base Year Turnover (Rs. in crores)

1 110
2 120
3 132
5 160
8 215
10 260

Fit a curve of type Y = axb

Solution:

Log y = log a + blog x


Y = A + Bx

Table 9.14
X Y
X2 Y2
Log x Log y
0 2.04 0 0
0.3 2.08 0.09 0.63
0.48 2.12 0.23 1.01
0.70 2.2 0.49 1.54
0.90 2.33 0.81 2.11
1.00 2.41 1.00 2.41
3.38 13.19 2.62 7.7

6  7.7  3.3813.19 46.2  44.5822


    0.3766
6  2.62  3.38 15.72  11.4244
2

A = 1.99 taking antilog the equation is


Y = 99.72 (2.364)x

Page 20 of 24
Example
Find the second degree regression polynomial y = a + bx + cx2 by least square method to the data given
below.

X 0 1 2 3 4

Y 1 0 3 10 21
Solution:
We need to fit a second degree regression polynomial of the form y = a + bx + cx 2. In order to obtain the value for the
constants a, b and c the normal equations are:
∑y = Na + b∑x + c∑x2
∑xy = a∑x + b∑x2 + c∑x3
∑x2y = a∑x2 + b∑x3 + c∑x4

Calculation
X Y X2 XY X2Y X3 X4
0 1 0 0 0 0 0
1 0 1 0 0 1 1
2 3 4 6 12 8 16
3 10 9 30 90 27 81
4 21 16 84 336 64 256
10 35 30 120 438 100 354

Substituting the values in the above equations and solving the simultaneous equations we get:
35 = 5a + 10b + 30c
120 = 10a + 30b + 100c
438 = 30a + 100b + 354c

a=1
b=-3
c=2
Therefore, the second degree parabola is Y = 1 – 3x + 2x2.

Page 21 of 24
9.11. Logistic Regression
In linear regression model the variables are assumed to take continuous values in the interval. However there
are situations wherein the dependent variable follows Binomial distribution. In such cases logistic regression
is used.

The relationship between dependent and independent variable is of the form.


1
 where, P is the probability of success
1  ey
1
 1  ey  or
P
1 1 
ey  1 
P 

e y  or
1 

Y log e e  log P  log(1  P)


Y  A  BX

Page 22 of 24
Example 9.7:

Suppose an event either is successful or failure. These are the values of Y, Viz 1 or 0 taken
by dependent variable. The corresponding revenue is given for twenty events as follows:

Y X
0 3.45
1 3.36
0 3.12
0 3.15
0 3.14
1 3.48
1 3.42
1 3.32
0 3.31
1 3.29
1 3.46
1 3.34
0 3.25
1 3.41
1 3.48
1 3.21
1 3.25
1 3.16
1 3.28
0 3.22

Then Regression equation is Y = 1.881 x 5.566

Note:

It is left as an exercise for the reader to find regression equation.


This regression equation does not yield
Y = 0 or Y = 1 when we put X = 2
Y = 3.762 – 5.566 = 1.204 > 1
Therefore we require a different technique to predict Y-value.
Let us construct class intervals

Mid X Prob of Success P Y = log (P / 1-P)


3.1-3.2 3.15 1/4 = 0.25 -0.477
3.2-3.3 3.25 4/6 = 0.67 0.308
3.3-3.4 3.35 3/4 = 0.75 0.477
3.4-3.5 3.45 5/6 = 0.81 0.689

(Cont. in next page)

Page 23 of 24
(Cont. from previous page)

Note:

There are 4 reading in the interval 3.1-3.2 and only one corresponds to 1
P = ¼

Regression equation of Y on X is Y = 3.667 X – 11.8572


  
(or) log  =3.667 x -11.8572
1  

The P values are given by:


e 3.667  11.852

1  e 3.667  11.852

For example when X = 2.7 Y = -1.9511 and P = 12%

9.12. Summary
In this unit we learnt what is regression, how to measure and how to interpret SPSS output. Further the
application of regression in financial field was explained with example. We also learnt how to calculate the
standard error of the estimate.

Page 24 of 24

S-ar putea să vă placă și