Sunteți pe pagina 1din 16

09-12-2015

Linear Regression Analysis

Correlation
Simple Linear Regression
The Multiple Linear Regression Model
Least Squares Estimates
R2 and Adjusted R2
Overall Validity of the Model (F test)
Testing for individual regressor (t test)
Problem of Multicollinearity

Smoking and Lung Capacity


Suppose, for example, we want to investigate the
relationship between cigarette smoking and lung
capacity
We might ask a group of people about their smoking
habits, and measure their lung capacities
Cigarettes (X)
0
5
10
15
20

Lung Capacity (Y)


45
42
33
31
29

Gaurav Garg (IIM Lucknow)

Gaurav Garg (IIM Lucknow)

Scatter plot of the data

Height and Weight

Lung Capacity

60
40
20
0
0

10

20

30

We can see that as smoking goes up, lung


capacity tends to go down.
The two variables change the values in opposite
directions.

Consider the following data of heights and weights of 5


women swimmers:
Height (inch):
62
64
65
66
68
Weight (pounds): 102 108 115 128 132
We can observe that weight is also increasing with
height.
150
100
50

0
60

Gaurav Garg (IIM Lucknow)

Sometimes two variables are related to each


other.
The values of both of the variables are paired.
Change in the value of one affects the value of
other.
Usually these two variables are two attributes of
each member of the population
For Example:
Height
Weight
Advertising Expenditure
Sales Volume
Unemployment
Crime Rate
Rainfall
Food Production
Expenditure
Savings
Gaurav Garg (IIM Lucknow)

65

70

Gaurav Garg (IIM Lucknow)

We have already studied one measure of relationship


between two variables Covariance
Covariance between two random variables, X and Y is
given by
Cov( X , Y ) E( XY ) E( X ) E(Y )
XY

For paired observations on variables X and Y,

Cov( X , Y ) XY

1 n
( xi x )( yi y )
n i 1

xx
yy

Gaurav Garg (IIM Lucknow)

09-12-2015

Correlation

Properties of Covariance:

Cov(X+a, Y+b) = Cov(X, Y)


[not affected by change in location]
Cov(aX, bY) = ab Cov(X, Y)
[affected by change in scale]
Covariance can take any value from - to +.
Cov(X,Y) > 0 means X and Y change in the same direction
Cov(X,Y) < 0 means X and Y change in the opposite direction
If X and Y are independent, Cov(X,Y) = 0 [other way may not be true]

It is not unit free.


So it is not a good measure of relationship between two
variables.
A better measure is correlation coefficient.
It is unit free and takes values in [-1,+1].

Karl Pearsons Correlation coefficient is given by

Cov( X , Y ) E ( XY ) E ( X ) E (Y )
Var ( X ) E ( X 2 ) [ E ( X )]2 ,Var (Y ) E (Y 2 ) [ E (Y )]2

When observations on X and Y are available


Cov( X , Y )

Gaurav Garg (IIM Lucknow)

Properties of Correlation Coefficient

Corr(aX+b, cY+d) = Corr(X, Y),


It is unit free.
It measures the strength of relationship on a
scale of -1 to +1.
So, it can be used to compare the relationships of
various pairs of variables.
Values close to 0 indicate little or no correlation
Values close to +1 indicate very strong positive
correlation.
Values close to -1 indicate very strong negative
correlation.

Scatter Diagram
Y

X
Positively Correlated

Weakly Correlated

Negatively Correlated

Strongly Correlated

Gaurav Garg (IIM Lucknow)

Not Correlated

Gaurav Garg (IIM Lucknow)

Correlation Coefficient measures the strength of


linear relationship.
r = 0 does not necessarily imply that there is no
correlation.
It may be there, but is not a linear one.

y y

xx

1.25

125

-0.9

45

0.8100

2025

-40.50

1.75

105

-0.4

25

0.1600

625

-10.00

2.25

65

0.1

-15

0.0100

225

-1.50

2.00

85

-0.15

0.0225

25

-0.75

2.50

75

0.35

-5

0.1225

25

-1.75

2.25

80

0.1

0.0100

2.70

50

0.55

-30

0.3025

900

-16.50

2.50

55

0.35

-25

0.1225

625

-8.75

17.50

640

1.560

4450

-79.75

SSX

SSY

SSXY

Gaurav Garg (IIM Lucknow)

1 n
( xi x )( yi y )
n i 1

1 n
1 n
( yi y ) 2
( xi x ) 2 ,Var (Y ) n
n i 1
i 1

Gaurav Garg (IIM Lucknow)

Var ( X ) Var (Y )

When the joint distribution of X and Y is known

Var ( X )

Cov( X , Y )

rXY Corr ( X , Y )

Cov( X , Y )

Var ( X )Var (Y )

( x x )2 ( y y)2

( x x )( y y )

SSXY
79.75

0.957
SSX SSY
1.56 4450

Gaurav Garg (IIM Lucknow)

09-12-2015

Alternative Formulas for Sum of Squares

SSX x 2

, SSY y 2

y ,
SSXY
2

x2

y2

x.y

1.25

125

1.5625

15625

156.25

1.75

105

3.0625

11025

183.75

2.25

65

5.0625

4225

146.25

2.00

85

4.0000

7225

170.00

2.50

75

6.2500

5625

187.50

2.25

80

5.0625

6400

180.00

2.70

50

7.2500

2500

135.00

2.50

55

6.2500

3025

137.50

17.20

640

38.54

55650

1296.25

Smoking and Lung Capacity Example

x y
xy n

Cigarettes
(X)

0
5
10
15
20
50

SSX = 1.56
SSY = 4450
SSXY= -79.75

X2
0
25
100
225
400
750

0
210
330
465
580
1585

Cov( X , Y )
SSXY
79.75

0.957
Var ( X )Var (Y ) GauravSSX
SSY
1
.
56 4450
Garg (IIM Lucknow)

Lung
Capacity
(Y)

Y2

XY

2025
1764
1089
961
841
6680

45
42
33
31
29
180

Gaurav Garg (IIM Lucknow)

Regression Analysis
rxy

Having determined the correlation between X and Y, we


wish to determine a mathematical relationship between
them.
Dependent variable: the variable you wish to explain
Independent variables: the variables used to explain the
dependent variable
Regression analysis is used to:
Predict the value of dependent variable based on the
value of independent variable(s)
Explain the impact of changes in an independent
variable on the dependent variable

(5)(1585) (50)(180)

(5)(750) 502 (5)(6680) 1802


7925 9000

(3750 2500)(33400 32400)

1075

1250 (1000)

.9615

Gaurav Garg (IIM Lucknow)

Gaurav Garg (IIM Lucknow)

Types of Relationships

Types of Relationships

Linear relationships

Strong relationships

Curvilinear relationships

X
Gaurav Garg (IIM Lucknow)

Weak relationships

X
Gaurav Garg (IIM Lucknow)

09-12-2015

Simple Linear Regression Analysis

Types of Relationships

No relationship

The simplest mathematical relationship is


Y = a + bX + error (linear)
Changes in Y are related to the changes in X
What are the most suitable values of
a (intercept) and b (slope)?

1
y = a + b.x

}a

Gaurav Garg (IIM Lucknow)

Method of Least Squares

Gaurav Garg (IIM Lucknow)

We want to fit a line for which all the errors are


minimum.
We want to obtain such values of a and b in
Y = a + bX + error for which all the errors are
minimum.
To minimize all the errors together we minimize
the sum of squares of errors (SSE).

a bX

(xi, yi)

error

yi

a bx i
X

xi

SSE (Yi a bX i ) 2
i 1

The best fitted line would be for which all the


ERRORS are minimum.
Gaurav Garg (IIM Lucknow)

Gaurav Garg (IIM Lucknow)

To get the values of a and b which minimize SSE, we


proceed as follows:

Solving above normal equations, we get


n

SSE
0 2 (Yi a bX i ) 0
a
i 1
n

i 1

Y X

Yi na b X i
i 1

(1)

i 1

i 1

i 1

Eq (1) and (2) are called normal equations.


Solve normal equations to get a and b
Gaurav Garg (IIM Lucknow)

( 2)

na b X i
i 1

a X i b X i2
i 1

n Yi X i Yi X i
i 1
i 1
i 1


b
2
n
n

2
Xi Xi

i 1
i 1
n

Yi X i a X i b X i2

i 1

i 1

n
SSE
0 2 (Yi a bX i ) X i 0
b
i 1
n

i 1

Y Y X
n

i 1

X
n

i 1

SSXY
SSX

a Y bX
Gaurav Garg (IIM Lucknow)

09-12-2015

The values of a and b obtained using least squares


method are called as least squares estimates (LSE)
of a and b.
Thus, LSE of a and b are given by

SSXY
b
.
SSX
Also the correlation coefficient between X and Y is
a Y bX,

rXY

Cov( X , Y )
Var ( X )Var (Y )

SSXY
SSX SSY

SSXY
SSX

y y

xx

1.25

125

-0.9

45

0.8100

2025

-40.50

1.75

105

-0.4

25

0.1600

625

-10.00

2.25

65

0.1

-15

0.0100

225

-1.50

2.00

85

-0.15

0.0225

25

-0.75

2.50

75

0.35

-5

0.1225

25

-1.75

2.25

80

0.1

0.0100

2.70

50

0.55

-30

0.3025

900

-16.50

2.50

55

0.35

-25

0.1225

625

-8.75

17.50

640

1.560

4450

-79.75

SSX

SSY

SSXY

( x x )2 ( y y)2

SSX
SSX
b
SSY
SSY

( x x )( y y )

X 2.15, Y 80.
Gaurav Garg (IIM Lucknow)

Gaurav Garg (IIM Lucknow)

SSXY
0.957
SSX SSY

SSXY
b
51.12 a Y bX 189.91
SSX
FittedLineis Y 189.91 51.12 X
140
120
100
80
60
40

0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 2.25 2.50 2.75

FittedLineisY 189.91 51.12 X


189.91 is the estimated mean value of Y when
the value of X is zero.
-51.12 is the change in the average value of Y as
a result of a one-unit change in X.
We can predict the value of Y for some given
value of X.
For example at X=2.15, predicted value of Y is
189.91 51.12 x 2.15= 80.002

Gaurav Garg (IIM Lucknow)

Residuals: ei Yi Yi
Residual is the unexplained part of Y
The smaller the residuals, the better the utility of
Regression.
Sum of Residuals is always zero. Least Square
procedure ensures that.
Residuals play an important role in investigating
the adequacy of the fitted model.
We obtain coefficient of determination (R2)
using the residuals.
R2 is used to examine the adequacy of the fitted
linear model to the given data.

Gaurav Garg (IIM Lucknow)

Coefficient of Determination
Y Y Y Y

Y Y

Y
X
n

TotalSumof Squares: SST (Yi Y ) 2


i 1

RegressionSum of Squares: SSR (Yi Y ) 2


i 1

Error Sumof Squares: SSE (Yi Yi ) 2


i 1

Also, SST = SSR + SSE


Gaurav Garg (IIM Lucknow)

Gaurav Garg (IIM Lucknow)

09-12-2015

The fraction of SST explained by Regression is given by R2


R2 = SSR/ SST = 1 (SSE/ SST)
Clearly, 0 R2 1
When SSR is closed to SST, R2 will be closed to 1.
This means that regression explains most of the variability
in Y. (Fit is good)
When SSE is closed to SST, R2 will be closed to 0.
This means that regression does not explain much
variability in Y. (Fit is not good)
R2 is the square of correlation coefficient between X and
Y. (proof omitted)

r = -1

r=1

0 < R2 < 1
Weak linear
relationships

R2 = 1
Perfect linear
relationship

R2 = 0
No linear
relationship

Some but not all of


the variation in Y is
explained by X

100% of the variation


in Y is explained by X

Gaurav Garg (IIM Lucknow)

None of the
variation in Y is
explained by X

Gaurav Garg (IIM Lucknow)

(Y Y ) (Y Y ) (Y Y ) (Y Y ) 2 (Y Y ) 2 (Y Y ) 2
45
-1
46
2025
1
2116

1.25

125

126.0

1.75

105

100.5

25

4.5

20.5

625

20.25

420.25

2.25

65

74.9

-15

-9.9

-5.1

225

98.00

26.01

2.00

85

87.7

-2.2

7.7

25

4.84

59.29

2.50

75

62.1

-5

12.9

-17.7

25

166.41 313.29

2.25

80

74.9

5.1

-5.1

26.01

26.01

2.70

50

51.9

-30

-1.9

-28.1

900

3.61

789.61

2.50

55

62.1

-25

-7.1

-17.9

625

50.41

320.41

17.20

640

4450

370.54 4079.4
6

Coefficient of Determination: R2 = (4450-370.5)/4450 = 0.916


Correlation Coefficient:
r = -0.957
Coefficient of Determination = (Correlation Coefficient)2

Example:
Watching television also reduces the amount of physical exercise,
causing weight gains.
A sample of fifteen 10-year old children was taken.
The number of pounds each child was overweight was recorded
(a negative number indicates the child is underweight).
Additionally, the number of hours of television viewing per weeks
was also recorded. These data are listed here.
TV
Overweight

42 34 25 35 37 38 31 33 19 29 38 28 29 36 18
18 6 0 -1 13 14 7 7 -9 8 8 5 3 14 -7

Calculate the sample regression line and describe what the


coefficients tell you about the relationship between the two
variables.

Y = -24.709 + 0.967 X

Gaurav Garg (IIM Lucknow)

R2 = 0.768

and

Gaurav Garg (IIM Lucknow)

20.00

15.00

10.00

5.00
Y
Predicted Y
0.00
1

10

11

12

13

14

15

-5.00

-10.00

-15.00

Gaurav Garg (IIM Lucknow)

Gaurav Garg (IIM Lucknow)

09-12-2015

Standard Error

Assumptions

Consider a dataset.
All the observations can not be exactly the same as
arithmetic mean (AM).
Variability of the observations around AM is measured
by standard deviation.
Similarly in regression, all Y values can not be the same
as predicted Y values.
Variability of Y values around the prediction line is
measured by STANDARD ERROR OF THE ESTIMATE.
n
It is given by
2
SYX

SSE

n2

The relationship between X and Y is linear


Error values are statistically independent
All the Errors have a common variance.
(Homoscedasticity)
Var(ei )= 2, where e Y Y
i
i
i
E(ei )= 0
No distributional assumption about errors is
required for least squares method.

(Y Y )
i 1

n2

Gaurav Garg (IIM Lucknow)

Gaurav Garg (IIM Lucknow)

Linearity

Independence
Linear

Not Linear

Independent

Gaurav Garg (IIM Lucknow)

Gaurav Garg (IIM Lucknow)

Equal Variance
Unequal variance
(Heteroscadastic)

residuals

X
residuals

residuals

residuals

residuals

Not Independent

TV Watching Weight Gain Example

Equal variance
(Homoscadastic)

20.00

Scatter Plot of X and Y

15.00
10.00
5.00

0.00
0

10

15

20

25

30

35

40

45

-5.00
-10.00
-15.00

Scatter Plot of X and Residuals


6.00
4.00

residuals

residuals

2.00
0.00

-2.00

10

15

20

25

30

35

40

45

-4.00
-6.00
-8.00
-10.00
-12.00

Gaurav Garg (IIM Lucknow)

Gaurav Garg (IIM Lucknow)

09-12-2015

The Multiple Linear Regression Model


In simple linear regression analysis, we fit linear relation
between
one independent variable (X) and
one dependent variable (Y).

We assume that Y is regressed on only one regressor


variable X.
In some situations, the variable Y is regressed on more
than one regressor variables (X1, X2, X3, ).
For EXample:
Cost
Salary
Sales

-> Labor cost, Electricity cost, Raw material cost


-> Education, Experience
-> Cost, Advertising Expenditure

Example:
A distributor of frozen dessert pies wants to
evaluate factors which influence the demand
Dependent variable:
Y: Pie sales (units per week)

Independent variables:
X1: Price (in $)
X2: Advertising Expenditure ($100s)

Data are collected for 15 weeks

Gaurav Garg (IIM Lucknow)

Gaurav Garg (IIM Lucknow)

Week

Pie
Sales

Price
($)

Advertising
($100s)

350

5.50

3.3

460

7.50

3.3

350

8.00

3.0

430

8.00

4.5

350

6.80

3.0

380

7.50

4.0

430

4.50

3.0

470

6.40

3.7

450

7.00

3.5

10

490

5.00

4.0

11

340

7.20

3.5

12

300

7.90

3.2

13

440

5.90

4.0

14

450

5.00

3.5

15

300

7.00

2.7

Using the given data, we wish to fit a linear


function of the form:
Yi 0 1 X 1i 2 X 2i i ,
i 1,2,,15.
where
Y: Pie sales (units per week)
X1: Price (in $)
X2: Advertising Expenditure ($100s)

Fitting means, we want to get the values of


regression coefficients denoted by
Original values of s are not known.
We estimate them using the given data.

Gaurav Garg (IIM Lucknow)

Gaurav Garg (IIM Lucknow)

The Multiple Linear Regression Model


Examine the linear relationship between
one dependent (Y) and
two or more independent variables (X1, X2, , Xk).

Multiple Linear Regression Model with k


Independent Variables:
Intercept

Random Error

Slopes

Yi 0 1 X 1i 2 X 2i k X ki i
i 1,2,, n.
Gaurav Garg (IIM Lucknow)

Multiple Linear Regression Equation


Intercept and Slopes are estimated using observed
data.
Multiple linear regression equation with k
independent variables
Estimated
value

Estimate of
intercept

Estimates of slopes

Yi b0 b1 X 1i b2 X 2i bk X ki
i 1,2,, n.
Gaurav Garg (IIM Lucknow)

09-12-2015

Multiple Regression Equation

Estimating Regression Coefficients

Example with two independent variables


Y b0 b1 X 1 b2 X 2
Y

X2

The multiple linear regression model


Yi 0 1 X1i 2 X 2i k X ki i ,i 1,2,...,n
In matrix notations
0
Y1 1 X 11 X 12 X 1k 1

1
Y2 1 X 21 X 22 X 2 k 2



2


Y 1 X
X n 2 X nk n
n1
n
k

Y X

or
X1

Gaurav Garg (IIM Lucknow)

Gaurav Garg (IIM Lucknow)

Assumptions
No. of observations (n) is greater than no. of
regressors (k). i.e., n> k
Random Errors are independent
Random Errors have the same variances.
(Homoscedasticity)
Var(i )= 2
In long run, mean effect of random errors is zero.
E(i )= 0.

No Assumption on distribution of Random errors


is required for least squares method.
Gaurav Garg (IIM Lucknow)

In order to find the estimate of , we minimize


n

S( ) i2 (Y X)(Y X )
i 1

Y Y- 2 X Y X X
We differentiate S() with respect to and equate
to zero, i.e.,
This gives

S
0,

b (X X)1 X Y

b is called least squares estimator of .


Gaurav Garg (IIM Lucknow)

Example: Consider the pie example.


We want to fit the model Yi 0 1 X 1i 2 X 2i i ,
The variables are
Y: Pie sales (units per week)
X1: Price (in $)
X2: Advertising Expenditure ($100s)

Sales 306.53 - 24.98( X 1 ) 74.13( X 2 )

Using the matrix formula, the least squares estimate


(LSE) of s are obtained as below:
LSE of Intercept 0

Intercept

(b0)

306.53

LSE of slope 1

Price

(b1)

-24.98

LSE of slope 2

Advertising (b2)

74.13

Pie Sales = 306.53 24.98 Price + 74.13 Adv. Expend.


Gaurav Garg (IIM Lucknow)

b1 = -24.98: sales will decrease, on


average, by 24.98 pies per week for
each $1 increase in selling price,
while advertising expenses are kept
fixed.

b2 = 74.13: sales will


increase, on average, by
74.13 pies per week for
each $100 increase in
advertising, while selling
price are kept fixed.

Gaurav Garg (IIM Lucknow)

09-12-2015

Y 306.52619 24.97509 X 1 74.13096 X 2

Prediction:
Predict sales for a week in which
selling price is $5.50
Advertising expenditure is $350:

Sales = 306.53 24.98 X1 + 74.13 X2

= 306.53 24.98 (5.50) + 74.13 ( 3.5)

= 428.62

Predicted sales is 428.62 pies


Note that Advertising is in $100s, so X2 = 3.5

Gaurav Garg (IIM Lucknow)

Y
350
460
350
430
350
380
430
470
450
490
340
300
440
450
300

X1
5.5
7.5
8.0
8.0
6.8
7.5
4.5
6.4
7.0
5.0
7.2
7.9
5.9
5.0
7.0

X2
3.3
3.3
3.0
4.5
3.0
4.0
3.0
3.7
3.5
4.0
3.5
3.2
4.0
3.5
2.7

Predicted Y
413.77
363.81
329.08
440.28
359.06
415.70
416.51
420.94
391.13
478.15
386.13
346.40
455.67
441.09
331.82

Residuals
-63.80
96.15
20.88
-10.31
-9.09
-35.74
13.47
49.03
58.84
11.83
-46.16
-46.44
-15.70
8.89
-31.85

Gaurav Garg (IIM Lucknow)

Coefficient of Determination
Coefficient of Determination (R2 ) is obtained using the
same formula as was in simple linear regression.

600

Total Sum of Squares, SST (Yi Y ) 2

500

i 1

400

Regression Sum of Squares, SSR (Yi Y ) 2

Y
Predicted Y

300

i 1

Error Sum of Squares, SSE (Yi Yi ) 2


i 1

200

Also, SST = SSR + SSE

R2 = SSR/SST = 1 (SSE/SST)

100

0
1

10

11

12

13

14

15

Gaurav Garg (IIM Lucknow)

Since
SST = SSR + SSE
and all three quantities are non-negative,
Also,
0 SSR SST

So

0 SSR/SST 1

Or
0 R2 1
2
When R is close to 0, the linear fit is not good
And X variables do not contribute in explaining the
variability in Y.
When R2 is close to 1, the linear fit is good.
In the previously discussed example, R2 = 0.5215
If we consider Y and X1 only,

R2 =0.1965

If we consider Y and X2 only, R2 =0.3095


Gaurav Garg (IIM Lucknow)

R2 is the proportion of variation in Y explained by


regression.
Gaurav Garg (IIM Lucknow)

Adjusted R2
If one more regressor is added to the model, the value
of R2 will increase
This increase is regardless of the contribution of newly
added regressor.
So, an adjusted value of R2 is defined, which is called as
adjusted R2 and defined as
2
R Adj
1

SSE (n k 1 )
SST (n 1 )

This Adjusted R2 will only increase, if the additional


variable contribute in explaining the variation in Y.
For our example, Adjusted R2 = 0.4417
Gaurav Garg (IIM Lucknow)

10

09-12-2015

F-Test for Overall Significance


We check if there is a linear relationship between all the
regressors (X1, X2, , Xk) and response (Y).
Use F test statistic
To test:

Total Sum of Square (SST) is partitioned into


Sum of Squares due to Regression (SSR) and
Sum of Squares due to Residuals (SSE)

where

H0: 1 = 2 = = k = 0 (no regressor is significant)


H1: at least one i 0 (at least one regressor affects Y)

SST Yi Y
n

i 1
n

i 1

i 1

SSE e i2 Yi Yi

The technique of Analysis of Variance is used.


Assumptions:
n > k, Var(i )= 2, E(i )= 0.
is are independent. This implies that Corr (i , j ) = 0, for i j
is have Normal Distribution. [i ~ N(0, 2)]
[NEW ASSUMPTION]

SSR SST SSE

eis are called the residuals.

Gaurav Garg (IIM Lucknow)

Gaurav Garg (IIM Lucknow)

Analysis of Variance Table


df

SS

MS

Fc
MSR/MSE

Regression

SSR

MSR

Residual or Error

n-k-1

SSE

MSE

Total

n-1

SST

H0: j = 0 (no linear relationship)


H1: j 0 (linear relationship exists between Xj and Y)

Test Statistic:
Fc = MSR / MSE ~ F(k, n-k-1)
For the previous example, we wish to test
H0: 1 = 2 = 0 Against H1: at least one i 0
ANOVA Table
df

SS

MS

F(2,12)(0.05)

6.5386

3.89

Regression

29460.03

14730.01

Residual or Error

12

27033.31

2252.78

Total

14

56493.33

Individual Variables Tests of Hypothesis


We test if there is a linear relationship between a
particular regressor Xj and Y
Hypotheses:

We use a two tailed t-test


If H0: j = 0 is accepted,
this indicates that the variable Xj can be deleted
from the model.

Thus H0 is rejected at 5% level of significance.


Gaurav Garg (IIM Lucknow)

Test Statistic:

Tc

bj

2 C jj

Tc ~ Students t with (n-k-1) degree of freedom


bj is the least squares estimate of j
C j j is the (j, j)th element of matrix (XX)-1

2 MSE

(MSE is obtained in ANOVA Table)


Gaurav Garg (IIM Lucknow)

Gaurav Garg (IIM Lucknow)

In our example
2 2252.7755

and

5.7946 0 .3312 1.0165

(X X) 1 0 .3312 0 .0521 0 .0038


1.0165 0 .0038 0 .2993

To test H0: 1 = 0 against H1: 1 0


Tc = -2.3057
To test H0: 2 = 0 against H1: 2 0
Tc =2.8548
Two tailed critical values of t at 12 d.f. are
3.0545 for 1% level of significance
2.6810 for 2% level of significance
2.1788 for 5% level of significance
Gaurav Garg (IIM Lucknow)

11

09-12-2015

Standard Error

i 1

Yi )

residuals

SSE

n k 1

Linear

Not Linear

n k 1

Gaurav Garg (IIM Lucknow)

Gaurav Garg (IIM Lucknow)

Residual Analysis for Equal Variance

residuals

Assumption of Equal Variance


We assume that Var(i )= 2
The variance is constant for all observations.

This assumption is examined by looking at the


plot of

residuals

SYX

(Y

Assumption of Linearity

residuals

Consider a dataset.
All the observations can not be exactly the same as
arithmetic mean (AM).
Variability of the observations around AM is measured
by standard deviation.
Similarly in regression, all Y values can not be the same
as predicted Y values.
Variability of Y values around the prediction line is
measured by STANDARD ERROR OF THE ESTIMATE.
n
It is given by
2

Unequal variance

Equal variance

Predicted values Yi and residuals e i Yi Yi


Gaurav Garg (IIM Lucknow)

Gaurav Garg (IIM Lucknow)

Assumption of Uncorrelated Residuals

Residual Analysis for Independence


(Uncorrelated Errors)

DurbinWatson statistic is a test statistic used to detect


the presence of autocorrelation.
n
It is given by
(e e ) 2
i 1

Independent

2
i

The value of d always lies between 0 and 4.


d = 2 indicates no autocorrelation.
Small values of d < 2 indicate successive error terms are
positively correlated.
If d > 2 successive error terms are negatively correlated.
The value of d more than 3 and less than 1 are alarming.
Gaurav Garg (IIM Lucknow)

Y
residuals

residuals

i 2

Not Independent

i 1

residuals

Gaurav Garg (IIM Lucknow)

12

09-12-2015

Assumption of Normality
When we use F test or t test, we assume that 1,
2 , , n are normally distributed.
This assumption can be examined by histogram
of residuals.

Normality can also be examined using Q-Q plot


or Normal probability plot.

NORMAL

NOT NORMAL

NORMAL

NOT NORMAL
Gaurav Garg (IIM Lucknow)

Standardized Regression Coefficient


In a multiple linear regression, we may like to know
which regressor contributes more.
We obtain standardized estimates of regression
coefficients.
For that, first we standardize the observations.
Y

1 n
Yi ,
n i 1

sY

1 n
(Yi Y )2
n 1 i 1

X1

1 n
X1i , s X1
n i 1

1 n
( X 1i X 1 )2
n 1 i 1

X2

1 n
X 2i , s X 2
n i 1

1 n
( X 2i X 2 ) 2
n 1 i 1

Gaurav Garg (IIM Lucknow)

Standardize all Y, X1 and X2 values as follows:


Standardized Yi

Y Y
,
sY

Standardized X 1i

Fit the regression in the standardized data and obtain


the least squares estimate of regression coefficients.
These coefficients are dimensionless or unit-free and
can be compared.
Look for the regression coefficient having the highest
magnitude.
Corresponding regressor contributes the most.

Gaurav Garg (IIM Lucknow)

Pie
Sales

Price
($)

Advertising
($100s)

-0.78

-0.95

-0.37

0.96

0.76

-0.37

-0.78

1.18

-0.98

0.48

1.18

2.09

-0.78

0.16

-0.98

-0.30

0.76

1.06

0.48

-1.80

-0.98

1.11

-0.18

0.45

0.80

0.33

0.04

10

1.43

-1.38

1.06

11

-0.93

0.50

0.04

12

-1.56

1.10

-0.57

13

0.64

-0.61

1.06

14

0.80

-1.38

0.04

15

-1.56

0.33

-1.60

Gaurav Garg (IIM Lucknow)

Note that:

Standardized Data
Week

X 1i X 1
X X2
, Standardized X 2i 2i
s X1
sX 2

2
R Adj
1

Y = 0 0.461 X1 + 0.570 X2
Since 0.461 < 0.570
X2 Contributes the most

Gaurav Garg (IIM Lucknow)

Fc

(1 R 2 )(n 1)
(n k 1)

(n k 1) R 2
k (1 R 2 )

Adjusted R2 can be negative


Adjusted R2 is always less than or equal to R2
Inclusion of intercept term is not necessary.
It depends on the problem.
Analyst may decide on this.
Gaurav Garg (IIM Lucknow)

13

09-12-2015

Example: Following data was collected for the sales, number of


advertisements published and advertizing expenditure for 12
weeks. Fit a regression model to predict the sales.

Model
1

ANOVAb
Sum of
Squares
df
309.986

Regression
Residual

Mean Square
154.993

143.201

15.911

453.187

11

Sales (0,000 Rs)

Ads (Nos.)

Adv Ex (000 Rs)

43.6

12

13.9

38.0

11

12

a. Predictors: (Constant), Ex_Adv, No_Adv

30.1

9.3

b. Dependent Variable: Sales

35.3

9.7

46.4

12

12.3

34.2

11.4

30.2

9.3

40.7

13

14.3

38.5

10.2

22.6

8.4

37.6

11.2

35.2

10

11.1

Gaurav Garg (IIM Lucknow)

Multicollinearity

Total

p-value < 0.05;

F
9.741

Sig.
.006a

CONTRADICTION

H0 is rejected;

All s are not zero

All p-values > 0.05; No H0 rejected.

0 =0, 1 =0, 2 =0

Coefficientsa
Standardized
Unstandardized Coefficients Coefficients
B
Std. Error
Beta
(Constant)
6.584
8.542
No_Adv
.625
1.120
.234
Ex_Adv
2.139
1.470
.611
a. Dependent Variable: Sales
Model
1

t
.771
.558
1.455

Sig.
.461
.591
.180

Gaurav Garg (IIM Lucknow)

We assume that regressors are independent variables.

Including two highly correlated independent variables can


adversely affect the regression results

When we regress Y on regressors X1, X2, , Xk.

Can lead to unstable coefficients

We assume that all regressors X1, X2, , Xk are


statistically independent of each other.
All the regressors affect the values of Y.
One regressor does not affect the values of other
regressor.
Sometimes, in practice this assumption is not met.
We face the problem of multicollinearity.
The correlated variables contribute redundant information
to the model
Gaurav Garg (IIM Lucknow)

EXAMPLES IN WHICH THIS MIGHT HAPPEN:


Miles per gallon Vs. horsepower and engine size
Income Vs. age and experience
Sales Vs. No. of Advertisement and Advert. Expenditure

Variance Inflationary Factor:


VIFj is used to measure multicollinearity generated
by variable Xj
It is given by
1

VIF j

1 R 2j

where R2j is the coefficient of determination of a


regression model that uses
Xj as the dependent variable and
all other X variables as the independent variables.
Gaurav Garg (IIM Lucknow)

Some Indications of Strong Multicollinearity:


Coefficient signs may not match prior expectations
Large change in the value of a previous coefficient when a new
variable is added to the model
A previously significant variable becomes insignificant when a
new independent variable is added.
F says at least one variable is significant, but none of the ts
indicates a useful variable.

Large standard error and corresponding regressors is still


significant.
MSE is very high and/or R2 is very small
Gaurav Garg (IIM Lucknow)

If VIFj > 5, Xj is highly correlated with the other


independent variables
Mathematically, the problem of multicollinearity occurs
when the columns of matrix X have near linear
dependence
LSE b can not be obtained when the matrix XX is singular
The matrix XX becomes singular when
the columns of matrix X have exact linear dependence
If any of the eigen value of matrix XX is zero

Thus, near zero eigen value is also an indication of


multicollinearity.
The methods of dealing with multicollinearity:
Collecting Additional Data
Variable Elimination
Gaurav Garg (IIM Lucknow)

14

09-12-2015

Coefficientsa
Standardize
d
Unstandardized
Coefficients
Coefficients
Model
B
Std. Error
Beta
1
(Constant)
6.584
8.542
No_Adv
.625
1.120
.234
Ex_Adv
2.139
1.470
.611
a. Dependent Variable: Sales

Collinearity
Statistics
Sig. Tolerance VIF
.461
.591
.199 5.022
.180
.199 5.022

t
.771
.558
1.455

Tolerance = 1/VIF

Greater than 5

Collinearity Diagnosticsa
Variance Proportions
Condition
Index
Model
Dimension
Eigenvalue
(Constant) No_Adv Ex_Adv
1
1
2.966
1.000
.00
.00
.00
2
.030
9.882
.33
.17
.00
3
.003
30.417
.67
.83
1.00
a. Dependent Variable: Sales

Negligible Value

We may use the method of variable elimination.


In practice, If Corr (X1, X2) is more than 0.7 or
less than -0.7, we eliminate one of them.
Techniques:
Stepwise
Forward Inclusion
Backward Elimination

(based on ANOVA)
(based on Correlation)
(based on Correlation)

Large Value

Gaurav Garg (IIM Lucknow)

Gaurav Garg (IIM Lucknow)

Stepwise Regression
Y = 0 + 1 X1 + 2 X2 + 3 X3 + 4 X4 + 5 X5 +
Step 1: Run 5 simple linear regressions:

Y = 0 + 1 X1
Y = 0 + 2 X2
Y = 0 + 3 X3
Y = 0 + 4 X4 <==== has lowest p-value (ANOVA) < 0.05
Y = 0 + 5 X5

Step 2: Run 4 two-variable linear regressions:

Y = 0 + 4 X4 + 1 X1
Y = 0 + 4 X4 + 2 X2
Y = 0 + 4 X4 + 3 X3 <= has lowest p-value (ANOVA) < 0.05
Y = 0 + 4 X4 + 5 X5

Step 3: Run 3 three-variable linear regressions:


Y = 0 + 3 X3 + 4 X4 + 1 X1
Y = 0 + 3 X3 + 4 X4 + 2 X2
Y = 0 + 3 X3 + 4 X4 + 5 X5

Suppose none of these models have


p-values < 0.05

STOP
Best model is the one with X3 and X4 only

Gaurav Garg (IIM Lucknow)

Gaurav Garg (IIM Lucknow)

Example: Following data was collected for the sales, number of


advertisements published and advertizing expenditure for 12
months. Fit a regression model to predict the sales.
Sales (0,000 Rs)

Ads (Nos.)

Adv Ex (000 Rs)

43.6

12

13.9

38.0

11

12

30.1

9.3

35.3

9.7

46.4

12

12.3

34.2

11.4

30.2

9.3

40.7

13

14.3

38.5

10.2

22.6

8.4

37.6

11.2

35.2

10

11.1

Gaurav Garg (IIM Lucknow)

Summary Output 1: Sales Vs. No_Adv


Model Summary
Model
R
R Square
Adjusted R Square
1
.781a
.610
.571
a. Predictors: (Constant), No_Adv
Model
1

Std. Error of the


Estimate
4.20570

ANOVAb
Sum of Squares
df
Mean Square
F
276.308
1
276.308 15.621
176.879
10
17.688

Regression
Residual
Total

453.187

Sig.
.003a

11

a. Predictors: (Constant), No_Adv


b. Dependent Variable: Sales

Model
1

(Constant)

No_Adv
a. Dependent Variable: Sales

Coefficientsa
Standardized
Unstandardized Coefficients Coefficients
B
Std. Error
Beta
16.937
4.982
2.083
.527
.781

t
3.400
3.952

Sig.
.007
.003

Gaurav Garg (IIM Lucknow)

15

09-12-2015

Summary Output 2: Sales Vs. Ex_Adv

Summary Output 3: Sales Vs. No_Adv & Ex_Adv

Model Summary

Model Summary

Model
R
R Square
Adjusted R Square
1
.820a
.673
.640
a. Predictors: (Constant), Ex_Adv
Model
1

Std. Error of the


Estimate
3.84900

ANOVAb
Sum of Squares
df
Mean Square
F
305.039
1
305.039 20.590
148.148
10
14.815

Regression
Residual
Total

Sig.
.001a

Model
1

ANOVAb
Sum of Squares
df
309.986
143.201

Regression
Residual

a. Predictors: (Constant), Ex_Adv

Total
453.187
a. Predictors: (Constant), Ex_Adv, No_Adv

b. Dependent Variable: Sales

b. Dependent Variable: Sales

Model
1

(Constant)

Ex_Adv
a. Dependent Variable: Sales

453.187

11

Coefficientsa
Standardized
Unstandardized Coefficients Coefficients
B
Std. Error
Beta
4.173
7.109
2.872
.633
.820

t
.587
4.538

Sig.
.570
.001

Qualitative Independent Variables


Johnson Filtration, Inc., provides maintenance
service for water filtration systems throughout
southern Florida.
To estimate the service time and the service cost,
the managers want to predict the repair time
necessary for each maintenance request.
Repair time is believed to be related to two
factors Number of months since the last maintenance
service
Type of repair problem (mechanical or electrical)

Model
1

(Constant)

No_Adv
Ex_Adv
a. Dependent Variable: Sales

Using least squares method, we fitted the model as

Y 2.1473 0.3041 X 1

R2 =0.534
At 5% level of significance, we reject
H0: 0 = 0 (Using t test)
H0: 1 = 0 (Using t and F test)
X1 alone explains 53.4% variability in repair time.
To introduce the type of repair into the model, we define a
dummy variable given as
0, if typeof repair is mechanical
X2
1, if typeof repair is electrical

Regression Model that uses X1 and X2 to regress Y is


Y=0+ 1 X1+ 2 X2+
Gaurav Garg (IIM Lucknow)

Mean Square
154.993
15.911

F
9.741

Sig.
.006a

Sig.
.461
.591
.180

11

Standardized
Coefficients
Beta
.234
.611

.771
.558
1.455

Gaurav Garg (IIM Lucknow)

Data for a sample of 10 service calls are given:


Service Call
1
2
3
4
5
6
7
8
9
10

Months Since Last


Service
2
6
8
3
2
7
9
8
4
6

Type of Repair
electrical
mechanical
electrical
mechanical
electrical
electrical
mechanical
mechanical
electrical
electrical

Repair Time in
Hours
2.9
3.0
4.8
1.8
2.9
4.9
4.2
4.8
4.4
4.5

Let Y denote the repair time, X1 denote the number of


months since last maintenance service.
Regression Model that uses X1 only to regress Y is
Y=0+ 1 X1+

Gaurav Garg (IIM Lucknow)

Is the new model improved?

2
9

Coefficientsa
Unstandardized Coefficients
B
Std. Error
6.584
8.542
.625
1.120
2.139
1.470

Gaurav Garg (IIM Lucknow)

Std. Error of the


Estimate
3.98888

Model
R
R Square
Adjusted R Square
1
.827a
.684
.614
a. Predictors: (Constant), Ex_Adv, No_Adv

Gaurav Garg (IIM Lucknow)

Summary
Multiple linear regression model Y=X +
Least Squares Estimate of is given by b= (XX)-1XY
R2 and adjusted R2
Using ANOVA (F test), we examine if all s are zero or
not.
t test is conducted for each regressor separately.
Using t test, we examine if corresponding to that
regressor is zero or not.
Problem of Multicollinearity VIF, eigen value
Dummy Variable
Examining the assumptions :
common variance, independence, normality
Gaurav Garg (IIM Lucknow)

16

S-ar putea să vă placă și