Linear Regression Analysis: Gaurav Garg (IIM Lucknow)

Linear Regression Analysis
Correlation
Simple Linear Regression
The Multiple Linear Regression Model
Least Squares Estimates
R2 and Adjusted R2
Overall Validity of the Model (F test)
Testing for individual regressor (t test)
Problem of Multicollinearity
Gaurav Garg (IIM Lucknow)
Smoking and Lung Capacity

Suppose, for example, we want to investigate the
relationship between cigarette smoking and lung
capacity
We might ask a group of people about their smoking
habits, and measure their lung capacities
Cigarettes (X)
0
5
10
15
20
Lung Capacity (Y)

45
42
33
31
29
Scatter plot of the data

Lung Capacity
60
40
20
0
0
10
20
30
We can see that as smoking goes up, lung

capacity tends to go down.
The two variables change the values in opposite
directions.
Height and Weight

Consider the following data of heights and weights of 5
women swimmers:
Height (inch):
62
64
65
66
68
Weight (pounds): 102 108 115 128 132
We can observe that weight is also increasing with
height.
150
100
50
0
60
65
70
Sometimes two variables are related to each

other.
The values of both of the variables are paired.
Change in the value of one affects the value of
other.
Usually these two variables are two attributes of
each member of the population
For Example:
Height
Weight
Advertising Expenditure
Sales Volume
Unemployment
Crime Rate
Rainfall
Food Production
Expenditure
Savings
We have already studied one measure of relationship

between two variables Covariance
Covariance between two random variables, X and Y is
given by
Cov( X , Y ) E ( XY ) E ( X ) E (Y )
XY
For paired observations on variables X and Y,

1 n
Cov( X , Y ) XY ( xi x )( yi y )
n i 1
xx
yy
Properties of Covariance:
Cov(X+a, Y+b) = Cov(X, Y)

[not affected by change in location]
Cov(aX, bY) = ab Cov(X, Y)
[affected by change in scale]
Covariance can take any value from - to +.
Cov(X,Y) > 0 means X and Y change in the same direction
Cov(X,Y) < 0 means X and Y change in the opposite direction
If X and Y are independent, Cov(X,Y) = 0 [other way may not be true]
It is not unit free.

So it is not a good measure of relationship between two
variables.
A better measure is correlation coefficient.
It is unit free and takes values in [-1,+1].
Correlation
Karl Pearsons Correlation coefficient is given by
rXY Corr ( X , Y )
Cov( X , Y )
Var( X )Var(Y )
When the joint distribution of X and Y is known

Cov( X , Y ) E ( XY ) E ( X ) E (Y )
Var( X ) E ( X 2 ) [ E ( X )]2 ,Var(Y ) E (Y 2 ) [ E (Y )]2
When observations on X and Y are available

1 n
Cov( X , Y ) ( xi x )( yi y )
n i 1
1 n
1 n
2
Var( X ) ( xi x ) ,Var(Y ) ( yi y ) 2
n i 1
n i 1
Properties of Correlation Coefficient
Corr(aX+b, cY+d) = Corr(X, Y),

It is unit free.
It measures the strength of relationship on a
scale of -1 to +1.
So, it can be used to compare the relationships of
various pairs of variables.
Values close to 0 indicate little or no correlation
Values close to +1 indicate very strong positive
correlation.
Values close to -1 indicate very strong negative
correlation.
Scatter Diagram
Y
X
Positively Correlated
Weakly Correlated
Negatively Correlated
Strongly Correlated
Not Correlated
Correlation Coefficient measures the strength of

linear relationship.
r = 0 does not necessarily imply that there is no
correlation.
It may be there, but is not a linear one.
y y
xx
1.25
125
-0.9
45
0.8100
2025
-40.50
1.75
105
-0.4
25
0.1600
625
-10.00
2.25
65
0.1
-15
0.0100
225
-1.50
2.00
85
-0.15
0.0225
25
-0.75
2.50
75
0.35
-5
0.1225
25
-1.75
2.25
80
0.1
0.0100
2.70
50
0.55
-30
0.3025
900
-16.50
2.50
55
0.35
-25
0.1225
625
-8.75
17.50
640
1.560
4450
-79.75
SSX
SSY
SSXY
( x x )2 ( y y)2
( x x )( y y )
Cov( X , Y )
SSXY
79.75
r
0.957
Var ( X )Var (Y ) GauravSSX
SSY
1.56 4450
Garg (IIM Lucknow)
Alternative Formulas for Sum of Squares
x y
, SSXY xy
, SSY y
n
n
x2
y2
x.y
1.25
125
1.5625
15625
156.25
1.75
105
3.0625
11025
183.75
2.25
65
5.0625
4225
146.25
2.00
85
4.0000
7225
170.00
2.50
75
6.2500
5625
187.50
2.25
80
5.0625
6400
180.00
2.70
50
7.2500
2500
135.00
2.50
55
6.2500
3025
137.50
17.20
640
38.54
55650
1296.25
SSX x
SSX = 1.56
SSY = 4450
SSXY= -79.75
Cov( X , Y )
SSXY
79.75
r
0.957
Var ( X )Var (Y ) GauravSSX
SSY
1.56 4450
Garg (IIM Lucknow)
Smoking and Lung Capacity Example

Cigarettes
(X)
0
5
10
15
20
50
0
25
100
225
400
750
XY
0
210
330
465
580
1585
Lung
Capacity
(Y)
2025
1764
1089
961
841
6680
45
42
33
31
29
180
rxy
(5)(1585) (50)(180)
(5)(750) 502 (5)(6680) 1802

7925 9000
(3750 2500)(33400 32400)
1075
1250 (1000)
.9615
Regression Analysis
Having determined the correlation between X and Y, we
wish to determine a mathematical relationship between
them.
Dependent variable: the variable you wish to explain
Independent variables: the variables used to explain the
dependent variable
Regression analysis is used to:
Predict the value of dependent variable based on the
value of independent variable(s)
Explain the impact of changes in an independent
variable on the dependent variable
Types of Relationships
Linear relationships
Curvilinear relationships
X
Strong relationships
Weak relationships
X
No relationship
X
Y
X
Simple Linear Regression Analysis
The simplest mathematical relationship is

Y = a + bX + error (linear)
Changes in Y are related to the changes in X
What are the most suitable values of
a (intercept) and b (slope)?
Y
b
}a
1
y = a + b.x
X
X
Method of Least Squares

(xi, yi)
a bX
error
yi
a bx i
X
xi
The best fitted line would be for which all the

ERRORS are minimum.
We want to fit a line for which all the errors are

minimum.
We want to obtain such values of a and b in
Y = a + bX + error for which all the errors are
minimum.
To minimize all the errors together we minimize
the sum of squares of errors (SSE).
n
SSE (Yi a bX i )
i 1
To get the values of a and b which minimize SSE, we

proceed as follows:
n
SSE
0 2 (Yi a bX i ) 0
a
i 1
n
Yi na b X i
i 1
(1)
i 1
n
SSE
0 2 (Yi a bX i ) X i 0
b
i 1
n
Yi X i a X i b X i2
i 1
i 1
i 1
Eq (1) and (2) are called normal equations.

Solve normal equations to get a and b
( 2)
Solving above normal equations, we get

n
Y
i 1
Y X
i 1
na b X i
i 1
a X i b X
i 1
n n
n Yi X i Yi X i
i 1
i 1
i 1
2
n
n
2
Xi Xi
i 1
i 1
n
i 1
2
i
Y Y X
n
i 1
a Y bX
X
n
i 1
SSXY
SSX
The values of a and b obtained using least squares

method are called as least squares estimates (LSE)
of a and b.
Thus, LSE of a and b are given by
SSXY
b
.
SSX
Also the correlation coefficient between X and Y is
a Y bX,
rXY
Cov( X , Y )
Var ( X )Var (Y )
SSXY
SSXY
SSX
SSX SSY
SSX
SSX
b
SSY
SSY
y y
xx
1.25
125
-0.9
45
0.8100
2025
-40.50
1.75
105
-0.4
25
0.1600
625
-10.00
2.25
65
0.1
-15
0.0100
225
-1.50
2.00
85
-0.15
0.0225
25
-0.75
2.50
75
0.35
-5
0.1225
25
-1.75
2.25
80
0.1
0.0100
2.70
50
0.55
-30
0.3025
900
-16.50
2.50
55
0.35
-25
0.1225
625
-8.75
17.50
640
1.560
4450
-79.75
SSX
SSY
SSXY
( x x )2 ( y y)2
X 2.15, Y 80 .
( x x )( y y )
SSXY
0.957
SSX SSY
b SSXY 51.12 a Y bX 189.91

SSX
Fitted Line is Y 189 .91 51 .12 X
140
120
100
80
60
40
0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 2.25 2.50 2.75
Fitted Line is Y 189 .91 51 .12 X

189.91 is the estimated mean value of Y when
the value of X is zero.
-51.12 is the change in the average value of Y as
a result of a one-unit change in X.
We can predict the value of Y for some given
value of X.
For example at X=2.15, predicted value of Y is
189.91 51.12 x 2.15= 80.002
Residuals : ei Yi Yi
Residual is the unexplained part of Y
The smaller the residuals, the better the utility of
Regression.
Sum of Residuals is always zero. Least Square
procedure ensures that.
Residuals play an important role in investigating
the adequacy of the fitted model.
We obtain coefficient of determination (R2)
using the residuals.
R2 is used to examine the adequacy of the fitted
linear model to the given data.
Coefficient of Determination
Y Y Y Y
Y Y
Y
X
n
Total Sum of Squares : SST (Yi Y ) 2

i 1
Regression Sum of Squares : SSR (Yi Y ) 2

i 1
Error Sum of Squares : SSE (Yi Yi ) 2

i 1
Also, SST = SSR + SSE

The fraction of SST explained by Regression is given by R2

R2 = SSR/ SST = 1 (SSE/ SST)
Clearly, 0 R2 1
When SSR is closed to SST, R2 will be closed to 1.
This means that regression explains most of the variability
in Y. (Fit is good)
When SSE is closed to SST, R2 will be closed to 0.
This means that regression does not explain much
variability in Y. (Fit is not good)
R2 is the square of correlation coefficient between X and
Y. (proof omitted)
r = -1
r=1
R2 = 1
Perfect linear
relationship
100% of the variation
in Y is explained by X
0 < R2 < 1
Weak linear
relationships
Some but not all of
the variation in Y is
explained by X
R2 = 0
No linear
relationship
None of the
variation in Y is
explained by X
1.25
125
(Y Y ) (Y Y ) (Y Y ) (Y Y ) 2 (Y Y ) 2 (Y Y ) 2
126.0
45
-1
46
2025
1
2116
1.75
105
100.5
25
4.5
20.5
625
20.25
420.25
2.25
65
74.9
-15
-9.9
-5.1
225
98.00
26.01
2.00
85
87.7
-2.2
7.7
25
4.84
59.29
2.50
75
62.1
-5
12.9
-17.7
25
166.41 313.29
2.25
80
74.9
5.1
-5.1
26.01
26.01
2.70
50
51.9
-30
-1.9
-28.1
900
3.61
789.61
2.50
55
62.1
-25
-7.1
-17.9
625
50.41
320.41
17.20
640
4450
370.54 4079.4
6
Coefficient of Determination: R2 = (4450-370.5)/4450 = 0.916

Correlation Coefficient:
r = -0.957
Coefficient of Determination = (Correlation Coefficient)2
Example:
Watching television also reduces the amount of physical exercise,
causing weight gains.
A sample of fifteen 10-year old children was taken.
The number of pounds each child was overweight was recorded
(a negative number indicates the child is underweight).
Additionally, the number of hours of television viewing per weeks
was also recorded. These data are listed here.
TV
42 34 25 35 37 38 31 33 19 29 38 28 29 36 18
18 sample
6 0 regression
-1 13 14 line
7 and
7 -9describe
8 8 what
5 3 the
14 -7
Overweight
Calculate the
coefficients tell you about the relationship between the two

variables.
Predicted Y = -24.709 + 0.967 X

0.768
and
R2 =
20.00
15.00
10.00
5.00
Y
Predicted Y
0.00
10
-5.00
-10.00
-15.00
11
12
13
14
15
Standard Error
Consider a dataset.
All the observations can not be exactly the same as
arithmetic mean (AM).
Variability of the observations around AM is measured
by standard deviation.
Similarly in regression, all Y values can not be the same
as predicted Y values.
Variability of Y values around the prediction line is
measured by STANDARD ERROR OF THE ESTIMATE.
n
It is given by
2
SYX
SSE
n2
(Y Y )
i 1
n2
Assumptions
The relationship between X and Y is linear
Error values are statistically independent
All the Errors have a common variance.
(Homoscedasticity)
Var(ei )= 2, where e Y Y
i
i
i
E(ei )= 0
No distributional assumption about errors is
required for least squares method.
Linearity
Linear
Not Linear
residuals
residuals
Independence
Independent
residuals
residuals
residuals
Not Independent
Equal Variance
Unequal variance
(Heteroscadastic)
Equal variance
(Homoscadastic)
Y
residuals
residuals
TV Watching Weight Gain Example

20.00
Scatter Plot of X and Y
15.00
10.00
5.00
0.00
0
10
15
20
25
30
35
40
45
-5.00
-10.00
-15.00
Scatter Plot of X and Residuals

6.00
4.00
2.00
0.00
-2.00
10
15
20
25
-4.00
-6.00
-8.00
-10.00
-12.00
30
35
40
45

In simple linear regression analysis, we fit linear relation
between
one independent variable (X) and
one dependent variable (Y).
We assume that Y is regressed on only one regressor

variable X.
In some situations, the variable Y is regressed on more
than one regressor variables (X1, X2, X3, ).
For EXample:
Cost
Salary
Sales
-> Labor cost, Electricity cost, Raw material cost

-> Education, EXperience
-> Cost, Advertising EXpenditure
Example:
A distributor of frozen dessert pies wants to
evaluate factors which influence the demand
Dependent variable:
Y: Pie sales (units per week)
Independent variables:
X1: Price (in $)
X2: Advertising Expenditure ($100s)
Data are collected for 15 weeks

Week
Pie
Sales
Price
($)
Advertising
($100s)
350
5.50
3.3
460
7.50
3.3
350
8.00
3.0
430
8.00
4.5
350
6.80
3.0
380
7.50
4.0
430
4.50
3.0
470
6.40
3.7
450
7.00
3.5
10
490
5.00
4.0
11
340
7.20
3.5
12
300
7.90
3.2
13
440
5.90
4.0
14
450
5.00
3.5
15
300
7.00
2.7
Using the given data, we wish to fit a linear

function of the form:
Yi 0 1 X 1i 2 X 2 i i ,
i 1,2,,15.
where
X1: Price (in $)
Fitting means, we want to get the values of

regression coefficients denoted by
Original values of s are not known.
We estimate them using the given data.

Examine the linear relationship between
one dependent (Y) and
two or more independent variables (X1, X2, , Xk).
Multiple Linear Regression Model with k

Independent Variables:
Intercept
Random Error
Slopes
Yi 0 1 X 1i 2 X 2 i k X ki i
i 1,2,, n.
Multiple Linear Regression Equation

Intercept and Slopes are estimated using observed
data.
Multiple linear regression equation with k
independent variables
Estimated
value
Estimate of
intercept
Estimates of slopes
Yi b0 b1 X1i b2 X 2i bk X ki
i 1,2,, n.
Multiple Regression Equation

EXample with two independent variables
Y b0 b1 X1 b2 X 2
Y
X2
X1
Estimating Regression Coefficients

The multiple linear regression model
Yi 0 1 X 1i 2 X 2i k X ki i ,i 1,2,...,n
In matriX notations
0
Y1 1 X 11 X 12 X 1k 1

1
Y2 1 X 21 X 22 X 2 k 2
2

Y 1 X
X
n1
n2
nk
n
k
or
Y X
Assumptions
No. of observations (n) is greater than no. of
regressors (k). i.e., n> k
Random Errors are independent
Random Errors have the same variances.
(Homoscedasticity)
Var(i )= 2
In long run, mean effect of random errors is zero.
E(i )= 0.
No Assumption on distribution of Random errors

is required for least squares method.
In order to find the estimate of , we minimize

n
S( ) (Y X)(Y X )
i 1
2
i
Y Y-2 X Y X X
We differentiate S() with respect to and equate
to zero, i.e.,
This gives
S
0,
b (X X) X Y
b is called least squares estimator of .

Example: Consider the pie example.

We want to fit the model Yi 0 1 X 1i 2 X 2 i i ,
The variables are
X1: Price (in $)
Using the matriX formula, the least squares estimate

(LSE) of s are obtained as below:
LSE of Intercept 0
Intercept
(b0)
306.53
LSE of slope 1
Price
(b1)
-24.98
LSE of slope 2
Advertising (b2)
74.13
Pie Sales = 306.53 24.98 Price + 74.13 Adv. Expend.

Sales 306 .53 - 24 .98( X 1 ) 74 .13( X 2 )
b1 = -24.98: sales will decrease, on

average, by 24.98 pies per week for
each $1 increase in selling price,
while advertising expenses are kept
fixed.
b2 = 74.13: sales will

increase, on average, by
74.13 pies per week for
each $100 increase in
advertising, while selling
price are kept fixed.
Prediction:
Predict sales for a week in which
selling price is $5.50
Advertising eXpenditure is $350:
Sales = 306.53 24.98 X1 + 74.13 X2
= 306.53 24.98 (5.50) + 74.13 ( 3.5)
= 428.62
Predicted sales is 428.62 pies

Note that Advertising is in $100s, so X2 = 3.5
Y 306 .52619 24 .97509 X 1 74 .13096 X 2

Y
350
460
350
430
350
380
430
470
450
490
340
300
440
450
300
X1
5.5
7.5
8.0
8.0
6.8
7.5
4.5
6.4
7.0
5.0
7.2
7.9
5.9
5.0
7.0
X2
3.3
3.3
3.0
4.5
3.0
4.0
3.0
3.7
3.5
4.0
3.5
3.2
4.0
3.5
2.7
Predicted Y
413.77
363.81
329.08
440.28
359.06
415.70
416.51
420.94
391.13
478.15
386.13
346.40
455.67
441.09
331.82
Residuals
-63.80
96.15
20.88
-10.31
-9.09
-35.74
13.47
49.03
58.84
11.83
-46.16
-46.44
-15.70
8.89
-31.85
600
500
400
Y
Predicted Y
300
200
100
0
1
10
11
12
13
14
15
Coefficient of Determination
Coefficient of Determination (R2 ) is obtained using the
same formula as was in simple linear regression.
n
Total Sum of Squares, SST (Yi Y ) 2

i 1
Regression Sum of Squares, SSR (Yi Y ) 2

i 1
Error Sum of Squares, SSE (Yi Yi ) 2

i 1
Also, SST = SSR + SSE
R2 = SSR/SST = 1 (SSE/SST)
R2 is the proportion of variation in Y explained by
regression.
Since
SST = SSR + SSE
and all three quantities are non-negative,
Also,
0 SSR SST
So
0 SSR/SST 1
Or
0 R2 1
When R2 is close to 0, the linear fit is not good

And X variables do not contribute in explaining the
variability in Y.
When R2 is close to 1, the linear fit is good.
In the previously discussed example, R2 = 0.5215
If we consider Y and X1 only, R2 =0.1965
If we consider Y and X2 only, R2 =0.3095
Adjusted R2
If one more regressor is added to the model, the value
of R2 will increase
This increase is regardless of the contribution of newly
added regressor.
So, an adjusted value of R2 is defined, which is called as
adjusted R2 and defined as
R
2
Adj
SSE (n k 1 )
1
SST (n 1 )
This Adjusted R2 will only increase, if the additional

variable contribute in explaining the variation in Y.
For our example, Adjusted R2 = 0.4417
F-Test for Overall Significance

We check if there is a linear relationship between all the
regressors (X1, X2, , Xk) and response (Y).
Use F test statistic
To test:
H0: 1 = 2 = = k = 0 (no regressor is significant)
H1: at least one i 0 (at least one regressor affects Y)
The technique of Analysis of Variance is used.

Assumptions:
n > k, Var(i )= 2, E(i )= 0.
is are independent. This implies that Corr (i , j ) = 0, for i j
is have Normal Distribution. [i ~ N(0, 2)]
[NEW ASSUMPTION]
Total Sum of Square (SST) is partitioned into

Sum of Squares due to Regression (SSR) and
Sum of Squares due to Residuals (SSE)
where
SST Yi Y
n
i 1
n
SSE e Yi Yi
i 1
2
i
i 1
SSR SST SSE
eis are called the residuals.

Analysis of Variance Table

df
SS
MS
Fc
Regression
SSR
MSR
MSR/MSE
Residual or Error
n-k-1
SSE
MSE
Total
n-1
SST
Test Statistic:
Fc = MSR / MSE ~ F(k, n-k-1)
For the previous eXample, we wish to test
H0: 1 = 2 = 0 Against H1: at least one i 0
ANOVA Table
df
SS
MS
F(2,12)(0.05)
Regression
29460.03
14730.01
6.5386
3.89
Residual or Error
12
27033.31
2252.78
Total
14
56493.33
Thus H0 is rejected at 5% level of significance.

Individual Variables Tests of Hypothesis

We test if there is a linear relationship between a
particular regressor Xj and Y
Hypotheses:
H0: j = 0 (no linear relationship)
H1: j 0 (linear relationship exists between Xj and Y)
We use a two tailed t-test

If H0: j = 0 is accepted,
this indicates that the variable Xj can be deleted
from the model.
Test Statistic:
Tc
bj
2 C jj
Tc ~ Students t with (n-k-1) degree of freedom

bj is the least squares estimate of j
C j j is the (j, j)th element of matrix (XX)-1
MSE
(MSE is obtained in ANOVA Table)

In our example
2 2252.7755 and
5.7946 0 .3312 1.0165
(X X) 0 .3312 0 .0521 0 .0038

1.0165 0 .0038 0 .2993
To test H0: 1 = 0 against H1: 1 0

Tc = -2.3057
To test H0: 2 = 0 against H1: 2 0
Tc =2.8548
Two tailed critical values of t at 12 d.f. are
3.0545 for 1% level of significance
Standard Error
Consider a dataset.
All the observations can not be exactly the same as
arithmetic mean (AM).
Variability of the observations around AM is measured
by standard deviation.
Similarly in regression, all Y values can not be the same
as predicted Y values.
Variability of Y values around the prediction line is
measured by STANDARD ERROR OF THE ESTIMATE.
n
It is given by
2
SYX
SSE
n k 1
(Y
i 1
Yi )
n k 1
Assumption of Linearity
Linear
Not Linear
residuals
residuals
Assumption of Equal Variance

We assume that Var(i )= 2
The variance is constant for all observations.
This assumption is examined by looking at the

plot of
Predicted values Yi and residuals ei Yi Yi
residuals
residuals
Residual Analysis for Equal Variance
Unequal variance
Y
Equal variance
Assumption of Uncorrelated Residuals

DurbinWatson statistic is a test statistic used to detect
the presence of autocorrelation.
n
It is given by
(e e ) 2
d
i 2
i 1
e
i 1
2
i
The value of d always lies between 0 and 4.

d = 2 indicates no autocorrelation.
Small values of d < 2 indicate successive error terms are
positively correlated.
If d > 2 successive error terms are negatively correlated.
The value of d more than 3 and less than 1 are alarming.
Residual Analysis for Independence

(Uncorrelated Errors)
residuals
Independent
Y
residuals
residuals
Not Independent
Assumption of Normality
When we use F test or t test, we assume that 1,
2 , , n are normally distributed.
This assumption can be examined by histogram
of residuals.
NORMAL
NOT NORMAL
Normality can also be examined using Q-Q plot

or Normal probability plot.
NORMAL
NOT NORMAL
Standardized Regression Coefficient

In a multiple linear regression, we may like to know
which regressor contributes more.
We obtain standardized estimates of regression
coefficients.
For that, first we standardize the observations.
1 n
Y Yi ,
n i 1
sY
1 n
2
(
Y
Y
)
i
n 1 i 1
1 n
X 1 X 1i , s X1
n i 1
1 n
2
(
X
X
)
1i
1
n 1 i 1
1 n
X 2 X 2i , s X 2
n i 1
1 n
2
(
X
X
)
2i
2
n 1 i 1
Standardize all Y, X1 and X2 values as follows:

Y Y
Standardized Yi
,
sY
X1i X1
X 2i X 2
Standardized X1i
, Standardized X 2i
s X1
sX 2
Fit the regression in the standardized data and obtain

the least squares estimate of regression coefficients.
These coefficients are dimensionless or unit-free and
can be compared.
Look for the regression coefficient having the highest
magnitude.
Corresponding regressor contributes the most.
Standardized Data
Week
Pie
Sales
Price
($)
Advertising
($100s)
-0.78
-0.95
-0.37
0.96
0.76
-0.37
-0.78
1.18
-0.98
0.48
1.18
2.09
-0.78
0.16
-0.98
-0.30
0.76
1.06
0.48
-1.80
-0.98
1.11
-0.18
0.45
0.80
0.33
0.04
10
1.43
-1.38
1.06
11
-0.93
0.50
0.04
12
-1.56
1.10
-0.57
13
0.64
-0.61
1.06
14
0.80
-1.38
0.04
15
-1.56
0.33
-1.60
Y = 0 0.461 X1 + 0.570 X2
Since 0.461 < 0.570
X2 Contributes the most
Note that:
2
R Adj
(1 R 2 )( n 1)
1
(n k 1)
(n k 1) R 2
Fc
k (1 R 2 )
Adjusted R2 can be negative

Adjusted R2 is always less than or equal to R2
Inclusion of intercept term is not necessary.
It depends on the problem.
Analyst may decide on this.
Example: Following data was collected for the sales, number of

advertisements published and advertizing expenditure for 12
weeks. Fit a regression model to predict the sales.
Sales (0,000 Rs)
Ads (Nos.)
Adv Ex (000 Rs)
43.6
12
13.9
38.0
11
12
30.1
9.3
35.3
9.7
46.4
12
12.3
34.2
11.4
30.2
9.3
40.7
13
14.3
38.5
10.2
22.6
8.4
37.6
11.2
35.2
10
11.1
ANOVAb
Model
1
Sum of
Squares
309.986
Regression
Residual
Total
df
Mean Square
2
154.993
143.201
453.187
11
F
9.741
Sig.
.006a
15.911
a. Predictors: (Constant), Ex_Adv, No_Adv
CONTRADICTION
b. Dependent Variable: Sales
p-value < 0.05;
H0 is rejected;
All s are not zero
All p-values > 0.05; No H0 rejected.
0 =0, 1 =0, 2 =0
Coefficientsa
Standardized
Unstandardized Coefficients Coefficients
Model
B
Std. Error
Beta
1
(Constant)
6.584
8.542
No_Adv
.625
1.120
.234
Ex_Adv
2.139
1.470
.611
a. Dependent Variable: Sales
.771
.558
1.455
Sig.
.461
.591
.180
Multicollinearity
We assume that regressors are independent variables.
When we regress Y on regressors X1, X2, , Xk.
We assume that all regressors X1, X2, , Xk are

statistically independent of each other.
All the regressors affect the values of Y.
One regressor does not affect the values of other
regressor.
Sometimes, in practice this assumption is not met.
We face the problem of multicollinearity.
The correlated variables contribute redundant information
to the model
Including two highly correlated independent variables can

adversely affect the regression results
Can lead to unstable coefficients
Some Indications of Strong Multicollinearity:
Coefficient signs may not match prior expectations
Large change in the value of a previous coefficient when a new
variable is added to the model
A previously significant variable becomes insignificant when a

new independent variable is added.
F says at least one variable is significant, but none of the ts
indicates a useful variable.
Large standard error and corresponding regressors is still

significant.
MSE is very high and/or R2 is very small
EXAMPLES IN WHICH THIS MIGHT HAPPEN:

Miles per gallon Vs. horsepower and engine size
Income Vs. age and experience
Sales Vs. No. of Advertisement and Advert. Expenditure
Variance Inflationary Factor:

VIFj is used to measure multicollinearity generated
by variable Xj
It is given by
1
VIF j
1 R
2
j
where R2j is the coefficient of determination of a

regression model that uses
Xj as the dependent variable and
all other X variables as the independent variables.
If VIFj > 5, Xj is highly correlated with the other

independent variables
Mathematically, the problem of multicollinearity occurs
when the columns of matrix X have near linear
dependence
LSE b can not be obtained when the matrix XX is singular
The matrix XX becomes singular when
the columns of matrix X have exact linear dependence
If any of the eigen value of matrix XX is zero
Thus, near zero eigen value is also an indication of

multicollinearity.
The methods of dealing with multicollinearity:
Collecting Additional Data
Variable Elimination
Coefficientsa
Standardize
d
Unstandardized
Coefficients
Coefficients
Model
B
Std. Error
Beta
1
(Constant)
6.584
8.542
No_Adv
.625
1.120
.234
Ex_Adv
2.139
1.470
.611
t
.771
.558
1.455
Tolerance = 1/VIF
Sig.
.461
.591
.180
Collinearity
Statistics
Tolerance VIF
.199 5.022
.199 5.022
Greater than 5
Collinearity Diagnosticsa
Variance Proportions
Condition
Index
Model
Dimension
Eigenvalue
(Constant) No_Adv Ex_Adv
1
1
2.966
1.000
.00
.00
.00
2
.030
9.882
.33
.17
.00
3
.003
30.417
.67
.83
1.00
Negligible Value
Large Value
We may use the method of variable elimination.

In practice, If Corr (X1, X2) is more than 0.7 or
less than -0.7, we eliminate one of them.
Techniques:
Stepwise
Forward Inclusion
Backward Elimination
(based on ANOVA)
(based on Correlation)
(based on Correlation)
Stepwise Regression
Y = 0 + 1 X 1 + 2 X2 + 3 X3 + 4 X4 + 5 X5 +
Step 1: Run 5 simple linear regressions:
Y = 0 + 1 X1
Y = 0 + 2 X2
Y = 0 + 3 X3
Y = 0 + 4 X4 <==== has lowest p-value (ANOVA) < 0.05
Y = 0 + 5 X5
Step 2: Run 4 two-variable linear regressions:
Y = 0 + 4 X4 + 1 X1
Y = 0 + 4 X4 + 2 X2
Y = 0 + 4 X4 + 3 X3 <= has lowest p-value (ANOVA) < 0.05
Y = 0 + 4 X4 + 5 X5
Step 3: Run 3 three-variable linear regressions:

Y = 0 + 3 X3 + 4 X4 + 1 X1
Y = 0 + 3 X3 + 4 X4 + 2 X2
Y = 0 + 3 X3 + 4 X4 + 5 X5
Suppose none of these models have

p-values < 0.05
STOP
Best model is the one with X3 and X4 only
Example: Following data was collected for the sales, number of

advertisements published and advertizing expenditure for 12
months. Fit a regression model to predict the sales.
Sales (0,000 Rs)
Ads (Nos.)
Adv Ex (000 Rs)
43.6
12
13.9
38.0
11
12
30.1
9.3
35.3
9.7
46.4
12
12.3
34.2
11.4
30.2
9.3
40.7
13
14.3
38.5
10.2
22.6
8.4
37.6
11.2
35.2
10
11.1
Summary Output 1: Sales Vs. No_Adv

Model Summary
Model
R
R Square
Adjusted R Square
a
1
.781
.610
.571
a. Predictors: (Constant), No_Adv
Model
1
Std. Error of the

Estimate
4.20570
ANOVAb
Sum of Squares
df
Mean Square
F
276.308
1
276.308 15.621
176.879
10
17.688
Regression
Residual
Total
453.187
Sig.
.003a
11
a. Predictors: (Constant), No_Adv

Coefficientsa
Model
1
(Constant)
No_Adv
Standardized
B
Std. Error
Beta
16.937
4.982
2.083
.527
.781
t
3.400
3.952
Sig.
.007
.003
Summary Output 2: Sales Vs. Ex_Adv

Model Summary
Model
R
R Square
Adjusted R Square
a
1
.820
.673
.640
a. Predictors: (Constant), Ex_Adv
Model
1
Std. Error of the

Estimate
3.84900
ANOVAb
Sum of Squares
df
Mean Square
F
305.039
1
305.039 20.590
148.148
10
14.815
Regression
Residual
Total
453.187
Sig.
.001a
11
a. Predictors: (Constant), Ex_Adv

Coefficientsa
Model
1
(Constant)
Ex_Adv
Standardized
B
Std. Error
Beta
4.173
7.109
2.872
.633
.820
t
.587
4.538
Sig.
.570
.001
Summary Output 3: Sales Vs. No_Adv & Ex_Adv

Model Summary
Model
R
R Square
Adjusted R Square
a
1
.827
.684
.614
Model
1
ANOVAb
Sum of Squares
df
309.986
143.201
Regression
Residual
Total
453.187
Std. Error of the

Estimate
3.98888
Mean Square
2
154.993
9
15.911
F
9.741
Sig.
.006a
Sig.
.461
.591
.180
11
Coefficientsa
Model
1
(Constant)
No_Adv
Ex_Adv
Unstandardized Coefficients
B
Std. Error
6.584
8.542
.625
1.120
2.139
1.470
Standardized
Coefficients
Beta
.234
.611
.771
.558
1.455
Qualitative Independent Variables

Johnson Filtration, Inc., provides maintenance
service for water filtration systems throughout
southern Florida.
To estimate the service time and the service cost,
the managers want to predict the repair time
necessary for each maintenance request.
Repair time is believed to be related to two
factors Number of months since the last maintenance
service
Type of repair problem (mechanical or electrical)
Data for a sample of 10 service calls are given:

Service Call
1
2
3
4
5
6
7
8
9
10
Months Since Last

Service
2
6
8
3
2
7
9
8
4
6
Type of Repair
electrical
mechanical
electrical
mechanical
electrical
electrical
mechanical
mechanical
electrical
electrical
Repair Time in
Hours
2.9
3.0
4.8
1.8
2.9
4.9
4.2
4.8
4.4
4.5
Let Y denote the repair time, X1 denote the number of

months since last maintenance service.
Regression Model that uses X1 only to regress Y is
Y=0+ 1 X1+
Using least squares method, we fitted the model as
Y 2.1473 0.3041 X 1
R2 =0.534
At 5% level of significance, we reject
H0: 0 = 0 (Using t test)
H0: 1 = 0 (Using t and F test)
X1 alone explains 53.4% variability in repair time.
To introduce the type of repair into the model, we define a
dummy variable given as
0, if type of repair is mechanical
X2
1, if type of repair is electrical
Regression Model that uses X1 and X2 to regress Y is

Y=0+ 1 X1+ 2 X2+
Is the new model improved?
Summary
Multiple linear regression model Y=X +
Least Squares Estimate of is given by b= (XX)-1XY
R2 and adjusted R2
Using ANOVA (F test), we examine if all s are zero or
not.
t test is conducted for each regressor separately.
Using t test, we examine if corresponding to that
regressor is zero or not.
Problem of Multicollinearity VIF, eigen value
Dummy Variable
Examining the assumptions :
common variance, independence, normality

Linear Regression Analysis: Gaurav Garg (IIM Lucknow)

Încărcat de

Informații document

Titlu original

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

Linear Regression Analysis: Gaurav Garg (IIM Lucknow)

Încărcat de

Drepturi de autor:

Formate disponibile

Linear Regression Analysis

Smoking and Lung Capacity

Lung Capacity (Y)

Scatter plot of the data

We can see that as smoking goes up, lung

Height and Weight

Sometimes two variables are related to each

We have already studied one measure of relationship

For paired observations on variables X and Y,

Gaurav Garg (IIM Lucknow)

Cov(X+a, Y+b) = Cov(X, Y)

It is not unit free.

Gaurav Garg (IIM Lucknow)

When the joint distribution of X and Y is known

When observations on X and Y are available

Properties of Correlation Coefficient

Corr(aX+b, cY+d) = Corr(X, Y),

Correlation Coefficient measures the strength of

Gaurav Garg (IIM Lucknow)

Alternative Formulas for Sum of Squares

Smoking and Lung Capacity Example

Gaurav Garg (IIM Lucknow)

(5)(750) 502 (5)(6680) 1802

(3750 2500)(33400 32400)

Gaurav Garg (IIM Lucknow)

Simple Linear Regression Analysis

The simplest mathematical relationship is

Method of Least Squares

The best fitted line would be for which all the

We want to fit a line for which all the errors are

Gaurav Garg (IIM Lucknow)

To get the values of a and b which minimize SSE, we

Eq (1) and (2) are called normal equations.

Solving above normal equations, we get

The values of a and b obtained using least squares

b SSXY 51.12 a Y bX 189.91

Gaurav Garg (IIM Lucknow)

Fitted Line is Y 189 .91 51 .12 X

Total Sum of Squares : SST (Yi Y ) 2

Regression Sum of Squares : SSR (Yi Y ) 2

Error Sum of Squares : SSE (Yi Yi ) 2

Also, SST = SSR + SSE

The fraction of SST explained by Regression is given by R2

Coefficient of Determination: R2 = (4450-370.5)/4450 = 0.916

coefficients tell you about the relationship between the two

Predicted Y = -24.709 + 0.967 X

Gaurav Garg (IIM Lucknow)

Gaurav Garg (IIM Lucknow)

Gaurav Garg (IIM Lucknow)

Gaurav Garg (IIM Lucknow)

Gaurav Garg (IIM Lucknow)

Gaurav Garg (IIM Lucknow)

Gaurav Garg (IIM Lucknow)

TV Watching Weight Gain Example

Scatter Plot of X and Y

Scatter Plot of X and Residuals

Gaurav Garg (IIM Lucknow)

The Multiple Linear Regression Model

We assume that Y is regressed on only one regressor

-> Labor cost, Electricity cost, Raw material cost

Data are collected for 15 weeks

Gaurav Garg (IIM Lucknow)

Using the given data, we wish to fit a linear