Linear Regression Analysis: Predicting Lung Capacity from Cigarette Smoking

09-12-2015
Linear Regression Analysis
Correlation
Simple Linear Regression
The Multiple Linear Regression Model
Least Squares Estimates
R2 and Adjusted R2
Overall Validity of the Model (F test)
Testing for individual regressor (t test)
Problem of Multicollinearity
Smoking and Lung Capacity

Suppose, for example, we want to investigate the
relationship between cigarette smoking and lung
capacity
We might ask a group of people about their smoking
habits, and measure their lung capacities
Cigarettes (X)
0
5
10
15
20
Lung Capacity (Y)

45
42
33
31
29
Gaurav Garg (IIM Lucknow)
Scatter plot of the data
Height and Weight
Lung Capacity
60
40
20
0
0
10
20
30
We can see that as smoking goes up, lung

capacity tends to go down.
The two variables change the values in opposite
directions.
Consider the following data of heights and weights of 5

women swimmers:
Height (inch):
62
64
65
66
68
Weight (pounds): 102 108 115 128 132
We can observe that weight is also increasing with
height.
150
100
50
0
60
Sometimes two variables are related to each

other.
The values of both of the variables are paired.
Change in the value of one affects the value of
other.
Usually these two variables are two attributes of
each member of the population
For Example:
Height
Weight
Advertising Expenditure
Sales Volume
Unemployment
Crime Rate
Rainfall
Food Production
Expenditure
Savings
65
70
We have already studied one measure of relationship

between two variables Covariance
Covariance between two random variables, X and Y is
given by
Cov( X , Y ) E( XY ) E( X ) E(Y )
XY
For paired observations on variables X and Y,
Cov( X , Y ) XY
1 n
( xi x )( yi y )
n i 1
xx
yy
09-12-2015
Correlation
Properties of Covariance:
Cov(X+a, Y+b) = Cov(X, Y)

[not affected by change in location]
Cov(aX, bY) = ab Cov(X, Y)
[affected by change in scale]
Covariance can take any value from - to +.
Cov(X,Y) > 0 means X and Y change in the same direction
Cov(X,Y) < 0 means X and Y change in the opposite direction
If X and Y are independent, Cov(X,Y) = 0 [other way may not be true]
It is not unit free.

So it is not a good measure of relationship between two
variables.
A better measure is correlation coefficient.
It is unit free and takes values in [-1,+1].
Karl Pearsons Correlation coefficient is given by
Cov( X , Y ) E ( XY ) E ( X ) E (Y )
Var ( X ) E ( X 2 ) [ E ( X )]2 ,Var (Y ) E (Y 2 ) [ E (Y )]2
When observations on X and Y are available

Cov( X , Y )
Properties of Correlation Coefficient
Corr(aX+b, cY+d) = Corr(X, Y),

It is unit free.
It measures the strength of relationship on a
scale of -1 to +1.
So, it can be used to compare the relationships of
various pairs of variables.
Values close to 0 indicate little or no correlation
Values close to +1 indicate very strong positive
correlation.
Values close to -1 indicate very strong negative
correlation.
Scatter Diagram
Y
X
Positively Correlated
Weakly Correlated
Negatively Correlated
Strongly Correlated
Not Correlated
Correlation Coefficient measures the strength of

linear relationship.
r = 0 does not necessarily imply that there is no
correlation.
It may be there, but is not a linear one.
y y
xx
1.25
125
-0.9
45
0.8100
2025
-40.50
1.75
105
-0.4
25
0.1600
625
-10.00
2.25
65
0.1
-15
0.0100
225
-1.50
2.00
85
-0.15
0.0225
25
-0.75
2.50
75
0.35
-5
0.1225
25
-1.75
2.25
80
0.1
0.0100
2.70
50
0.55
-30
0.3025
900
-16.50
2.50
55
0.35
-25
0.1225
625
-8.75
17.50
640
1.560
4450
-79.75
SSX
SSY
SSXY
1 n
( xi x )( yi y )
n i 1
1 n
1 n
( yi y ) 2
( xi x ) 2 ,Var (Y ) n
n i 1
i 1
Var ( X ) Var (Y )
When the joint distribution of X and Y is known
Var ( X )
Cov( X , Y )
rXY Corr ( X , Y )
Cov( X , Y )
Var ( X )Var (Y )
( x x )2 ( y y)2
( x x )( y y )
SSXY
79.75
0.957
SSX SSY
1.56 4450
09-12-2015
Alternative Formulas for Sum of Squares
SSX x 2
, SSY y 2
y ,
SSXY
2
x2
y2
x.y
1.25
125
1.5625
15625
156.25
1.75
105
3.0625
11025
183.75
2.25
65
5.0625
4225
146.25
2.00
85
4.0000
7225
170.00
2.50
75
6.2500
5625
187.50
2.25
80
5.0625
6400
180.00
2.70
50
7.2500
2500
135.00
2.50
55
6.2500
3025
137.50
17.20
640
38.54
55650
1296.25
Smoking and Lung Capacity Example
x y
xy n
Cigarettes
(X)
0
5
10
15
20
50
SSX = 1.56
SSY = 4450
SSXY= -79.75
X2
0
25
100
225
400
750
0
210
330
465
580
1585
Cov( X , Y )
SSXY
79.75
0.957
Var ( X )Var (Y ) GauravSSX
SSY
1
.
56 4450
Garg (IIM Lucknow)
Lung
Capacity
(Y)
Y2
XY
2025
1764
1089
961
841
6680
45
42
33
31
29
180
Regression Analysis
rxy
Having determined the correlation between X and Y, we

wish to determine a mathematical relationship between
them.
Dependent variable: the variable you wish to explain
Independent variables: the variables used to explain the
dependent variable
Regression analysis is used to:
Predict the value of dependent variable based on the
value of independent variable(s)
Explain the impact of changes in an independent
variable on the dependent variable
(5)(1585) (50)(180)
(5)(750) 502 (5)(6680) 1802

7925 9000
(3750 2500)(33400 32400)
1075
1250 (1000)
.9615
Types of Relationships
Linear relationships
Strong relationships
Curvilinear relationships
X
Weak relationships
X
09-12-2015
Simple Linear Regression Analysis
No relationship
The simplest mathematical relationship is

Y = a + bX + error (linear)
Changes in Y are related to the changes in X
What are the most suitable values of
a (intercept) and b (slope)?
1
y = a + b.x
}a
Method of Least Squares
We want to fit a line for which all the errors are

minimum.
We want to obtain such values of a and b in
Y = a + bX + error for which all the errors are
minimum.
To minimize all the errors together we minimize
the sum of squares of errors (SSE).
a bX
(xi, yi)
error
yi
a bx i
X
xi
SSE (Yi a bX i ) 2
i 1
The best fitted line would be for which all the

ERRORS are minimum.
To get the values of a and b which minimize SSE, we

proceed as follows:
Solving above normal equations, we get

n
SSE
0 2 (Yi a bX i ) 0
a
i 1
n
i 1
Y X
Yi na b X i
i 1
(1)
i 1
i 1
i 1
Eq (1) and (2) are called normal equations.

Solve normal equations to get a and b
( 2)
na b X i
i 1
a X i b X i2
i 1
n Yi X i Yi X i
i 1
i 1
i 1

b
2
n
n
2
Xi Xi
i 1
i 1
n
Yi X i a X i b X i2
i 1
i 1
n
SSE
0 2 (Yi a bX i ) X i 0
b
i 1
n
i 1
Y Y X
n
i 1
X
n
i 1
SSXY
SSX
a Y bX
09-12-2015
The values of a and b obtained using least squares

method are called as least squares estimates (LSE)
of a and b.
Thus, LSE of a and b are given by
SSXY
b
.
SSX
Also the correlation coefficient between X and Y is
a Y bX,
rXY
Cov( X , Y )
Var ( X )Var (Y )
SSXY
SSX SSY
SSXY
SSX
y y
xx
1.25
125
-0.9
45
0.8100
2025
-40.50
1.75
105
-0.4
25
0.1600
625
-10.00
2.25
65
0.1
-15
0.0100
225
-1.50
2.00
85
-0.15
0.0225
25
-0.75
2.50
75
0.35
-5
0.1225
25
-1.75
2.25
80
0.1
0.0100
2.70
50
0.55
-30
0.3025
900
-16.50
2.50
55
0.35
-25
0.1225
625
-8.75
17.50
640
1.560
4450
-79.75
SSX
SSY
SSXY
( x x )2 ( y y)2
SSX
SSX
b
SSY
SSY
( x x )( y y )
X 2.15, Y 80.
SSXY
0.957
SSX SSY
SSXY
b
51.12 a Y bX 189.91
SSX
FittedLineis Y 189.91 51.12 X
140
120
100
80
60
40
0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 2.25 2.50 2.75
FittedLineisY 189.91 51.12 X

189.91 is the estimated mean value of Y when
the value of X is zero.
-51.12 is the change in the average value of Y as
a result of a one-unit change in X.
We can predict the value of Y for some given
value of X.
For example at X=2.15, predicted value of Y is
189.91 51.12 x 2.15= 80.002
Residuals: ei Yi Yi
Residual is the unexplained part of Y
The smaller the residuals, the better the utility of
Regression.
Sum of Residuals is always zero. Least Square
procedure ensures that.
Residuals play an important role in investigating
the adequacy of the fitted model.
We obtain coefficient of determination (R2)
using the residuals.
R2 is used to examine the adequacy of the fitted
linear model to the given data.
Coefficient of Determination
Y Y Y Y
Y Y
Y
X
n
TotalSumof Squares: SST (Yi Y ) 2

i 1
RegressionSum of Squares: SSR (Yi Y ) 2

i 1
Error Sumof Squares: SSE (Yi Yi ) 2

i 1
Also, SST = SSR + SSE

09-12-2015
The fraction of SST explained by Regression is given by R2

R2 = SSR/ SST = 1 (SSE/ SST)
Clearly, 0 R2 1
When SSR is closed to SST, R2 will be closed to 1.
This means that regression explains most of the variability
in Y. (Fit is good)
When SSE is closed to SST, R2 will be closed to 0.
This means that regression does not explain much
variability in Y. (Fit is not good)
R2 is the square of correlation coefficient between X and
Y. (proof omitted)
r = -1
r=1
0 < R2 < 1
Weak linear
relationships
R2 = 1
Perfect linear
relationship
R2 = 0
No linear
relationship
Some but not all of

the variation in Y is
explained by X
100% of the variation

in Y is explained by X
None of the
variation in Y is
explained by X
(Y Y ) (Y Y ) (Y Y ) (Y Y ) 2 (Y Y ) 2 (Y Y ) 2
45
-1
46
2025
1
2116
1.25
125
126.0
1.75
105
100.5
25
4.5
20.5
625
20.25
420.25
2.25
65
74.9
-15
-9.9
-5.1
225
98.00
26.01
2.00
85
87.7
-2.2
7.7
25
4.84
59.29
2.50
75
62.1
-5
12.9
-17.7
25
166.41 313.29
2.25
80
74.9
5.1
-5.1
26.01
26.01
2.70
50
51.9
-30
-1.9
-28.1
900
3.61
789.61
2.50
55
62.1
-25
-7.1
-17.9
625
50.41
320.41
17.20
640
4450
370.54 4079.4
6
Coefficient of Determination: R2 = (4450-370.5)/4450 = 0.916

Correlation Coefficient:
r = -0.957
Coefficient of Determination = (Correlation Coefficient)2
Example:
Watching television also reduces the amount of physical exercise,
causing weight gains.
A sample of fifteen 10-year old children was taken.
The number of pounds each child was overweight was recorded
(a negative number indicates the child is underweight).
Additionally, the number of hours of television viewing per weeks
was also recorded. These data are listed here.
TV
Overweight
42 34 25 35 37 38 31 33 19 29 38 28 29 36 18
18 6 0 -1 13 14 7 7 -9 8 8 5 3 14 -7
Calculate the sample regression line and describe what the

coefficients tell you about the relationship between the two
variables.
Y = -24.709 + 0.967 X
R2 = 0.768
and
20.00
15.00
10.00
5.00
Y
Predicted Y
0.00
1
10
11
12
13
14
15
-5.00
-10.00
-15.00
09-12-2015
Standard Error
Assumptions
Consider a dataset.
All the observations can not be exactly the same as
arithmetic mean (AM).
Variability of the observations around AM is measured
by standard deviation.
Similarly in regression, all Y values can not be the same
as predicted Y values.
Variability of Y values around the prediction line is
measured by STANDARD ERROR OF THE ESTIMATE.
n
It is given by
2
SYX
SSE
n2
The relationship between X and Y is linear

Error values are statistically independent
All the Errors have a common variance.
(Homoscedasticity)
Var(ei )= 2, where e Y Y
i
i
i
E(ei )= 0
No distributional assumption about errors is
required for least squares method.
(Y Y )
i 1
n2
Linearity
Independence
Linear
Not Linear
Independent
Equal Variance
Unequal variance
(Heteroscadastic)
residuals
X
residuals
residuals
residuals
residuals
Not Independent
TV Watching Weight Gain Example
Equal variance
(Homoscadastic)
20.00
Scatter Plot of X and Y
15.00
10.00
5.00
0.00
0
10
15
20
25
30
35
40
45
-5.00
-10.00
-15.00
Scatter Plot of X and Residuals

6.00
4.00
residuals
residuals
2.00
0.00
-2.00
10
15
20
25
30
35
40
45
-4.00
-6.00
-8.00
-10.00
-12.00
09-12-2015

In simple linear regression analysis, we fit linear relation
between
one independent variable (X) and
one dependent variable (Y).
We assume that Y is regressed on only one regressor

variable X.
In some situations, the variable Y is regressed on more
than one regressor variables (X1, X2, X3, ).
For EXample:
Cost
Salary
Sales
-> Labor cost, Electricity cost, Raw material cost

-> Education, Experience
-> Cost, Advertising Expenditure
Example:
A distributor of frozen dessert pies wants to
evaluate factors which influence the demand
Dependent variable:
Y: Pie sales (units per week)
Independent variables:
X1: Price (in $)
X2: Advertising Expenditure ($100s)
Data are collected for 15 weeks
Week
Pie
Sales
Price
($)
Advertising
($100s)
350
5.50
3.3
460
7.50
3.3
350
8.00
3.0
430
8.00
4.5
350
6.80
3.0
380
7.50
4.0
430
4.50
3.0
470
6.40
3.7
450
7.00
3.5
10
490
5.00
4.0
11
340
7.20
3.5
12
300
7.90
3.2
13
440
5.90
4.0
14
450
5.00
3.5
15
300
7.00
2.7
Using the given data, we wish to fit a linear

function of the form:
Yi 0 1 X 1i 2 X 2i i ,
i 1,2,,15.
where
X1: Price (in $)
Fitting means, we want to get the values of

regression coefficients denoted by
Original values of s are not known.
We estimate them using the given data.

Examine the linear relationship between
one dependent (Y) and
two or more independent variables (X1, X2, , Xk).
Multiple Linear Regression Model with k

Independent Variables:
Intercept
Random Error
Slopes
Yi 0 1 X 1i 2 X 2i k X ki i
i 1,2,, n.
Multiple Linear Regression Equation

Intercept and Slopes are estimated using observed
data.
Multiple linear regression equation with k
independent variables
Estimated
value
Estimate of
intercept
Estimates of slopes
Yi b0 b1 X 1i b2 X 2i bk X ki
i 1,2,, n.
09-12-2015
Multiple Regression Equation
Estimating Regression Coefficients
Example with two independent variables

Y b0 b1 X 1 b2 X 2
Y
X2
The multiple linear regression model

Yi 0 1 X1i 2 X 2i k X ki i ,i 1,2,...,n
In matrix notations
0
Y1 1 X 11 X 12 X 1k 1

1
Y2 1 X 21 X 22 X 2 k 2

2

Y 1 X
X n 2 X nk n
n1
n
k
Y X
or
X1
Assumptions
No. of observations (n) is greater than no. of
regressors (k). i.e., n> k
Random Errors are independent
Random Errors have the same variances.
(Homoscedasticity)
Var(i )= 2
In long run, mean effect of random errors is zero.
E(i )= 0.
No Assumption on distribution of Random errors

is required for least squares method.
In order to find the estimate of , we minimize

n
S( ) i2 (Y X)(Y X )
i 1
Y Y- 2 X Y X X
We differentiate S() with respect to and equate
to zero, i.e.,
This gives
S
0,
b (X X)1 X Y
b is called least squares estimator of .

Example: Consider the pie example.

We want to fit the model Yi 0 1 X 1i 2 X 2i i ,
The variables are
X1: Price (in $)
Sales 306.53 - 24.98( X 1 ) 74.13( X 2 )
Using the matrix formula, the least squares estimate

(LSE) of s are obtained as below:
LSE of Intercept 0
Intercept
(b0)
306.53
LSE of slope 1
Price
(b1)
-24.98
LSE of slope 2
Advertising (b2)
74.13
Pie Sales = 306.53 24.98 Price + 74.13 Adv. Expend.

b1 = -24.98: sales will decrease, on

average, by 24.98 pies per week for
each $1 increase in selling price,
while advertising expenses are kept
fixed.
b2 = 74.13: sales will

increase, on average, by
74.13 pies per week for
each $100 increase in
advertising, while selling
price are kept fixed.
09-12-2015
Y 306.52619 24.97509 X 1 74.13096 X 2
Prediction:
Predict sales for a week in which
selling price is $5.50
Advertising expenditure is $350:
Sales = 306.53 24.98 X1 + 74.13 X2
= 306.53 24.98 (5.50) + 74.13 ( 3.5)
= 428.62
Predicted sales is 428.62 pies

Note that Advertising is in $100s, so X2 = 3.5
Y
350
460
350
430
350
380
430
470
450
490
340
300
440
450
300
X1
5.5
7.5
8.0
8.0
6.8
7.5
4.5
6.4
7.0
5.0
7.2
7.9
5.9
5.0
7.0
X2
3.3
3.3
3.0
4.5
3.0
4.0
3.0
3.7
3.5
4.0
3.5
3.2
4.0
3.5
2.7
Predicted Y
413.77
363.81
329.08
440.28
359.06
415.70
416.51
420.94
391.13
478.15
386.13
346.40
455.67
441.09
331.82
Residuals
-63.80
96.15
20.88
-10.31
-9.09
-35.74
13.47
49.03
58.84
11.83
-46.16
-46.44
-15.70
8.89
-31.85
Coefficient of Determination
Coefficient of Determination (R2 ) is obtained using the
same formula as was in simple linear regression.
600
Total Sum of Squares, SST (Yi Y ) 2
500
i 1
400
Regression Sum of Squares, SSR (Yi Y ) 2
Y
Predicted Y
300
i 1
Error Sum of Squares, SSE (Yi Yi ) 2

i 1
200
Also, SST = SSR + SSE
R2 = SSR/SST = 1 (SSE/SST)
100
0
1
10
11
12
13
14
15
Since
SST = SSR + SSE
and all three quantities are non-negative,
Also,
0 SSR SST
So
0 SSR/SST 1
Or
0 R2 1
2
When R is close to 0, the linear fit is not good
And X variables do not contribute in explaining the
variability in Y.
When R2 is close to 1, the linear fit is good.
In the previously discussed example, R2 = 0.5215
If we consider Y and X1 only,
R2 =0.1965
If we consider Y and X2 only, R2 =0.3095

R2 is the proportion of variation in Y explained by

regression.
Adjusted R2
If one more regressor is added to the model, the value
of R2 will increase
This increase is regardless of the contribution of newly
added regressor.
So, an adjusted value of R2 is defined, which is called as
adjusted R2 and defined as
2
R Adj
1
SSE (n k 1 )
SST (n 1 )
This Adjusted R2 will only increase, if the additional

variable contribute in explaining the variation in Y.
For our example, Adjusted R2 = 0.4417
10
09-12-2015
F-Test for Overall Significance

We check if there is a linear relationship between all the
regressors (X1, X2, , Xk) and response (Y).
Use F test statistic
To test:
Total Sum of Square (SST) is partitioned into

Sum of Squares due to Regression (SSR) and
Sum of Squares due to Residuals (SSE)
where
H0: 1 = 2 = = k = 0 (no regressor is significant)

H1: at least one i 0 (at least one regressor affects Y)
SST Yi Y
n
i 1
n
i 1
i 1
SSE e i2 Yi Yi
The technique of Analysis of Variance is used.

Assumptions:
n > k, Var(i )= 2, E(i )= 0.
is are independent. This implies that Corr (i , j ) = 0, for i j
is have Normal Distribution. [i ~ N(0, 2)]
[NEW ASSUMPTION]
SSR SST SSE
eis are called the residuals.
Analysis of Variance Table

df
SS
MS
Fc
MSR/MSE
Regression
SSR
MSR
Residual or Error
n-k-1
SSE
MSE
Total
n-1
SST
H0: j = 0 (no linear relationship)

H1: j 0 (linear relationship exists between Xj and Y)
Test Statistic:
Fc = MSR / MSE ~ F(k, n-k-1)
For the previous example, we wish to test
H0: 1 = 2 = 0 Against H1: at least one i 0
ANOVA Table
df
SS
MS
F(2,12)(0.05)
6.5386
3.89
Regression
29460.03
14730.01
Residual or Error
12
27033.31
2252.78
Total
14
56493.33
Individual Variables Tests of Hypothesis

We test if there is a linear relationship between a
particular regressor Xj and Y
Hypotheses:
We use a two tailed t-test

If H0: j = 0 is accepted,
this indicates that the variable Xj can be deleted
from the model.
Thus H0 is rejected at 5% level of significance.

Test Statistic:
Tc
bj
2 C jj
Tc ~ Students t with (n-k-1) degree of freedom

bj is the least squares estimate of j
C j j is the (j, j)th element of matrix (XX)-1
2 MSE
(MSE is obtained in ANOVA Table)

In our example
2 2252.7755
and
5.7946 0 .3312 1.0165
(X X) 1 0 .3312 0 .0521 0 .0038

1.0165 0 .0038 0 .2993
To test H0: 1 = 0 against H1: 1 0

Tc = -2.3057
To test H0: 2 = 0 against H1: 2 0
Tc =2.8548
Two tailed critical values of t at 12 d.f. are
3.0545 for 1% level of significance
11
09-12-2015
Standard Error
i 1
Yi )
residuals
SSE
n k 1
Linear
Not Linear
n k 1
Residual Analysis for Equal Variance
residuals
Assumption of Equal Variance

We assume that Var(i )= 2
The variance is constant for all observations.
This assumption is examined by looking at the

plot of
residuals
SYX
(Y
Assumption of Linearity
residuals
Consider a dataset.
All the observations can not be exactly the same as
arithmetic mean (AM).
Variability of the observations around AM is measured
by standard deviation.
Similarly in regression, all Y values can not be the same
as predicted Y values.
Variability of Y values around the prediction line is
measured by STANDARD ERROR OF THE ESTIMATE.
n
It is given by
2
Unequal variance
Equal variance
Predicted values Yi and residuals e i Yi Yi

Assumption of Uncorrelated Residuals
Residual Analysis for Independence

(Uncorrelated Errors)
DurbinWatson statistic is a test statistic used to detect

the presence of autocorrelation.
n
It is given by
(e e ) 2
i 1
Independent
2
i
The value of d always lies between 0 and 4.

d = 2 indicates no autocorrelation.
Small values of d < 2 indicate successive error terms are
positively correlated.
If d > 2 successive error terms are negatively correlated.
The value of d more than 3 and less than 1 are alarming.
Y
residuals
residuals
i 2
Not Independent
i 1
residuals
12
09-12-2015
Assumption of Normality
When we use F test or t test, we assume that 1,
2 , , n are normally distributed.
This assumption can be examined by histogram
of residuals.
Normality can also be examined using Q-Q plot

or Normal probability plot.
NORMAL
NOT NORMAL
NORMAL
NOT NORMAL
Standardized Regression Coefficient

In a multiple linear regression, we may like to know
which regressor contributes more.
We obtain standardized estimates of regression
coefficients.
For that, first we standardize the observations.
Y
1 n
Yi ,
n i 1
sY
1 n
(Yi Y )2
n 1 i 1
X1
1 n
X1i , s X1
n i 1
1 n
( X 1i X 1 )2
n 1 i 1
X2
1 n
X 2i , s X 2
n i 1
1 n
( X 2i X 2 ) 2
n 1 i 1
Standardize all Y, X1 and X2 values as follows:

Standardized Yi
Y Y
,
sY
Standardized X 1i
Fit the regression in the standardized data and obtain

the least squares estimate of regression coefficients.
These coefficients are dimensionless or unit-free and
can be compared.
Look for the regression coefficient having the highest
magnitude.
Corresponding regressor contributes the most.
Pie
Sales
Price
($)
Advertising
($100s)
-0.78
-0.95
-0.37
0.96
0.76
-0.37
-0.78
1.18
-0.98
0.48
1.18
2.09
-0.78
0.16
-0.98
-0.30
0.76
1.06
0.48
-1.80
-0.98
1.11
-0.18
0.45
0.80
0.33
0.04
10
1.43
-1.38
1.06
11
-0.93
0.50
0.04
12
-1.56
1.10
-0.57
13
0.64
-0.61
1.06
14
0.80
-1.38
0.04
15
-1.56
0.33
-1.60
Note that:
Standardized Data
Week
X 1i X 1
X X2
, Standardized X 2i 2i
s X1
sX 2
2
R Adj
1
Y = 0 0.461 X1 + 0.570 X2
Since 0.461 < 0.570
X2 Contributes the most
Fc
(1 R 2 )(n 1)
(n k 1)
(n k 1) R 2
k (1 R 2 )
Adjusted R2 can be negative

Adjusted R2 is always less than or equal to R2
Inclusion of intercept term is not necessary.
It depends on the problem.
Analyst may decide on this.
13
09-12-2015
Example: Following data was collected for the sales, number of

advertisements published and advertizing expenditure for 12
weeks. Fit a regression model to predict the sales.
Model
1
ANOVAb
Sum of
Squares
df
309.986
Regression
Residual
Mean Square
154.993
143.201
15.911
453.187
11
Sales (0,000 Rs)
Ads (Nos.)
Adv Ex (000 Rs)
43.6
12
13.9
38.0
11
12
a. Predictors: (Constant), Ex_Adv, No_Adv
30.1
9.3
b. Dependent Variable: Sales
35.3
9.7
46.4
12
12.3
34.2
11.4
30.2
9.3
40.7
13
14.3
38.5
10.2
22.6
8.4
37.6
11.2
35.2
10
11.1
Multicollinearity
Total
p-value < 0.05;
F
9.741
Sig.
.006a
CONTRADICTION
H0 is rejected;
All s are not zero
All p-values > 0.05; No H0 rejected.
0 =0, 1 =0, 2 =0
Coefficientsa
Standardized
Unstandardized Coefficients Coefficients
B
Std. Error
Beta
(Constant)
6.584
8.542
No_Adv
.625
1.120
.234
Ex_Adv
2.139
1.470
.611
a. Dependent Variable: Sales
Model
1
t
.771
.558
1.455
Sig.
.461
.591
.180
We assume that regressors are independent variables.
Including two highly correlated independent variables can

adversely affect the regression results
When we regress Y on regressors X1, X2, , Xk.
Can lead to unstable coefficients
We assume that all regressors X1, X2, , Xk are

statistically independent of each other.
All the regressors affect the values of Y.
One regressor does not affect the values of other
regressor.
Sometimes, in practice this assumption is not met.
We face the problem of multicollinearity.
The correlated variables contribute redundant information
to the model
EXAMPLES IN WHICH THIS MIGHT HAPPEN:

Miles per gallon Vs. horsepower and engine size
Income Vs. age and experience
Sales Vs. No. of Advertisement and Advert. Expenditure
Variance Inflationary Factor:

VIFj is used to measure multicollinearity generated
by variable Xj
It is given by
1
VIF j
1 R 2j
where R2j is the coefficient of determination of a

regression model that uses
Xj as the dependent variable and
all other X variables as the independent variables.
Some Indications of Strong Multicollinearity:

Coefficient signs may not match prior expectations
Large change in the value of a previous coefficient when a new
variable is added to the model
A previously significant variable becomes insignificant when a
new independent variable is added.
F says at least one variable is significant, but none of the ts
indicates a useful variable.
Large standard error and corresponding regressors is still

significant.
MSE is very high and/or R2 is very small
If VIFj > 5, Xj is highly correlated with the other

independent variables
Mathematically, the problem of multicollinearity occurs
when the columns of matrix X have near linear
dependence
LSE b can not be obtained when the matrix XX is singular
The matrix XX becomes singular when
the columns of matrix X have exact linear dependence
If any of the eigen value of matrix XX is zero
Thus, near zero eigen value is also an indication of

multicollinearity.
The methods of dealing with multicollinearity:
Collecting Additional Data
Variable Elimination
14
09-12-2015
Coefficientsa
Standardize
d
Unstandardized
Coefficients
Coefficients
Model
B
Std. Error
Beta
1
(Constant)
6.584
8.542
No_Adv
.625
1.120
.234
Ex_Adv
2.139
1.470
.611
Collinearity
Statistics
Sig. Tolerance VIF
.461
.591
.199 5.022
.180
.199 5.022
t
.771
.558
1.455
Tolerance = 1/VIF
Greater than 5
Collinearity Diagnosticsa
Variance Proportions
Condition
Index
Model
Dimension
Eigenvalue
(Constant) No_Adv Ex_Adv
1
1
2.966
1.000
.00
.00
.00
2
.030
9.882
.33
.17
.00
3
.003
30.417
.67
.83
1.00
Negligible Value
We may use the method of variable elimination.

In practice, If Corr (X1, X2) is more than 0.7 or
less than -0.7, we eliminate one of them.
Techniques:
Stepwise
Forward Inclusion
Backward Elimination
(based on ANOVA)
(based on Correlation)
(based on Correlation)
Large Value
Stepwise Regression
Y = 0 + 1 X1 + 2 X2 + 3 X3 + 4 X4 + 5 X5 +
Step 1: Run 5 simple linear regressions:
Y = 0 + 1 X1
Y = 0 + 2 X2
Y = 0 + 3 X3
Y = 0 + 4 X4 <==== has lowest p-value (ANOVA) < 0.05
Y = 0 + 5 X5
Step 2: Run 4 two-variable linear regressions:
Y = 0 + 4 X4 + 1 X1
Y = 0 + 4 X4 + 2 X2
Y = 0 + 4 X4 + 3 X3 <= has lowest p-value (ANOVA) < 0.05
Y = 0 + 4 X4 + 5 X5
Step 3: Run 3 three-variable linear regressions:

Y = 0 + 3 X3 + 4 X4 + 1 X1
Y = 0 + 3 X3 + 4 X4 + 2 X2
Y = 0 + 3 X3 + 4 X4 + 5 X5
Suppose none of these models have

p-values < 0.05
STOP
Best model is the one with X3 and X4 only
Example: Following data was collected for the sales, number of

advertisements published and advertizing expenditure for 12
months. Fit a regression model to predict the sales.
Sales (0,000 Rs)
Ads (Nos.)
Adv Ex (000 Rs)
43.6
12
13.9
38.0
11
12
30.1
9.3
35.3
9.7
46.4
12
12.3
34.2
11.4
30.2
9.3
40.7
13
14.3
38.5
10.2
22.6
8.4
37.6
11.2
35.2
10
11.1
Summary Output 1: Sales Vs. No_Adv

Model Summary
Model
R
R Square
Adjusted R Square
1
.781a
.610
.571
a. Predictors: (Constant), No_Adv
Model
1
Std. Error of the

Estimate
4.20570
ANOVAb
Sum of Squares
df
Mean Square
F
276.308
1
276.308 15.621
176.879
10
17.688
Regression
Residual
Total
453.187
Sig.
.003a
11
a. Predictors: (Constant), No_Adv

Model
1
(Constant)
No_Adv
Coefficientsa
Standardized
B
Std. Error
Beta
16.937
4.982
2.083
.527
.781
t
3.400
3.952
Sig.
.007
.003
15
09-12-2015
Summary Output 2: Sales Vs. Ex_Adv
Summary Output 3: Sales Vs. No_Adv & Ex_Adv
Model Summary
Model Summary
Model
R
R Square
Adjusted R Square
1
.820a
.673
.640
a. Predictors: (Constant), Ex_Adv
Model
1
Std. Error of the

Estimate
3.84900
ANOVAb
Sum of Squares
df
Mean Square
F
305.039
1
305.039 20.590
148.148
10
14.815
Regression
Residual
Total
Sig.
.001a
Model
1
ANOVAb
Sum of Squares
df
309.986
143.201
Regression
Residual
a. Predictors: (Constant), Ex_Adv
Total
453.187
Model
1
(Constant)
Ex_Adv
453.187
11
Coefficientsa
Standardized
B
Std. Error
Beta
4.173
7.109
2.872
.633
.820
t
.587
4.538
Sig.
.570
.001
Qualitative Independent Variables

Johnson Filtration, Inc., provides maintenance
service for water filtration systems throughout
southern Florida.
To estimate the service time and the service cost,
the managers want to predict the repair time
necessary for each maintenance request.
Repair time is believed to be related to two
factors Number of months since the last maintenance
service
Type of repair problem (mechanical or electrical)
Model
1
(Constant)
No_Adv
Ex_Adv
Using least squares method, we fitted the model as
Y 2.1473 0.3041 X 1
R2 =0.534
At 5% level of significance, we reject
H0: 0 = 0 (Using t test)
H0: 1 = 0 (Using t and F test)
X1 alone explains 53.4% variability in repair time.
To introduce the type of repair into the model, we define a
dummy variable given as
0, if typeof repair is mechanical
X2
1, if typeof repair is electrical
Regression Model that uses X1 and X2 to regress Y is

Y=0+ 1 X1+ 2 X2+
Mean Square
154.993
15.911
F
9.741
Sig.
.006a
Sig.
.461
.591
.180
11
Standardized
Coefficients
Beta
.234
.611
.771
.558
1.455
Data for a sample of 10 service calls are given:

Service Call
1
2
3
4
5
6
7
8
9
10
Months Since Last

Service
2
6
8
3
2
7
9
8
4
6
Type of Repair
electrical
mechanical
electrical
mechanical
electrical
electrical
mechanical
mechanical
electrical
electrical
Repair Time in
Hours
2.9
3.0
4.8
1.8
2.9
4.9
4.2
4.8
4.4
4.5
Let Y denote the repair time, X1 denote the number of

months since last maintenance service.
Regression Model that uses X1 only to regress Y is
Y=0+ 1 X1+
Is the new model improved?
2
9
Coefficientsa
Unstandardized Coefficients
B
Std. Error
6.584
8.542
.625
1.120
2.139
1.470
Std. Error of the

Estimate
3.98888
Model
R
R Square
Adjusted R Square
1
.827a
.684
.614
Summary
Multiple linear regression model Y=X +
Least Squares Estimate of is given by b= (XX)-1XY
R2 and adjusted R2
Using ANOVA (F test), we examine if all s are zero or
not.
t test is conducted for each regressor separately.
Using t test, we examine if corresponding to that
regressor is zero or not.
Problem of Multicollinearity VIF, eigen value
Dummy Variable
Examining the assumptions :
common variance, independence, normality
16

Linear Regression Analysis: Predicting Lung Capacity from Cigarette Smoking

Încărcat de

Informații document

Descriere originală:

Titlu original

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

Linear Regression Analysis: Predicting Lung Capacity from Cigarette Smoking

Încărcat de

Drepturi de autor:

Formate disponibile

09-12-2015

Linear Regression Analysis

Smoking and Lung Capacity

Lung Capacity (Y)

Gaurav Garg (IIM Lucknow)

Gaurav Garg (IIM Lucknow)

Scatter plot of the data

Height and Weight

We can see that as smoking goes up, lung

Consider the following data of heights and weights of 5

Gaurav Garg (IIM Lucknow)

Sometimes two variables are related to each

Gaurav Garg (IIM Lucknow)

We have already studied one measure of relationship

For paired observations on variables X and Y,

Gaurav Garg (IIM Lucknow)

Cov(X+a, Y+b) = Cov(X, Y)

It is not unit free.

Karl Pearsons Correlation coefficient is given by

When observations on X and Y are available

Gaurav Garg (IIM Lucknow)

Properties of Correlation Coefficient

Corr(aX+b, cY+d) = Corr(X, Y),

Gaurav Garg (IIM Lucknow)

Gaurav Garg (IIM Lucknow)

Correlation Coefficient measures the strength of

Gaurav Garg (IIM Lucknow)

Gaurav Garg (IIM Lucknow)

When the joint distribution of X and Y is known

Gaurav Garg (IIM Lucknow)

Alternative Formulas for Sum of Squares

Smoking and Lung Capacity Example

Gaurav Garg (IIM Lucknow)

Having determined the correlation between X and Y, we

(5)(750) 502 (5)(6680) 1802

(3750 2500)(33400 32400)

Gaurav Garg (IIM Lucknow)

Gaurav Garg (IIM Lucknow)

Simple Linear Regression Analysis

The simplest mathematical relationship is

Gaurav Garg (IIM Lucknow)

Method of Least Squares

Gaurav Garg (IIM Lucknow)

We want to fit a line for which all the errors are

The best fitted line would be for which all the

Gaurav Garg (IIM Lucknow)

To get the values of a and b which minimize SSE, we

Solving above normal equations, we get

Eq (1) and (2) are called normal equations.

The values of a and b obtained using least squares

Gaurav Garg (IIM Lucknow)

FittedLineisY 189.91 51.12 X

Gaurav Garg (IIM Lucknow)

Gaurav Garg (IIM Lucknow)

TotalSumof Squares: SST (Yi Y ) 2

RegressionSum of Squares: SSR (Yi Y ) 2

Error Sumof Squares: SSE (Yi Yi ) 2

Also, SST = SSR + SSE

Gaurav Garg (IIM Lucknow)

The fraction of SST explained by Regression is given by R2

Some but not all of

100% of the variation