Sunteți pe pagina 1din 28

# MULTICOLLINEARITY &

AUTOCORRELATION
Multicollinearity
• The theory of causation and multiple causation
• Interdependence between the Independent Variables and
variability of Dependent Variables
• Parsimony and Linear Regression
• Theoretical consistency and Parsimony

X5

X4

Y X1

X3

X2
• One of the assumptions of the CLRM is
that there is no Multicollinearity
amongst the explanatory variables.
• Multicollinearity refers to perfect or
exact relationship among some or all
explanatory variables
Expl.: X1 X2 X *2
10 50 52
15 75 75
18 90 97
24 120 129
30 150 152
X2i = 5X1i & X*2 was created by adding 2,
0, 7, 9 & 2 from, a random number table.
Here r1.2 = 1 & r2.2* = 0.99
X1 & X2 show perfect multicollinearity
X2 & X*2 near-perfect multicollinearity
• The problem of multicollinearity and its
degree in types of data
• Overlap between the variables
indicates the extent of it as shown in
the Venn diagram.
Example:
Y = a + b1x1 + b2x2 + u
where
Y = Consumption Expenditure
X1 = Income & X2 = Wealth
Consumption expenditure depends on income (x1) and
wealth (x2)
• The estimated equation from a set of data is as follows:
Ŷ = 24.77 + 0.94x1 – 0.04x2
‘t’ : (3.66) (1.14) (0.52)
R2 = 0.96 Ř2 = 0.95 F = 92.40
The individual β coefficients are not significant although ‘F’ value
suggests a high degree of association
There is a wrong sign with x2
The fact that the ‘F’ test is significant but the ‘t’
values of X1 and X2 are individually
insignificant means that the two variables are
so highly correlated that it is impossible to
isolate the individual impact of either income
or wealth on consumption.
Let us regress X2 on X1
X2 = 7.54 + 10.19 X1
‘t’ = (0.25) (62.04) R2 0.99
This shows near perfect multi-collinearity
between X2 and X1
Y on X1 Y on X2
Ŷ = 24.24 + 0.51X1 Ŷ = 24.41 + 0.05 X2

## ‘t’ = (3.81) (14.24) ‘t’ = (3.55) (13.29)

R2 = 0.96 R2 = 0.96

## Dropping highly collinear variable has made the

other variable significant.
Sources of Multicollinearity
• Data collection method employed:
Sampling over a limited range of the values taken by
the regressors in the population
• Constraints on the model or in the population being
sampled:
Regression of electricity consumption on income and
house size. There is a constraint : families with higher
income may have larger homes.
Sources of Multicollinearity
• Model specification:
Adding polynomial terms to a model when
range of X variable is small
• An Over determined Model:
This happens when the model has more
explanatory variables than the number of
observations.
• Use of time series data:
Model share a common trend.
Practical Consequences of Multicollinearity:
In cases of near perfect or high multicollinearity one is
likely to encounter the following consequences:

## 1. The OLS estimators have large variances and co-

variances making precise estimation difficult.
2. (a) Because of ‘1’ the confidence intervals tend to be
much wider leading to the acceptance of the
“zero” null hypothesis (i.e. the true population
(b) Because of ‘1’ the ‘t’ ratios of one or more
coefficients tend to be statistically insignificant.
Practical Consequences of Multicollinearity:
3. Although the ‘t’ ratio(s) of one or more
coefficients is/are statistically insignificant, R2
the overall measure of goodness of fit, can be
very high.

## 4. The OLS estimators and their S.E.s can be

sensitive to small changes in the data.
Detection of Multicollinearity

## 1. High R2 but few significant ‘t’ values

2. High pair wise correlation amongst regressors
(seen from correlation matrix)
3. Examination of partial correlation
4. Auxiliary Regressors and F-test (regress each xi
on remaining xis. Find ‘F’ values and decide).
5. Eigen values and condition index.
Remedial Measures

## 1. A priori information and articulation

2. Dropping a highly collinear variable
3. Transformation of Data
4. Additional information or new data
5. Identifying the purpose and reducing the
degree of it. (Or) Simply identifying it if
the purpose is prediction.
AUTOCORRELATION
The assumption E(UU’) = б2 I 
• Each u distribution has the same variance
(homoscedastic)
• All disturbances are pair wise uncorrelated
This assumption gives
Var u1 Cov (U1 U2) ... Cov (U1, U2) б2 0 ... 0
Cov (U2 U1) Var V2 ... Cov (U2, Un) 0 б2 ... 0
.... .... ... .... = ... ... ... ...
Cov (unU1) Cov(Un U2) ... (Var Un) 0 0 ... б2
E(UiUj) = 0 ij
This assumption when violated leads to:
1. Heteroscedasticity
2. Autocorrelation
Covariance is the measure of how much two
random variables vary together (as distinct from
variance, which measures how much a single
variable varies.)
Covariance between two random variables say X
and Y is defined as
Cov (X, Y) = E [(X - )(Y- )]
Where  and  are expected values of X and Y
respectively.
If X and Y are independent their cov. is “Zero”
The assumption implies that the disturbance
term relating to any observation is not
influenced by the disturbance term relating
to any other observation.
For example:
1. If we are dealing with quarterly time
series data involving the regression of the
following specification. (Time Series Data)
Output (Q) = f (Labour and Capital Input)
Q L K U
Q 1.1 L1 K1 U1
Q 1.2 L2 K2 U2
Output is Q 1.3 L3 K3 U3
There is no
affected Q 1.4 L4 K4 U4
reason to believe
due to Q 2.1 ... ... ... that this will be
labour ... ... ... ... carried over to
strike ... ... ... ... U4
... ... ... ...
Q n.4 L4n K4n U4n
2. Let
Family Consumption Expenditure = f (income)
(A regression involving Cross Section Data)

## Consumption Expenditure Income of

of Families Family
F1 I1 U1
F2 I2 U2
... ... ...
... ... ...
... ... ...
... ... ...
Fn In Un
The effect of an increase of one family’s income on
consumption expenditure is not expected to affect the
consumption expenditure of another family.
The reality:
1. Distribution caused by strike may affect production
2.Consumption expenditure of one family may
influence that of another family i.e.
To keep up with the Joneses – Demonstration effect

## • Autocorrelation is a feature in most time-series data.

• In cross section data it is referred to as spatial
autocorrelation.
Why it may occur
1. Inertia:
A salient feature of most economic series is inertia.
Time series data such as PCI, price indices, production,
profit, employment etc. exhibit cycles. Starting at the
bottom of a recession, when economic recovery
starts, most of these series move upward. In this
upswing the value of the series at one point of time is
greater than its previous value. Thus, there is a
“momentum” built into that an it continues until
something happens to slow them down.
[Intervention]
Therefore, in regression involving time series data
successive observations are likely to be inter-dependent
which reflect in a systematic pattern of the ui s.
2. Specification bias:
Excluded variable(s) or incorrect functional form.
a) When some relevant variables have been excluded
from the model they will reflect a systematic pattern
in the ui s.
b) In case of incorrect functional form i.e. fitting a linear
function when the true relationship is non-linear (&
vice-versa), there will either be over estimation or
under estimation of the dependent variable which will
have a systematic impact on Ui s.
Example:
(Correct) MC = β1 + β2 output + β3 (output)2 + Ui
(Incorrect) MC = b1 + b2 output + Vi
Where vi = (output)2 + ui and hence it will catch the
systematic effect of (output)2 on the MC leading to serial
correlation of uis.
3. Cobweb Phenomenon:
Supply of many agricultural commodities reflect the so
called “Cobweb-Phenomenon” where supply reacts to
price with lag of one time period because supply
decisions take time to implement (gestation period).
Expl. : At the beginning of this year’s planting of crops farmers
are influenced by the price prevailing last year.
Suppose at the end of period ‘t’ price Pt turns out
to be lower than Pt-1. Therefore, in period t +1 the
farmer may decide to produce less than they did in
period ‘t’.
Such phenomena are known as Cobweb
Phenomena. And they give a systematic pattern to
the Uis.
In cases of Household Expenditure, share prices
etc. such problem arises. In general, when lagged
variable is not included (in many cases) the uis are
correlated.
4. Manipulation of time series data:
(i) Extrapolation of values of variables like
population give rise to serial dependence of
successive Uis.
(ii) Very often we use projected population figure
to arrive at per capita figure for any macro-
variable and use of such figures in forecasting
using regression (the successive Uis are serially
correlated).
Consequences: (Proofs are not given)
In the presence of autocorrelation in a model:
a) Residual variance is likely to under estimate
the true б2.
b) R2 is likely to be over estimated.
c) ‘t’ test are not valid and if applied likely to give
OLS estimators although linear and unbiased, they
do not have minimum variance leading to invalid
‘t’ and ‘F’ test.
Detection of autocorrelation:
The assumption of the CLRM relates to the population
disturbance term which are not directly observable.
Therefore, their proxies i.e. ûis are obtained from OLS
and examined for the presence/absence of auto
correlation.
There are various methods. Some of them are:
1. Graphical Method
2. Runs Test (a non-parametric test) – Examines the
signs
3. DW-statistics

## A decision rule is applied.

Remedial Measures:
Data transformation by
a) First difference method (Xt+1 – Xt) (one degree
of freedom is lost)
b) ‘’ – transformation
Estimated

## The transformed model becomes

(Yt -  Yt-1) = β1(1- )+β2(Xt- Xt-1)+ut

## This is known as generalised or quasi-difference

equation.
Exercise 4 ( Refer Ch 10&12 of DNG)
• Use time series data in MR
• Find Correlation table
• See the extent of multicollinearity
• Test for autocorrelation
• In the presence of it use ‘Ro’ transformation
• Addressing both the problems calculate
forecast error and select an equation which
gives the minimum forecast error.