OLS Advanced

Ordinary Least Squares at advanced level
1. Review of the two-variate case with algebra

OLS is the fundamental technique for linear regressions. You should by now be aware of the two-variate
case and the usual derivations. In this text we are going to review the OLS using matrix algebra, which is
the right tool to have a more generalized (multivariate) view of the OLS methodology.
In the standard two-variate case we had the following model for the population:
Yi =
0 + 1 X i + ei (1.1) where e is the error, X is the explanatory variable, Y is the dependent variable
and betas denotes the population coefficients. This is called the Population Regression Equation.
What we have is only a sample (sample observations are denoted by lowercase letters):
yi =0 + 1 xi + ui (1.2) where u is the residual which is our estimate of the error. This is the Sample
Regression Equation. y and x are hence a sample drawn from the population Y and X and the beta hats are
our estimates of the betas (population parameters) from the sample.
We arrived at the OLS estimates of the beta coefficients by the least squares principle (SSR sum of
squares residuals):
( y
u=i2
SSR=
=i 1 =i 1
1 xi
(1.3). The first order conditions for a minimum requires that:
2
i
1
=2 yi 0 1 xi =2 ui =0 ui =0 ui =0 E (u ) =0 (1.4) in other
ni1
0
=
i 1
=i 1 =i 1 =
i =1
words, the residual of the OLS will always have zero mean, provided we include an intercept or constant
term.
n
ui2
i =1
1 n
E (ux) =
0 (1.5) This is
ui xi =
ni1
=
=
2 yi 0 1 xi xi =
2 ui xi =
0 ui xi =
0
=
i 1
1
=i 1 =i 1
the sample version of the orthogonality or exogeneity condition. The OLS will always assume that the
residuals and the explanatory variables are uncorrelated. If this is not true, the OLS is biased.
From the above conditions:
(
n
=i 1
n
1 n
1 x = y x (1.6)
y
i 1n
1
i
n i 1=
=
i 1
yi 0 1 xi = yi n0 1 xi =0 0 =
=i 1
=i 1
( y
n
=i 1
yi xi 0 xi
2
=
i 1 =i 1
0 1 xi xi = yi xi 0 xi 1 xi = 0 1 =
n
=i 1
=i 1 =i 1
xi2
i =1
E ( x y) y x
(1.7)
E( x2 ) x x
We can now substitute our estimate for the intercept into the above expression to arrive at the OLS
estimator for the slope coefficient.
n
xi yi n y x
( xi x )( yi y ) ( xi x ) yi
E
x
y
y
x
(
)
xy
=
=
=
i
i
1
1
(1.8)
=
=
= =
= i n1
1
n
n
2
2
2
2
x
E(x ) x x
2
xi n x x
( xi x )
( xi x )
=i 1
=i 1 =i 1
These are all equivalent and we can use any of these as we please.
You should be able to reproduce above derivation without any difficulty before advancing further. If you
do not understand a step, you will find plenty of help on the internet. The idea is not that you memorize
the derivations but that you arrive at valid results starting out from the same assumptions, in other words
you can reproduce the derivations. Let me share my experience with you: you completely understand
something only if you can derive it. Knowing the big picture only, will help you to advance faster initially,
but it will hold you back from some point on.
Let us look at the properties of the OLS now using the two-variate case!
1. Linearity: First of all we will show why the regression as in (1.2) is linear. If the parameters can be
expressed as a linear combination or weighted average of the observations of the dependent
variable, we call the regression linear. We take one version of the estimator:
n
=
1
(x x ) y
i
=
2
( xi x )
i =1
n
k y (1.9),
i =1
( xi x )
where ki =
(x x )
i
i =1
i =1
(1.10). So every single observation of the

2
dependent variable is going to affect our estimate of the slope parameter 1 by a unique weight. The
weight depends on the variance of the explanatory variable and the deviation of the explanatory variable
at observation i from the mean.
Properties of ki
n
=
ki
i =1
( xi x )
i =1
=
0 , ki2
=
n
2
i =1
( xi x )
(x x )
=
n
2
2
x
x
(
)
i
( xi x )
i =1
=i 1 =i 1
i =1
=
ki xi
i =1
(x x ) x
i
, ki ( xi x )
= 1=
2
i =1
( xi x )
i =1
n
i =1
1
=
n
2
( xi x )
ki
,
( xi x )
=i 1
( x x )( x x )
i =1
=
1
n
2
( xi x )
i =1
2. Unbiasedness: An estimator is unbiased if its expected value equals the population parameter. That
is E ( ) = . If there is a difference then that difference is called the bias.
We show the unbiasedness first. The trick is that since the population regression function (or the Data
Generating Process) is Yi =
0 + 1 X i + ei , hence yi = 0 + 1 xi + ei . This can be substituted into the
estimator to derive the relationship between the population parameter and our estimate.
n
( xi x ) ei
( xi x )( yi y ) ( xi x ) ( 0 + 1 xi + ei 0 1 x )
n
=
=
i
i
i
1
1
1
=
=
+
=
+
1 =
ki ei
1
1
n
n
n
2
2
2
i =1
( xi x )
( xi x )
( xi x )
i 1 =i 1
=i 1
(1.11)
) 1 + xe2 hence, if the error and the explanatory variables are uncorrelated, the OLS estimator is
E ( 1=
x
unbiased (orthogonality condition again). Then: E ( 1 ) = 1
Similarly, o =y 1 x = 0 1 1 x hence E ( o ) =
0 E 1 1 x , so if 1 is unbiased, 0 is
unbiased too.
3. Efficiency: This concept can only be understood in a relative sense. An estimator is going to have a
standard deviation (called standard error) since with any new sample drawn from the same
population, you are going to obtain different estimates for the population parameter. If you have
two alternative estimators, the one with lower standard error is called more efficient. There is a
lower limit of standard errors, given by the CramrRao lower bound (in other words, no estimator
can have less variance than this limit). If the estimators variance is at the lower bound, then we call
it efficient.
We can express the variance of the OLS estimator for 1 as follows:
2
n
E 1 E ( 1 ) =
E ki ei (1.12)
=
1
i =1
n 2 2
k
e
E
ki ei since the cross products are all zero. If
=
i i
=
i 1
i 1=
If the orthogonality condition holds, E
( )
the error is homoscedastic then E e 2 = e2 and we can treat it as constant and bring it in front of the
summa sign.
n 2 2
n 2
2
=
=
2 E=
k
e
E
e
E
(
)
i i
ki
1
=
i 1=
i1
e2
n
(x x )
i =1
=
1
e
n
(x x )
i =1
(1.13)
2
The problem is that we do not know the standard deviation of the error. But we can use the residual
variance (or mean sum of squares residual) as estimator of the error variance. =
2
e
u2
n2
But why does our OLS estimator for the population error variance equals
u2
n2
in a two-variate
regression? Everyone seems to accept this, yet only rarely is it derived algebraically (it is quite simple to
do with matrix algebra, though). Let us see the derivation!
First, you will need some uncomfortable algebra to express the residual as function of the error.
ui =yi o 1 xi = o + 1 xi + ei (y 1 x ) 1 xi = o + 1 xi + ei ( o + 1 x + e ) + 1 x 1 xi =
= ei e 1 1
)( x x )
i
No we take the square residual:
2
ui2 =( ei e ) + 1 1
) (x x)
2
2 1 1
) ( x x )( e e ) and the sum of squared residuals is

i
then:
n
ui2=
2
( ei e ) + 1 1
=i 1 =i 1
2
( xi x ) 2 1 1
=i 1
) ( x x )( e e )
n
=i 1
Now we need the expectation of the sum of squared residuals.
((
))
n
n
2
n 2
n
2
2
E =
ui E ( ei e ) + E 1 1 ( xi x ) 2 E 1 1 ( xi x )( ei =
e)
=
=i 1
=i 1
i 1=
i1
( xi x ) ei
= (n 1) e2=
+ e2 2 i n1
( x x )( e e ) =
n 2
( xi x ) ei2
2
2
i 1 =i 1
=
e
n
2
i =1
i
i
i 1 =i 1
(x x )
(x x )
= n e2 2 e2 =
= (n 2) e2 (1.14)
where we made us of the following:
ei
n
n
n 2
2
2
2
i =1
E ( ei e ) =
E ei ne =
E ( ei ) n n
i 1
i 1=
i 1=
n
1 n
1 n
2
2
=
E
e
E
e
=
n
E (ei2 ) =
( n 1) e2
( i ) n i
n i 1
i 1=
i1 =
n
n
where the conversion: ei = ei2 is true under the assumption of no autocorrelation.
=
i 1
i 1=
The standard error of the constant term can be derived as follows:
o 0 = 1 1 x + e = e x ki ei hence
i =1
ei2
2
n
n
n
n
2
2
2
i =1
o 0 =
e + x ki ei 2ex ki ei =
x
k
e
ex
ki ei and
2
i i
n2
=
=
i 1
i 1
i 1=
i 1=
((
E o 0
) ) =n
2
e2
x 2 e2
n
(x x )
Let us remember:
=
x2
2
1 n 2
xi x2 substituting this into the
n i =1
i =1
previous equation gives us:

n
1 n 2 2
2 2
2
2
2
i
e
x e
e xi
2
n
(1.15)
=n i1
o 0 =e + i 1n=
E =
2
2
n
n ( xi x )
( xi x )
((
))
=i 1 =i 1
The Cramr-Rao lower bound

The Cramr-Rao lower bound for the variance of an estimator ( ) is expressed as: 2
1
, where
I ( )
denotes the population parameter to be estimated. I() is the Fischer information which is defined as:
( x, ) 2
2 ( x, )
I ( ) = E
E
=
, where (x, ) is the log-likelihood function, and we have a

single parameter to estimate.

For example if the dependent variable Y follows a normal distribution and we estimate its population
mean only ( Y ). 1 Then the log-likelihood function is:
n
1
( X , 0 , 1 ) =
ln(2 e2 ) 2
2
2 e
( X , Y ) 1
=
Y
e2
2
i
e
)
(Y =
2
e
n
1
=
ln(2 e2 ) 2
2
2 e
(Y
i
, also known as the score function.
e2
( X , Y )
1
n
. You should remember that when we the sample
=
2 1 =
2 , hence var( Y )
n
Y Y
e
e
mean is used as an estimator for the population mean, its standard error was: y2 =
e2
n
hence the
sample mean is at the Cramr-Rao lower bound and is an efficient estimator of the population mean.
1
Actually, the standard deviation is also a parameter to estimate, but this is independent of the mean, so I disregard
it now.
What if, as usually the case, we have a vector of parameters to estimate (i.e. multiple parameters)? Then
we have the Fischer information matrix. The i,jth element of which is:
(x, ) (x, )
2 (x, )
I()i , j = E
= E
.
i j
i
j
For example if we have the PRF (1.1) and e is assumed to be normally distributed, then
n
n
1
1
( X , 0 , 1 ) =
ln(2 e2 ) 2 ei2 =
ln(2 e2 ) 2
2
2 e
2
2 e
( X , 0 , 1 ) 1
=
(Yi 0 1 Xi ) Xi
1
e2
(Y
i
1 X i )
X i 2 ( X , 0 , 1 )
2 ( X , 0 , 1 )
Xi
,
=
=
e2
11
1 0
e2
2
( X , 0 , 1 ) 1
=
e2
0
=
1 Xi )
2
e
( X , 0 , 1 )
n ( X , 0 , 1 )
Xi
=
= 2,
0 1
e2
e
0 0
e2 X i2
n
X
X
(
)
e2
i
I()1 =
2
Xi
X e2
e2
( Xi X )
n
2
e
I() =
X
2 i
e
Hence: 2
(Y
i
e2 X i2
n ( X i X )
and 2
1
e2
( Xi X )
( Xi X )
e2
2
( Xi X )
X e2
, which equal the variances (1.13), (1.14) of the
OLS estimates under exogeneity, homoscedasticity and no autocorrelation assumptions.
The Gauss Markov theorem

If the following conditions are met:
k
0 + j X j ,i + ei (the model is linear)

1. Yi =
j =1
2. E (e) = 0
3. Var (e=
) e2 < (homoscedasticity)
4. Cov(ei , e=
0, i j (no autocorrelation)
j)
5. Cov(X j , e) = 0 for any Xj (exogeneity)
Then the OLS is the best linear unbiased estimator or BLUE. Best refers to the fact that its
standard errors are on the Cramr-Rao lower bound, hence we cannot have any estimator
with a lower standard error. This is core result in statistics. Observe that the normality of the
error term is not required, even though it is customary to list among the assumptions of the
Classical Linear Model, but we used it to derive the Cramr-Rao Lower Bound. Yet, since the
coefficients are calculated as the weighted sum of observations drawn from the same
probability distribution (y), their distribution should converge to the normal distribution
according to the Central Limit Theorem (CLT).
2. OLS with matrix algebra

Let us define the following linear model in the population:
y1
y
+ e or 2
=
=
y X

y n
X 11
X 21

X n1
X 1k 1 e1

X 1k 2 e2
+
(2.1)

X nk k en
we can estimate the vector of coefficients () using the least squares principle. The vector e denotes the
.
vector of errors: e= y X
Hence the sum of squares residual (SSE) is
SSE = u Tu = ( y X )
( y X ) = y y y X X y + X X = y y 2 X y + X X (2.2)
T
, which is a scalar. Here we made us of the fact that T X T y = y T X , since they are scalars (their
dimension is 1x1).
The First Order Condition of an extremum requires that:
u Tu
=
2 XT y + 2 XT X =
0 (2.3)
or X T X = X T y , which is called the normal eqaution

Where I made use of the following rules:
Ax
x T Ax
x T Ax
= A, = x T ( A + A T ) and if A is symmetric :
= 2x T A
x
x
x
The vector of betas is hence:
-1
= ( XT X ) XT y (2.4) we can further differentiate (2.5) by in order to check the Second Order
2u Tu
Condition and obtain: =

2 XT X > 0 , that is, we indeed have a minimum. From this point it follows
2
that X T X must be invertible, hence it must be of full rank. This is only possible if the matrix X has full
column rank, i.e., our explanatory variables are linearly independent. (this is the condition of no
multicollinearity).
It will make our life much easier if we introduce two important matrices. The first is the projection matrix
(sometimes referred to as the hat matrix) (P): P = X(X T X)-1 X T (2.6), the second is the annihilator
matrix (M): M = I n P = I n X(X T X)-1 X T (2.7). These matrices are square matrices (nxn), symmetric,
that is P = P T and M = M T and idempotent, i.e., PP = P and MM = M .
T
T
Proof: PP X(X T X)-1 X T X(X
=
=
X)-1 XT X(X
=
X)-1 XT P
MM =( I n P )( I n P ) =I n 2P + PP =I n 2P + P =I n P =M
The projection matrix projects y onto a column vector space defined by the explanatory variables X. That
is:
y (2.8) Hence the projection matrix contains the weights and plays the
= X(X T X)-1 XT=
=
Py
y X
same role as the weights in (1.10).
My =
u (2.9)
(I n X(X T X)-1 XT )y =y X =
MX =
(I n X(X T X)-1 XT ) X =
X X(X T X)-1 XT X =
0 (2.10) and My = M ( X + e) = Me = u (2.11).
Unbiasedness: We can prove the unbiasedness of the OLS estimator as follows:
-1
-1
-1
=( XT X ) XT y =( XT X ) XT ( X + e ) = + ( XT X ) XTe (2.12)
-1
E ( )= + ( XT X ) E ( XTe ) (2.13). That is, if E ( XTe ) = 0 (exogeneity) the OLS estimates are
unbiased.
Efficiency:
First we need the variance of the OLS estimator with unibiasedness assummed.
The variance of the estimator is then:
-1
-1
-1
-1
2 = E ( ) 2 = E ( XT X ) XTeeT X ( XT X ) = ( XT X ) XT E (eeT ) X ( XT X ) (2.14)
If the error is homoscedastic, and not autocorrealted (this is a weak version of the condition of identically
e2 0
and independently distributed errors)

then E (eeT ) =
=
e2 I n .
0 e2
Hence (2.13) can be written in a much simpler form: 2 = e2 X T X
-1
Yet, we do not know the error variance, only the residual varince. We canhowever establish the
relationship easily. Using (2.10):
T
=
u Tu e=
M T Me eT Me (2.15) Since this is a scalar, its value will be equal to its trace:
tr (u Tu) = tr (eT Me) for the trace there exists a rule regarding cyclic permutations, namely that
( ABCD) tr(
tr=
=
DABC) tr(
=
CDAB) .... (2.16) using this rules we obtain that:
T
=
tr (u Tu) tr=
etr (M ) e2tr (M ) (2.17)
( eTeM ) e=
But what is the trace of the annihilator matrix? The trace of an nxn identity matrix is n, and the trace of
the projection matrix equals the rank of the matrix X), which is k.
T
tr (I n ) tr (P) =
n k (2.18).
tr ( X(XT X)1=
=
XT ) tr((XT X)1 X=
X) tr(
I k ) k . Hence: tr (M ) =
e2 =
u2
nk
(2.19) Here we received the same result for k<n parameters to be estimated as in (1.14) for
k=2.
The effect of additional explanatory variables on the coefficient

Let assume that we have two sets of regressors, X1 and X2. If we regress y on both sets of variables:
y = X1 + X 2 + u The residual will be: u =

y X1 X 2 The sum of square residuals is:
u Tu = y T T X1T T XT2
)( y X X ) =
1
=
y T y y T X1 y T X 2 T X1T y + T X1T X1 + T X1T X 2 T XT2 y + T XT2 X1 + T XT2 X 2
which we seek to minimize by choosing the coefficient vectors:
1
1
u Tu
0 =
=2 X1T y + 2 X1T X 2 + 2 X1T X1 =
X1T X1 ) X1T y ( X1T X1 ) X1T X 2 (2.20) and
(
1
1
u Tu
0 =
=2 XT2 y + 2 XT2 X1 + 2 XT2 X 2 =
XT2 X 2 ) XT2 y ( XT2 X 2 ) XT2 X1 (2.21)
(
X1T X1
Or T
X 2 X1
X1T X 2 X1T y
=
(2.22) which is the set of normal equations.
XT2 X 2 XT2 y
Hence we can see that the coefficients in a multivariate regression will reflect the effect of the correlation
among the different regressors. If, and only if the two sets of regressors were uncorrelated, that is,
X X ) X X (=
X X ) X X
(=
T
2
T
2
T
1
T
1
0 could we expect that the coefficient from a regression of y on X1
would yield the same beta coefficients as in (1).

The Frisch-Waugh Theorem (also known as the Frisch-Waugh-Lovell Theorem)
Let us substitute (2.20) into (2.22)!
X1T X1
T
X 2 X1
1
1
X1T X 2 ( X1T X1 ) X1T y ( X1T X1 ) X1T X 2 X1T y
= T
X2 y
XT2 X 2
XT2 y
XT2 X1 ( X1T X1 ) X1T y XT2 X1 ( X1T X1 ) X1T X 2 + XT2 X 2 =
1
Let us define the projection matrix for the column vector space spanned by X1: P1 = X1 X1T X1
X1T and
an annihilator matrix: M=
I n P1
1
XT2 P1 y XT2 P1 X 2 + XT2 X 2 =

XT2 y
XT2 M1 X 2 = XT2 M1 y = ( XT2 M1 X 2 ) XT2 M1 y or, due to idempotence and symmetry
1
= ( XT2 M1T M1 X 2 ) XT2 M1T M1 y

1
What is M 1 y ? It is the residual from a regression of y on X1 only. Similarly, M1 X 2 is the set of residuals
from the regressions of all columns of X2 on X1. The effect of X1 on the coefficient vector is netted out or
partialed out.
Frisch-Waugh theorem states that the coefficients from a multivariate regression are identical from a
two-variate regression where the effect of all other variables is netted out. The coefficients from a
multivariate regression hence can be interpreted as the partial effect of the variable in question on the
dependent variable, that is, with all other effects removed.
But what is the practical importance of this another core result in statistics?
1. The idea of ceteris paribus is central in the methodology of economics, for example in
comparative statics. In comparative statics we analyze the effect of a single variable or parameter
on the outcome variable with all other factors fixed. Hence multivariate regressions are obvious
ways to directly measure such relationships.
2. Have you ever considered what the right way is to regress y on x when you know that seasonal
effects are present? Should you regress y on x with seasonal dummies included, or rather should
you first deseasonalize y and x individually, and regress the deseasonalized y on the
deseasonalized x? Frisch and Waugh have a good news to you. It is the same.
Practical example
We have data on the salary of employees and their education (years of education) and experience (years)
(Ramanthan data6-4.gdt in Gretl). First we estimate the effect of both education and age on the logarithm
of salary in a three-variate regression.
Model 2: OLS, using observations 1-49
Dependent variable: l_WAGE
Now we are going to partial out the effect of experience on education.

First we regress the log wage on experience and save the residual (res1).
Then we regress the education on experience and save the residual (res2).
Which indeed yield the same coefficient as the education has in the multivariate regression.

OLS Advanced

Încărcat de

Informații document

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

OLS Advanced

Încărcat de

Drepturi de autor:

Formate disponibile

Ordinary Least Squares at advanced level

1. Review of the two-variate case with algebra

(1.3). The first order conditions for a minimum requires that:

(1.10). So every single observation of the

unbiased (orthogonality condition again). Then: E ( 1 ) = 1

If the orthogonality condition holds, E

No we take the square residual:

) ( x x )( e e ) and the sum of squared residuals is

Now we need the expectation of the sum of squared residuals.

The standard error of the constant term can be derived as follows:

previous equation gives us:

The Cramr-Rao lower bound

, where (x, ) is the log-likelihood function, and we have a

single parameter to estimate.

, also known as the score function.

, which equal the variances (1.13), (1.14) of the

OLS estimates under exogeneity, homoscedasticity and no autocorrelation assumptions.

The Gauss Markov theorem

0 + j X j ,i + ei (the model is linear)

2. OLS with matrix algebra

Hence the sum of squares residual (SSE) is

or X T X = X T y , which is called the normal eqaution

Condition and obtain: =

and independently distributed errors)

Hence (2.13) can be written in a much simpler form: 2 = e2 X T X

The effect of additional explanatory variables on the coefficient

y = X1 + X 2 + u The residual will be: u =

0 could we expect that the coefficient from a regression of y on X1

would yield the same beta coefficients as in (1).

Let us substitute (2.20) into (2.22)!

XT2 P1 y XT2 P1 X 2 + XT2 X 2 =

= ( XT2 M1T M1 X 2 ) XT2 M1T M1 y

Now we are going to partial out the effect of experience on education.

S-ar putea să vă placă și