OLS

AUEB Fall 2008 Ekaterini Kyriazidou
LECTURE 2
1. The Method of (Ordinary) Least Squares
Once we write down the statistical model, such as the Classical Linear Regression model, which
can be summarized as:
1
1 (1 [A) = A,
\ (1 [A) = o
2
1
n
along with the identication condition
A
0
A is full column rank 1
the question is how to estimate the unknown parameters in the model, the ,
k
s and o
2
, from our
sample (1
i
, A
i
)
n
i=1
. We will rst focus on estimating the ,
k
s. If we believe that the assumptions
of the CLR model hold, we can estimate the ,
k
s by running an (Ordinary) Least Squares
(OLS or simply LS) regression of 1 on A, or, for shorter, by regressing 1 on A.
2
This
means solving the following minimization problem:
min
fb
1
;b
2
;:::;b
K
g2<
K
n
i=1
(1
i
(/
1
A
i1
+/
2
A
i2
+... +/
K
A
iK
))
2
(1.1)
or equivalently
min
b2<
K
n
i=1
(1
i
A
i
/)
2
(1.2)
which may be also written in matrix notation as:
min
b2<
K
(1 A/)
0
(1 A/) (1.3)
1
The CLR model can be equivalently expressed in terms of the error term as:
Y = X +"
E ("jX) = 0
V ("jX) =
2
In
along with the identication condition
X
0
X is full column rank K
2
We will see why OLS is appropriate under the assumptions of the CLR model in the next lecture.
Note that here I changed the notation from , to /. This is simply done to convey the idea that ,
is a xed but unknown quantity; / denotes the argument with respect to which we are minimizing.
To solve this problem we take First Order Necessary Conditions (FONC), or simply FOC, i.e.
we take the rst derivatives of (1.1) , or equivalently of (1.2) or (1.3) , with respect to /, and set
them all simultaneously equal to zero.
The derivatives of (1.1) are:
w/r/t /
1
: 2
n
i=1
A
i1
(1
i
(/
1
A
i1
+/
2
A
i2
+... +/
K
A
iK
)) (1.4)
w/r/t /
2
: 2
n
i=1
A
i2
(1
i
(/
1
A
i1
+/
2
A
i2
+... +/
K
A
iK
))
.
.
.
w/r/t /
K
: 2
n
i=1
A
iK
(1
i
(/
1
A
i1
+/
2
A
i2
+... +/
K
A
iK
))
or equivalently for (1.2)
2
n
i=1
A
0
i
(1
i
A
i
/) (1.5)
or yet another way of writing them using matrix notation
2A
0
(1 A/) (1.6)
Let
^
, =
_
_
^
,
1
^
,
2
.
.
.
^
,
K
_
_
be the vector that satises the FOC, i.e. that sets the derivatives above equal to zero. Whichever
way we write the FOC (1.4), (1.5), or (1.6), they involve solving (the same) system of 1 equations
(called the normal equations) in 1 unknowns. It is convenient to work with matrix notation to
express the normal equations. Since
^
, sets the FOC (1.6) equal to a 11 vector of zeros, we have:
2A
0
_
1 A
^
,
_
= 0 = A
0
_
1 A
^
,
_
= 0 = A
0
1 =
_
A
0
A
_
^
,
The solution to this system of equations is the (Ordinary) Least Squares estimators of ,.
The OLS estimator has the form:
^
, =
_
A
0
A
_
1
A
0
1 (1.7)
and it exists and it is unique provided that the 1 1 matrix (A
0
A) is invertible, i.e. non-
singular. This is guaranteed by assumption (1.4) that A has rank 1. If this assumption fails to
hold, i.e. if (A
0
A)
1
does not exist, then we cannot solve uniquely the normal equations and the
OLS estimator
^
, does not exist. This is the problem of perfect collinearity (to be distinguished
2
from multicollinearity, a problem that sometimes occurs when running an OLS regression and
which we will discuss later).
The OLS estimator may be equivalently written in terms of sample averages as
^
, = o
1
XX
o
XY
(1.8)
where
o
XX
=
1
:
n
i=1
A
0
i
A
i
=
1
:
A
0
A
and
o
XY
=
1
:
n
i=1
A
0
i
1
i
=
1
:
A
0
1
The form in (1.7) is more useful for deriving the nite sample properties of the OLS estimator while
the form in (1.8) is more useful for deriving the asymptotic properties of the estimator.
How do we know that
^
, is a minimum? By checking the Second Order Sucient Conditions
(SSOC) for minimization, that is, that the Hessian, i.e. the matrix of second derivatives of (1.1)
or equivalently of (1.2) or (1.3)) with respect to /, is a Positive Denite (PD) matrix. In our case,
the Hessian is 2 (A
0
A) which is obviously PD, since 2 is positive and (A
0
A) may be thought of as
being the square of A.
1.1. Fitted Values, Residuals
It is useful to dene for the ith observation the following quantities:
Fitted value:
^
1
i
= A
i
^
,
Residual: ^-
i
= 1
i

^
1
i
= 1
i
A
i
^
,
Stacking the tted values
^
1
i
s in a : 1 vector, we obtain:
^
1 =
_
_
^
1
1
^
1
2
.
.
.
^
1
n
_
_
=
_
_
A
1
^
,
A
2
^
,
.
.
.
A
n
^
,
_
_
=
_
_
A
1
A
2
.
.
.
A
n
_
_
^
, =
_
_
A
11
A
12
. . . A
1K
A
21
A
22
. . . A
2K
.
.
.
.
.
.
.
.
.
.
.
.
A
n1
A
n2
. . . A
nK
_
_
_
_
^
,
1
^
,
2
.
.
.
^
,
K
_
_
i.e.
^
1 = A
^
, = A
_
A
0
A
_
1
A
0
1 = 1
x
1 where 1
x
= A
_
A
0
A
_
1
A
0
The : : matrix 1
x
is called a projection matrix. It is a square, symmetric (i.e. 1
0
x
= 1
x
),
idempotent matrix (i.e. 1
x
1
x
= 1
x
), that has the property that 1
x
A = A.
Stacking the residuals in a : 1 vector, we obtain:
^- =
_
_
^-
1
^-
2
.
.
.
^-
n
_
_
=
_
_
1
1

^
1
1
1
2

^
1
2
.
.
.
1
n

^
1
n
_
_
3
i.e.
^- = 1
^
1 = 1 A
_
A
0
A
_
1
A
0
1 =
_
1
n
A
_
A
0
A
_
1
A
0
_
1 = (1
n
1
x
) 1 = '
x
1
where '
x
= (1
n
1
x
) is a square : : matrix, that is symmetric (i.e. '
0
x
= '
x
), idempotent
(i.e. '
x
'
x
= '
x
), that has the property that '
x
A = (1
n
1
x
) A = AA = 0. We will call '
x
the annihilator matrix.
A special annihilator matrix is the one that is made from the : dimensional unit vector,
i =
_
_
1
.
.
.
1
_
_
Note that ii
0
is an : : matrix of ones and i
0
i = :. We will dene
'
0
= 1
n
i
_
i
0
i
_
1
i
0
= 1
n

ii
0
:
The matrix '
0
, when multiplied by a vector, say 1, has the property of producing the demeaned
values of the vector:
'
0
1 =
_
_
1
1

1
.
.
.
1
n

1
_
_
Note that by their denition, ^-
i
= 1
i
A
i
^
,, the residuals satisfy:
n
i=1
A
ij
^-
i
= 0 = A
0
j
^- = 0 for all , = 1, ..., 1
or equivalently,
A
0
^- = 0
That is, the residual vector ^- is orthogonal to each column A
k
of A. That this is true may
be seen by observing that the equations above are just the normal equations of the LS problem.
If A contains a column of ones, i.e. if A
i1
= 1 for all i, then this means that
n
i=1
^-
i
= 0
i.e. the positive and negative residuals cancel each other out.
4
1.2. OLS in the Two-Variable (Simple) Regression Model
The method of Least Squares is a quite general principle of curve tting. Suppose we have :
pairs of observations (1
i
, A
i2
)
n
i=1
. Here, each A
i2
is a scalar variable. We may represent these
: pairs as : points in a two-dimensional graph. This is the scatterplot of the data. Suppose we
want to t a straight line through these points that lies close to these points. Remember that
a line is characterized by an intercept and a slope. Thus we want to nd a function j = /
1
+/
2
r
2
.
Most likely these points will not lie exactly on such a line, that is, in general, 1
i
,= /
1
+ /
2
A
i2
for
specic /
1
and /
2
. We may imagine many lines going through the points and typically some or all
of the points will deviate from any such line, the deviation being 1
i
(/
1
+/
2
A
i2
), and it will be
positive if a point lies above the line and negative if it lies below the line. Suppose for example
that we connect the leftmost point to the rightmost point. This is one principle for tting a line,
but it is not very reasonable if we want the line to be close to all points simultaneously. To account
for the fact that we want all points to lie close to a line simultaneously, we may think of trying to
make the sum of all deviations small. That is, we may think of nding an intercept /
1
and a slope
/
2
so as to minimize
n
i=1
(1
i
(/
1
+/
2
A
i2
)) . This principle is also not very good, since a very large
positive deviation may cancel out a very large negative deviation, so that, when we try to minimize
n
i=1
(1
i
(/
1
+/
2
A
i2
)) with respect to /
1
and /
2
, these two points will not aect our result as they
cancel each other out. So the line that we t using this principle will be far from both these points.
This problem of positive and negative deviations cancelling out may be solved if we square each
deviation and then sum them all up, that is if we adopt the Least Squares principle which says nd
a /
1
and a /
2
so as to minimize
n
i=1
(1
i
(/
1
+/
2
A
i2
))
2
. Thus, our problem is:
min
b
1
;b
2
n
i=1
(1
i
/
1
A
i2
/
2
)
2
The FOC are:
2
n
i=1
_
1
i

^
,
1
A
i2
^
,
2
_
= 0
2
n
i=1
A
i2
_
1
i

^
,
1
A
i2
^
,
2
_
= 0
which is a system of 2 equations in 2 unknowns,
^
,
1
and
^
,
2
, which may be equivalently written as:
n
i=1
1
i
=
^
,
1
: +
^
,
2
n
i=1
A
i2
(1.9)
n
i=1
A
i2
1
i
=
^
,
1
n
i=1
A
i2
+
^
,
2
n
i=1
A
2
i2
(1.10)
5
Multiplying (1.9) by

n
i=1
A
i2
and (1.10) by :, we obtain:
n
i=1
A
i2
n
i=1
1
i
=
^
,
1
:
n
i=1
A
i2
+
^
,
2
_
n
i=1
A
i2
_
2
(1.11)
:
n
i=1
A
i2
1
i
=
^
,
1
:
n
i=1
A
i2
+
^
,
2
:
n
i=1
A
2
i2
(1.12)
Subtracting (1.11) from (1.12), we obtain:
:
n
i=1
A
i2
1
i

n
i=1
A
i2
n
i=1
1
i
=
^
,
2
_
_
:
n
i=1
A
2
i2

_
n
i=1
A
i2
_
2
_
_
and solving for
^
,
2
we get:
^
,
2
=
:
n
i=1
A
i2
1
i

n
i=1
A
i2
n
i=1
1
i
:
n
i=1
A
2
i2
(
n
i=1
A
i2
)
2
Multiplying both numerator and denominator of the last equation by
1
n
2
, we obtain:
^
,
2
=
1
n
n
i=1
A
i2
1
i

1
n
n
i=1
A
i2
1
n
n
i=1
1
i
1
n
n
i=1
A
2
i2

_
1
n
n
i=1
A
i2
_
2
=
1
n
n
i=1
A
i2
1
i

A
2
1
1
n
n
i=1
A
2
i2

A
2
2
where

1 =
1
n
n
i=1
1
i
and

A
2
=
1
n
n
i=1
A
i2
are the sample averages of the 1
i
s and the A
i2
s, respec-
tively. Manipulating the last equation, we obtain:
^
,
2
=
n
i=1
_
A
i2

A
2
_ _
1
i

1
_
n
i=1
_
A
i

A
2
_
2
(Can you show this?) This is the LS estimator of the slope. The LS estimator of the intercept is
obtained by solving (1.9) for
^
,
1
,
^
,
1
=

1

A
2
^
,
2
For the particular values of the pairs (1
i
, A
i2
)
n
i=1
,
^
,
1
and
^
,
2
take on specic values and
dene a straight line in the two-dimensional plane, ^ j =
^
,
1
+
^
,
2
r
2
. This is the sample (or tted)
regression function and it is the line where all the tted values lie on, i.e. the tted values
^
1
i
satisfy:
^
1
i
=
^
,
1
+
^
,
2
A
i2
. As we mentioned above, the actual 1
i
values will not in general lie on
this line, that is, 1
i
,=
^
,
1
+
^
,
2
A
i2
. The deviation of each 1
i
from this tted line (i.e. the vertical
distance of the point (A
i2
, 1
i
) from the line), is 1
i
_
^
,
1
+
^
,
2
A
i2
_
, and is precisely what we called
the residual for the ith observation, ^-
i
.
It is interesting to note at this point that, in the two-variable regression model, the slope
estimator
^
,
2
is related to two well-known statistics, the sample covariance of 1 and A
2
, dened
6
as
1
n
n
i=1
_
A
i2

A
2
_ _
1
i

1
_
, and the sample variance of A
2
, dened as
1
n
n
i=1
_
A
i

A
2
_
2
, and
hence to the sample correlation (coecient) of 1 and A
2
, dened as
r =
1
n
n
i=1
_
A
i2

A
2
_ _
1
i

1
_
_
1
n
n
i=1
_
A
i

A
2
_
2
_
1
n
n
i=1
_
1
i

1
_
2
(Can you nd the relationship?) And, as we know, the sample correlation of two variables measures
the direction and strength of the linear relationship between the variables. However, you should
remember that in the multivariate regression model, when 1 2, that there is no such simple
relationship of
^
,
j
with the sample covariance of 1 and A
j
and the sample variance of A
j
.
1.3. OLS in the Three-Variable Regression Model
The OLS regression of 1 on a constant, A
2
and A
3
produces the following:
^
,
2
=
r
i2
j
i
r
2
i3

r
i3
j
i
r
i2
r
i3
r
2
i2
r
2
i3
(
r
i2
r
i3
)
2
where r
i2
= A
i2

A
2
, r
i3
= A
i3

A
3
, and j
i
= 1
i

1 . Dividing top and bottom by

r
2
i2
r
2
i3
gives
^
,
2
=
^
,
12

^
,
13
^
,
32
1
^
,
23
^
,
32
=
r
12
r
13
r
32
1 r
2
23

:
1
:
2
where
^
,
jl
are the OLS coecients form the simple OLS regressions of , on | including a constant,
i.e.
^
,
12
=
r
i2
j
i
r
2
i2
= r
12
:
1
:
2
^
,
13
=
r
i3
j
i
r
2
i3
= r
13
:
1
:
3
^
,
32
=
r
i2
r
i3
r
2
i2
= r
32
:
3
:
2
^
,
23
=
r
i2
r
i3
r
2
i3
= r
23
:
2
:
3
and r
jl
are the simple sample correlation coecients between , and | and :
j
is the sample standard
deviation of ,, for example, :
1
=
_
1
n
j
2
i
.
7
1.4. The Principle of Least Squares
Of course, the Least Squares principle is not the only reasonable principle for tting a line that
we may think of. For example, we may instead take the absolute value of each deviation and then
sum them all up. This is the principle of Least Absolute Deviations (LAD) and it says nd a
/
1
and a /
2
so as to minimize
n
i=1
[1
i
(/
1
+/
2
A
i2
)[ . The dierence between LS and LAD is that
LS penalizes large (positive or negative) deviations more than LAD, or in other words, LS gives
more weight to large deviations than LAD. This is because the LS objective function which we are
trying to minimize increases by the square of each individual deviation, while in the LAD objective
function, the importance of each deviation is proportional to its magnitude. Both methods treat
positive and negative deviations symmetrically. Choosing between the two principles, involves
taking astand on how we want large deviations, called outliers, to be treated. One advantage of
using the LS principle is that it gives us analytical solutions (i.e. closed-form expressions) for the
optimal /
1
and /
2
,
^
,
1
and
^
,
2
above.
How does the principle of Least Squares relate to the classical regression model?
As we mentioned in our discussion of the CR model, the most important assumption is the
linearity of the (mean) regression function, which for 1 = 2, i.e. for the two-variable regression
model, takes the form:
1 (1
i
[A) = ,
1
A
i1
+,
2
A
i2
= ,
1
+,
2
A
i2
where the rst regressor A
i1
= 1 for all i. That is, we believe that the means of each one of the
1
i
s are linear functions of the respective A
i
s, i.e. they lie on a straight line with an intercept ,
1
and a slope ,
2
. These two parameters are unknown and are the objects of our interest. If we knew
the actual values of the 1 (1
i
[A)s we would also know ,
1
and ,
2
. But we dont observe the true
1 (1
i
[A)s. We do observe, however, a realization of each 1
i
most likely not having the same value
as its mean, i.e. in general, 1
i
,= 1 (1
i
[A) , so that in general 1
i
,= ,
1
+ ,
2
A
i2
. In fact, we may
think of the unobservable error term -
i
as being exactly the deviation of 1
i
from its conditional
mean, that is -
i
= 1
i
1 (1
i
[A) = 1
i
(,
1
+,
2
A
i2
) and it is clear that it will satisfy 1 (-
i
[A) = 0
for all i. Thus, each 1
i
that we observe deviates from this unknown, unobservable straight line, the
(true or population) mean regression function. To estimate it, i.e. to estimate the intercept ,
1
and
the slope ,
2
, we may apply the Least Squares principle using our data. This produces the (sample
or tted) regression function,
^
1
i
=
^
,
1
+
^
,
2
A
i2
. In general, the estimated intercept and slope,
^
,
1
and
^
,
2
, will not be equal to the true parameters, ,
1
and ,
2
, i.e.
^
,
1
,= ,
1
and
^
,
2
,= ,
2
, so that the
true unknown regression function will not coincide with the tted regression function. Similarly
the unobservable, unknown error -
i
, which is the deviation of the actual, observed 1
i
from the
unknown population regression function, will not coincide with the actual, observed deviation of
the observed 1
i
from the actual, tted regression function, what we called the residual, ^-
i
. That is,
in general, ^-
i
,= -
i
. What we hope by doing LS is that the tted regression line is close to the
true regression line, i.e. that the estimated
^
,
j
s lie close to their unknown counterparts, the true
,
j
s. But remember, the true ,s remain always elusive!
8
1.5. Measuring the Goodness of Fit
As we discussed above, when running OLS the objective is to t a straight line (in the 2-variable
case) so as to make the sum of squared deviations from that line as small as possible. Thus, it
makes sense after running an OLS regression to ask ourselves how good the t of the line is. We
will next describe several such measures of the goodness of t.
By the denitions of
^
1
i
= A
i
^
, and ^-
i
= 1
i

^
1
i
= 1
i
A
i
^
,, we have:
1
i
=
^
1
i
+ ^-
i
= A
i
^
, + ^-
i
i.e. each 1
i
is decomposed into an explained (tted or predicted) part,
^
1
i
= A
i
^
,, which is
explained by the A
ij
s, and an unexplained part, ^-
i
. The smaller is each ^-
i
, the closer is its
corresponding 1
i
to the tted value,
^
1
i
. In the case where 1 = 2 where the model is 1 (1
i
[A) =
,
1
+ ,
2
A
i2
, the residual ^-
i
is the vertical distance (deviation) of 1
i
from the tted line, given by
the equation j =
^
,
1
+
^
,
2
r
2
where
^
,
1
and
^
,
2
are the OLS estimates of ,
1
and ,
2
. By running
the regression, we hope that the actual 1
i
s lie close to the tted line, simultaneously for all i.
Large residuals imply a poor t while small residuals imply a good t. Thus, we could think of
measuring the goodness of t by the sum of the residuals,

n
i=1
^-
i
. This measure has the problem
that a large positive residual (1
i
above the tted line) may cancel out a large negative residual
(1
i
below the tted line), so that even though

n
i=1
^-
i
may be small, which would indicate a good
t, for some is the residuals may be large (positive and negative). Besides, as we saw before, if
the regression includes an intercept,

n
i=1
^-
i
= 0, i.e. the positive and negative residuals cancel
each other out completely. This of course doesnt mean that all ^-
i
s are small. Thus,

n
i=1
^-
i
is
not a good measure of the goodness of t. We may think instead of using the sum of squared
residuals, i.e.

n
i=1
^-
2
i
which is always non-negative. But this measure has a problem as well,
namely its magnitude depends on the units in which the 1
i
s are measured. Imagine what happens
to a residual when we increase the scale of the 1 axis, i.e. we stretch it. Then all the residuals
look larger and

n
i=1
^-
2
i
is larger as well. For this reason we may use:
n
i=1
^-
2
i
n
i=1
1
2
i
=
^-
0
^-
1
0
1
which is a unit-free measure of goodness of t. A small value of this measure then means a good
t. Notice that since A
0
^- = 0 this implies that
^
1
0
^- =
^
,
0
A
0
^- = 0 so that
1
0
1 =
^
1
0
^
1 + ^-
0
^-
and
^-
0
^-
1
0
1
= 1
^
1
0
^
1
1
0
1
so that a small
^"
0
^"
Y
0
Y
is equivalent to a large value for
^
Y
0 ^
Y
Y
0
Y
, where both quantities are between 0 and
9
1. The quantity
^
Y
0 ^
Y
Y
0
Y
is sometimes called the uncentered 1 squared:
1
2
UC
= 1
n
i=1
^-
2
i
n
i=1
1
i
2
=
n
i=1
^
1
2
i
n
i=1
1
i
2
= 1
^-
0
^-
1
0
1
=
^
1
0
^
1
1
0
1
The larger this value is the better the t of the regression, where
0 _ 1
2
UC
_ 1
One problem with this measure is that it is sensitive with respect to changing the origin of 1,
i.e. to adding a constant to 1. For this reason it is customary to report the (centered) 1 squared,
also known as the coecient of determination or multiple correlation coecient or simply
1
2
:
1
2
= 1
2
C
= 1
n
i=1
^-
2
i
n
i=1
_
1
i

1
_
2
= 1
^-
0
^-
1
0
'
0
1
However this quantity is only guaranteed to be between 0 and 1 only if the regression includes an
intercept (see proof in the appendix).
The quantity

n
i=1
^-
2
i
is also known as the Sum of Squared Residuals (oo1) while the
quantity

n
i=1
_
1
i

1
_
2
is known as the Total Sum of Squares (Too) . The latter measures
the total variation of the 1 s around their sample mean. The quantity

n
i=1
_
^
1
i

1
_
2
is known
as the Explained Sum of Squares (oo1) and measures the variation of the tted 1 s around
the sample mean of the actual 1 s (which as we show in the appendix coincides with the sample
average of the
^
1 s when the regression includes an intercept,
3
in which case

1 =
^
1 ). Thus, 1
2
is
just:
1
2
= 1
2
C
= 1
oo1
ooT
Sometimes (especially when the regression does not include an intercept) another measure of
goodness of t is reported:
n
i=1
_
^
1
i

^
1
_
2
n
i=1
_
1
i

1
_
2
=
^
1
0
'
0
^
1
1
0
'
0
1
which measures the proportion of the total variation of the 1 s explained by the As. This latter
measure coincides with the standard (centered) 1
2
when the regression includes an intercept.
Remarks:
Clearly, 1
2
= 1 (perfect t) if

n
i=1
^-
2
i
= 0, or equivalently if ^-
i
= 0 for all i. In the bivariate
regression model with an intercept, 1 (1
i
[A) = ,
1
+ ,
2
A
i2
, this last situation occurs if the
observed 1
i
s lie exactly on a straight line, the slope and intercept of which we estimate by
running a LS regression.
3
Or, equivalently, when some linear combination of the columns of X equals a constant.
10
1
2
= 0 (no t) if

n
i=1
_
^
1
i

^
1
_
2
= 0 or equivalently if
^
1
i
=
^
1 for all i. In the bivariate
regression model with an intercept, this last situation occurs if the
^
1
i
s lie on a straight
horizontal line that goes through
^
1 , which as we saw above is just

1 if the regression includes
an intercept.
It may be shown that in the two-variable regression model with an intercept, 1 (1
i
[A) =
,
1
+ ,
2
A
i2
, the 1
2
is equal to the square of the sample correlation coecient between A
2
and 1 , i.e.
1
2
=
_
_

n
i=1
_
1
i

1
_ _
A
i2

A
2
_
_
n
i=1
_
1
i

1
_
2
_
n
i=1
_
A
i2

A
2
_
2
_
_
2
= r
2
(Can you show this?)
We should be careful how we use 1
2
. First, if there is no intercept in the regression model i.e.
if A doesnt include a column of ones, 1
2
may take a negative value. Second, we shouldnt try
to make 1
2
as big as possible. The objective of running a LS regression is not to maximize
1
2
. We should think of 1
2
as just a descriptive statistic of the model. Indeed, if the objective
was to maximize 1
2
, we could simply add more columns in A (i.e. more variables that have
perhaps nothing to do with the economic question at hand) that are linearly independent so
that A retains its full column rank. By doing that we could increase 1
2
. We could certainly
not decrease it (this is a mathematical fact). However, a bigger 1
2
doesnt mean a better
model, i.e. the additional variables in A may be completely irrelevant. The choice of which
variables to include in A should be guided by economic intuition and what we are trying to
explain, and not by the magnitude of 1
2
.
To deal with the sensitivity of 1
2
to the number 1 of variables in A that we discussed in the
remark above, we usually report along with 1
2
another measure, called adjusted 1
2
, or 1
bar squared, dened by:
1
2
= 1
oo1, (: 1)
ooT, (: 1)
and is related to 1
2
via the relationship
1
2
= 1
_
1 1
2
_
: 1
: 1
It only makes sense to compare 1
2
s (or

1
2
s) for models that have the dependent variable
expressed in the same units.
1
2
UC
has a simple geometrical interpretation. The cosine of the angle between the vectors 1
and
^
1 , say c, is:
cos c =
_
_
_
^
1
_
_
_
|1 |
11
This is the uncentered correlation coecient between 1 and
^
1 . Hence
1
2
UC
= cos
2
c
2. Partitioned Regression
It is of interest to derive a formula for the OLS estimator of a single coecient or of a subset of
coecients in the linear regression model,
1 = A, +-
For this, we will write the model in partitioned form:
1 = A
1
,
1
+A
2
,
2
+-
where as usual, 1 and - are : 1 vectors, A
1
and A
2
are : 1
1
and : 1
2
matrices of full rank
1
1
and 1
2
, respectively, and ,
1
and ,
2
are 1
1
1 and 1
2
1 vectors, respectively. Obviously,
A =
_
A
1
A
2
_
is an : 1 matrix of full rank 1 = 1
1
+1
2
, and , =
_
,
1
,
2
_
is 1 1.
Suppose we regress 1 on A =
_
A
1
A
2
_
. Then
^
, =
_
^
,
1
^
,
2
_
satises the normal equations
A
0
A
^
, = A
0
1 =
_
A
1
A
2
_
0
_
A
1
A
2
_
_
^
,
1
^
,
2
_
=
_
A
1
A
2
_
0
1 =
_
A
0
1
A
0
2
_
_
A
1
A
2
_
_
^
,
1
^
,
2
_
=
_
A
0
1
A
0
2
_
1 =
_
A
0
1
A
1
A
0
1
A
2
A
0
2
A
1
A
0
2
A
2
__
^
,
1
^
,
2
_
=
_
A
0
1
1
A
0
2
1
_
=
A
0
1
A
1
^
,
1
+A
0
1
A
2
^
,
2
= A
0
1
1 (2.1)
A
0
2
A
1
^
,
1
+A
0
2
A
2
^
,
2
= A
0
2
1 (2.2)
Solving (2.2) in terms of
^
,
2
we obtain:
^
,
2
=
_
A
0
2
A
2
_
1
_
A
0
2
1 A
0
2
A
1
^
,
1
_
(2.3)
Plugging (2.3) in (2.1) and solving for
^
,
1
we obtain:
4
^
,
1
=
_
A
0
1
'
2
A
1
_
1
A
0
1
'
2
1
4
See the Appendix for an alternative derivation.
12
where '
2
= 1
n
A
2
(A
0
2
A
2
) A
0
2
. From the denition of '
2
we see that '
2
a residual matrix and
hence satises '
0
2
= '
2
and '
2
'
2
= '
2
.
Dene
~
A
1
= '
2
A
1
. Note that
~
A
1
expresses the residual from the regression of each variable
(column) of A
1
on A
2
. Since
~
A
0
1
~
A
1
= A
0
1
'
0
2
'
2
A
1
= A
0
1
'
2
A
1
we see that
^
,
1
, the OLS estimator
of ,
1
in the regression of 1 on A
1
and A
2
, is just:
^
,
1
=
_
~
A
0
1
~
A
1
_
1
~
A
0
1
1 (2.4)
But this is just the OLS estimator that we would obtain if we regressed 1 on
~
A
1
, i.e. if we regressed
1 on the residual from the regression of A
1
on A
2
. We call this regression residual regression.
It expresses the fact that the OLS estimator of a coecient ,
k
measures the pure eect of A
k
on
1 purged from the eect that A
k
has on 1 through its relationship with the other variables in the
regression, that is, controlling for the other variables in A
i
.
It is also of interest to note that since A
0
1
'
2
1 = A
0
1
'
0
2
'
2
1,
^
,
1
may be also written as
^
,
1
=
_
~
A
0
1
~
A
1
_
1
~
A
0
1
~
1 (2.5)
where
~
1 = '
2
1, i.e.
~
1 is the residual from the regression of 1 on A
2
. Therefore,
^
,
1
is also obtained
from the regression of
~
1 (the residual from the regression of 1 on A
2
) on
~
A
1
(the residual from
the regression of A
1
on A
2
). We call this the double residual regression.
It is possible to show that the residuals from the original regression, 1 A
1
^
,
1
A
2
^
,
2
, are
equal to the residuals of the double residual regression,
~
1
~
A
1
^
,
1
. This fact along with the result
that the OLS estimators of coecient sub-vectors can be obtained by the double residual regression
above are known as the Frisch-Waugh-Lovell theorem.
Obviously in any application we would obtain
^
,
1
as the subset of the OLS estimator
^
, from
the regression of 1 on both A
1
and A
2
. We wouldnt obtain it by neither the residual regression
(2.4) nor the double residual regression (2.5). The two expressions (2.4) and (2.5) are useful and
interesting for many reasons:
First, it is clear that the estimated OLS coecient of ,
1
when both A
1
and A
2
(long regression)
are included in the regression,
^
,
L
1
=
_
A
0
1
'
2
A
1
_
1
A
0
1
'
2
1
is in general dierent from the one obtained by just regressing 1 on A
1
:
^
,
S
1
=
_
A
0
1
A
1
_
1
A
0
1
1
unless A
0
1
A
2
= 0, i.e. the regressors in A
1
are orthogonal to those in A
2
, or in the trivial case
where ,
2
= 0.
Second, from the formula of
^
,
1
above,
^
,
1
= (A
0
1
'
2
A
1
)
1
A
0
1
'
2
1, we can see that OLS
estimates the pure eect of A
1
on 1, after purging (or netting out or partialling out) the eect
13
of A
2
from both A
1
and 1. And this is precisely what we are after; the eect of A
1
on 1 holding
the other variables xed.
A third reason why the expressions (2.4) and (2.5) are interesting is because they allow us
to understand the eect of common data procedures, such as demeaning, detrending and de-
seasonalizing of the dependent and/or the independent variables, on the estimated regression
coecients.
We rst explain what demeaning is. Common sense says that to demean an : 1 vector
variable, say 1, we subtract the sample average of the vectors elements,
1
n
n
i=1
1
i
, from each
element of 1, 1
i
. Formally, this demeaning is to run the regression on 1 on a : 1 vector of ones,
i.e. estimate by OLS the model: 1
i
= 1 c+
i
, and then compute the residuals form this regression.
To nd the OLS estimator of c we solve:
min
c
n
i=1
(1
i
c)
2
The FOC is: 2
n
i=1
(1
i
^ c) = 0 == ^ c =
1
n
n
i=1
1
i
. Thus the residual
~
1
i
for each observation i
is:
~
1
i
= 1
i
1 ^ c = 1
i

1
n
n
i=1
1
i
.
Now suppose we are interested in estimating ,
1
in the model:
1
i
= ,
1
A
i1
+,
2
+-
i
= ,
1
A
i1
+,
2
A
i2
+-
i
where we have dened A
i2
= 1 for all i. Obviously we can estimate ,
1
by running the regression
above. By the residual regression expression for
^
,
1
,
^
,
1
can be also obtained as:
^
,
1
=
_
~
A
0
1
~
A
1
_
1
~
A
0
1
1
=
~
A
0
1
1
~
A
0
1
~
A
1
=
n
i=1
~
A
i1
1
i
n
i=1
~
A
2
i1
where
~
A
i1
is the residual for the ith observation from the regression of A
1
on A
2
. But here A
2
is
just a vector of ones. Hence this residual is just the demeaned A
i1
, i.e.
~
A
i1
= A
i1
1
n
n
i=1
A
i1
=
A
i1

A
1
. Thus,
^
,
1
=
n
i=1
_
A
i1

A
1
_
1
i
n
i=1
_
A
i1

A
1
_
2
14
But note that from the double residual regression, we have:
^
,
1
=
_
~
A
0
1
~
A
1
_
1
~
A
0
1
~
1
=
~
A
0
1
~
1
~
A
0
1
~
A
1
=
n
i=1
~
A
i1
~
1
i
n
i=1
~
A
2
i1
=
n
i=1
_
A
i1

A
1
_ _
1
i

1
_
n
i=1
_
A
i1

A
1
_
2
since
~
1
i
is the residual form the regression of 1 on A
2
which for this particular example is just the
demeaned value, i.e.
~
1
i
= 1
i

1 .
In other words, to estimate ,
1
we can either run the regression of 1 on A
1
and a vector of
ones, or demean A
1
and then regress 1 on the demeaned A
1
, or still, demean both 1 and A
1
and
then regress the demeaned 1 on the demeaned A
1
.
Another application is when we worry about linear, say, trends in both 1 and A
1
. To measure
the pure eect of A
1
on 1 purged from the spurious strong linear relationship that 1 and A
1
may
exhibit just due to the fact that they are both trending linearly, we may either do OLS on the
model:
1
t
= ,
1
A
t1
+,
2
+,
3
t +-
t
i.e. include a trend t as an additional variable in the model, or detrend A
1
and then run the
regression of 1 on the detrended A
1
(residual regression), or still, detrend both 1 and A
1
and then
run the regression of the detrended 1 on the detrended A
1
(double residual regression). Similarly
to demeaning, detrending a variable, say 1
t
, means to calculate the residual from the regression:
1
t
= c
1
+c
2
t +n
t
Note in this example A
2
=
_
1
.
.
.t
_
, i.e. has two columns.
Finally, seasonal eects on 1 and A
1
may be controlled for by including in the regression 3
seasonal dummies and a constant (or alternatively all four seasonal dummies but no constant term).
Let 1
t
, \
t
and o
t
be 1 if period t is in the fall season, winter season, or spring season, respectively,
and 0 otherwise. We may write the model:
1
t
= ,
1
A
t1
+,
2
+,
3
1
t
+,
4
\
t
+,
5
o
t
+-
t
and regress 1 on A =
_
A
1
.
.
.1
.
.
.1
t
.
.
.\
t
.
.
.o
t
_
in order to estimate ,
1
, or alternatively, regress both 1
and A
1
on A
2
=
_
1
.
.
.1
t
.
.
.\
t
.
.
.o
t
_
, obtain the residuals
~
1 and
~
A
1
and then regress
~
1 on
~
A
1
. Note that
^
,
1
should be the same no matter which seasonal dummies we include in the regression, although
the estimated intercept,
^
,
2
, and the coecients on the seasonal dummies,
^
,
3
,
^
,
4
, and
^
,
5
, will be
numerically dierent (and their interpretation will be dierent as well) if we choose a dierent set
of 3 seasonal dummies, e.g. for summer, fall, and winter.
15
APPENDIX
On the Coecient of determination
Proposition 2.1. If the regression of 1 on the A
j
s includes an intercept, i.e. if the A matrix has
a column of ones, then:
0 _ 1
2
_ 1
Proof: From the identity 1
i
=
^
1
i
+ ^-
i
, we obtain by squaring both sides
1
2
i
=
_
^
1
i
+ ^-
i
_
2
=
^
1
2
i
+ ^-
2
i
+ 2
^
1
i
^-
i
and by summing over all observations:
n
i=1
1
2
i
=
n
i=1
^
1
2
i
+
n
i=1
^-
2
i
+ 2
n
i=1
^
1
i
^-
i
(2.6)
Note however that
^
1
i
= A
i
^
, =
_
A
i
^
,
_
0
=
^
,
0
A
0
i
since
^
1
i
is a scalar variable. The last term of the
equation above is therefore:
2
n
i=1
^
1
i
^-
i
= 2
n
i=1
^
,
0
A
0
i
^-
i
= 2
^
,
0
n
i=1
A
0
i
^-
i
since
^
,
0
is constant (doesnt vary with i). But from the normal equations (the FOC of the LS
problem), we have that

n
i=1
A
0
i
^-
i
=

n
i=1
A
0
i
_
1
i
A
i
^
,
_
= 0
K1
. Thus, from (2.6) we have:
n
i=1
1
2
i
=
n
i=1
^
1
2
i
+
n
i=1
^-
2
i
(2.7)
Also summing up the identity 1
i
=
^
1
i
+ ^-
i
over all observations and dividing by :, we obtain:
1
:
n
i=1
1
i
=
1
:
n
i=1
_
^
1
i
+ ^-
i
_
=
1
:
n
i=1
^
1
i
+
1
:
n
i=1
^-
i
(2.8)
But if the regressor matrix A has as a rst, say, column a column of ones, so that A
i1
= 1 for all
i, then from the rst normal equation that corresponds to the intercept term, we have:
n
i=1
A
i1
^-
i
=
n
i=1
A
i1
_
1
i
A
i
^
,
_
=
n
i=1
1
_
1
i
A
i
^
,
_
= 0
which means that
n
i=1
^-
i
= 0 and hence their sample average is
1
n
n
i=1
^-
i
= 0. Therefore equation
(2.8) becomes:
1
:
n
i=1
1
i
=
1
:
n
i=1
^
1
i
16
i.e. the sample average of the actual 1
i
s,

1 , is equal to the sample average
^
1 of the predicted
values, the
^
1
i
s. That is,

1 =
^
1 , so that:
:
1
2
= :
^
1
2
(2.9)
Finally, subtracting (2.9) from (2.7) , we obtain:
n
i=1
1
2
i
:
1
2
=
n
i=1
^
1
2
i
:
^
1
2
+
n
i=1
^-
2
i
or equivalently,
n
i=1
_
1
i

1
_
2
=
n
i=1
_
^
1
i

^
1
_
2
+
n
i=1
^-
2
i
(2.10)
This last equation is called analysis (or decomposition) of variance and says that the total
variation of the dependent variable or Total Sum of Squares
ooT =
n
i=1
_
1
i

1
_
2
can be decomposed into a variation that is explained by the regression, the Explained Sum of
Squares,
oo1 =
n
i=1
_
^
1
i

^
1
_
2
and to an unexplained variation, the Sum of Squared Residuals,
oo1 =
n
i=1
^-
2
i
In other words,
ooT = oo1 +oo1
Dividing equation (2.10) by

n
i=1
_
1
i

1
_
2
, we get:
1 =
n
i=1
_
^
1
i

^
1
_
2
n
i=1
_
1
i

1
_
2
+
n
i=1
^-
2
i
n
i=1
_
1
i

1
_
2
where both terms in the sum above are non-negative and in fact should be both less than or equal
to 1, i.e. 0 _
P
n
i=1
^
Y
i
^
Y
2
P
n
i=1
(Y
i
Y )
2
_ 1 and 0 _
P
n
i=1
^"
2
i
P
n
i=1
(Y
i
Y )
2
_ 1, which implies that
1
2
= 1
n
i=1
^-
2
i
n
i=1
_
1
i

1
_
2
=
n
i=1
_
^
1
i

^
1
_
2
n
i=1
_
1
i

1
_
2
is between 0 and 1, i.e. 0 _ 1
2
_ 1. QED.
17
Alternative derivation of partitioned regression formulae
^
,
1
may be also obtained as follows: Recall that the inverse of a partitioned matrix is given by
_
1
C 1
_
1
=
_
1
1
1
1
11
1
1
C1
1
1
1
_
(see e.g. Amemiya (1985) pp. 460) where
1 = 11
1
C
1 = 1 C
1
1
Now, from
_
^
,
1
^
,
2
_
=
_
A
0
1
A
1
A
0
1
A
2
A
0
2
A
1
A
0
2
A
2
_
1
_
A
0
1
1
A
0
2
1
_
=
_

1
1
1
C
1
1
1
__
A
0
1
1
A
0
2
1
_
=
_

1
A
0
1
1 +1
1
A
0
2
1
C
1
A
0
1
1 +1
1
A
0
2
1
_
where
1
= 1
1
1
1
= 1
1
11
C
1
= 1
1
C1
1
1
1
= 1
1
with = A
0
1
A
1
, 1 = A
0
1
A
2
, C = A
0
2
A
1
and 1 = A
0
2
A
2
, we have that
^
,
1
=
1
A
0
1
1 +1
1
A
0
2
1
= 1
1
A
0
1
1 1
1
11
1
A
0
2
1
= 1
1
_
A
0
1
11
1
A
0
2
_
1
= 1
1
_
A
0
1
A
0
1
A
2
_
A
0
2
A
2
_
1
A
0
2
_
1
= 1
1
_
A
0
1
'
2
1
_
where
1 = 11
1
C
= A
0
1
A
1
A
0
1
A
2
_
A
0
2
A
2
_
1
A
0
2
A
1
=
_
A
0
1
'
2
A
1
_
Thus we obtain
^
,
1
=
_
A
0
1
'
2
A
1
_
1
A
0
1
'
2
1
18

OLS

Încărcat de

Informații document

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

OLS

Încărcat de

Drepturi de autor:

Formate disponibile

AUEB Fall 2008 Ekaterini Kyriazidou

S-ar putea să vă placă și