Documente Academic
Documente Profesional
Documente Cultură
Lecture Notes
Joshua M. Tebbs
Department of Statistics
The University of South Carolina
TABLE OF CONTENTS
Contents
1 Examples of the General Linear Model
2 The Linear Least Squares Problem
1
13
2.1
13
2.2
Geometric considerations . . . . . . . . . . . . . . . . . . . . . . . . . . .
15
2.3
Reparameterization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
25
28
3.1
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
28
3.2
Estimability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
28
3.2.1
One-way ANOVA . . . . . . . . . . . . . . . . . . . . . . . . . . .
35
3.2.2
37
3.2.3
39
3.3
Reparameterization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
40
3.4
46
54
4.1
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
54
4.2
55
4.3
57
4.4
60
4.4.1
Underfitting (Misspecification) . . . . . . . . . . . . . . . . . . . .
60
4.4.2
Overfitting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
61
63
4.5
5 Distributional Theory
5.1
68
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
68
TABLE OF CONTENTS
5.2
69
5.2.1
69
5.2.2
70
5.2.3
Properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
72
5.2.4
73
5.2.5
Independence results . . . . . . . . . . . . . . . . . . . . . . . . .
74
5.2.6
Conditional distributions . . . . . . . . . . . . . . . . . . . . . . .
76
5.3
Noncentral 2 distribution . . . . . . . . . . . . . . . . . . . . . . . . . .
77
5.4
Noncentral F distribution . . . . . . . . . . . . . . . . . . . . . . . . . .
79
5.5
81
5.6
85
5.7
Cochrans Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
89
6 Statistical Inference
95
6.1
Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
95
6.2
Testing models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
97
6.3
6.4
6.5
6.6
6.5.1
6.5.2
6.6.2
7 Appendix
118
7.1
7.2
TABLE OF CONTENTS
7.3
7.4
7.5
7.6
7.7
iii
CHAPTER 1
PAGE 1
CHAPTER 1
Yn1
Y1
Y2
..
.
Yn
Xn1
1
..
.
1
11 = ,
n1
1
2
..
.
n
Y1
1 x1
Y2
1 x2
Yn1 = . , Xn2 = . .
,
..
.. ..
Yn
1 xn
21 =
0
1
n1
1
2
..
.
n
CHAPTER 1
for i = 1, 2, ..., n, where i are uncorrelated random variables with mean 0 and common
variance 2 > 0. If the independent variables are fixed constants, measured without
error, then this model is a special GM model Y = X + where
, p1 =
Y = . , Xnp = . .
.
.
.
..
..
..
.. ..
..
0
1
2
..
.
k
, =
1
2
..
.
n
Y11
1n1 1n1 0n1 0n1
Y12
1
0n2 1n2 0n2
, Xnp = n2
, p1 =
Yn1 =
2 ,
..
..
..
..
..
...
.
.
.
.
.
..
Yana
1na 0na 0na 1na
a
where p = a + 1 and n1 = (11 , 12 , ..., ana )0 , and where 1ni is an ni 1 vector of ones
and 0ni is an ni 1 vector of zeros. Note that E() = 0 and cov() = 2 I.
NOTE : In Example 1.4, note that the first column of X is the sum of the last a columns;
i.e., there is a linear dependence in the columns of X. From results in linear algebra,
we know that X is not of full column rank. In fact, the rank of X is r = a, one less
PAGE 3
CHAPTER 1
Y
1 1 0 0 1 0
111
Y112
1 1 0 0 1 0
Y121
1 1 0 0 0 1
Y122
1 1 0 0 0 1
Y211
1 0 1 0 0 0
Y212
1 0 1 0 0 0
, X =
Y=
Y221
1 0 1 0 0 0
1 0 1 0 0 0
Y222
Y311
1 0 0 1 0 0
Y312
1 0 0 1 0 0
Y321
1 0 0 1 0 0
Y322
1 0 0 1 0 0
0 0 0 0
0 0 0 0
1
0 0 0 0
2
0 0 0 0
3
1 0 0 0
1 0 0 0
, = 11
12
0 1 0 0
21
0 1 0 0
22
0 0 1 0
31
0 0 1 0
0 0 0 1
32
0 0 0 1
and = (111 , 112 , ..., 322 )0 . Note that E() = 0 and cov() = 2 I. The X matrix is not
of full column rank. The rank of X is r = 6 and there are p = 10 columns.
PAGE 4
CHAPTER 1
Example 1.6. Two-way crossed ANOVA with interaction. Consider an experiment with
two factors (A and B), where Factor A has a levels and Factor B has b levels. In general,
we say that factors A and B are crossed if every level of A occurs in combination with
every level of B. Consider the two-factor (crossed) ANOVA model given by
Yijk = + i + j + ij + ijk ,
for i = 1, 2, ..., a, j = 1, 2, ..., b, and k = 1, 2, ..., nij , where the random errors ij are
uncorrelated random variables with zero mean and constant unknown variance 2 > 0.
If all the parameters are fixed, this is a special GM model Y = X + . For example,
with a = 3, b = 2, and nij = 3,
1
Y111
1
Y112
1
Y113
1
Y121
1
Y122
1
Y123
1
Y211
1
Y212
1
Y213
, X =
Y=
1
Y221
1
Y222
1
Y223
1
Y311
1
Y312
1
Y313
1
Y321
1
Y322
Y323
1 0 0 1 0 1 0 0 0 0 0
1 0 0 1 0 1 0 0 0 0 0
1 0 0 1 0 1 0 0 0 0 0
1 0 0 0 1 0 1 0 0 0 0
1
1 0 0 0 1 0 1 0 0 0 0
2
1 0 0 0 1 0 1 0 0 0 0
3
0 1 0 1 0 0 0 1 0 0 0
1
0 1 0 1 0 0 0 1 0 0 0
0 1 0 1 0 0 0 1 0 0 0
, = 2
11
0 1 0 0 1 0 0 0 1 0 0
12
0 1 0 0 1 0 0 0 1 0 0
21
0 1 0 0 1 0 0 0 1 0 0
22
0 0 1 1 0 0 0 0 0 1 0
31
0 0 1 1 0 0 0 0 0 1 0
32
0 0 1 1 0 0 0 0 0 1 0
0 0 1 0 1 0 0 0 0 0 1
0 0 1 0 1 0 0 0 0 0 1
1 0 0 1 0 1 0 0 0 0 0 1
and = (111 , 112 , ..., 323 )0 . Note that E() = 0 and cov() = 2 I. The X matrix is not
of full column rank. The rank of X is r = 6 and there are p = 12 columns.
PAGE 5
CHAPTER 1
1 1 0 0 1 0
Y111
1 1 0 0 1 0
Y112
1 1 0 0 1 0
Y113
1 1 0 0 0 1
Y121
1 1 0 0 0 1
Y122
1 1 0 0 0 1
Y123
1 0 1 0 1 0
Y211
1
1 0 1 0 1 0
Y212
2
1 0 1 0 1 0
Y213
, =
,
, X =
Y=
3
1 0 1 0 0 1
Y221
1
Y222
1 0 1 0 0 1
Y223
1 0 1 0 0 1
2
1 0 0 1 1 0
Y311
1 0 0 1 1 0
Y312
1 0 0 1 1 0
Y313
1 0 0 1 0 1
Y321
1 0 0 1 0 1
Y322
Y323
1 0 0 1 0 1
and = (111 , 112 , ..., 323 )0 . Note that E() = 0 and cov() = 2 I. The X matrix is not
of full column rank. The rank of X is r = 4 and there are p = 6 columns. Also note that
PAGE 6
CHAPTER 1
the design matrix for the no-interaction model is the same as the design matrix for the
interaction model, except that the last 6 columns are removed.
Example 1.8. Analysis of covariance. Consider an experiment to compare a 2
treatments after adjusting for the effects of a covariate x. A model for the analysis of
covariance is given by
Yij = + i + i xij + ij ,
for i = 1, 2, ..., a, j = 1, 2, ..., ni , where the random errors ij are uncorrelated random
variables with zero mean and common variance 2 > 0. In this model, represents the
overall mean, i represents the (fixed) effect of receiving the ith treatment (disregarding
the covariates), and i denotes the slope of the line that relates Y to x for the ith
treatment. Note that this model allows the treatment slopes to be different. The xij s
are assumed to be fixed values measured without error.
NOTE : The analysis of covariance (ANCOVA) model is a special GM model Y = X+.
For example, with a = 3 and n1 = n2
1 1 0 0
Y11
1 1 0 0
Y12
1 1 0 0
Y13
1 0 1 0
Y21
Y = Y22 , X = 1 0 1 0
1 0 1 0
Y23
1 0 0 1
Y31
1 0 0 1
Y32
Y33
1 0 0 1
= n3 = 3, we have
x11 0
0
x12 0
0
1
x13 0
0
2
0 x21 0
0 x22 0 , = 3 , =
0 x23 0
0
0 x31
3
0
0 x32
0
0 x33
11
12
13
21
22
23
31
32
33
Note that E() = 0 and cov() = 2 I. The X matrix is not of full column rank. If there
are no linear dependencies among the last 3 columns, the rank of X is r = 6 and there
are p = 7 columns.
REDUCED MODEL: Consider the ANCOVA model in Example 1.8 which allows for
unequal slopes. If 1 = 2 = = a ; that is, all slopes are equal, then the ANCOVA
PAGE 7
CHAPTER 1
model reduces to
Yij = + i + xij + ij .
That is, the common-slopes ANCOVA model is a reduced version of the model that
allows for different slopes. Assuming the same error structure, this reduced ANCOVA
model is also a special GM model Y = X + . With a = 3 and n1 = n2 = n3 = 3, as
before, we have
Y
11
Y12
Y13
Y21
Y = Y22
Y23
Y31
Y32
Y33
, X =
1 1 0 0 x11
1 1 0 0 x12
1 1 0 0 x13
1 0 1 0 x21
1 0 1 0 x22
1 0 1 0 x23
1 0 0 1 x31
1 0 0 1 x32
1 0 0 1 x33
, = 2 , =
11
12
13
21
22
23
31
32
33
As long as at least one of the xij s is different, the rank of X is r = 4 and there are p = 5
columns.
GOAL: We now provide examples of linear models of the form Y = X + that are not
GM models.
TERMINOLOGY : A factor of classification is said to be random if it has an infinitely
large number of levels and the levels included in the experiment can be viewed as a
random sample from the population of possible levels.
Example 1.9. One-way random effects ANOVA. Consider the model
Yij = + i + ij ,
for i = 1, 2, ..., a and j = 1, 2, ..., ni , where the treatment effects 1 , 2 , ..., a are best
regarded as random; e.g., the a levels of the factor of interest are drawn from a large
population of possible levels, and the random errors ij are uncorrelated random variables
PAGE 8
CHAPTER 1
with zero mean and common variance 2 > 0. For concreteness, let a = 4 and nij = 3.
The model Y = X + looks like
Y11
Y12
Y13
Y21
Y22
Y23
=
1
+
Y=
12
Y31
Y32
Y33
Y41
Y42
Y43
13 03 03 03
03 13 03 03
03 03 13 03
03 03 03 13
{z
= Z1
4
} | {z }
= 1
11
12
13
21
22
23
31
32
33
41
42
43
| {z }
= 2
= X + Z1 1 + 2 ,
where we identify X = 112 , = , and = Z1 1 + 2 . This is not a GM model because
cov() = cov(Z1 1 + 2 ) = Z1 cov(1 )Z01 + cov(2 ) = Z1 cov(1 )Z01 + 2 I,
provided that the i s and the errors ij are uncorrelated. Note that cov() 6= 2 I.
Example 1.10. Two-factor mixed model. Consider an experiment with two factors (A
and B), where Factor A is fixed and has a levels and Factor B is random with b levels.
A statistical model for this situation is given by
Yijk = + i + j + ijk ,
for i = 1, 2, ..., a, j = 1, 2, ..., b, and k = 1, 2, ..., nij . The i s are best regarded as fixed
and the j s are best regarded as random. This model assumes no interaction.
APPLICATION : In a randomized block experiment, b blocks may have been selected
randomly from a large collection of available blocks. If the goal is to make a statement
PAGE 9
CHAPTER 1
about the large population of blocks (and not those b blocks in the experiment), then
blocks are considered as random. The treatment effects 1 , 2 , ..., a are regarded as
fixed constants if the a treatments are the only ones of interest.
NOTE : For concreteness, suppose that a = 2, b = 4, and nij = 1. We can write the
model above as
Y
11
Y12
Y13
Y14
Y=
Y21
Y22
Y23
Y24
14 14 04
=
1
14 04 14
|
{z
= X
I4
2
+
I4 3
4
}
{z
|
= Z1 1
11
12
13
14
21
22
23
24
| {z }
= 2
11
Y12
Y13
Y14
14 04
1
I4 2
+
= 18 +
Y=
Y21
04 14
2
I4 3
|
{z
}
Y22
= Z1 1
4
|
{z
}
Y23
= Z2 2
Y24
|
11
12
13
14
.
21
22
23
24
{z }
= 3
CHAPTER 1
Example 1.11. Time series models. When measurements are taken over time, the GM
model may not be appropriate because observations are likely correlated. A linear model
of the form Y = X + , where E() = 0 and cov() = 2 V, V known, may be more
appropriate. The general form of V is chosen to model the correlation of the observed
responses. For example, consider the statistical model
Yt = 0 + 1 t + t ,
for t = 1, 2, ..., n, where t = t1 + at , at iid N (0, 2 ), and || < 1 (this is a
stationarity condition). This is called a simple linear trend model where the error
process {t : t = 1, 2, ..., n} follows an autoregressive model of order 1, AR(1). It is easy
to show that E(t ) = 0, for all t, and that cov(t , s ) = 2 |ts| , for all t and s. Therefore,
if n = 5,
2
3
1
V = 2 2 1 2 .
3 2
1
4 3 2 1
Example 1.12. Random coefficient models. Suppose that t measurements are taken
(over time) on n individuals and consider the model
Yij = x0ij i + ij ,
for i = 1, 2, ..., n and j = 1, 2, ..., t; that is, the different p 1 regression parameters i
are subject-specific. If the individuals are considered to be a random sample, then we
can treat 1 , 2 , ..., n as iid random vectors with mean and p p covariance matrix
, say. We can write this model as
Yij = x0ij i + ij
= x0ij + x0ij ( i ) + ij .
|{z} |
{z
}
fixed
random
CHAPTER 1
(Yi , Wi )
(Xi , i , Ui )
(0 , 1 , 2 , U2 ).
Because the Wi s are not fixed in advance, we would at least need E(i |Wi ) = 0 for this
to be a GM linear model. However, note that
E(i |Wi ) = E(i 1 Ui |Xi + Ui )
= E(i |Xi + Ui ) 1 E(Ui |Xi + Ui ).
The first term is zero if i is independent of both Xi and Ui . The second term generally
is not zero (unless 1 = 0, of course) because Ui and Xi + Ui are correlated. Therefore,
this can not be a GM model.
PAGE 12
CHAPTER 2
2.1
p
R
LEAST SQUARES : Let = (1 , 2 , ..., p )0 and define the error sum of squares
Q() = (Y X)0 (Y X),
the squared distance from Y to X. The point where Q() is minimized satisfies
Q()
0
1
Q()
2 0
Q()
= 0, or, in other words,
.. = .. .
. .
Q()
0
p
This minimization problem can be tackled either algebraically or geometrically.
PAGE 13
CHAPTER 2
and
b0 Ab
= (A + A0 )b.
b
When X0 X is singular, which can happen in ANOVA models (see Chapter 1), there can
be multiple solutions to the normal equations. Having already proved algebraically that
the normal equations are consistent, we know that the general form of the least squares
solution is
b = (X0 X) X0 Y + [I (X0 X) X0 X]z,
PAGE 14
CHAPTER 2
2.2
Geometric considerations
CHAPTER 2
b X)0 (Y X)
b = 0; verify this using the fact that
since the cross product term 2(X
b solves the normal equations. Thus, we have shown that Q() = Q()
b + z0 z, where
e = Q().
b But because Q()
e = Q()
b + z0 z, where z = X
b X,
e it must be
Thus, Q()
b X
e = 0; that is, X
b = X.
e Thus,
true that z = X
e = X 0 X
b = X0 Y,
X 0 X
b is a solution to the normal equations. This shows that
e is also solution to the
since
normal equations.
INVARIANCE : In proving the last result, we have discovered a very important fact;
b and
e both solve the normal equations, then X
b = X.
e In other words,
namely, if
b is invariant to the choice of .
b
X
NOTE : The following result ties least squares estimation to the notion of a perpendicular
projection matrix. It also produces a general formula for the matrix.
b is a least squares estimate if and only if X
b = MY, where
Result 2.5. An estimate
M is the perpendicular projection matrix onto C(X).
Proof. We will show that
(Y X)0 (Y X) = (Y MY)0 (Y MY) + (MY X)0 (MY X).
Both terms on the right hand side are nonnegative, and the first term does not involve
. Thus, (Y X)0 (Y X) is minimized by minimizing (MY X)0 (MY X), the
squared distance between MY and X. This distance is zero if and only if MY = X,
which proves the result. Now to show the above equation:
(Y X)0 (Y X) = (Y MY + MY X)0 (Y MY + MY X)
= (Y MY)0 (Y MY) + (Y MY)0 (MY X)
|
{z
}
()
PAGE 16
CHAPTER 2
It suffices to show that () and () are zero. To show that () is zero, note that
(Y MY)0 (MY X) = Y0 (I M)(MY X) = [(I M)Y]0 (MY X) = 0,
because (I M)Y N (X0 ) and MY X C(X). Similarly, () = 0 as well.
Result 2.6. The perpendicular projection matrix onto C(X) is given by
M = X(X0 X) X0 .
b = (X0 X) X0 Y is a solution to the normal equations, so it is a
Proof. We know that
b = MY. Because perpendicular
least squares estimate. But, by Result 2.5, we know X
projection matrices are unique, M = X(X0 X) X0 as claimed.
NOTATION : Monahan uses PX to denote the perpendicular projection matrix onto
C(X). We will henceforth do the same; that is,
PX = X(X0 X) X0 .
PROPERTIES : Let PX denote the perpendicular projection matrix onto C(X). Then
(a) PX is idempotent
(b) PX projects onto C(X)
(c) PX is invariant to the choice of (X0 X)
(d) PX is symmetric
(e) PX is unique.
We have already proven (a), (b), (d), and (e); see Matrix Algebra Review 5. Part (c) must
be true; otherwise, part (e) would not hold. However, we can prove (c) more rigorously.
0
0
Result 2.7. If (X0 X)
1 and (X X)2 are generalized inverses of X X, then
0
0
0
1. X(X0 X)
1 X X = X(X X)2 X X = X
0
0
0
2. X(X0 X)
1 X = X(X X)2 X .
PAGE 17
CHAPTER 2
CHAPTER 2
We call b
e the vector of residuals. Note that b
e N (X0 ). Because C(X) and N (X0 ) are
orthogonal complements, we know that Y can be uniquely decomposed as
b +b
Y=Y
e.
b and b
We also know that Y
e are orthogonal vectors. Finally, note that
Y0 Y = Y0 IY = Y0 (PX + I PX )Y
= Y0 PX Y + Y0 (I PX )Y
= Y0 PX PX Y + Y0 (I PX )(I PX )Y
b +b
b 0Y
e0 b
e,
= Y
since PX and IPX are both symmetric and idempotent; i.e., they are both perpendicular
projection matrices (but onto orthogonal spaces). This orthogonal decomposition of Y0 Y
is often given in a tabular display called an analysis of variance (ANOVA) table.
ANOVA TABLE : Suppose that Y is n 1, X is n p with rank r p, is p 1, and
is n 1. An ANOVA table looks like
Source
df
Model
SS
b 0Y
b = Y 0 PX Y
Y
Residual n r
Total
b
e0 b
e = Y0 (I PX )Y
Y0 Y = Y0 IY
It is interesting to note that the sum of squares column, abbreviated SS, catalogues
3 quadratic forms, Y0 PX Y, Y0 (I PX Y), and Y0 IY. The degrees of freedom column,
abbreviated df, catalogues the ranks of the associated quadratic form matrices; i.e.,
r(PX ) = r
r(I PX ) = n r
r(I) = n.
The quantity Y0 PX Y is called the (uncorrected) model sum of squares, Y0 (I PX )Y
is called the residual sum of squares, and Y0 Y is called the (uncorrected) total sum of
squares.
PAGE 19
CHAPTER 2
NOTE : The following visualization analogy is taken liberally from Christensen (2002).
VISUALIZATION : One can think about the geometry of least squares estimation in
three dimensions (i.e., when n = 3). Consider your kitchen table and take one corner of
the table to be the origin. Take C(X) as the two dimensional subspace determined by the
surface of the table, and let Y be any vector originating at the origin; i.e., any point in
R3 . The linear model says that E(Y) = X, which just says that E(Y) is somewhere on
b = PX Y is the perpendicular projection
b = X
the table. The least squares estimate Y
of Y onto the surface of the table. The residual vector b
e = (I PX )Y is the vector
starting at the origin, perpendicular to the surface of the table, that reaches the same
height as Y. Another way to think of the residual vector is to first connect Y and
PX Y with a line segment (that is perpendicular to the surface of the table). Then,
shift the line segment along the surface (keeping it perpendicular) until the line segment
has one end at the origin. The residual vector b
e is the perpendicular projection of Y
onto C(I PX ) = N (X0 ); that is, the projection onto the orthogonal complement of the
table surface. The orthogonal complement C(I PX ) is the one-dimensional space in
the vertical direction that goes through the origin. Once you have these vectors in place,
sums of squares arise from Pythagoreans Theorem.
A SIMPLE PPM : Suppose Y1 , Y2 , ..., Yn are iid with mean E(Yi ) = . In terms of the
general linear model, we can write Y = X + , where
Y
1
1
1
Y2
, X = 1 = , = ,
Y=
..
..
.
.
Yn
1
The perpendicular projection matrix onto C(X) is given by
P1 = 1(10 1) 10 = n1 110 = n1 J,
where J is the n n matrix of ones. Note that
P1 Y = n1 JY = Y 1,
PAGE 20
=
1
2
..
.
n
CHAPTER 2
where Y = n1
Pn
i=1
space
C(P1 ) = {z Rn : z = (a, a, ..., a)0 ; a R}.
Note that r(P1 ) = 1. Note also that
Y Y
1
Y2 Y
(I P1 )Y = Y P1 Y = Y Y 1 =
..
Yn Y
the vector which contains the deviations from the mean. The perpendicular projection
matrix I P1 projects Y onto
(
C(I P1 ) =
z Rn : z = (a1 , a2 , ..., an )0 ; ai R,
n
X
)
ai = 0 .
i=1
PAGE 21
CHAPTER 2
This is the model sum of squares for the model Yi = + i ; that is, Y0 P1 Y is the sum of
squares that arises from fitting the overall mean . Now, consider a general linear model
of the form Y = X + , where E() = 0, and suppose that the first column of X is 1.
In general, we know that
Y0 Y = Y0 IY = Y0 PX Y + Y0 (I PX )Y.
Subtracting Y0 P1 Y from both sides, we get
Y0 (I P1 )Y = Y0 (PX P1 )Y + Y0 (I PX )Y.
The quantity Y0 (IP1 )Y is called the corrected total sum of squares and the quantity
Y0 (PX P1 )Y is called the corrected model sum of squares. The term corrected
is understood to mean that we have removed the effects of fitting the mean. This is
important because this is the sum of squares breakdown that is commonly used; i.e.,
Source
df
SS
Model (Corrected)
r1
Y0 (PX P1 )Y
Residual
nr
Y0 (I PX )Y
Total (Corrected)
n1
Y0 (I P1 )Y
In ANOVA models, the corrected model sum of squares Y0 (PX P1 )Y is often broken
down further into smaller components which correspond to different parts; e.g., orthogonal contrasts, main effects, interaction terms, etc. Finally, the degrees of freedom are
simply the corresponding ranks of PX P1 , I PX , and I P1 .
NOTE : In the general linear model Y = X + , the residual vector from the least
squares fit b
e = (I PX )Y N (X0 ), so b
e0 X = 0; that is, the residuals in a least squares
fit are orthogonal to the columns of X, since the columns of X are in C(X). Note that if
1 C(X), which is true of all linear models with an intercept term, then
b
e0 1 =
n
X
ebi = 0,
i=1
that is, the sum of the residuals from a least squares fit is zero. This is not necessarily
true of models for which 1
/ C(X).
PAGE 22
CHAPTER 2
PAGE 23
CHAPTER 2
Result 2.10. If C(W) C(X), then C(PX PW ) = C[(I PW )X] is the orthogonal
complement of C(PW ) with respect to C(PX ); that is,
C(PX PW ) = C(PW )
C(PX ) .
Proof. C(PX PW )C(PW ) because (PX PW )PW = PX PW P2W = PW PW = 0.
Because C(PX PW ) C(PX ), C(PX PW ) is contained in the orthogonal complement
of C(PW ) with respect to C(PX ). Now suppose that v C(PX ) and vC(PW ). Then,
v = PX v = (PX PW )v + PW v = (PX PW )v C(PX PW ),
showing that the orthogonal complement of C(PW ) with respect to C(PX ) is contained
in C(PX PW ).
REMARK : The preceding two results are important for hypothesis testing in linear
models. Consider the linear models
Y = X +
and
Y = W + ,
where C(W) C(X). As we will learn later, the condition C(W) C(X) implies that
Y = W + is a reduced model when compared to Y = X + , sometimes called
the full model. If E() = 0, then, if the full model is correct,
E(PX Y) = PX E(Y) = PX X = X C(X).
Similarly, if the reduced model is correct, E(PW Y) = W C(W). Note that if
the reduced model Y = W + is correct, then the full model Y = X + is also
correct since C(W) C(X). Thus, if the reduced model is correct, PX Y and PW Y
are attempting to estimate the same thing and their difference (PX PW )Y should be
small. On the other hand, if the reduced model is not correct, but the full model is, then
PX Y and PW Y are estimating different things and one would expect (PX PW )Y to
be large. The question about whether or not to accept the reduced model as plausible
thus hinges on deciding whether or not (PX PW )Y, the (perpendicular) projection of
Y onto C(PX PW ) = C(PW )
C(PX ) , is large or small.
PAGE 24
CHAPTER 2
2.3
Reparameterization
and
Y = W + .
If C(X) = C(W), then PX does not depend on which of X or W is used; it depends only
on C(X) = C(W). As we will find out, the least-squares estimate of E(Y) is
b = Wb
b = PX Y = X
Y
.
IMPLICATION : The parameters in the model Y = X + , where E() = 0, are
not really all that crucial. Because of this, it is standard to reparameterize linear models
(i.e., change the parameters) to exploit computational advantages, as we will soon see.
The essence of the model is that E(Y) C(X). As long as we do not change C(X), the
design matrix X and the corresponding model parameters can be altered in a manner
suitable to our liking.
EXAMPLE : Recall the simple linear regression model from Chapter 1 given by
Yi = 0 + 1 xi + i ,
for i = 1, 2, ..., n. Although not critical for this discussion, we will assume that 1 , 2 , ..., n
are uncorrelated random variables with mean 0 and common variance 2 > 0. Recall
PAGE 25
CHAPTER 2
1 x1
Y1
1 x2
Y2
Yn1 = . , Xn2 =
.. ..
. .
..
1 xn
Yn
21 =
n1
1
2
..
.
n
As long as (x1 , x2 , ..., xn )0 is not a multiple of 1n and at least one xi 6= 0, then r(X) = 2
and (X0 X)1 exists. Straightforward calculations show that
P
1
x2
x
P
P
+ (xi x)2 (xi x)2
n
i xi
n
0
1
i
i
X0 X = P
P 2 , (X X) =
1
x
P
P
x
x
.
i i
i i
(xi x)2
(xi x)2
i
and
P
i Yi
X0 Y = P
.
x
Y
i i i
b
b
Y 1 x
b = (X0 X)1 X0 Y = 0 = P
.
(xi x)(Yi Y )
iP
b
1
(xi x)2
i
For the simple linear regression model, it can be shown (verify!) that the perpendicular
projection matrix PX is given by
PX = X(X0 X)1 X0
2
1
1 x)
+ P(x(x
2
i i x)
n
1 (xP
2 x)
n + 1 x)(x
2
i (xi x)
=
..
1
1 x)(xn x)
+ (xP
n
(xi x)2
i
1
n
+
1
n
1
n
(xP
1 x)(x2 x)
2
i (xi x)
1
n
P(x2
2
i (xi x)
..
.
1
n
(xP
2 x)(xn x)
2
i (xi x)
x)2
..
.
(xP
1 x)(xn x)
2
i (xi x)
(xP
2 x)(xn x)
2
i (xi x)
..
.
1
n
2
P(xn x) 2
(x
x)
i
i
Y
1 x1 x
1
Y2
1 x2 x
Yn1 =
.. , Wn2 = ..
..
.
.
.
Yn
1 xn x
PAGE 26
21 =
0
1
n1
1
2
..
.
n
CHAPTER 2
1 x
,
U=
0 1
then W = XU and X = WU1 (verify!) so that C(X) = C(W). Moreover, E(Y) =
X = W = XU. Taking P0 = (X0 X)1 X0 leads to = P0 X = P0 XU = U; i.e.,
0
1 x
= 0
= U.
=
1
1
To find the least-squares estimator for in the reparameterized model, observe that
1
0
n
0
and (W0 W)1 = n
.
W0 W =
P
1
2
P
0
0
i (xi x)
(xi x)2
i
Note that (W0 W)1 is diagonal; this is one of the benefits to working with this parameterization. The least squares estimator of is given by
b0
= P
,
b = (W0 W)1 W0 Y =
(xi x)(Yi Y )
iP
b1
(xi x)2
i
b However, it can be shown directly (verify!) that the perpenwhich is different than .
dicular projection matrix onto C(W) is
PW = W(W0 W)1 W0
2
1
1 x)
+ P(x(x
2
n
x)
i
i
1 (xP
2 x)
n + 1 x)(x
(xi x)2
i
=
..
1
1 x)(xn x)
+ (xP
(xi x)2
n
i
1
n
+
1
n
1
n
(xP
1 x)(x2 x)
2
i (xi x)
1
n
(xP
1 x)(xn x)
2
i (xi x)
P(x2
2
i (xi x)
...
1
n
(xP
2 x)(xn x)
2
i (xi x)
(xP
2 x)(xn x)
2
i (xi x)
x)2
..
.
..
.
1
n
2
P(xn x) 2
(x
x)
i
i
b = PX Y =
which is the same as PX . Thus, the fitted values will be the same; i.e., Y
b = Wb
X
= PW Y, and the analysis will be the same under both parameterizations.
Exercise: Show that the one way fixed effects ANOVA model Yij = + i + ij , for
i = 1, 2, ..., a and j = 1, 2, ..., ni , and the cell means model Yij = i + ij are reparameterizations of each other. Does one parameterization confer advantages over the other?
PAGE 27
CHAPTER 3
3.1
Introduction
REMARK : Estimability is one of the most important concepts in linear models. Consider
the general linear model
Y = X + ,
where E() = 0. In our discussion that follows, the assumption cov() = 2 I is not
needed. Suppose that X is n p with rank r p. If r = p (as in regression models), then
b = (X0 X)1 X0 Y. If r < p,
estimability concerns vanish as is estimated uniquely by
(a common characteristic of ANOVA models), then can not be estimated uniquely.
However, even if is not estimable, certain functions of may be estimable.
3.2
Estimability
DEFINITIONS :
1. An estimator t(Y) is said to be unbiased for 0 iff E{t(Y)} = 0 , for all .
2. An estimator t(Y) is said to be a linear estimator in Y iff t(Y) = c + a0 Y, for
c R and a = (a1 , a2 , ..., an )0 , ai R.
3. A function 0 is said to be (linearly) estimable iff there exists a linear unbiased
estimator for it. Otherwise, 0 is nonestimable.
Result 3.1. Under the model assumptions Y = X + , where E() = 0, a linear
function 0 is estimable iff there exists a vector a such that 0 = a0 X; that is, 0 R(X).
Proof. (=) Suppose that there exists a vector a such that 0 = a0 X. Then, E(a0 Y) =
a0 X = 0 , for all . Therefore, a0 Y is a linear unbiased estimator of 0 and hence
PAGE 28
CHAPTER 3
1
Y11
1
Y12
1
Y21
,
X=
Y=
1
Y22
1
Y31
1
Y32
1 0 0
1 0 0
1
0 1 0
.
, and =
2
0 1 0
3
0 0 1
0 0 1
Note that r(X) = 3, so X is not of full rank; i.e., is not uniquely estimable. Consider
the following parametric functions 0 :
Parameter
0 R(X)?
Estimable?
01 =
01 = (1, 0, 0, 0)
no
no
02 = 1
02 = (0, 1, 0, 0)
no
no
03 = + 1
03 = (1, 1, 0, 0)
yes
yes
04 = 1 2
04 = (0, 1, 1, 0)
yes
yes
yes
yes
CHAPTER 3
Y 2+ + Y 3+
2
= c + a0 Y,
CHAPTER 3
2
0
XX=
2
and r(X0 X) = 3. Here are two
0 0 0
0 12 0
(X0 X)
=
1
0 0 12
0 0 0
2 2 2
2 0 0
0 2 0
0 0 2
generalized inverses of X0 X:
1
0
12 12
1
1
2 1
0
2
0
(X
X)
=
2
1 1
2 2
0
1
1
0
0
0
2
PAGE 31
0
.
CHAPTER 3
Note that
1 1
1 1
0
XY =
0 0
0 0
1 1 1 1
0 0 0 0
1 1 0 0
0 0 1 1
Y11
Y12
Y21
Y22
Y31
Y32
Y31 + Y32
0
Y 3+
Y Y 3+
b = (X0 X) X0 Y = 1+
e = (X0 X) X0 Y = 1+
and
1
2
Y 2+
Y 2+ Y 3+
Y 3+
0
Parameter
0 R(X)?
Estimable?
03 = + 1
03 = (1, 1, 0, 0)
yes
yes
04 = 1 2
04 = (0, 1, 1, 0)
yes
yes
yes
yes
CHAPTER 3
Finally, note that these three estimable functions are linearly independent since
1 0
0
1 1
1
= 3 4 5 =
0 1 1/2
0 0 1/2
has rank r() = 3. Of course, more estimable functions 0i can be found, but we can
find no more linearly independent estimable functions because r(X) = 3.
Result 3.4. Under the model assumptions Y = X + , where E() = 0, the least
b of an estimable function 0 is a linear unbiased estimator of 0 .
squares estimator 0
b solves the normal equations. We know (by definition) that 0
b is
Proof. Suppose that
the least squares estimator of 0 . Note that
b = 0 {(X0 X) X0 Y + [I (X0 X) X0 X]z}
0
= 0 (X0 X) X0 Y + 0 [I (X0 X) X0 X]z.
Also, 0 is estimable by assumption, so 0 R(X) C(X0 ) N (X). Result MAR5.2 says that [I (X0 X) X0 X]z N (X0 X) = N (X), so 0 [I (X0 X) X0 X]z =
b = 0 (X0 X) X0 Y, which is a linear estimator in Y. We now show that 0
b
0. Thus, 0
is unbiased. Because 0 is estimable, 0 R(X) = 0 = a0 X, for some a. Thus,
b = E{0 (X0 X) X0 Y} = 0 (X0 X) X0 E(Y)
E(0 )
= 0 (X0 X) X0 X
= a0 X(X0 X) X0 X
= a0 PX X = a0 X = 0 .
SUMMARY : Consider the linear model Y = X + , where E() = 0. From the
definition, we know that 0 is estimable iff there exists a linear unbiased estimator for
it, so if we can find a linear estimator c+a0 Y whose expectation equals 0 , for all , then
0 is estimable. From Result 3.1, we know that 0 is estimable iff 0 R(X). Thus,
if 0 can be expressed as a linear combination of the rows of X, then 0 is estimable.
PAGE 33
CHAPTER 3
CHAPTER 3
3.2.1
One-way ANOVA
GENERAL CASE : Consider the one-way fixed effects ANOVA model Yij = + i + ij ,
for i = 1, 2, ..., a and j = 1, 2, ..., ni , where E(ij ) = 0. In matrix form, X
Xnp = .
and p1 = 2
.
.
.
.
..
..
..
..
..
..
and are
and the first column is the sum of the last a columns. Hence, r(X) = r = a and
s = p r = 1. With c1 = (1, 10a )0 , note that Xc1 = 0 so {c1 } forms a basis for N (X).
P
Thus, the necessary and sufficient condition for 0 = 0 + ai=1 i i to be estimable
is
0
c1 = 0 = 0 =
a
X
i .
i=1
Pa
i=1
i i , where
Pa
i=1
i = 0.
Pa
i=1
ni i .
There is only s = 1 jointly nonestimable function. Later we will learn that jointly nonestimable functions can be used to force particular solutions to the normal equations.
PAGE 35
CHAPTER 3
The following are examples of sets of linearly independent estimable functions (verify!):
1. { + 1 , + 2 , ..., + a }
2. { + 1 , 1 2 , ..., 1 a }.
LEAST SQUARES ESTIMATES : We now wish to calculate the least squares estimates
of estimable functions.
n n 1 n2
n1 n1 0
X0 X = n 2 0 n 2
..
..
..
.
.
.
na 0 0
na
0
0
0
0 1/n1
0
0
0 and (X X) = 0
0
1/n2
0
..
..
..
..
..
..
..
. .
.
.
.
.
.
na
0
0
0
1/na
0
0
0
0
i
j Yij
P
0 1/n1
0
0
j Y1j
b = (X0 X) X0 Y =
0
0
1/n2
0
j Y2j
..
..
..
..
..
..
.
.
.
.
.
.
P
0
0
0
1/na
j Yaj
0
Y 1+
= Y 2+
..
.
Y a+
REMARK : We know that this solution is not unique; had we used a different generalized
inverse above, we would have gotten a different least squares estimate of . However, least
squares estimates of estimable functions 0 are invariant to the choice of generalized
inverse, so our choice of (X0 X) above is as good as any other. From this solution, we
have the unique least squares estimates:
Estimable function, 0
b
Least squares estimate, 0
+ i
Y i+
i k
Pa
i=1 i i , where
i=1 i = 0
Pa
PAGE 36
Y i+ Y k+
Pa
i=1 i Y i+
CHAPTER 3
3.2.2
GENERAL CASE : Consider the two-way fixed effects (crossed) ANOVA model
Yijk = + i + j + ijk ,
for i = 1, 2, ..., a and j = 1, 2, ..., b, and k = 1, 2, ..., nij , where E(ij ) = 0. For ease of
presentation, we take nij = 1 so there is no need for a k subscript; that is, we can rewrite
the model as Yij = + i + j + ij . In matrix form, X and are
Xnp
1b 1b 0b
0b Ib
1b 0b 1b 0b Ib
.. .. .. . .
. .
. .. ..
. . .
1b 0b 0b
1b Ib
and
p1
..
.
a ,
..
.
where p = a + b + 1 and n = ab. Note that the first column is the sum of the last b
columns. The 2nd column is the sum of the last b columns minus the sum of columns 3
through a + 1. The remaining columns are linearly independent. Thus, we have s = 2
linear dependencies so that r(X) = a + b 1. The dimension
1
1
c1 = 1a and c2 = 0a
0b
1b
of N (X) is s = 2. Taking
produces Xc1 = Xc2 = 0. Since c1 and c2 are linearly independent; i.e., neither is
a multiple of the other, {c1 , c2 } is a basis for N (X). Thus, necessary and sufficient
conditions for 0 to be estimable are
0
c1 = 0 = 0 =
a
X
i=1
0 c2 = 0 = 0 =
b
X
j=1
PAGE 37
a+j .
CHAPTER 3
Pa
i i , where
Pb
a+j j , where
i=1
j=1
Pa
i=1
i = 0
Pb
j=1
a+j = 0.
Pa
5.
Pb
j .
i=1
j=1
We can find s = 2 jointly nonestimable functions. Examples of sets of jointly nonestimable functions are
1. {a , b }
P
P
2. { i i , j j }.
A set of linearly independent estimable functions (verify!) is
1. { + 1 + 1 , 1 2 , ..., 1 a , 1 2 , ..., 1 b }.
NOTE : When replication occurs; i.e., when nij > 1, for all i and j, our estimability
findings are unchanged. Replication does not change R(X). We obtain the following
least squares estimates:
PAGE 38
CHAPTER 3
Estimable function, 0
b
Least squares estimate, 0
+ i + j
Y ij+
i l
Y i++ Y l++
j l
Pa
i=1 ci i , with
i=1 ci = 0
Pb
Pb
j=1 di j , with
j=1 di = 0
Y +j+ Y +l+
Pa
i=1 ci Y i++
Pb
j=1 di Y +j+
Pa
These formulae are still technically correct when nij = 1. When some nij = 0, i.e., there
are missing cells, estimability may be affected; see Monahan, pp 46-48.
3.2.3
GENERAL CASE : Consider the two-way fixed effects (crossed) ANOVA model
Yijk = + i + j + ij + ijk ,
for i = 1, 2, ..., a and j = 1, 2, ..., b, and k = 1, 2, ..., nij , where E(ij ) = 0.
SPECIAL CASE :
1
X=
1 0 0 1 0 1 0 0 0 0 0
1
1 0 0 1 0 1 0 0 0 0 0
2
1 0 0 0 1 0 1 0 0 0 0
3
1 0 0 0 1 0 1 0 0 0 0
1
0 1 0 1 0 0 0 1 0 0 0
0 1 0 1 0 0 0 1 0 0 0
and = 2
11
0 1 0 0 1 0 0 0 1 0 0
12
0 1 0 0 1 0 0 0 1 0 0
21
0 0 1 1 0 0 0 0 0 1 0
22
0 0 1 1 0 0 0 0 0 1 0
31
0 0 1 0 1 0 0 0 0 0 1
0 0 1 0 1 0 0 0 0 0 1
32
PAGE 39
CHAPTER 3
There are p = 12 parameters. The last six columns of X are linearly independent, and the
other columns can be written as linear combinations of the last six columns, so r(X) = 6
and s = p r = 6. To determine which functions 0 are estimable, we need to find a
basis for N (X). One basis
1
1
1 0
1 0
0 1
0 1
,
0 0
0 0
0 0
0
0
{c1 , c2 , ..., c6 } is
0
,
,
0
0
1
0
0
0
0
0
1
1
0
0
0
0
0
0
1
0
1
0
1
0
1
0
1
1
1
0
1
0
1
0
1
0
0
1
3.3
Reparameterization
where E() = 0.
CHAPTER 3
where E() = 0,
CHAPTER 3
3. This follows from (2), since the least squares estimate is invariant to the choice of the
solution to the normal equations.
4. If q0 R(W), then q0 = a0 W, for some a. Then, q0 S0 = a0 WS0 = a0 X R(X), so
that q0 S0 is estimable under Model GL. From (3), we know the least squares estimate
of q0 S0 is q0 S0 Tb
. But,
b.
q0 S0 Tb
= a0 WS0 Tb
= a0 XTb
= a0 Wb
= q0
WARNING: The converse to (4) is not true; i.e., q0 S0 being estimable under Model GL
doesnt necessarily imply that q0 is estimable under Model GL-R. See Monahan, pp 52.
TERMINOLOGY : Because C(W) = C(X) and r(X) = r, Wnt must have at least r
columns. If W has exactly r columns; i.e., if t = r, then the reparameterization of
Model GL is called a full rank reparameterization. If, in addition, W0 W is diagonal,
the reparameterization of Model GL is called an orthogonal reparameterization; see,
e.g., the centered linear regression model in Section 2 (notes).
NOTE : A full rank reparameterization always exists; just delete the columns of X that are
linearly dependent on the others. In a full rank reparameterization, (W0 W)1 exists, so
b = (W0 W)1 W0 Y.
the normal equations W0 W = W0 Y have a unique solution; i.e.,
DISCUSSION : There are two (opposing) points of view concerning the utility of full rank
reparameterizations.
Some argue that, since making inferences about q0 under the full rank reparameterized model (Model GL-R) is equivalent to making inferences about q0 S0 in
the possibly-less-than-full rank original model (Model GL), the inclusion of the
possibility that the design matrix has less than full column rank causes a needless
complication in linear model theory.
The opposing argument is that, since computations required to deal with the reparameterized model are essentially the same as those required to handle the original
model, we might as well allow for less-than-full rank models in the first place.
PAGE 42
CHAPTER 3
I tend to favor the latter point of view; to me, there is no reason not to include
less-than-full rank models as long as you know what you can and can not estimate.
.
.
.
.
.
..
and are
1
0n1 0n1
+ 1
n1
Wnt = .
and
=
t1
..
..
..
..
.
.
.
.
.
.
+ a
0na 0na 1na
1
2
..
.
a
Y
1+
Y 2+
b = (W0 W)1 W0 Y =
..
.
Y a+
Exercise: What are the matrices T and S associated with this reparameterization?
PAGE 43
CHAPTER 3
Reparameterization 2: Deleting
1
1n1
0n1
n1
1n2
0n2
1n2
..
..
.
Wnt = ..
.
.
1na
0na
0na
0n1
+ a
1 a
0n2
..
...
. and t1 = 2 a
..
.
1na1
a1 a
0na
where t = a. This is called the cell-reference model (what SAS uses by default). This
is a full rank reparameterization with C(W) = C(X). The least squares estimate of is
Y a+
Y 1+ Y a+
b = (W0 W)1 W0 Y = Y 2+ Y a+ .
..
Y (a1)+ Y a+
Reparameterization 3: Another reparameterization of the effects model uses
1
1n1
0n1 0n1
+
n1
1n2
1
0n2
1n2 0n2
..
..
..
.
..
Wnt = ..
and
=
.
2
.
.
.
t1
..
1na1 0na1 0na1 1na1
a1
1na 1na 1na 1na
P
where t = a and = a1 i i . This is called the deviations from the mean
model.
CHAPTER 3
0
1 0
+ (X1 X1 ) X1 X2 2
1
.
= 1
=
2
2
With this reparameterization, note that
0
0
X1 X 1
0
X1 Y
and W0 Y =
W0 W =
0
0
0
X2 (I PX1 )X2
X2 (I PX1 )Y
so that
b = (W0 W)1 W0 Y =
(X01 X1 )1 X01 Y
{X02 (I PX1 )X2 }1 X02 (I PX1 )Y
b1
b2
1 0
0
b
b
b (X1 X1 ) X1 X2 2
b = (X0 X)1 X0 Y = 1 = 1
,
b
b
2
2
where note that (X01 X1 )1 X01 X2 is the estimate obtained from regressing X2 on X1 .
b 2 can be thought of as the estimate obtained from regressing
Furthermore, the estimate
Y on W2 = (I PX1 )X2 .
APPLICATION : Consider the two part full-rank regression model Y = X1 1 +X2 2 +,
where E() = 0. Suppose that X2 = x2 is n 1 and that 2 = 2 is a scalar. Consider
two different models:
Reduced model:
Full model:
Y = X1 1 +
Y = X1 1 + x2 2 + .
We use the term reduced model since C(X1 ) C(X1 , x2 ). Consider the full model
Y = X1 1 + x2 2 + and premultiply by I PX1 to obtain
(I PX1 )Y = (I PX1 )X1 1 + b2 (I PX1 )x2 + (I PX1 )
= b2 (I PX1 )x2 + ,
PAGE 45
CHAPTER 3
3.4
CHAPTER 3
Example 3.5. Consider the one-way fixed effects ANOVA model Yij = + i + ij , for
i = 1, 2, ..., a and j = 1, 2, ..., ni ,
n n1 n2
n1 n1 0
X0 X = n2 0 n2
..
..
..
.
.
.
na 0 0
P P
na
i j ij
0 1
Y1j
j
= X0 Y,
0 2 =
Y
2j
j
..
. . . .. ..
. .
.
P
na
a
Y
j aj
a
X
ni i = Y++
i=1
ni + ni i = Yi+ , i = 1, 2, ..., a,
where Yi+ =
P P
i
unique solution. However, from our discussion on generalized inverses (and consideration
of this model), we know that
if we set = 0, then we get the solution
b = 0 and
bi = Y i+ , for i = 1, 2, ..., a.
if we set
Pa
i=1
for i = 1, 2, ..., a.
if we set another nonestimable function equal to 0, well get a different solution to
the normal equations.
REMARK : Equations like = 0 and
Pa
i=1
lution to the normal equations and are called side conditions. Different side conditions
produce different least squares solutions. We know that in the one-way ANOVA model,
the parameters and i , for i = 1, 2, ..., a, are not estimable (individually). Imposing
side conditions does not change this. My feeling is that when we attach side conditions to
force a unique solution, we are doing nothing more than solving a mathematical problem
that isnt relevant. After all, estimable functions 0 have least squares estimates that
do not depend on which side condition was used, and these are the only functions we
should ever be concerned with.
PAGE 47
CHAPTER 3
REMARK : We have seen similar results for the two-way crossed ANOVA model. In
general, what and how many conditions should we use to force a particular solution to
the normal equations? Mathematically, we are interested in imposing additional linear
restrictions of the form C = 0 where the matrix C does not depend on Y.
TERMINOLOGY : We say that the system of equations Ax = q is compatible if
c0 A = 0 = c0 q = 0; i.e., c N (A0 ) = c0 q = 0.
Result 3.6. The system Ax = q is consistent if and only if it is compatible.
Proof. If Ax = q is consistent, then Ax = q, for some x . Hence, for any c such that
c0 A = 0, we have c0 q = c0 Ax = 0, so Ax = q is compatible. If Ax = q is compatible,
then for any c N (A0 ) = C(I PA ), we have 0 = c0 q = q0 c = q0 (I PA )z, for all
z. Successively taking z to be the standard unit vectors, we have q0 (I PA ) = 0 =
(I PA )q = 0 = q = A(A0 A) A0 q = q = Ax , where x = (A0 A) A0 q. Thus,
Ax = q is consistent.
AUGMENTED NORMAL EQUATIONS : We now consider adjoining the set of equations
C = 0 to the normal equations; that is, we consider the new set of equations
X0 X
X0 Y
=
.
C
0
These are called the augmented normal equations. When we add the constraint
C = 0, we want these equations to be consistent for all Y. We now would like to find
a sufficient condition for consistency. Suppose that w R(X0 X) R(C). Note that
w R(X0 X) = w = X0 Xv1 , for some v1
w R(C) = w = C0 v2 , for some v2 .
Thus, 0 = w w = X0 Xv1 + C0 v2
= 0 = v10 X0 X + v20 C
0
0
X
X
X
X
= v0
,
= 0 = (v10 v20 )
C
C
PAGE 48
CHAPTER 3
0
0
XY
XX
=
0
C
is consistent, or equivalently from Result 3.6, is compatible. Compatibility occurs when
0
0
XY
XX
= 0.
= 0 = v0
v0
0
C
Thus, we need v10 X0 Y = 0, for all Y. Successively taking Y to be standard unit vectors,
for i = 1, 2, ..., n, convinces us that v10 X0 = 0 Xv1 = 0 X0 Xv1 = 0 = w = 0.
Thus, the augmented normal equations are consistent when R(X0 X)R(C) = {0}. Since
R(X0 X) = R(X), a sufficient condition for consistency is R(X) R(C) = {0}. Now,
consider the parametric function 0 C, for some . We know that 0 C is estimable if
and only if 0 C R(X). However, clearly 0 C R(C). Thus, 0 C is estimable if and
only if 0 C = 0. In other words, writing
c01
C=
c02
.. ,
.
0
cs
the set of functions {c01 , c02 , ..., c0s } is jointly nonestimable. Therefore, we can set
a collection of jointly nonestimable functions equal to zero and augment the normal
equations so that they remain consistent. We get a unique solution if
0
XX
= p.
r
C
Because R(X0 X) R(C) = {0},
X0 X
= r(X0 X) + r(C) = r + r(C),
p = r
C
showing that we need r(C) = s = p r.
PAGE 49
CHAPTER 3
SUMMARY : To augment the normal equations, we can find a set of s jointly nonestimable
functions {c01 , c02 , ..., c0s } with
c01
r(C) = r
c02
.. = s.
.
c0s
Then,
XX
C
XY
0
P P
Y
n n1 n2 na
i j ij
n1 n1 0 0 1
Y1j
j
X0 X = n2 0 n2 0 2 =
= X0 Y.
j Y2j
..
..
..
.. . .
.. ..
. . .
.
.
.
.
a
na 0 0 na
j Yaj
We know that r(X) = r = a < p (this system can not be solved uniquely) and that
s = p r = (a + 1) a = 1. Thus, to augment the normal equations, we need to find
s = 1 (jointly) nonestimable function. Take c01 = (1, 0, 0, ..., 0), which produces
0
c1 = (1 0 0 0) 2 = .
..
.
PAGE 50
CHAPTER 3
P P
n n1 n2 na
i
j Yij
n1 n1 0 0
j Y1j
1
P
n2 0 n2 0
X0 X
j Y2j
=
2 =
..
..
..
.. . .
..
. .
.
.
c1
.
.
.
..
na 0 0 na
j Yaj
a
0
1 0 0 0
XY
=
.
Solving this (now full rank) system produces the unique solution
b = 0
bi = Y i+ i = 1, 2, ..., a.
Youll note that this choice of c1 used to augment the normal equations corresponds to
specifying the side condition = 0.
Exercise. Redo this example using (a) the side condition
Pa
i=1
condition a = 0 (what SAS does), and (c) using another side condition.
Example 3.6. Consider the two-way fixed effects (crossed) ANOVA model
Yij = + i + j + ij ,
for i = 1, 2, ..., a and j = 1, 2, ..., b, where E(ij ) = 0. For purposes of illustration, lets
take a = b = 3, so that n = ab = 9 and
1 1 0 0 1
1 1 0 0 0
1 1 0 0 0
1 0 1 0 1
X97 = 1 0 1 0 0
1 0 1 0 0
1 0 0 1 1
1 0 0 1 0
1 0 0 1 0
0 0
1 0
1
0 1
2
0 0
1 0 and 71 = 3 .
0 1
2
0 0
1 0
3
0 1
PAGE 51
CHAPTER 3
9 3
3 3
3 0
X0 X = 3 0
3 1
3 1
3 1
3 3 3 3 3
Y
i
j ij
P
0 0 1 1 1 1
Y
Pj 1j
3 0 1 1 1 2
Y2j
j
= X0 Y.
Y
0 3 1 1 1 3 =
3j
j
1 1 3 0 0 1
i Yi1
1 1 0 3 0 2
Yi2
i
P
3
1 1 0 0 3
i Yi3
This system does not have a unique solution. To augment the normal equations, we will
need a set of s = 2 linearly independent jointly nonestimable functions. From Section
P
P
3.2.2, one example of such a set is { i i , j j }. For this choice, our matrix C is
0
c1
0 1 1 1 0 0 0
=
.
C=
0
c2
0 0 0 0 1 1 1
Thus, the augmented normal
9 3 3
3 3 0
3 0 3
3 0 0
0
XX
=
3 1 1
3 1 1
3 1 1
0 1 1
0 0 0
equations become
3 3 3 3
0 1 1 1
0 1 1 1 1
3 1 1 1 2
1 3 0 0 3
1 0 3 0 1
1 0 0 3 2
1 0 0 0 3
0 1 1 1
P P
i
j Yij
P
Y
Pj 1j
Y
Pj 2j
j Y3j
P
=
i Yi1
P
i Yi2
P
i Yi3
XY
.
=
Solving this system produces the estimates of , i and j under the side conditions
P
P
i i =
j j = 0. These estimates are
b = Y ++
bi = Y i+ Y ++ , i = 1, 2, 3
bj = Y +j Y ++ , j = 1, 2, 3.
PAGE 52
CHAPTER 3
Exercise. Redo this example using (a) the side conditions a = 0 and b = 0 (what
SAS does) and (b) using another set of side conditions.
QUESTION : In general, can we give a mathematical form for the particular solution?
Note that we are now solving
XX
C
XY
which is equivalent to
XX
0
CC
XY
0
since C = 0 iff C0 C = 0. Thus, any solution to this system must also satisfy
(X0 X + C0 C) = X0 Y.
But,
r(X0 X + C0 C) = r (X0 C0 )
X
C
= r
X
C
= p,
that is, X0 X + C0 C is nonsingular. Hence, the unique solution to the augmented normal
equations must be
b = (X0 X + C0 C)1 X0 Y.
So, imposing s = p r conditions C = 0, where the elements of C are jointly nonestimable, yields a particular solution to the normal equations. Finally, note that by
Result 2.5 (notes),
b = X(X0 X + C0 C)1 X0 Y
X
= PX Y,
which shows that
PX = X(X0 X + C0 C)1 X0
is the perpendicular projection matrix onto C(X). This shows that (X0 X + C0 C)1 is a
(non-singular) generalized inverse of X0 X.
PAGE 53
CHAPTER 4
4.1
Introduction
PAGE 54
CHAPTER 4
4.2
b b 0 )
b = 0. Recalling that 0
b = a0 PX Y, we have
We now show that cov(0 ,
b = cov(a0 PX Y, a0 Y a0 PX Y)
b b 0 )
cov(0 ,
= cov[a0 PX Y, a0 (I PX )Y]
= a0 PX cov(Y, Y)[a0 (I PX )]0
= 2 Ia0 PX (I PX )a = 0,
b has variance no larger
b showing that 0
b var(0 ),
since PX (I PX ) = 0. Thus, var()
b Equality results when var(b 0 )
b = 0. However, if var(b 0 )
b = 0,
than that of .
b = 0 as well, b 0
b is a degenerate random variable at 0; i.e.,
then because E(b 0 )
b = 1. This establishes uniqueness.
pr(b = 0 )
MULTIVARIATE CASE : Suppose that we wish to estimate simultaneously k estimable
linear functions
01
0
2
0
=
..
.
0k
PAGE 55
CHAPTER 4
where 0i = a0i X, for some ai ; i.e., 0i R(X), for i = 1, 2, ..., k. We say that 0 is
estimable if and only if 0i , i = 1, 2, ..., k, are each estimable. Put another way, 0 is
estimable if and only if 0 = A0 X, for some matrix A.
Result 4.2. Consider the Gauss-Markov model Y = X + , where E() = 0 and
cov() = 2 I. Suppose that 0 is any k-dimensional estimable vector and that c+A0 Y is
b denote any solution
any vector of linear unbiased estimators of the elements of 0 . Let
b is nonnegative
to the normal equations. Then, the matrix cov(c + A0 Y) cov(0 )
definite.
b
Proof. It suffices to show that x0 [cov(c + A0 Y) cov(0 )]x
0, for all x. Note that
b
b
= x0 cov(c + A0 Y)x x0 cov(0 )x
x0 [cov(c + A0 Y) cov(0 )]x
b
= var(x0 c + x0 A0 Y) var(x0 0 ).
But x0 0 = x0 A0 X (a scalar) is estimable since x0 A0 X R(X). Also, x0 c + x0 A0 Y =
x0 (c+A0 Y) is a linear unbiased estimator of x0 0 . The least squares estimator of x0 0
b 0.
b Thus, by Result 4.1, var(x0 c + x0 A0 Y) var(x0 0 )
is x0 0 .
OBSERVATION : Consider the Gauss-Markov linear model Y = X +, where E() = 0
and cov() = 2 I. If X is full rank, then X0 X is nonsingular and every linear combination
b = (X0 X)1 X0 Y. It
of 0 is estimable. The (ordinary) least squares estimator of is
is unbiased and
b = cov[(X0 X)1 X0 Y] = (X0 X)1 X0 cov(Y)[(X0 X)1 X0 ]0
cov()
= (X0 X)1 X0 2 IX(X0 X)1 = 2 (X0 X)1 .
Note that this is not correct if X is less than full rank.
Example 4.1. Recall the simple linear regression model
Yi = 0 + 1 xi + i ,
for i = 1, 2, ..., n, where 1 , 2 , ..., n are uncorrelated random variables with mean 0 and
common variance 2 > 0 (these are the Gauss Markov assumptions). Recall that, in
PAGE 56
CHAPTER 4
matrix notation,
Y=
Y1
Y2
..
.
Yn
X=
1 x1
1 x2
.. ..
. .
1 xn
=
1
2
..
.
n
b = (X0 X)1 X0 Y =
b0
b1
Y b1 x
P
(xi x)(Yi Y
iP
2
i (xi x)
b is
The covariance matrix of
b = 2 (X0 X)1 = 2
cov()
4.3
1
n
2
P x
2
i (xi x)
P (xxi x)2
i
x
2
i (xi x)
P 1
2
i (xi x)
REVIEW : Consider the Gauss-Markov model Y = X+, where E() = 0 and cov() =
b
2 I. The best linear unbiased estimator (BLUE) for any estimable function 0 is 0 ,
b is any solution to the normal equations. Clearly, E(Y) = X is estimable and
where
the BLUE of E(Y) is
b = X(X0 X)1 X0 Y = PX Y = Y,
b
X
the perpendicular projection of Y onto C(X); that is, the fitted values from the least
squares fit. The residuals are given by
b = Y PX Y = (I PX )Y,
b
e=YY
the perpendicular projection of Y onto N (X0 ). Recall that the residual sum of squares
is
b = (Y X)
b 0 (Y X)
b =b
Q()
e0 b
e = Y0 (I PX )Y.
We now turn our attention to estimating 2 .
PAGE 57
CHAPTER 4
Result 4.3. Suppose that Z is a random vector with mean E(Z) = and covariance
matrix cov(Z) = . Let A be nonrandom. Then
E(Z0 AZ) = 0 A + tr(A).
Proof. Note that Z0 AZ is a scalar random variable; hence, Z0 AZ = tr(Z0 AZ). Also, recall
that expectation E() and tr() are linear operators. Finally, recall that tr(AB) = tr(BA)
for conformable A and B. Now,
E(Z0 AZ) = E[tr(Z0 AZ)] = E[tr(AZZ0 )]
= tr[AE(ZZ0 )]
= tr[A( + 0 )]
= tr(A) + tr(A0 )
= tr(A) + tr(0 A) = tr(A) + 0 A.
REMARK : Finding var(Z0 AZ) is more difficult; see Section 4.9 in Monahan. Considerable simplification results when Z follows a multivariate normal distribution.
APPLICATION : We now find an unbiased estimator of 2 under the GM model. Suppose
that Y is n 1 and X is n p with rank r p. Note that E(Y) = X. Applying Result
4.3 directly with A = I PX , we have
E[Y0 (I PX )Y] = (X)0 (I PX )X +tr[(I PX ) 2 I]
|
{z
}
= 0
= [tr(I) tr(PX )]
= 2 [n r(PX )] = 2 (n r).
Thus,
b2 = (n r)1 Y0 (I PX )Y
is an unbiased estimator of 2 in the GM model. In non-matrix notation,
b2 = (n r)1
n
X
i=1
(Yi Ybi )2 ,
CHAPTER 4
Source
df
SS
MS
Model (Corrected)
r1
SSR = Y0 (PX P1 )Y
MSR =
SSR
r1
Residual
nr
SSE = Y0 (I PX )Y
MSE =
SSE
nr
Total (Corrected)
n1
SST = Y0 (I P1 )Y
F =
MSR
MSE
NOTES :
The degrees of freedom associated with each SS is the rank of its appropriate
perpendicular projection matrix; that is, r(PX P1 ) = r 1 and r(IPX ) = nr.
Note that
b b
cov(Y,
e) = cov[PX Y, (I PX )Y] = PX 2 I(I PX ) = 0.
That is, the least squares fitted values are uncorrelated with the residuals.
We have just shown that E(MSE) = 2 . If X
/ C(1), that is, the independent
variables in X add to the model (beyond an intercept, for example), then
E(SSR) = E[Y0 (PX P1 )Y] = (X)0 (PX P1 )X + tr[(PX P1 ) 2 I]
= (X)0 X + 2 r(PX P1 )
= (X)0 X + (r 1) 2 .
Thus,
E(MSR) = (r 1)1 E(SSR) = 2 + (r 1)1 (X)0 X.
If X C(1), that is, the independent variables in X add nothing to the model,
then (X)0 (PX P1 )X = 0 and MSR and MSE are both unbiased estimators
of 2 . If this is true, F should be close to 1. Large values of F occur when
(X)0 (PX P1 )X is large, that is, when X is far away from C(1), that is,
when the independent variables in X are more relevant in explaining E(Y).
PAGE 59
CHAPTER 4
4.4
4.4.1
Underfitting (Misspecification)
CHAPTER 4
CONSEQUENCES : Now, lets turn to the estimation of 2 . Under the correct model,
E[Y0 (I PX )Y] = (X + W)0 (I PX )(X + W) + tr[(I PX ) 2 I]
= (W)0 (I PX )W + 2 (n r),
where r = r(X). Thus,
E(MSE) = 2 + (n r)1 (W)0 (I PX )W,
that is,
b2 = MSE is unbiased if and only if W C(X).
4.4.2
Overfitting
X0 X = X0 Y
X01 X1 X01 X2
X02 X1
X02 X2
1
2
X01 Y
X02 Y
0
0
0
b
X
X
X
X
X
Y
1 2
b= 1 = 1 1
1 .
b2
X02 X1 X02 X2
X02 Y
PAGE 61
CHAPTER 4
0
0
0
XX
0
X1 X1 X1 X2
;
= 1 1
X0 X =
0
0
0
0
X 2 X2
X2 X1 X2 X2
e . This would mean that
b = (X0 X1 )1 X0 Y =
i.e., X0 X is block diagonal, and
1
1
1
1
using the unnecessarily large model has no effect on our estimate of 1 . However,
the precision with which we can estimate 2 is affected since r(IPX ) < r(IPX1 );
that is, we have fewer residual degrees of freedom.
If the columns of X2 are not all orthogonal to C(X1 ), then
b ) 6= 2 (X0 X1 )1 .
cov(
1
1
Furthermore, as X2 gets closer to C(X1 ), then X02 (I PX1 )X2 gets smaller.
b ) larger.
This makes [X02 (I PX1 )X2 ]1 larger. This makes cov(
1
Multicollinearity occurs when X2 is close to C(X1 ). Severe multicollinearity
can greatly inflate the variances of the least squares estimates. In turn, this can
have a deleterious effect on inference (e.g., confidence intervals too wide, hypothesis
tests with no power, predicted values with little precision, etc.). Various diagnostic
measures exist to assess multicollinearity (e.g., VIFs, condition numbers, etc.); see
the discussion in Monahan, pp 80-82.
PAGE 62
CHAPTER 4
4.5
where D1/2 = diag( 1 , 2 , ..., n ). Note that V1/2 V1/2 = V and that V1 =
V1/2 V1/2 , where
V1/2 = QD1/2 Q0
Y = U + ,
PAGE 63
CHAPTER 4
CHAPTER 4
X)1 X0 V1 Y.
GLS = (X V
Q () = (Y X) V (Y X) =
n
X
wi (Yi x0i )2 ,
i=1
b
where wi = 1/vi and x0i is the ith row of X. In this situation,
GLS is called the weighted
least squares estimator.
Result 4.4. Consider the Aitken model Y = X+, where E() = 0 and cov() = 2 V,
0
b
where V is known. If 0 is estimable, then 0
GLS is the BLUE for .
b ) = and
that E(
GLS
b ) = 2 (X0 V1 X)1 .
cov(
GLS
Example 4.2. Heteroscedastic regression through the origin. Consider the regression
model Yi = xi + i , for i = 1, 2, ..., n, where E(i ) = 0, var(i ) = 2 g 2 (xi ), for some real
function g(),
Y
1
Y2
Y=
..
.
Yn
x
g 2 (x1 )
0
x
0
g 2 (x2 )
, X = 2 , and V =
..
..
..
.
.
xn
0
0
PAGE 65
..
.
0
..
.
g 2 (xn )
CHAPTER 4
Pn
2
i=1 xi Yi /g (xi )
P
= (X V X) X V Y =
.
n
2
2
i=1 xi /g (xi )
0
Which one is better? Both of these estimators are unbiased, so we turn to the variances.
Straightforward calculations show that
var(bOLS ) =
Pn
2 2
i=1 xi g (xi )
P
2
( ni=1 x2i )
2
.
2
2
i=1 xi /g (xi )
var(bGLS ) = Pn
We are thus left to compare
Pn
2 2
i=1 xi g (xi )
P
2
( ni=1 x2i )
1
.
2
2
i=1 xi /g (xi )
with
Pn
n
X
!2
ui vi
n
X
i=1
Thus,
1
Pn 2 2
i=1 xi /g (xi )
i=1
u2i
n
X
i=1
vi2
n
X
x2i g 2 (xi )
i=1
n
X
x2i /g 2 (xi ).
i=1
Pn
2 2
i=1 xi g (xi )
P
2
( ni=1 x2i )
= var(bGLS ) var(bOLS ).
This result should not be surprising; after all, we know that bGLS is BLUE.
b is a generalized least squares estimate if and only if X
b=
Result 4.5. An estimate
AY, where A = X(X0 V1 X) X0 V1 .
Proof. The GLS estimate; i.e., the OLS estimate in the transformed model Y = U+ ,
where Y = V1/2 Y, U = V1/2 X, and = V1/2 , satisfies
b
V1/2 X[(V1/2 X)0 V1/2 X] (V1/2 X)0 V1/2 Y = V1/2 X,
by Result 2.5. Multiplying through by V1/2 and simplifying gives the result.
PAGE 66
CHAPTER 4
CHAPTER 5
Distributional Theory
5.1
Introduction
is a location-scale family generated by fZ (z); see Casella and Berger, Chapter 3. That
is, if Z fZ (z), then
1
X = Z + fX (x|, ) = fZ
.
APPLICATION : With the standard normal density fZ (z), it is easy to see that
1
2
1
x
1
fX (x|, ) = fZ
=
e 22 (x) I(x R).
2
That is, any normal random variable X N (, 2 ) may be obtained by transforming
Z N (0, 1) via X = Z + .
PAGE 68
CHAPTER 5
5.2
5.2.1
STARTING POINT : Suppose that Z1 , Z2 , ..., Zp are iid standard normal random variables. The joint pdf of Z = (Z1 , Z2 , ..., Zp )0 is given by
fZ (z) =
p
Y
fZ (zi )
i=1
=
p
Pp
2
i=1 zi /2
p
Y
I(zi R)
i=1
CHAPTER 5
where |A| denotes the determinant of A. The matrix V1/2 is pd; thus, its determinant
is always positive. Thus, for y Rp ,
fY (y) = fZ {g 1 (y)}|V1/2 |
= |V|1/2 fZ {V1/2 (y )}
= (2)p/2 |V|1/2 exp[{V1/2 (y )}0 V1/2 (y )/2]
= (2)p/2 |V|1/2 exp{(y )0 V1 (y )/2}.
If Y fY (y), we say that Y has a multivariate normal distribution with mean
and covariance matrix V. We write Y Np (, V).
IMPORTANT : In the preceding derivation, we assumed V to be pd (hence, nonsingular).
If V is singular, then the distribution of Y is concentrated in a subspace of Rp , with
dimension r(V). In this situation, the density function of Y does not exist.
5.2.2
is defined for all t in an open neighborhood about zero. The function MX (t) is called the
moment generating function (mgf ) of X.
Result 5.1.
1. If MX (t) exists, then E(|X|j ) < , for all j 1, that is, the moment generating
function characterizes an infinite set of moments.
2. MX (0) = 1.
3. The jth moment of X is given by
j
M
(t)
X
E(X j ) =
j
t
PAGE 70
.
t=0
CHAPTER 5
4. Uniqueness. If X1 MX1 (t), X2 MX2 (t), and MX1 (t) = MX2 (t) for all t in an
open neighborhood about zero, then FX1 (x) = FX2 (x) for all x.
5. If X1 , X2 , ..., Xn are independent random variables with mgfs MXi (t), i = 1, 2, ..., n,
P
and Y = a0 + ni=1 ai Xi , then
MY (t) = ea0 t
n
Y
i=1
Result 5.2.
1. If X N (, 2 ), then MX (t) = exp(t + t2 2 /2), for all t R.
2. If X N (, 2 ), then Y = a + bX N (a + b, b2 2 ).
3. If X N (, 2 ), then Z = 1 (X ) N (0, 1).
TERMINOLOGY : Define the random vector X = (X1 , X2 , ..., Xp )0 and let t = (t1 , t2 , ..., tp )0 .
The moment generating function for X is given by
Z
0
MX (t) = E{exp(t X)} =
exp(t0 x)dFX (x),
Rp
provided that E{exp(t0 X)} < , for all ||t|| < , > 0.
Result 5.3.
1. If MX (t) exists, then MXi (ti ) = MX (ti ), where ti = (0, ..., 0, ti , 0, ..., 0)0 . This
implies that E(|Xi |j ) < , for all j 1.
2. The expected value of X is
MX (t)
E(X) =
t
.
t=0
M
(t)
X
0
E(XX ) =
0
tt
.
t=0
Thus,
2 MX (t)
E(Xr Xs ) =
tr ts
PAGE 71
.
tr =ts =0
CHAPTER 5
4. Uniqueness. If X1 and X2 are random vectors with MX1 (t) = MX2 (t) for all t in
an open neighborhood about zero, then FX1 (x) = FX2 (x) for all x.
5. If X1 , X1 , ..., Xn are independent random vectors, and
Y = a0 +
n
X
Ai X i ,
i=1
exp(a00 t)
n
Y
i=1
6. Let X = (X01 , X02 , ..., X0m )0 and suppose that MX (t) exists. Let MXi (ti ) denote the
mgf of Xi . Then, X1 , X2 , ..., Xm are independent if and only if
MX (t) =
n
Y
MXi (ti )
i=1
for all t = (t01 , t02 , ..., t0m )0 in an open neighborhood about zero.
Result 5.4. If Y Np (, V), then MY (t) = exp(t0 + t0 Vt/2).
Proof. Exercise.
5.2.3
Properties
V11 V12
V=
V21 V22
PAGE 72
CHAPTER 5
5.2.4
CHAPTER 5
Z
1
0
Z1 = 1 1 ,
Y = +
2 Z1
2
0
where 0 = (1 , 2 )0 and r() = 1. Since r() = 1, this means that at least one of 1 and
2 is not equal to zero. Without loss, take 1 6= 0, in which case
Y2 =
2
Y1 .
1
2
2
2 2
1 2 1
1 2
1 1
= 1
= 0 = V.
cov(Y) = E(YY0 ) = E
Z 2 2Z 2
2
1 2 2
1 2 1
2 1
Note that |V| = 0. Thus, Y21 is a random vector with all of its probability mass located
in the linear subspace {(y1 , y2 ) : y2 = 2 y1 /1 }. Since r(V) = 1 < 2, Y does not have a
density function.
5.2.5
Independence results
V
V12 V1m
1
Y1
11
Y = . , = . , and V =
..
..
..
..
.
.
..
..
.
.
Then, Y1 , Y2 , ..., Ym are jointly independent if and only if Vij = 0, for all i 6= j.
Proof. Sufficiency (=): Suppose Y1 , Y2 , ..., Ym are jointly independent. For all i 6= j,
Vij = E{(Yi i )(Yj j )0 }
= E(Yi i )E{(Yj j )0 } = 0.
Necessity (=): Suppose that Vij = 0 for all i 6= j, and let t = (t01 , t02 , ..., t0m )0 . Note
that
0
t Vt =
m
X
t0i Vii ti
and t =
i=1
m
X
i=1
PAGE 74
t0i i .
CHAPTER 5
Thus,
MY (t) = exp(t0 + t0 Vt/2)
!
m
m
X
X
= exp
t0i i +
t0i Vii ti /2
i=1
m
Y
i=1
exp(t0i i
i=1
m
Y
MYi (ti ).
i=1
a
B
Y
1 = 1 + 1 X = a + BX.
a2
B2
Y2
Thus, Y is a linear combination of X; hence, Y follows a multivariate normal distribution
(i.e., Y1 and Y2 are jointly normal). Also, cov(Y1 , Y2 ) = cov(B1 X, B2 X) = B1 B02 .
Now simply apply Result 5.7.
REMARK : If X1 N (1 , 1 ), X2 N (2 , 2 ), and cov(X1 , X2 ) = 0, this does not
necessarily mean that X1 and X2 are independent! We need X = (X01 , X02 )0 to be jointly
normal.
APPLICATION : Consider the general linear model
Y = X + ,
where Nn (0, 2 I). We have already seen that Y Nn (X, 2 I). Also, note that
with PX = X(X0 X) X0 ,
b
Y
b
e
PX
I PX
Y,
b and b
a linear combination of Y. Thus, Y
e are jointly normal. By the last result, we
b and b
know that Y
e are independent since
b b
cov(Y,
e) = PX 2 I(I PX )0 = 0.
PAGE 75
CHAPTER 5
That is, the fitted values and residuals from the least-squares fit are independent. This
explains why residual plots that display nonrandom patterns are consistent with a violation of our model assumptions.
5.2.6
Conditional distributions
2
X
X
and =
=
Y
X Y
X Y
Y2
and = corr(X, Y ). The conditional distribution of Y , given X, is also normally distributed, more precisely,
Y |{X = x} N Y + (Y /X )(x X ), Y2 (1 2 ) .
It is important to see that the conditional mean E(Y |X = x) is a linear function of x.
Note also that the conditional variance var(Y |X = x) is free of x.
EXTENSION : We wish to extend the previous result to random vectors. In particular,
suppose that X and Y are jointly multivariate normal with XY 6= 0. That is, suppose
X Y X
N X ,
,
XY
Y
Y
and assume that X is nonsingular. The conditional distribution of Y given X is
Y|{X = x} N (Y |X , Y |X ),
where
Y |X = Y + Y X 1
X (x X )
and
Y |X = Y Y X 1
X XY .
Again, the conditional mean Y |X is a linear function of x and the conditional covariance
matrix Y |X is free of x.
PAGE 76
CHAPTER 5
5.3
Noncentral 2 distribution
RECALL: Suppose that U 2n ; that is, U has a (central) 2 distribution with n > 0
degrees of freedom. The pdf of U is given by
fU (u|n) =
u
( n2 )2n/2
n
1
2
ZZ=
n
X
Zi2 2n .
i=1
Proof. Exercise.
TERMINOLOGY : A univariate random variable V is said to have a noncentral 2
distribution with degrees of freedom n > 0 and noncentrality parameter > 0 if it has
the pdf
fV (v|n, ) =
j
X
e
j!
j=0
v
)2(n+2j)/2
( n+2j
2
|
n+2j
1
2
{z
fU (v|n+2j)
fV,W (v, j)
j=0
fW (j)fV |W (v|j)
j=0
j
X
e
j=0
j!
v
(n+2j)/2
( n+2j
)2
2
PAGE 77
n+2j
1
2
CHAPTER 5
MV (t) = (1 2t)
exp
2t
1 2t
,
E{E(e |W )} =
(n+2j)/2
e j
j!
(1 2t)
j=0
= e
(1 2t)
n/2
X
j
j=0
= e
(1 2t)
n/2
j!
(1 2t)j
j
12t
j!
j=0
= e (1 2t)
exp
1 2t
2t
n/2
= (1 2t)
exp
.
1 2t
n/2
MU (t) = (1 2t)
exp
2t
1 2t
,
tY 2
Z
)=
PAGE 78
1
1
2
2
ety e 2 (y) dy
2
R
CHAPTER 5
Now, combine the exponents in the integrand, square out the (y )2 term, combine like
terms, complete the square, and collapse the expression to (1 2t)1/2 exp{2 t/(1 2t)}
times some normal density that is integrated over R.
Result 5.12. If U1 , U2 , ..., Um are independent random variables, where Ui 2ni (i );
P
P
P
i = 1, 2, ..., m, then U = i Ui 2n (), where n = i ni and = i i .
Result 5.13. Suppose that V 2n (). For fixed n and c > 0, the quantity P (V > c)
is a strictly increasing function of .
Proof. See Monahan, pp 106-108.
IMPLICATION : If V1 2n (1 ) and V2 2n (2 ), where 2 > 1 , then pr(V2 > c) >
pr(V1 > c). That is, V2 is (strictly) stochastically greater than V1 , written V2 >st V1 .
Note that
V2 >st V1 FV2 (v) < FV1 (v) SV2 (v) > SV1 (v),
for all v, where FVi () denotes the cdf of Vi and SVi () = 1 FVi () denotes the survivor
function of Vi .
5.4
Noncentral F distribution
U1 /n1
Fn1 ,n2 .
U2 /n2
PAGE 79
CHAPTER 5
fW (w|n1 , n2 , ) =
j
X
j=0
j!
2
( n1 +2j+n
)
2
n1 +2j
n2
)( n22 )
( n1 +2j
2
(n1 +2j)/2
1+
n1 w
n2
w(n1 +2j2)/2
We write W Fn1 ,n2 (). When = 0, the noncentral F distribution reduces to the
central F distribution.
MEAN AND VARIANCE : If W Fn1 ,n2 (), then
2
n2
1+
E(W ) =
n2 2
n1
and
2n22
var(W ) = 2
n1 (n2 2)
n1 + 4
(n1 + 2)2
+
(n2 2)(n2 4)
n2 4
.
E(W ) exists only when n2 > 2 and var(W ) exists only when n2 > 4. The moment
generating function for the noncentral F distribution does not exist in closed form.
Result 5.14. If U1 and U2 are independent random variables with U1 2n1 () and
U2 2n2 , then
W =
U1 /n1
Fn1 ,n2 ().
U2 /n2
CHAPTER 5
the alternative hypothesis. Thus, the form of an appropriate rejection region is to reject
H0 for large values of the test statistic. The power is simply the probability of rejection
region (defined under H0 ) when the probability distribution is noncentral F . Noncentral
F distributions are available in most software packages.
5.5
GOAL: We would like to find the distribution of Y0 AY, where Y Np (, V). We will
obtain this distribution by taking steps. Result 5.16 is a very small step. Result 5.17 is
a large step, and Result 5.18 is the finish line. There is no harm in assuming that A is
symmetric.
Result 5.16. Suppose that Y Np (, I) and define
0
W = Y Y = Y IY =
p
X
Yi2 .
i=1
Result 5.11 says Yi2 21 (2i /2), for i = 1, 2, ..., p. Thus, from Result 5.12,
0
YY=
p
X
Yi2 2p (0 /2).
i=1
0
Is 0
P
1 = P1 P01 .
A = QDQ0 = P1 P2
0 0
P02
PAGE 81
CHAPTER 5
Thus, we have shown that (a) holds. To show that (b) holds, note that because Q is
orthogonal,
Ip = Q0 Q =
P01
P02
P1 P2
P01 P1
P01 P2
P02 P1 P02 P2
CHAPTER 5
since AV has rank s (by assumption) and V and V1/2 are both nonsingular. Also, AV
is idempotent by assumption so that
AV = AVAV A = AVA
V1/2 AV1/2 = V1/2 AV1/2 V1/2 AV1/2
B = BB.
Thus, B is idempotent of rank s. This implies that Y0 AY = X0 BX 2s (). Noting
that
1
1
1
= (V1/2 )0 B(V1/2 ) = 0 V1/2 V1/2 AV1/2 V1/2 = 0 A
2
2
2
completes the argument.
Example 5.2. Suppose that Y = (Y1 , Y2 , ..., Yn )0 Nn (1, 2 I), so that = 1 and
V = 2 I, where 1 is n 1 and I is n n. The statistic
n
X
(n 1)S =
(Yi Y )2 = Y0 (I n1 J)Y = Y0 AY,
2
i=1
CHAPTER 5
CHAPTER 5
In the last calculation, note that = (X)0 X/2 2 = 0 iff X = 0. In this case, both
quadratic forms Y0 (I PX )Y/ 2 and Y0 PX Y/ 2 have central 2 distributions.
5.6
0
D1 0
P
1 = P1 D1 P01 ,
A = QDQ0 = P1 P2
0 0
P02
PAGE 85
CHAPTER 5
BVB BVP1
B
B
BY
.
,
=
Y N
P0
P01 VB0 P01 VP1
P01
X
1
Suppose that BVA = 0. Then,
0 = BVA = BVP1 D1 P01 = BVP1 D1 P01 P1 .
But, because Q is orthogonal,
Ip = Q0 Q =
P01
P02
P1 P2
P01 P1 P01 P2
P02 P1
P02 P2
CHAPTER 5
0
D1 0
P
1 = P1 D1 P01 ,
A = PDP0 = P1 P2
P02
0 0
where D1 = diag(1 , 2 , ..., s ) and s = r(A). Similarly, write
0
R1 0
Q
1 = Q1 R1 Q01 ,
B = QRQ0 = Q1 Q2
Q02
0 0
where R1 = diag(1 , 2 , ..., t ) and t = r(B). Since P and Q are orthogonal, this implies
that P01 P1 = Is and Q01 Q1 = It . Suppose that BVA = 0. Then,
0 = BVA = Q1 R1 Q01 VP1 D1 P01
= Q01 Q1 R1 Q01 VP1 D1 P01 P1
= R1 Q01 VP1 D1
0
1
= R1
1 R1 Q1 VP1 D1 D1
= Q01 VP1
= cov(Q01 Y, P01 Y).
Now,
P0
P VP1
0
Y N 1 , 1
.
Q0
Q01
0
Q01 VQ1
1
P01
That is, P01 Y and Q01 Y are jointly normal and uncorrelated; thus, they are independent.
So are Y0 P1 D1 P01 Y and Y0 Q1 R1 Q01 Y. But A = P1 D1 P01 and B = Q1 R1 Q01 , so we are
done.
Example 5.5. Consider the general linear model
Y = X + ,
where X is n p with rank r p and Nn (0, 2 I). Let PX = X(X0 X) X0 denote
the perpendicular projection matrix onto C(X). We know that Y Nn (X, 2 I). In
PAGE 87
CHAPTER 5
F
=
(X)
X/2
,
r,nr
2 Y0 (I PX )Y/(n r)
F =
a noncentral F distribution with degrees of freedom r (numerator) and n r (denominator) and noncentrality parameter = (X)0 X/2 2 .
OBSERVATIONS :
Note that if X = 0, then F Fr,nr since the noncentrality parameter = 0.
On the other hand, as the length of X gets larger, so does . This shifts the
noncentral Fr,nr {(X)0 X/2 2 } distribution to the right, because the noncentral
F distribution is stochastically increasing in its noncentrality parameter.
Therefore, large values of F are consistent with large values of ||X||.
PAGE 88
CHAPTER 5
5.7
Cochrans Theorem
REMARK : An important general notion in linear models is that sums of squares like
Y0 PX Y and Y0 Y can be broken down into sums of squares of smaller pieces. We now
discuss Cochrans Theorem (Result 5.21), which serves to explain why this is possible.
Result 5.21. Suppose that Y Nn (, 2 I). Suppose that A1 , A2 , ..., Ak are n n
symmetric and idempotent matrices, where r(Ai ) = si , for i = 1, 2, ..., k. If A1 + A2 +
+ Ak = In , then Y0 A1 Y/ 2 , Y0 A2 Y/ 2 , ..., Y0 Ak Y/ 2 follow independent 2si (i )
P
distributions, where i = 0 Ai /2 2 , for i = 1, 2, ..., k, and ki=1 si = n.
Outline of proof. If A1 + A2 + + Ak = In , then Ai idempotent implies that both (a)
P
Ai Aj = 0, for i 6= j, and (b) ki=1 si = n hold. That Y0 Ai Y/ 2 2si (i ) follows from
Result 5.18 with V = 2 I. Because Ai Aj = 0, Result 5.20 with V = 2 I guarantees
that Y0 Ai Y/ 2 and Y0 Aj Y/ 2 are independent.
IMPORTANCE : We now show how Cochrans Threorem can be used to deduce the joint
distribution of the sums of squares in an analysis of variance. Suppose that we partition
the design matrix X and the parameter vector in Y = X + into k + 1 parts, so that
Y = (X0 X1 Xk )
.. ,
.
k
P
where the dimensions of Xi and i are n pi and pi 1, respectively, and ki=0 pi = p.
We can now write this as a k + 1 part model (the full model):
Y = X0 0 + X1 1 + + Xk k + .
Now consider fitting each of the k submodels
Y = X0 0 +
Y = X0 0 + X1 1 +
..
.
Y = X0 0 + X1 1 + + Xk1 k1 + ,
PAGE 89
CHAPTER 5
and let R( 0 , 1 , ..., i ) denote the regression (model) sum of squares from fitting the
ith submodel, for i = 0, 1, ..., k; that is,
i = 1, 2, ..., k 1,
Ak = PX PXk1
Ak+1 = I PX .
Note that A0 + A1 + A2 + + Ak+1 = I. Note also that the Ai matrices are symmetric,
for i = 0, 1, ..., k + 1, and that
A2i = (PXi PXi1 )(PXi PXi1 )
= P2Xi PXi1 PXi1 + P2Xi1
= PXi PXi1 = Ai ,
for i = 1, 2, ..., k, since PXi PXi1 = PXi1 and PXi1 PXi = PXi1 . Thus, Ai is idempotent for i = 1, 2, ..., k. However, clearly A0 = PX0 = X0 (X00 X0 ) X00 and Ak+1 = I PX
PAGE 90
CHAPTER 5
Pk+1
i=0
si = n.
where sk+1 = n r(X). Note that the last quadratic follows a central 2 distribution
because k+1 = (X)0 (I PX )X/2 2 = 0. Cochrans Theorem also guarantees that the
quadratic forms Y0 A0 Y/ 2 , Y0 A1 Y/ 2 , ..., Y0 Ak Y/ 2 , Y0 Ak+1 Y/ 2 are independent.
ANOVA TABLE : The quadratic forms Y0 Ai Y, for i = 0, 1, ..., k + 1, and the degrees of
freedom si = r(Ai ) are often presented in the following ANOVA table:
Source
df
SS
Noncentrality
s0
R( 0 )
0 = (X)0 A0 X/2 2
1 (after 0 )
s1
R( 0 , 1 ) R( 0 )
1 = (X)0 A1 X/2 2
2 (after 0 , 1 )
..
.
s2
..
.
R( 0 , 1 , 2 ) R( 0 , 1 )
..
.
2 = (X)0 A2 X/2 2
..
.
k (after 0 , ..., k1 )
sk
R( 0 , ..., k ) R( 0 , ..., k1 )
k = (X)0 Ak X/2 2
Residual
sk+1
Y0 Y R( 0 , ..., k )
k+1 = 0
Total
Y0 Y
(X)0 X/2 2
CHAPTER 5
Yn1
Y
11
Y12
=
..
.
Yana
Xnp
where p = a + 1 and n =
0n1
0n2
.. ,
...
.
1na
and p1
1
2
..
.
a
where X0 = 1,
X1 =
1n1 0n1
0n2 1n2
..
..
.
.
0na 0na
0n1
0n2
.. ,
...
.
1na
PAGE 92
CHAPTER 5
Source
df
SS
Noncentrality
R()
0 = (X)0 P1 X/2 2
1 , ..., a (after )
a1
Residual
na
Y0 Y R(, 1 , ..., a )
Total
Y0 Y
(X)0 X/2 2
F STATISTIC : Because
1 0
Y (PX P1 )Y 2a1 (1 )
2
PAGE 93
CHAPTER 5
and
1 0
Y (I PX )Y 2na ,
2
and because these two quadratic forms are independent, it follows that
F =
Y0 (PX P1 )Y/(a 1)
Fa1,na (1 ).
Y0 (I PX )Y/(n a)
and
E(MSR) = E (a 1)1 Y0 (PX P1 )Y = 2 + (X)0 (PX P1 )X/(a 1)
a
X
= 2 +
ni (i + )2 /(a 1),
i=1
P
where + = a1 ai=1 i . Again, note that if H0 : 1 = 2 = = a = 0 is true, then
MSR is also an unbiased estimator of 2 . Therefore, values of
F =
Y0 (PX P1 )Y/(a 1)
Y0 (I PX )Y/(n a)
CHAPTER 6
Statistical Inference
Complementary reading from Monahan: Chapter 6 (and revisit Sections 3.9 and 4.7).
6.1
Estimation
t1 (y) = y0 y
w2 () = / 2
t2 (y) = X0 y,
that is, Y has pdf in the exponential family (see Casella and Berger, Chapter 3). The
family is full rank (i.e., it is not curved), so we know that T(Y) = (Y0 Y, X0 Y) is
a complete sufficient statistic for . We also know that minimum variance unbiased
estimators (MVUEs) of functions of are unbiased functions of T(Y).
Result 6.1. Consider the general linear model Y = X + , where X is n p with rank
b
r p and Nn (0, 2 I). The MVUE for an estimable function 0 is given by 0 ,
PAGE 95
CHAPTER 6
b X0 Y),
= (n r)1 (Y0 Y
where PX is the perpendicular projection matrix onto C(X).
b and MSE are unbiased estimators of 0 and 2 , respectively. These
Proof. Both 0
estimators are also functions of T(Y) = (Y0 Y, X0 Y), the complete sufficient statistic.
Thus, each estimator is the MVUE for its expected value.
MAXIMUM LIKELIHOOD: Consider the general linear model Y = X + , where X is
n p with rank r p and Nn (0, 2 I). The likelihood function for = ( 0 , 2 )0 is
L(|y) = L(, 2 |y) = (2)n/2 ( 2 )n/2 exp{(y X)0 (y X)/2 2 }.
Maximum likelihood estimators for and 2 are found by maximizing
n
n
log L(, 2 |y) = log(2) log 2 (y X)0 (y X)/2 2
2
2
with respect to and 2 . For every value of 2 , maximizing the loglikelihood is the same
as minimizing Q() = (y X)0 (y X), that is, the least squares estimator
b = (X0 X) X0 Y,
b 0 (y X)
b = y0 (I PX )y in for Q() and
is also an MLE. Now substitute (y X)
differentiate with respect to 2 . The MLE of 2 is
2
bMLE
= n1 Y0 (I PX )Y.
Note that the MLE for 2 is biased. The MLE is rarely used in practice; MSE is the
conventional estimator for 2 .
INVARIANCE : Under the normal GM model, the MLE for an estimable function 0
b where
b is any solution to the normal equations. This is true because of the
is 0 ,
invariance property of maximum likelihood estimators (see, e.g., Casella and Berger,
b is unique even if
b is not.
Chapter 7). If 0 is estimable, recall that 0
PAGE 96
CHAPTER 6
6.2
Testing models
PREVIEW : We now provide a general discussion on testing reduced versus full models
within a Gauss Markov linear model framework. Assuming normality will allow us to
derive the sampling distribution of the resulting test statistic.
PROBLEM : Consider the linear model
Y = X + ,
where r(X) = r p, E() = 0, and cov() = 2 I. Note that these are our usual GM
model assumptions. For the purposes of this discussion, we assume that this model (the
full model) is a correct model for the data. Consider also the linear model
Y = W + ,
where E() = 0, cov() = 2 I and C(W) C(X). We call this a reduced model because
the estimation space is smaller than in the full model. Our goal is to test whether or not
the reduced model is also correct.
If the reduced model is also correct, there is no reason not to use it. Smaller
models are easier to interpret and fewer degrees of freedom are spent in estimating
2 . Thus, there are practical and statistical advantages to using the reduced model
if it is also correct.
Hypothesis testing in linear models essentially reduces to putting a constraint on
the estimation space C(X) in the full model. If C(W) = C(X), then the W model
is a reparameterization of the X model and there is nothing to test.
RECALL: Let PW and PX denote the perpendicular projection matrices onto C(W) and
C(X), respectively. Because C(W) C(X), we know that PX PW is the ppm onto
C(PX PW ) = C(W)
C(X) .
GEOMETRY : In a general reduced-versus-full model testing framework, we start by assuming the full model Y = X + is essentially correct so that E(Y) = X C(X).
PAGE 97
CHAPTER 6
If the reduced model is also correct, then E(Y) = W C(W) C(X). Geometrically, performing a reduced-versus-full model test therefore requires the analyst to decide
whether E(Y) is more likely to be in C(W) or C(X) C(W). Under the full model,
our estimate for E(Y) = X is PX Y. Under the reduced model, our estimate for
E(Y) = W is PW Y.
If the reduced model is correct, then PX Y and PW Y are estimates of the same
thing, and PX Y PW Y = (PX PW )Y should be small.
If the reduced model is not correct, then PX Y and PW Y are estimating different
things, and PX Y PW Y = (PX PW )Y should be large.
The decision about reduced model adequacy therefore hinges on assessing whether
(PX PW )Y is large or small. Note that (PX PW )Y is the perpendicular
projection of Y onto C(W)
C(X) .
MOTIVATION : An obvious measure of the size of (PX PW )Y is its squared length,
that is,
{(PX PW )Y}0 (PX PW )Y = Y0 (PX PW )Y.
However, the length of (PX PW )Y is also related to the sizes of C(X) and C(W). We
therefore adjust for these sizes by using
Y0 (PX PW )Y/r(PX PW ).
We now compute the expectation of this quantity when the reduced model is/is not
correct. For notational simplicity, set r = r(PX PW ). When the reduced model is
correct, then
1
(W)0 (PX PW )W + tr{(PX PW ) 2 I}
r
1
= { 2 tr(PX PW )}
r
1
= (r 2 ) = 2 .
r
CHAPTER 6
r
1
= {(X)0 (PX PW )X + r 2 }
r
= 2 + (X)0 (PX PW )X/r .
Thus, if the reduced model is not correct, Y0 (PX PW )Y/r is estimating something
larger than 2 . Of course, 2 is unknown, so it must be estimated. Because the full
model is assumed to be correct,
MSE = (n r)1 Y0 (I PX )Y,
the MSE from the full model, is an unbiased estimator of 2 .
TEST STATISTIC : To test the reduced model versus the full model, we use
F =
Y0 (PX PW )Y/r
.
MSE
Using only our GM model assumptions (i.e., not necessarily assuming normality), we can
surmise the following:
When the reduced model is correct, the numerator and denominator of F are both
unbiased estimators of 2 , so F should be close to 1.
When the reduced model is not correct, the numerator in F is estimating something
larger than 2 , so F should be larger than 1. Thus, values of F much larger than
1 are not consistent with the reduced model being correct.
Values of F much smaller than 1 may mean something drastically different; see
Christensen (2003).
OBSERVATIONS : In the numerator of F , note that
Y0 (PX PW )Y = Y0 PX Y Y0 PW Y = Y0 (PX P1 )Y Y0 (PW P1 )Y,
which is the difference in the regression (model) sum of squares, corrected or uncorrected,
from fitting the two models. Also, the term
r = r(PX PW ) = tr(PX PW ) = tr(PX ) tr(PW ) = r(PX ) r(PW ) = r r0 ,
PAGE 99
CHAPTER 6
say, where r0 = r(PW ) = r(W). Thus, r = r r0 is the difference in the ranks of the
X and W matrices. This also equals the difference in the model degrees of freedom from
the two ANOVA tables.
REMARK : You will note that we have formulated a perfectly sensible strategy for testing
reduced versus full models while avoiding the question, What is the distribution of F ?
Our entire argument is based on first and second moment assumptions, that is, E() = 0
and cov() = 2 I, the GM assumptions. We now address the distributional question.
DISTRIBUTION OF F: To derive the sampling distribution of
F =
Y0 (PX PW )Y/r
,
MSE
we require that Nn (0, 2 I), from which it follows that Y Nn (X, 2 I). First,
we handle the denominator MSE = Y0 (I PX )Y/(n r). In Example 5.3 (notes), we
showed that
Y0 (I PX )Y/ 2 2nr .
This distributional result holds regardless of whether or not the reduced model is correct.
Now, we turn our attention to the numerator. Take A = 2 (PX PW ) and consider
the quadratic form
Y0 AY = Y0 (PX PW )Y/ 2 .
With V = 2 I, the matrix
AV = 2 (PX PW ) 2 I = PX PW
is idempotent with rank r(PX PW ) = r . Therefore, we know that Y0 AY 2r (),
where
1
1
= 0 A = 2 (X)0 (PX PW )X.
2
2
Now, we make the following observations:
If the reduced model is correct and X C(W), then (PX PW )X = 0 because
PX PW projects onto C(W)
C(X) . This means that the noncentrality parameter
= 0 and Y0 (PX PW )Y/ 2 2r , a central 2 distribution.
PAGE 100
CHAPTER 6
Y0 (PX PW )Y/r
2 Y0 (PX PW )Y/r
= 2 0
Fr ,nr (),
MSE
Y (I PX )Y/(n r)
where r = r r0 and
=
1
(X)0 (PX PW )X.
2 2
If the reduced model is correct, that is, if X C(W), then = 0 and F Fr ,nr , a
central F distribution. Note also that if the reduced model is correct,
E(F ) =
nr
1.
nr2
This reaffirms our (model free) assertion that values of F close to 1 are consistent with the
reduced model being correct. Because the noncentral F family is stochastically increasing
in , larger values of F are consistent with the reduced model not being correct.
SUMMARY : Consider the linear model Y = X + , where X is n p with rank r p
and Nn (0, 2 I), Suppose that we would like to test
H0 : Y = W +
versus
H1 : Y = X + ,
where C(W) C(X). An level rejection region is
RR = {F : F > Fr ,nr, },
where r = r r0 , r0 = r(W), and Fr ,nr, is the upper quantile of the Fr ,nr
distribution.
PAGE 101
CHAPTER 6
Y1
1 x1 x
Y2
1 x2 x
, = 0 ,
Y=
.. , X = ..
..
.
.
.
1
1 xn x
Yn
=
1
2
..
.
n
where Nn (0, 2 I). Suppose that we would like to test whether the reduced model
Yi = 0 + i ,
for i = 1, 2, ..., n, also holds. In matrix
Y2
, W =
Y=
..
Yn
1
1
2
1
.. = 1, = 0 , = .. ,
.
.
n
1
where Nn (0, 2 I) and 1 is an n 1 vector of ones. Note that C(W) C(X) with
r0 = 1, r = 2, and r = r r0 = 1. When the reduced model is correct,
F =
Y0 (PX PW )Y/r
F1,n2 ,
MSE
where MSE is the mean-squared error from the full model. When the reduced model is
not correct, F F1,n2 (), where
1
(X)0 (PX PW )X
2 2
n
X
2
= 1
(xi x)2 /2 2 .
i=1
Exercises: (a) Verify that this expression for the noncentrality parameter is correct.
(b) Suppose that n is even and the values of xi can be selected anywhere in the interval
(d1 , d2 ). How should we choose the xi values to maximize the power of a level test?
PAGE 102
CHAPTER 6
6.3
PROBLEM : Consider our usual Gauss-Markov linear model with normal errors; i.e.,
Y = X + , where X is n p with rank r p and Nn (0, 2 I). We now consider
the problem of testing
H0 : K0 = m
versus
H1 : K0 6= m,
where K is a p s matrix with r(K) = s and m is s 1.
Example 6.2. Consider the regression model Yi = 0 + 1 xi1 + 2 xi2 + 3 xi3 + 4 xi4 + i ,
for i = 1, 2, ..., n. Express each hypothesis in the form H0 : K0 = m:
1. H0 : 1 = 0
2. H0 : 3 = 4 = 0
3. H0 : 1 + 3 = 1, 2 4 = 1
4. H0 : 2 = 3 = 4 .
Example 6.3. Consider the analysis of variance model Yij = + i + ij , for i = 1, 2, 3, 4
and j = 1, 2, ..., ni . Express each hypothesis in the form H0 : K0 = m:
1. H0 : + 1 = 5, 3 4 = 1
2. H0 : 1 2 = 3 4
3. H0 : 1 2 = 13 (2 + 3 + 4 ).
TERMINOLOGY : The general linear hypothesis H0 : K0 = m is said to be testable
iff K has full column rank and each component of K0 is estimable. In other words, K0
contains s linearly independent estimable functions. Otherwise, H0 : K0 = m is said to
be nontestable.
PAGE 103
CHAPTER 6
CHAPTER 6
1
(K0 m)0 H1 (K0 m).
2 2
F =
PAGE 105
CHAPTER 6
We know that V = Y0 (I PX )Y/ 2 2nr and that Z and V are independent (verify!).
Thus,
p
b k0 )/ 2 k0 (X0 X) k
b k0
(k0
k0
p
p
T =
=
tnr .
2 Y0 (I PX )Y/(n r)
MSE k0 (X0 X) k
When H0 : k0 = m is true, the statistic
b m
k0
T =p
tnr .
MSE k0 (X0 X) k
Therefore, an level rejection region, when H1 is two sided, is
RR = {T : T tnr,/2 }.
One sided tests use rejection regions that are suitably adjusted. When H0 is not true,
T tnr (), where
k0 m
= p
.
2 k0 (X0 X) k
This distribution is of interest for power and sample size calculations.
PAGE 106
CHAPTER 6
6.4
CHAPTER 6
Because C(U) = C(K) , we know that X0 v C(K) = C(X0 D), since K0 = D0 X. Thus,
v = PX v = X(X0 X) X0 v C[X(X0 X) X0 D] = C(PX D).
Suppose that v C(PX D). Clearly, v C(X). Also, v = PX Dd, for some d and
v0 XU = d0 D0 PX XU = d0 D0 XU = d0 K0 U = 0,
because C(U) = C(K) . Thus, v C(XU)
C(X) .
IMPLICATION : It follows immediately that the numerator sum of squares for testing
the reduced model Y = W + versus the full model Y = X + is Y0 MPX D Y, where
MPX D = PX D[(PX D)0 (PX D)] (PX D)0 = PX D(D0 PX D) D0 PX
is the ppm onto C(PX D). If Nn (0, 2 I), the resulting test statistic
Y0 MPX D Y/r(MPX D )
F = 0
Fr(MPX D ),r(IPX ) (),
Y (I PX )Y/r(I PX )
where the noncentrality parameter
=
1
(X)0 MPX D X.
2 2
GOAL: Our goal now is to show that the F statistic above is the same F statistic we
derived in Section 6.3 with m = 0, that is,
F =
b 0 H1 K0 /s
b
(K0 )
.
Y0 (I PX )Y/(n r)
Recall that this statistic was derived for the testable hypothesis H0 : K0 = 0. First,
we show that r(MPX D ) = s, where, recall, s = r(K). To do this, it suffices to show
that r(K) = r(PX D). Because K0 is estimable, we know that K0 = D0 X, for some D.
Writing K = X0 D, we see that for any vector a,
X0 Da = 0 DaC(X),
PAGE 108
CHAPTER 6
Y0 MPX D Y/r(MPX D )
Y0 (I PX )Y/r(I PX )
and
F =
b
b 0 H1 K0 /s
(K0 )
Y0 (I PX )Y/(n r)
6.5
6.5.1
Constrained estimation
REMARK : In the linear model Y = X + , where E() = 0 (note the minimal assumptions), we have, up until now, allowed the p 1 parameter vector to take on any value
in Rp , that is, we have made no restrictions on the parameters in . We now consider
the case where is restricted to the subspace of Rp consisting of values of that satisfy
PAGE 109
CHAPTER 6
a(, )
= 2(P0 ).
Setting these equal to zero leads to the restricted normal equations (RNEs), that is,
0
0
XX P
XY
=
.
0
P
0
b and
bH the solutions to the RNEs, respectively. The solution
b is called
Denote by
H
H
a restricted least squares estimator.
DISCUSSION : We now present some facts regarding this restricted linear model and its
(restricted) least squares estimator. We have proven all of these facts for the unrestricted
model; restricted versions of the proofs are all in Monahan.
1. The restricted normal equations are consistent; see Result 3.8, Monahan (pp 62-63).
b minimizes Q() over the set T { Rp : P0 = }; see Result
2. A solution
H
3.9, Monahan (pp 63).
PAGE 110
CHAPTER 6
E() = 0,
P0 = ,
X
.
0 R
0
P
See Result 3.7, Monahan (pp 60).
4. If 0 is estimable in the unrestricted model; i.e., the model without the linear
restriction, then 0 is estimable in the restricted model. The converse is not true.
5. Under the GM model assumptions, if 0 is estimable in the restricted model, then
b H is the BLUE of 0 in the restricted model. See Result 4.5, Monahan (pp
0
89-90).
6.5.2
Testing procedure
PAGE 111
CHAPTER 6
DERIVATION : Under our model assumptions, we know that Y Nn (X, 2 I). The
likelihood function for = ( 0 , 2 )0 is
L(|y) = L(, 2 |y) = (2 2 )n/2 exp{Q()/2 2 },
where Q() = (y X)0 (y X). The unrestricted parameter space is
= { : Rp , 2 R+ }.
The restricted parameter space, that is, the parameter space under H0 : K0 = m, is
0 = { : Rp , K0 = m, 2 R+ }.
The likelihood ratio statistic is
(Y) =
sup0 L(|Y)
.
sup L(|Y)
We reject the null hypothesis H0 for small values of = (Y). Thus, to perform a level
test, reject H0 when < c, where c (0, 1) is chosen to satisfy PH0 {(Y) c} = .
We have seen (Section 6.1) that the unrestricted MLEs of and 2 are
b = (X0 X) X0 Y
and
b2 =
b
Q()
.
n
b H and
b H )/n,
Similarly, maximizing L(|y) over 0 produces the solutions
e2 = Q(
b H is any solution to
where
XX K
K0
XY
m
b2
e2
n/2
=
b
Q()
b )
Q(
)n/2
.
b
Q()
b )
Q(
H
)n/2
< c
b H ) Q()}/s
b
{Q(
> c ,
b
Q()/(n
r)
PAGE 112
CHAPTER 6
6.6
6.6.1
Confidence intervals
Single intervals
b 0
0
2 0 (X0 X)
N (0, 1).
If 2 was known, our work would be done as Z is a pivot. More likely, this is not the
case, so we must estimate it. An obvious point estimator for 2 is MSE, where
MSE = (n r)1 Y0 (I PX )Y.
PAGE 113
CHAPTER 6
b 0
0
MSE 0 (X0 X)
b = (X0 X)1 X0 Y =
b0
b1
PAGE 114
Y b1 x
P
(xi x)(Yi Y
iP
2
i (xi x)
CHAPTER 6
b is
and that the covariance matrix of
b = 2 (X0 X)1 = 2
cov()
1
n
2
P x
(x
x)2
i
i
P (xxi x)2
i
x
2
i (xi x)
1
2
i (xi x)
We now consider the problem of writing a 100(1 ) percent confidence interval for
E(Y |x = x0 ) = 0 + 1 x0 ,
the mean response of Y when x = x0 . Note that E(Y |x = x0 ) = 0 + 1 x0 = 0 , where
0 = (1 x0 ). Also, 0 is estimable because this is a regression model so our previous
work applies. The least squares estimator (and MLE) of E(Y |x = x0 ) is
b = b0 + b1 x0 .
0
Straightforward algebra (verify!) shows that
x2
x
1
P
P
+ (xi x)2 (xi x)2
1
n
i
i
0 (X0 X)1 =
1 x0
P 1
P (xxi x)2
x
0
(xi x)2
i
1
(x0 x)2
=
+P
.
2
n
i (xi x)
Thus, a 100(1 ) percent confidence interval for 0 = E(Y |x = x0 ) is
s
2
1
(x
x)
0
(b0 + b1 x0 ) tn2,/2 MSE
+P
.
2
n
i (xi x)
6.6.2
Multiple intervals
PROBLEM : Consider the Gauss Markov linear model Y = X +, where X is np with
rank r p and Nn (0, 2 I). We now consider the problem of writing simultaneous
confidence intervals for the k estimable functions 01 , 02 , ..., 0k . Let the p k matrix
= (1 2 k ) so that
01
0
2
0
= =
..
.
0k
PAGE 115
CHAPTER 6
b2 hjj ,
where
b2 = MSE, is a 100(1 ) percent confidence interval for 0j , that is,
p
p
b + tnr,/2
b tnr,/2
b2 hjj < 0j < 0j
b2 hjj = 1 .
pr 0j
This statement is true for a single interval.
SIMULTANEOUS COVERAGE : To investigate the simultaneous coverage probability
of the set of intervals
b tnr,/2
{0j
let Ej denote the event that interval j contains 0j , that is, pr(Ej ) = 1 , for j =
1, 2, ..., k. The probability that each of the k intervals includes their target 0j is
pr
k
\
!
= 1 pr
Ej
j=1
k
[
!
Ej
j=1
j=1
Thus, the probability that each interval contains its intended target is
!
k
\
pr
Ej 1 k.
j=1
PAGE 116
CHAPTER 6
Obviously, this lower bound 1 k can be quite a bit lower than 1 , that is, the
simultaneous coverage probability of the set of
b tnr,/2
{0j
b2 hjj , j = 1, 2, ..., k}
k
\
!
Ej
1 k(/k) = 1 .
j=1
b2 hjj
for j = 1, 2, ..., k.
: The idea behind Scheffes approach is to consider an arbitrary linear combiSCHEFFE
nation of = 0 , say, u0 = u0 0 and construct a confidence interval
b2 u0 Hu, u0 b + d
b2 u0 Hu),
C(u, d) = (u0 b d
where d is chosen so that
pr{u0 C(u, d), for all u} = 1 .
Since d is chosen in this way, one guarantees the necessary simultaneous coverage probability for all possible linear combinations of = 0 (an infinite number of combinations).
Clearly, the desired simultaneous coverage is then conferred for the k functions of interest
j = 0j , j = 1, 2, ..., k; these functions result from taking u to be the standard unit
vectors. The argument in Monahan (pp 144) shows that d = (kFk,nr, )1/2 .
PAGE 117
CHAPTER 7
7
7.1
Appendix
Matrix algebra: Basic ideas
3 5 4
.
A=
1 2 8
The (i, j)th element of A is denoted by aij . The dimensions of A are m (the number of
rows) by n (the number of columns). If m = n, A is square. If we want to emphasize
the dimension of A, we can write Amn .
TERMINOLOGY : A vector is a matrix consisting of one column or one row. A column
vector is denoted by an1 . A row vector is denoted by a1n . By convention, we assume
a vector is a column vector, unless otherwise noted; that is,
a
1
a2
0
a
=
a=
a1 a2 an .
..
.
an
TERMINOLOGY : If A = (aij ) is an m n matrix, the transpose of A, denoted by A0
or AT , is the n m matrix (aji ). If A0 = A, we say A is symmetric.
Result MAR1.1.
(a) (A0 )0 = A
(b) For any matrix A, A0 A and AA0 are symmetric
(c) A = 0 iff A0 A = 0
(d) (AB)0 = B0 A0
(e) (A + B)0 = A0 + B0
PAGE 118
CHAPTER 7
1 0
0 1
I = In =
.. .. . .
.
. .
0 0
that is, aij = 1 for i = j, and aij = 0 when
1
J = Jn =
..
.
is given by
;
..
.
1
nn
1 1
1 1
;
.. . . ..
. .
.
1 1
nn
that is, aij = 1 for all i and j. Note that J = 110 , where 1 = 1n is an n 1 (column)
vector of ones. The n n matrix where aij = 0, for all i and j, is called the null matrix,
or the zero matrix, and is denoted by 0.
TERMINOLOGY : If A is an n n matrix, and there exists a matrix C such that
AC = CA = I,
then A is nonsingular and C is called the inverse of A; henceforth denoted by A1 .
If A is nonsingular, A1 is unique. If A is a square matrix and is not nonsingular, A is
singular.
SPECIAL CASE : The inverse of the 2 2 matrix
d
b
a b
1
.
is given by A1 =
A=
ad
bc
c d
c
a
SPECIAL CASE : The inverse of the n n diagonal matrix
a
0 0
a1 0 0
11
11
0 a22 0
0 a1
0
22
1
A= .
is given by A =
.
.
.
.
..
..
.
..
..
..
..
.
..
..
.
0
0 ann
0
0 a1
nn
PAGE 119
CHAPTER 7
Result MAR1.2.
(a) A is nonsingular iff |A| =
6 0.
(b) If A and B are nonsingular matrices, (AB)1 = B1 A1 .
(c) If A is nonsingular, then (A0 )1 = (A1 )0 .
7.2
ci ai = 0
i=1
and at least one of the ci s is not zero; that is, it is possible to express at least one vector
as a nontrivial linear combination of the others. If
n
X
ci ai = 0 = c1 = c2 = = cn = 0,
i=1
a1 a2
an
CHAPTER 7
The number of linearly independent rows of any matrix is always equal to the number of
linearly independent columns. Alternate notation for r(A) is rank(A).
TERMINOLOGY : If A is n p, then r(A) min{n, p}.
If r(A) = min{n, p}, then A is said to be of full rank.
If r(A) = n, we say that A is of full row rank.
If r(A) = p, we say that A is of full column rank.
If r(A) < min{n, p}, we say that A is less than full rank or rank deficient.
Since the maximum possible rank of an n p matrix is the minimum of n and p,
for any rectangular (i.e., non-square) matrix, either the rows or columns (or both)
must be linearly dependent.
Result MAR2.1.
(a) For any matrix A, r(A0 ) = r(A).
(b) For any matrix A, r(A0 A) = r(A).
(c) For conformable matrices, r(AB) r(A) and r(AB) r(B).
(d) If B is nonsingular, then, r(AB) = r(A).
(e) For any n n matrix A, r(A) = n A1 exists |A| =
6 0.
(f) For any matrix Ann and vector bn1 , r(A, b) r(A); i.e., the inclusion of a
column vector cannot decrease the rank of a matrix.
CHAPTER 7
This is the unique solution to the normal equations (since inverses are unique). Note
that if r(X) = r < p, then a unique solution to the normal equations does not exist.
TERMINOLOGY : We say that two vectors a and b are orthogonal, and write ab, if
their inner product is zero; i.e.,
a0 b = 0.
Vectors a1 , a2 , ..., an are mutually orthogonal if and only if a0i aj = 0 for all i 6= j. If
a1 , a2 , ..., an are mutually orthogonal, then they are also linearly independent (verify!).
The converse is not necessarily true.
TERMINOLOGY : Suppose that a1 , a2 , ..., an are orthogonal. If a0i ai = 1, for all
i = 1, 2, ..., n, we say that a1 , a2 , ..., an are orthonormal.
TERMINOLOGY : Suppose that a1 , a2 , ..., an are orthogonal. Then
ci = ai /||ai ||,
where ||ai || = (a0i ai )1/2 , i = 1, 2, ..., n, are orthonormal. The quantity ||ai || is the length
of ai . If a1 , a2 , ..., an are the columns of A, then A0 A is diagonal; similarly, C0 C = I.
TERMINOLOGY : Let A be an n n (square) matrix. We say that A is orthogonal
if A0 A = I = AA0 , or equivalently, if A0 = A1 . Note that if A is orthogonal, then
PAGE 122
CHAPTER 7
||Ax|| = ||x||. Geometrically, this means that multiplication of x by A only rotates the
vector x (since the length remains unchanged).
7.3
Vector spaces
(i) x1 V, x2 V x1 + x2 V, and
(ii) x V cx V for c R.
S1 = 0 : for z R .
S2 = y : for x, y R .
PAGE 123
CHAPTER 7
It is easy to see that S1 and S2 are orthogonal. That S1 is a subspace is argued as follows.
Clearly, S1 V. Now, suppose that x1 S1 and x2 S1 ; i.e.,
0
0
x1 = 0 and x2 = 0 ,
z1
z2
for z1 , z2 R. Then,
x 1 + x2 =
and
0
0
z1 + z2
S1
cx1 = 0 S1 ,
cz1
for all c R. Thus, S1 is a subspace. That S2 is a subspace follows similarly.
TERMINOLOGY : Suppose that V is a vector space and that x1 , x2 , ..., xn V. The set
of all linear combinations of x1 , x2 , ..., xn ; i.e.,
(
S=
xV:x=
n
X
)
ci x i
i=1
1
1
x1 = 1 and x2 = 0
1
0
PAGE 124
CHAPTER 7
a1 a2
PAGE 125
an
CHAPTER 7
x Rm : x =
n
X
)
cj aj ; cj R
j=1
= {x Rm : x = Ac; c Rn },
is the set of all m 1 vectors spanned by the columns of A; that is, C(A) is the set of all
vectors that can be written as a linear combination of the columns of A. The dimension
of C(A) is the column rank of A.
TERMINOLOGY : Let
Amn
b01
b02
..
.
b0m
where bi is n 1. Denote
n
R(A) = {x R : x =
m
X
di bi ; di R}
i=1
= {x Rn : x0 = d0 A; d Rm }.
We call R(A) the row space of A. It is the set of all n 1 vectors spanned by the rows
of A; that is, the set of all vectors that can be written as a linear combination of the
rows of A. The dimension of R(A) is the row rank of A.
TERMINOLOGY : The set N (A) = {x : Ax = 0} is called the null space of A, denoted
N (A). The dimension of N (A) is called the nullity of A.
Result MAR3.3.
(a) C(B) C(A) iff B = AC for some matrix C.
(b) R(B) R(A) iff B = DA for some matrix D.
(c) C(A), R(A), and N (A) are all vector spaces.
(d) R(A0 ) = C(A) and C(A0 ) = R(A).
PAGE 126
CHAPTER 7
1 1 2
A= 1 0 3
1 0 3
and c = 1 .
The column space of A is the set of all linear combinations of the columns of A; i.e., the
set of vectors of the form
c1 + c2 + 2c3
c1 a1 + c2 a2 + c3 a3 =
c1 + 3c3
c1 + 3c3
where c1 , c2 , c3 R. Thus, the column space C(A) is the set of all 3 1 vectors of
the form (a, b, b)0 , where a, b R. Any two vectors of {a1 , a2 , a3 } span this space. In
addition, any two of {a1 , a2 , a3 } are linearly independent, and hence form a basis for
C(A). The set {a1 , a2 , a3 } is not linearly independent since Ac = 0. The dimension of
C(A); i.e., the rank of A, is r = 2. The dimension of N (A) is 1, and c forms a basis for
this space.
Result MAR3.5. For an m n matrix A with rank r n, the dimension of N (A) is
n r. That is, dim{C(A)} + dim{N (A)} = n.
Proof. See pp 241-2 in Monahan.
Result MAR3.6. For an mn matrix A, N (A0 ) and C(A) are orthogonal complements
in Rm .
Proof. Both N (A0 ) and C(A) are vector spaces with vectors in Rm . From the last
result, we know that dim{C(A)} = rank(A) = r, say, and dim{N (A0 )} = m r, since
PAGE 127
CHAPTER 7
7.4
Systems of equations
CHAPTER 7
4 1 2
A= 1 1 5
3 1 3
1/3
and G = 1/3
1/3 0
4/3
0
0 .
Note that r(A) = 2 because a1 + 6a2 a3 = 0. Thus A1 does not exist. However, it
is easy to show that AGA = A; thus, G is a generalized inverse of A.
Result MAR4.2. Let A be an m n matrix with r(A) = r. If A can be partitioned
as follows
A=
C D
E F
PAGE 129
CHAPTER 7
1
C
0
G=
0 0
is a generalized inverse of A. This result essentially shows that every matrix has a
generalized inverse (see Results A.10 and A.11, Monahan). Also, it gives a method to
compute it.
COMPUTATION : This is an algorithm for finding a generalized inverse A for A, any
m n matrix of rank r.
1. Find any r r nonsingular submatrix C. It is not necessary that the elements of
C occupy adjacent rows and columns in A.
2. Find C1 and (C1 )0 .
3. Replace the elements of C by the elements of (C1 )0 .
4. Replace all other elements of A by zeros.
5. Transpose the resulting matrix.
Result MAR4.3. Let Amn , xn1 , cm1 , and Inn be matrices, and suppose that
Ax = c is consistent. Then, x is a solution to Ax = c if and only if
x = A c + (I A A)z,
for some z Rn . Thus, we can generate all solutions by just knowing one of them; i.e.,
by knowing A c.
Proof. (=) We know that x = A c is a solution (Result MAR4.1). Suppose that
x = A c + (I A A)z, for some z Rn . Thus,
Ax = AA c + (A AA A)z
= AA c = Ax = c;
PAGE 130
CHAPTER 7
PAGE 131
CHAPTER 7
where z Rp . Of course, if r(X) = p, then (X0 X)1 exists, and the unique solution
becomes
b = (X0 X)1 X0 Y.
5 2
X0 X = 2 2
3 0
PAGE 132
0 .
CHAPTER 7
0 0
0
(X0 X)
0
1 = 0 1/2
0 0 1/3
0 0
0
Y + Y12 + Y21 + Y22 + Y23
11
b = (X0 X) X0 Y = 0 1/2 0
Y11 + Y12
1
1
0 0 1/3
Y21 + Y22 + Y23
0
0
1
=
=
(Y
+
Y
)
Y
11
12
1+ .
2
1
(Y21 + Y22 + Y23 )
Y 2+
3
1/3
1/3 0
(X0 X)
=
1/3
2
0 ,
5/6
0
1/3 1/3 0
Y + Y12 + Y21 + Y22 + Y23
11
b = (X0 X) X0 Y =
1/3
5/6
0
Y
+
Y
2
2
11
12
0
0
0
Y21 + Y22 + Y23
1
Y 2+
(Y
+
Y
+
Y
)
21
22
23
3
= 12 (Y11 + Y12 ) 13 (Y21 + Y22 + Y23 ) = Y 1+ Y 2+
0
0
1
1
0
1 0 0
z
= Y 1+ + 1 0 0 z2
Y 2+
1 0 0
z3
z1
= Y 1+ z1
Y 2+ z1
CHAPTER 7
7.5
2a
S= z:z=
, for a R
a
and take
P=
0.8 0.4
0.4 0.2
PAGE 134
CHAPTER 7
CHAPTER 7
7.6
TERMINOLOGY : The sum of the diagonal elements of a square matrix A is called the
trace of A, written tr(A), that is, for Ann = (aij ),
tr(A) =
n
X
aii .
i=1
Result MAR6.1.
1. tr(A B) = tr(A) tr(B)
2. tr(cA) = ctr(A)
3. tr(A0 ) = tr(A)
4. tr(AB) = tr(BA)
5. tr(A0 A) =
Pn Pn
i=1
j=1
a2ij .
Qn
i=1
aii .
CHAPTER 7
REVIEW : The table below summarizes equivalent conditions for the existence of an
inverse matrix A1 (where A has dimension n n).
A1 exists
A is nonsingular
A is singular
|A| =
6 0
|A| = 0
r(A) = n
r(A) < n
CHAPTER 7
1. |A| =
Qn
2. tr(A) =
i=1
Pn
i=1
i .
PAGE 138
CHAPTER 7
PAGE 139
CHAPTER 7
n X
n
X
aij xi xj = x0 Ax.
i=1 j=1
x0 Ax =
CHAPTER 7
positive. If we define A1/2 = QD1/2 Q0 , where D1/2 = diag( 1 , 2 , ..., n ), then A1/2
is symmetric and
A1/2 A1/2 = QD1/2 Q0 QD1/2 Q0 = QD1/2 ID1/2 Q0 = QDQ0 = A.
The matrix A1/2 is called the symmetric square root of A. See Monahan (pp 259-60)
for an example.
PAGE 141
CHAPTER 7
7.7
Random vectors
Y
1
Y2
Y=
..
.
Yn
a random vector. The joint pdf of Y is denoted by fY (y).
DEFINTION : Suppose that E(Yi ) = i , var(Yi ) = i2 , for i = 1, 2, ..., n, and cov(Yi , Yj ) =
ij , for i 6= j. The mean of Y is
E(Y1 )
E(Y2 )
= E(Y) =
..
E(Yn )
1
2
..
.
n
2 12 1n
1
21 22 2n
= cov(Y) =
..
..
..
...
.
.
.
n1 n2 n2
NOTE : Note that contains the variances 12 , 22 , ..., n2 on the diagonal and the
n
2
covariance terms cov(Yi , Yj ), for i < j, as the elements strictly above the diagonal. Since
cov(Yi , Yj ) = cov(Yj , Yi ), it follows that is symmetric.
EXAMPLE : Suppose that Y1 , Y2 , ..., Yn is an iid sample with mean E(Yi ) = and
variance var(Yi ) = 2 and let Y = (Y1 , Y2 , ..., Yn )0 . Then = E(Y) = 1n and
= cov(Y) = 2 In .
EXAMPLE : Consider the GM linear model Y = X+. In this model, the random errors
1 , 2 , ..., n are uncorrelated random variables with zero mean and constant variance 2 .
We have E() = 0n1 and cov() = 2 In .
PAGE 142
CHAPTER 7
Z
Z12
11
Z21 Z22
Znp =
..
..
..
.
.
.
Zn1 Zn2
Z1p
Z2p
..
.
Znp
..
..
..
..
.
.
.
.
np
PAGE 143
CHAPTER 7
1 12 1n
21 1 2n
R = (ij ) =
..
..
.. ,
...
.
.
.
n1 n2 1
where, recall, the correlation ij is given by
ij =
ij
,
i j
for i, j = 1, 2, ..., n.
TERMINOLOGY : Suppose that Y1 , Y2 , ..., Yn are random variables and that a1 , a2 , ..., an
are constants. Define a = (a1 , a2 , ..., an )0 and Y = (Y1 , Y2 , ..., Yn )0 . The random variable
X = a0 Y =
n
X
ai Y i
i=1
i=1
i=1
Result RV4. Suppose that Y = (Y1 , Y2 , ..., Yn )0 is a random vector with mean =
E(Y), let Z be a random matrix, and let A and B (a and b) be nonrandom conformable
matrices (vectors). Then
1. E(AY) = A
2. E(a0 Zb) = a0 E(Z)b.
PAGE 144
CHAPTER 7
3. E(AZB) = AE(Z)B.
Result RV5. If a = (a1 , a2 , ..., an )0 is a vector of constants and Y = (Y1 , Y2 , ..., Yn )0 is a
random vector with mean = E(Y) and covariance matrix = cov(Y), then
var(a0 Y) = a0 a.
Proof. The quantity a0 Y is a scalar random variable, and its variance is given by
var(a0 y) = E{(a0 Y a0 )2 } = E[{a0 (Y )}2 ] = E{a0 (Y )a0 (Y )}.
But, note that a0 (Y ) is a scalar, and hence equals (Y )0 a. Using this fact, we
can rewrite the last expectation to get
E{a0 (Y )(Y )0 a} = a0 E{(Y )(Y )0 }a = a0 a.
Result RV6. Suppose that Y = (Y1 , Y2 , ..., Yn )0 is a random vector with covariance
matrix = cov(Y), and let a and b be conformable vectors of constants. Then
cov(a0 Y, b0 Y) = a0 b.
Result RV7. Suppose that Y = (Y1 , Y2 , ..., Yn )0 is a random vector with mean = E(Y)
and covariance matrix = cov(Y). Let b, A, and B denote nonrandom conformable
vectors/matrices. Then
1. E(AY + b) = A + b
2. cov(AY + b) = AA0
3. cov(AY, BY) = AB0 .
CHAPTER 7
Result RV9. If Y = (Y1 , Y2 , ..., Yn )0 is a random vector with mean = E(Y) and
covariance matrix , then P {(Y ) C()} = 1.
Proof. Without loss, take = 0, and let M be the perpendicular projection matrix
onto C(). We know that Y = M Y + (I M )Y and that
E{(I M )Y} = (I M )E(Y) = 0,
since = E(Y) = 0. Also,
cov{(I M )Y} = (I M )(I M )0 = ( M )(I M )0 = 0,
since M = . Thus, we have shown that P {(I M )Y = 0} = 1, which implies
that P (Y = M Y) = 1. Since M Y C(), we are done.
IMPLICATION : Result RV9 says that there exists a subset C() Rn that contains Y
with probability one (i.e., almost surely). If is positive semidefinite (psd), then is
singular and C() is concentrated in a subspace of Rn , where the subspace has dimension
r = r(), r < n. In this situation, the pdf of Y may not exist.
Result RV10. Suppose that X, Y, and Z are n 1 vectors and that X = Y + Z. Then
PAGE 146