Documente Academic
Documente Profesional
Documente Cultură
y(x) = ax + b
If n = 2 with x1 6= x2 , we can find a and b by solving a linear system of two equations in two unknowns. If
n > 2, the system is overdetermined, and we can apply the method of “least squares”.
Specifically, let ~y = (y1 , . . . , yn )T , ~x = (x1 , . . . , xn )T , ~e = (1, 1, . . . , 1)T . A reasonable means of finding the
“best” values of a and b is to select a and b as to minimize some measure of distance between ~y and a~x + b~e.
One notion of distance that leads to a particularly nice system of determining equations for a and b is to
measure the distance beween ~z1 , ~z2 ∈ IRn via
kz~1 − z~2 k,
where
~ 2=w
kwk ~ T Λw.
~
Here, Λ is a given n × n symmetric positive definite matrix. The minimizers a, b of the sum of squares
2. How can one assign “error bars” to our slope and intercept values a and b?
77
3. Is there any way to objectively test the linear model against the even simpler model in which a = 0
(the “constant” model)?
A statistical formulation of this linear modeling problem will permit us to address these issues.
Yi = a∗ xi + b∗ + εi
where a∗ and b∗ are the “true” shape and intercept values, and εi is a rv describing the “residual error” in
the linear model corresponding to observation Yi . The great majority of the literature on linear regression
presumes that ~ε = (ε1 , . . . , εn )T is a Gaussian random vector with mean ~0 and covariance matrix σ 2 C, where
σ 2 is unknown (and C is specified by the statistician and is therefore known). We follow the literature here,
and make the assumption that the residuals have this Gaussian structure.
This statistical model has three unknown parameters, namely a∗ , b∗ and σ 2 . The principle of maximum
likelihood asserts that a∗ , b∗ and σ 2 should be estimated as the maximizer (â, b̂, σ̂ 2 ) of the likelihood.
1 1 ~ ~ − a~x + b~e)) .
exp(− (Y − a~x + b~e)T C −1 (Y
(2πσ 2 )n/2 | det C|1/2 2
~ = (Y1 , . . . , Yn )T . (This likelihood presumes that C has been specified as a positive definite matrix.
Here, Y
Any reasonable model for C will have this property.)
The estimators satisfy
~ C ~x
µ T −1
~x C ~x ~eT C −1 ~x
¶ µ ¶ µ T −1 ¶
â Y
T −1 T −1 = ~ T −1
~x C ~e ~e C ~e b̂ Y C ~e
and
1 ~ ~ − â~x − b̂~e) .
σ̂ 2 = (Y − â~x − b̂~e)T C −1 (Y
n
It is common to choose C = I in the linear regression model. This corresponds to an assumption of iid
residual errors. However, one need not choose C = I. For example, suppose that one assumes that the
variability of Yi scales with the magnitude of xi . In this case, one would set C = diag(x21 , x22 , . . . , x2n ). This
leads to a “weighted least squares” problem. Note that the statistical formulation helps suggest plausible
forms for C.
This statistical framework also permits us to develop “error bars” for our estimates of a∗ and b∗ . Let
n
1X
x̄ = xi ,
n 1
n
X
sxx = (xi − x̄)2 ,
1
n
X
SSE = sum of squares of estimated residuals = (Yi − âxi − b̂)2 .
1
78
It can be shown that when C = I,
â − a∗ D
q = tn−2 ,
SSE/(n−2)
sxx
b̂ − b∗ D
r ³ ´ = tn−2 ,
SSE 1 x̄2
(n−2) n + sxx
where tn−2 is a so-called Student-t rv with n − 2 degrees of freedom (and is a “tabulated distribution”). It
follows that if one selects z so that P {−z ≤ tn−2 ≤ z} = 1 − δ, then
s s
SSE/(n − 2) SSE/(n − 2)
[â − z , â + z ],
sxx sxx
s s
x̄2 x̄2
µ ¶ µ ¶
SSE 1 SSE 1
[b̂ − z + , b̂ + z + ]
n − 2 n sxx n − 2 n sxx
â2 sxx D
= F1,n−2
SSE/(n − 2)
where F1,n−2 is a rv having the F distribution with 1 and n − 2 degrees of freedom. If we choose z so
2
that P {F1,n−2 > z} = γ (with γ small), then it is rare that the statistic âSSE
sxx
exceeds z when a∗ = 0. (A
n−2
â2 sxx
common value of γ is 0.05.) Hence, if SSE ≤ z, we view the data as being consistent with a∗ = 0 (i.e. the
n−2
“constant” model), whereas if the statistic is larger than z, we reject the hypothesis that a∗ = 0.
b̂ = Ȳ − âx̄
1
Pn
where Ȳ = n 1 Yi . Under modest assumptions on the xi ’s, it can be shown that
p
â −→ a∗
p
b̂ −→ b∗
1 See Chapter 12 of Probability and Statistics for the Engineering, Computing, and Physical Sciences by E.R. Dougherty.
79
as n → ∞. This guarantees that when n is large, each of the estimated residuals
ε̂i = Yi − âxi − b̂
(Of course, in the Gaussian setting with C = I, this is known to be a tn−2 rv. In the non-Gaussian setting,
this rv has a complicated and unknown distribution.) Specifically, we could sample the distribution of the
εi ’s n iid times, yielding ε11 , . . . , ε1n . Set Y1i = a∗ xi + b∗ + ε1i , for 1 ≤ i ≤ n, and compute the ordinary
least squares estimates â1 and b̂1 corresponding to the data set (x1 , Y11 ), . . . , (xn , Y1n ). If we repeat the
process m independent times (for a total of mn samples from the distribution of the εi ’s ), thereby yielding
â1 , b̂1 , . . . , ân , b̂n , we could estimate the required distribution via
m
1 X âi − a∗
I( q Pn ) ≤ ·) .
m i=1 2
1 (Yij −âi xj −b̂i ) /(n−2)
sxx
Of course, we generally don’t have the ability to cheaply obtain mn such samples from the distribution of
the εi ’s.
The bootstrap philosophy replaces a∗ by â, b∗ by b̂, and the distribution of the εi ’s by the ε̂i s. Sample the
ε̂i ’s n iid times (with replacement), thereby yielding ε∗11 , . . . , ε∗1n , and compute the ordinary least squares
∗
estimator, â∗1 and b̂∗1 , corresponding to the data set (x1 , Y11 ∗
), . . . , (xn , Y1n ∗
), where Y1j = âxj + b̂ + ε∗1j . We
now repeat this process m independent times, thereby yielding m bootstrap estimates â∗1 , b̂∗1 , . . . , â∗m , b̂∗m .
The required distribution can be estimated via
m
1 X â∗i − â
I( r P ) ≤ ·)
m i=1 n
(Y ∗ −â∗ xj −b̂∗ )2 /(n−2)
1 ij i i
sxx
The above estimated distribution is then used to construct a confidence interval for a∗ in the usual way.
A similar bootstrap method can be used to produce confidence intervals for b∗ (that are asymptotically valid
as m, n → ∞) or to produce asymptotically valid hypothesis testing regions.
Remark 5.1: Suppose that we assume the εi ’s have a covariance matrix that is known up to an (unknown)
factor σ 2 , so that
Yi = a∗ xi + b∗ + εi ,
where ~ε = (ε1 , . . . , εn )T is assumed to have a positive definite covariance matrix σ 2 C, where C is known
and σ 2 is assumed unknown. In this case, we can use the fact that C is known to compute the Cholesky
factorization C = LLT . Note that
~ = a∗ L−1 ~x + b∗ L−1~e + L−1 ~ε .
L−1 Y
~ = L−1 Y
Hence, if we set Z ~, w
~ = L−1 ~x, ~v = L−1~e, and ~ν = L−1 ~ε, we arrive at the model
~ = a∗ w
Z ~ + b∗~v + ~ν ,
where ~ν has mean zero and σ 2 I as its covariance matrix. If we now additionally assume that the νi ’s are iid,
the bootstrap can be applied to this transformed model (and hence to the original model with covariance
matrix σ 2 C).
80
5.4 Data Transformations
In many applied settings, one expects that a non-linear model might offer a better explanation of the data.
For example, one might postulate basic trends of the form:
Yi = a∗ T xi + b∗ + εi
for 1 ≤ i ≤ n, where a∗ ∈ IRd and b∗ are the “true” parameters, and (ε1 , . . . , εn )T is an n dimensional
Gaussian rv with mean 0 and covariance matrix σ 2 C with unknown σ 2 (but known C).
Here, the likelihood is given by
1 1 ~ ~ − xa − b~e))
exp(− (Y − xa − b~e)T C −1 (Y
(2πσ 2 )n/2 | det C|1/2 2
~ = (Y1 , . . . , Yn )T and x is the n x d matrix in which the i’th row is xi . The maximum likelihood
where Y
estimators â, b̂ and σ̂ 2 , satisfy
~
µ T −1
~x C ~x ~xT C −1~e
¶ µ ¶ µ T −1 ¶
â ~x C Y
T −1 T −1 = T ~
~e C ~x ~e C ~e b̂ −1
~e C Y
1 ~ ~ − xa − b~e)
σ̂ 2 = (Y − xâ − b̂~e)T C −1 (Y
n
All the ideas described in the context of (simple) linear regression models with d = 1 generalize in a
suitable way to the multiple linear regression context: confidence region procedures for a∗ , hypothesis testing,
bootstrap procedures for non-Gaussian residuals, etc.
Yi = a∗ T Xi + b∗ + εi
81
for some “true” a∗ ∈ IRd and b∗ ∈ IR, where ((Xi , εi ) : 1 ≤ i ≤ n) is a set of n iid pairs with E [εi ] = 0,,
and var (εi ) < ∞. (We permit Xi ∈ IRd to be vector valued, so as to permit Yi to depend on multiple
characteristics of each specimen.) Put X̃i = Xi − E [Xi ], and Ỹi = Yi − E [Yi ], for 1 ≤ i ≤ n. Note that the
best affine predictor of Yi given Xi must be a∗ T Xi + b∗ , and hence
h i h i
a∗ = (E X̃1 X̃1T )−1 E X̃1 Ỹ1
b∗ = E [Y1 ] − a∗ T E [X1 ]
h i
(We assume here, and throughout, that the covariance matrix, E X̃1 X̃1T , is non-singular.)
We describe now the bootstrap procedure that would be used to deal with such a correlation model (in the
presence of non-Gaussian residual errors).
Exercise 5.1: Suppose that E kX1 k2 < ∞ and E ǫ21 < ∞. Put
£ ¤ £ ¤
n n
1X 1X
ân = ( (Xi − X n )T (Xi − X n ))−1 · ( (Xi − X n )T (Yi − Y n )) ,
n 1 n 1
b̂n = Y n − ân X n
where
n
1X
Xn = Xi ,
n 1
n
1X
Yn = Yi .
n 1
Prove that
ân → â a.s. ,
b̂n → b̂ a.s.
as n → ∞.
According to Problem Exercise 5.1, ân and b̂n are (for large sample sizes n) close to a∗ and b∗ . Suppose that
∗ ∗ ∗ ∗
we sample (X11 , Y11 ), . . . , (X1n , Y1n ), from the collection of observations (X1 , Y1 ),. . . ,(Xn , Yn ), independently
(and with replacement). Put
n
∗ 1X
X1 = Xi
n 1
n
∗ 1X
Y1 = Yi
n 1
n n
1X ∗ ∗ ∗ 1X ∗ ∗ ∗
â∗1 = ( (X1i − X 1 )T (X1i
∗
− X 1 )T )−1 · ( (Y − Y 1 )T (X1i
∗
− X 1 ))
n i=1 n i=1 1i
∗ ∗
b̂∗1 = Y 1 − â∗1 X 1
If we independently generate m such bootstrap samples from (X1 , Y1 ), . . . , (Xn , Yn ), we obtain m pairs
(â∗1 , b̂∗1 ), . . . , (â∗m , b̂∗m ),. If m, n are both large, the distribution of
n
1X 1
I((Xi − X n )T (Xi − X n ))− 2 (ân − a∗ )
n i=1
82
can be approximated by
m n
1 X 1X ∗ ∗ ∗ 1
I(( (X − X i )T (Xij
∗
− X i ))− 2 (â∗i − ân ) ≤ ·)
m i=1 n j=1 ij
This bootstrap procedure can be used to construct confidence regions for a∗ and b∗ for the correlation model,
as well as hypothesis testing regions2 .
∆1 yn = yn+1 − yn
2 See Chapter 4 of The Bootstrap and Edgeworth Expansion by Peter Hall, Springer-Verlag (1992) for details.
83
5.9 Stochastic Linear Difference Equations of pth Order
The stochastic analog to a constant sequence zn = c is an iid sequence (Vn : n ≥ 0). Hence, the natural
stochastic analog to (5.3) is a stochastic sequence (Yn : n ≥ 0) satisfying
p
X
∆p Yn = βj ∆p−j Yn + Vn . (5.4)
j=1
Definition 5.2:
A sequence Y = (Yn : n ≥ 0) satisfying (5.5) with (Vn : n ≥ 0) iid is called a pth order autoregressive
sequence.
The autoregressive sequence is said to be Gaussian is the Vn ’s are Gaussian.
Any pth order (scalar) autoregression can be expressed as a first order (vector) autoregression, by following
the same idea as that leading to (5.2). Put
Xn = (Yn−p+1 , ..., Yn )T
and note that
84
Exercise 5.2:
Let X = (Xn : n ≥ 0) satisfy (5.6) for n ≥ 0 with EkZn k < ∞ and (Zn : n ≥ 0) iid.
Pn−1
1. Show that Xn = F n X0 + j=0 F j Zn−j .
D Pn−1
2. Prove that Xn = F n X0 + j=0 F j Zj .
3. Prove that if the spectral radius of F is less than one, then Xn =⇒ X∞ as n → ∞, where
∞
D
X
X∞ = F j Zj .
j=0
4. If the Zn ’s are Gaussian with covariance C, show that X∞ is Gaussian with mean (I − F )−1 EZ1 and
covariance matrix Λ satisfying
Λ = F ΛF T + C .
Λn+1 = F Λn F T + C
for n ≥ 0, subject to Λ0 = 0.
Requiring the eigenvalues of F to have moduli less than one is equivalent to requiring that the p roots
z1 , ..., zp of the degree p polynomial
p
X
zp − aj z p−j (5.7)
j=1
where (Zn : n ≥ 0) is a sequence of iid copies of the random variable Z1 . To indicate the dependence of Xk
on r, we write it as Xk,−r . Observe that as r → ∞,
85
Definition 5.3:
The sequence X ∗ is said to be the stationary version of X.
We interpret a stationary process as representing a system that was initialized at time −∞ and is in stochastic
equilibrium at every finite t.
E[Xn+m |Xj : j ≤ n] .
m−1
X
F m Xn + F j EZ1 . (5.8)
j=0
Hence, we can use this formula to predict Yn+m based on (Yn , Yn−1 , ..., Yn−p+1 )T (equal to XnT ).
n−1
à !
2 −n 1 X
(2πσ ) 2 exp − 2 (Yi+p − a1 Yi+p−1 − ... − ap Yi − µ)2 .
2σ i=0
The maximum likelihood estimators â1 , ... , âp , µ̂ and σˆ2 solve the linear system
n−1
1X
σˆ2 = (Yi+p − â1 Yi+p−1 − ... − âp Yi − µ̂)2
n i=0
As in the settings of conventional regression models, exact confidence regions and hypothesis testing have
been developed in this Gaussian setting. Details can be found in the enormous literature on so-called “time
series” models.
86
5.14 Parameter Estimation for Autoregressive Sequences with non-
Gaussian Residuals
We now turn to the issue of how to deal with an autoregressive sequence Y = (Yn : n ≥ 0) for which
see (5.8) above. If the Zn ’s are Gaussian, the conditional distribution of Xn+m is
m−1
X
N (F m Xn + F j EZ1 , Λm ) (5.10)
j=0
where Λm = F Λm−1 F T + E(Z1 − EZ1 )(Z1 − EZ1 )T with Λ0 = 0. This conditional distribution can be used
to make predictions such as
P (Xn+m ∈ ·|Xj , j ≤ n)
We now turn to the question of parameter estimation in the setting of non-Gaussian residuals. Note that
hεn , Yn−i i = 0
for i ≥ 1, so that
a∗1 EYn−1 Yn−i + ... + a∗p EYn−p Yn−i + µ∗ EYn−i = EYn Yn−i
for i ≥ 1. A square linear system of p + 1 equations is obtained by taking the first p + 1 such equations (i.e.
1 ≤ i ≤ p + 1).
Given observations (Yj : 0 ≤ j ≤ n+p), we can estimate EYl−k Yl−i via Yl−k Yl−i , suggesting that we consider
the linear system
p+n p+n p+n p+n
1 X 1 X 1 X 1 X
â1 Yl−1 Yl−i + ... + âp Yl−p Yl−i + µ̂ Yl−i = Yl Yl−i (5.12)
n n n n
l=p+1 l=p+1 l=p+1 l=p+1
for 1 ≤ i ≤ p + 1. (Note the similarity of (5.12) to (5.9). (What explains the similarity?)
87
Exercise 5.3:
Suppose that the roots of (5.7) are all less than one in modulus, and assume that Eε41 < ∞. Prove that
p p
âi −→ a∗i , 1 ≤ i ≤ p and that µ̂ −→ µ∗ as n → ∞.
To produce confidence regions for a∗1 , ... , a∗p and µ∗ , we can apply the bootstrap idea. For p ≤ i ≤ n + p, let
P (âj − a∗j ∈ ·)
from which a large-sample confidence interval for a∗j can be obtained. In a similar way, we can obtain a
large-sample bootstrap confidence interval for µ∗ .
Exercise 5.4:
Extend the bootstrap procedure to produce prediction regions for Yn+m , based on observing Yj , 0 ≤ j ≤ n,
that take into account parameter uncertainty in estimating a∗1 , ... , a∗p and µ∗ from the observed data.
88