Sunteți pe pagina 1din 139

A course in Time Series Analysis

Suhasini Subba Rao


Email: suhasini.subbarao@stat.tamu.edu
October 27, 2008
Contents
1 Introduction: Why do time series? 5
1.1 Stationary processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2 Linear time series 11
2.1 Dierence equations and back-shift operators . . . . . . . . . . . . . . . . . . . . 12
2.2 The ARMA model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.3 The autocovariance function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.3.1 The autocovariance of an autoregressive process . . . . . . . . . . . . . . 21
2.3.2 The autocovariance of a moving average process . . . . . . . . . . . . . . 23
2.3.3 The autocovariance of an autoregressive moving average process . . . . . 24
2.3.4 The partial covariance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.4 The autocovariance function, invertibility and causality . . . . . . . . . . . . . . 26
3 Prediction 29
3.1 Basis and linear vector spaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.1.1 Orthogonal basis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.1.2 Spaces spanned by innite number of elements . . . . . . . . . . . . . . . 31
3.2 Durbin-Levinson algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.3 Prediction for ARMA processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.4 The Wold Decomposition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
4 Estimation for Linear models 40
4.1 Estimation of the mean and autocovariance function . . . . . . . . . . . . . . . . 40
4.1.1 Estimating the mean . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
4.1.2 Estimating the covariance . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
4.1.3 Some asymptotic results on the covariance estimator . . . . . . . . . . . . 41
4.2 Estimation for AR models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
4.2.1 The Yule-Walker estimator . . . . . . . . . . . . . . . . . . . . . . . . . . 44
4.2.2 The Gaussian maximum likelihood (least squares estimator) . . . . . . . . 45
4.3 Estimation for ARMA models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
4.3.1 The Hannan and Rissanen AR() expansion method . . . . . . . . . . . 46
4.3.2 The Gaussian maximum likelihood estimator . . . . . . . . . . . . . . . . 48
1
5 Almost sure convergence, convergence in probability and asymptotic normal-
ity 50
5.1 Modes of convergence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
5.2 Ergodicity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
5.3 Sampling properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
5.4 Showing almost sure convergence of an estimator . . . . . . . . . . . . . . . . . . 54
5.4.1 Proof of Theorem 5.4.2 (The stochastic Ascoli theorem) . . . . . . . . . . 56
5.5 Almost sure convergence of the least squares estimator for an AR(p) process . . . 57
5.6 Convergence in probability of an estimator . . . . . . . . . . . . . . . . . . . . . . 59
5.7 Asymptotic normality of an estimator . . . . . . . . . . . . . . . . . . . . . . . . 60
5.8 Asymptotic normality of the least squares estimator . . . . . . . . . . . . . . . . 62
6 Sampling properties of ARMA parameter estimators 66
6.1 Asymptotic properties of the Hannan and Rissanen estimation method . . . . . . 66
6.1.1 Proof of Theorem 6.1.1 (A rate for |

b
T
b
T
|
2
) . . . . . . . . . . . . . . 70
6.2 Asymptotic properties of the GMLE . . . . . . . . . . . . . . . . . . . . . . . . . 72
7 Residual Bootstrap for estimation in autoregressive processes 80
7.1 The residual bootstrap . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
7.2 The sampling properties of the residual bootstrap estimator . . . . . . . . . . . . 82
8 Spectral Analysis 88
8.1 Some Fourier background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
8.2 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
8.3 Spectral representations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
8.3.1 The spectral distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
8.3.2 The spectral representation theorem . . . . . . . . . . . . . . . . . . . . . 94
8.3.3 The spectral densities of MA, AR and ARMA models . . . . . . . . . . . 97
8.3.4 Higher order spectrums . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
8.4 The Periodogram and the spectral density function . . . . . . . . . . . . . . . . . 99
8.4.1 The periodogram and its properties . . . . . . . . . . . . . . . . . . . . . 100
8.4.2 Estimating the spectral density . . . . . . . . . . . . . . . . . . . . . . . . 103
8.5 The Whittle Likelihood . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
9 Nonlinear Time Series 113
9.1 The ARCH model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
9.1.1 Some properties of the ARCH process . . . . . . . . . . . . . . . . . . . . 114
9.2 The quasi-maximum likelihood for ARCH processes . . . . . . . . . . . . . . . . 116
9.2.1 Consistency of the quasi-maximum likelihood estimator . . . . . . . . . . 116
9.2.2 Asymptotic normality of the quasi-maximum likelihood estimator . . . . . 118
9.3 Testing for linearity of a time series . . . . . . . . . . . . . . . . . . . . . . . . . . 120
9.3.1 Motivating the test statistic . . . . . . . . . . . . . . . . . . . . . . . . . . 120
9.3.2 Estimates of the higher order spectrum . . . . . . . . . . . . . . . . . . . 121
9.3.3 Hotellings T
2
-statistic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124
9.3.4 The test statistic for the test for linearity . . . . . . . . . . . . . . . . . . 125
2
10 Mixingales 126
10.1 Obtaining almost sure rates of convergence for some sums . . . . . . . . . . . . . 127
10.2 Proof of Theorem 6.1.3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128
A Appendix 131
A.1 Background: some denition and inequalities . . . . . . . . . . . . . . . . . . . . 131
3
Preface
The material for these notes come from all over the place. Some of it from books and articles,
and some it from my own work. For those interested in reading around material the following
books may be useful (it is by no means an exhaustable list).
For linear time series: Priestley (1983), Brockwell and Davis (1998), Fuller (1995), Grim-
mett and Stirzaker (1994), Chapter 9, and Shumway and Stoer (2006). This is not a
comprehesive list, and I am sure more books will be included.
Bartlett (1981) and M. and Grenander (1997) are very early books on time series, these
books were shortly followed by Parzen (1999) (these are reference to the latest edition,
they were rst published in the late 50s and early 60s). The book which brought time
series to the masses is Box and Jenkins (1970) and is very useful for any practitioner. At
about the same time Hannan (1970) and Anderson (1994) were published, which deals
with time series analysis.
I have yet to start writing the nonlinear notes. I will update this list.
4
Chapter 1
Introduction: Why do time series?
A time series is a series of observations x
t
, each observed at the time t. Typically the observations
can be over an entire interval, randomly sampled on an interval or at xed time points. Dierent
types of time sampling require dierent approaches to the data analysis. However in this course
we will focus on the case that observations are observed at xed time points, hence we will
suppose we observe x
t
: t = 1, . . . , n. Below we give examples of typical time series. Figure
1.1 is of the daily exchange rate between the British pound and the US dollar (after taking log
dierences). Figure 1.2 is of the monthly minimum temperatures recorded at Antarctic and the
Figure 1.3 is of the global temperature anomalies. Comparing the Antartic, exchange rate data
and global temperatures with the simulation of white noise (iid random variables) in Figure 1.4,
we see that unlike the iid realisation, there appears to be more smoothness in the plots and
dependence between observations which are close located close in time. Figures 1.1, 1.2 and 1.3
are examples of time series and various time series models are tted to this type of data.
Hence we observe the time series x
t
, usually we assume that x
t
is a realisation from a
random process X
t
. We formalise this notion below. The random process X
t
; t Z (where
Z denotes the integers) is dened on the probability space , T, P. We explain what these
mean below:
(i) is the set of all possible outcomes. Suppose that , then X
t
() is one realisation
from the random process. For any given , X
t
() is not random. In time series we will
usually assume that what we observe x
t
= X
t
() (for some ) is a typical realisation. That
is, for any other

, X
t
(

) will be dierent, but its general or overall characteristics


will be similar.
(ii) T is known as a sigma algebra. It is a set of subsets of (though not necessarily the set
of all subsets, as this can be too large). But it consists of all sets for which a probability
can be assigned. That is if A T, then P(A) is known.
(iii) P is the probability.
After all this formalisation, let us return to plots in Figures 1.2 and 1.1. We see that Figure
1.1 can be considered as one realisation from the stochastic process X
t
. Now based on the
one realisation we want to make inference about parameters associated with the process X
t
,
such as the mean etc. Let us consider estimators of the mean, noting that the discussion below
equally applies to any population parameter. We recall that in classical statistics we usually
5
Daily
lo
g
d
iffe
r
e
n
c
e
s
0 500 1000 1500 2000

0
.0
2

0
.0
1
0
.0
0
0
.0
1
0
.0
2
Figure 1.1: The GBP and USD exchange rate from 2000-2008 (after taking log dierences)
months
d
e
g
r
e
e
s
c
e
lc
iu
s
0 100 200 300 400 500 600 700

4
0

3
0

2
0

1
0
0
Figure 1.2: The monthly minimum temperatures at Faraday station in the Antarctic.
6
1850 1900 1950 2000

0
.6

0
.4

0
.2
0
.0
0
.2
0
.4
year
te
m
p
Figure 1.3: The global yearly temperature anomolies from 1850-present
Time
w
n
0 50 100 150

1
0
1
2
3
Figure 1.4: A simulation from 150 iid random variables
7
assume we observe several independent realisations, Z
t
from a random variable Z, and use
the multiple realisations to make inference about the mean:

Z =
1
n

n
k=1
Z
k
. Roughly speaking,
by using several independent realisations we are sampling over the entire probability space and
obtaining a good estimate of the mean. On the other hand if the samples were not independent
and highly dependent, then it is likely that Z
t
would be concentrated about a small part of
the probability space. In this case, the sample mean would be highly biased.
Now let us consider the time series. For most time series we need to estimate parameters
based on only one realisation x
t
= X
t
(). Therefore, it would appear impossible to obtain a
good estimator of the mean. However good estimates, of the mean, can be made, based on just
one realisation so long as certain assumptions are satised (i) the process is stationary (this is
a type of invariance assumption, that is the main characters of the process do not change over
time such as the mean does not change over time) and (ii) despite the fact that each time series
is generated from one realisation there is short memory in the observations. That is, what is
observed today, x
t
has little inuence on observations in the future, x
t+k
(when k is relatively
large). Hence, even though we observe one tragectory, that trajectory traverses much of the
probability space. The amount of dependency in the time series determines the quality of the
estimator. There are several ways to measure the dependency. We know that the most common
measure of linear dependency is the covariance. The covariance between in the stochastic process
X
t
is dened as
cov(X
t
, X
t+k
) = E(X
t
X
t+k
) E(X
t
)E(X
t+k
).
Hence if X
t
has mean zero, then the above reduces to cov(X
t
, X
t+k
) = E(X
t
X
t+k
). In a lot of
statistical analysis the covariance is often sucient as a measure. However it is worth bearing
in mind that the covariance only measure linear dependence, usually given cov(X
t
, X
t+k
) we
cannot say anything about cov(g(X
t
), g(X
t+k
)), where g is a nonlinear function. There are
occassions where we require a more general measure of dependence. Examples of more general
measures include mixing (in its various avours), rst introduced by Rosenblatt in the 50s (M.
and Grenander (1997)). However in this course we will not cover mixing.
Websites where the data can be obtained include:
http://www.cru.uea.ac.uk/
http://www.federalreserve.gov/releases/h10/Hist/
http://bossa.pl/notowania/daneatech/metastock/.
1.1 Stationary processes
Stationarity is a rather intuitive concept, and is an invariant property which means that statis-
tical characteristics of the time series do not change over time. For example, the yearly rainfall
may vary year by year, but the average rainfall in two equal length time intervals will be roughly
the same as would the number of times the rainfall exceeds a certain threshold. Of course, over
long periods of time this assumption may not be so plausible. For example, the climate change
that we are currently experiencing is causing changes in the overall weather patterns (we will
consider nonstationary time series towards the end of this course). However in many situa-
tions, and over shorter intervals the assumption of stationarity is quite a plausibe. Indeed often
8
the statistical analysis of a time series is done under the assumption that a time series is sta-
tionary. There are two denitions of stationarity, weak stationarity which only concerns the
covariance of a process and strict stationarity which is a much stronger condition and supposes
the distributions are invariant over time.
Denition 1.1.1 (Strict stationarity) The time series X
t
is said to be strictly stationary
if for any nite sequence of integers t
1
, . . . , t
k
and shift h the distribution of (X
t
1
, . . . , X
t
k
) and
(X
t
1
+h
, . . . , X
t
k
+h
) are the same.
Denition 1.1.2 (Second order stationarity/weak stationarity) The time series X
t
is
said to be second order stationary if for any t and k the covariance between X
t
and X
t+k
only
depends on the lag dierence k. In other words there exists a function c : Z R such that for
all t and k we have
c(k) = cov(X
t
, X
t+k
).
Remark 1.1.1 It is easy to show that strict stationarity implies second order stationarity. But
the converse is not necessarily true. To show that strict stationarity implies second order sta-
tionarity, suppose that X
t
is a strictly stationary process with zero mean, then
cov(X
t
, X
t+k
) =
_
xyP
X
t
,X
t+k
(dx, dy) =
_
xyP
X
t
1
,X
t
1
+k
(dx, dy) = cov(X
t
1
, X
t
1
+k
),
where P
X
t
,X
t+k
is the joint distribution of X
t
, X
t+k
. Clearly cov(X
t
, X
t+k
) does not depend on t
and X
t
is second order stationary.
The covariance of a stationary process has several very interesting properties. One of the
main properties is that it is non-negative denite, which we dene below.
Denition 1.1.3 (Non-negative denite) A sequence c(k) is said to be non-negative def-
inite if for any n Z and sequence x = (x
1
, . . . , x
n
) R
n
the following is satised
n

i,j
x
i
c(i j)x
j
.
Remark 1.1.2 You have probably encountered this notion before when dealing with non-negative
denite (positive denite) matrices. Recall the nn matrix
n
is non-negative denite if for all
x R
n
x

n
x 0. To see how this is related to non-negative denite matrices, suppose that the
matrix
n
has a special form, that is the elements of
n
are (
n
)
i,j
= c(i j). Then x

n
x =

n
i,j
x
i
c(i j)x
j
. We observe that in the case that X
t
is a stationary process with covariance
c(k), the variance covariance matrix of X
n
= (X
1
, . . . , X
n
) is
n
, where (
n
)
i,j
= c(i j).
We now take the above remark further and show that the covariance of a stationary process
is semi-negative denite.
Theorem 1.1.1 Suppose that X
t
is a stationary time series with covariance function c(k),
then c(k) is a non-negative denite sequence. Conversely for any negative denite sequence
there exists a stationary time series with a non-negative denite sequence as its covariance
function.
9
PROOF. To show that c(k) is negative denite. Consider any sequence x = (x
1
, . . . , x
n
) R
n
,
and the double sum

n
i,j
x
i
c(i j)x
j
. Dene the random variable Y =

n
i=1
x
i
X
i
. It is
straightforward to see that var(Y ) = xvar(X
n
)x =

n
i,j
x
i
c(i j)x
j
where X
n
= (X
1
, . . . , X
n
).
Since for any random variable Y , var(Y ) 0, this means that

n
i,j
x
i
c(ij)x
j
0, hence c(k)
is a positive denite sequence.
To show the converse that for any non-negative denite sequence c(k) we can nd a
corresponding stationary time series with the covariance c(k) is relatively straightfoward, but
depends on dening the characteristic function of a process and using Komologorovs extension
theorem. We omit the details but refer an interested reader to Brockwell and Davis (1998),
Section 1.5.
It is worth noting that a simple way to check for non-negative deniteness of sequence is
to consider its Fourier transform. If the Fourier tranform is postive, then the sequence is non-
negative denite, we will look at this in more depth when we consider the spectral density.
The above theorem applies also to spatial processes. Which is why in spatial statistics they
often look at the construction of positive denite covariance functions.
10
Chapter 2
Linear time series
Estimating of autocovariances of a time series gives us information about the linear dependence
structure of the process. It is therefore desirable to nd models which help to explain some of
the characteristics which we see in the autocovariances. An important class of models are linear
time series models and MA() models. We recall that linear regression model the dependent
variable is inuenced by current values of independent variables. The linear time series model
it a generalisation of this idea, where the dependent variable is inuenced by both past and
present and future independent variables. The MA() model is a subclass which has a more
natural interpretation, here the dependent variable is inuenced by the current and past. There
are two popular sub-groups of linear time models (a) the autoregressive and (a) the moving
average models, which can be combined to make the autoregressive moving average models. A
nice feature of the autoregressive models is that the previous observations linearly inuences the
current observation. A nice feature of moving average processes is that there is only non-zero
correlation for a nite number of lags, for a large enough lag the covariance will be zero.
Before dening a linear time series, we consider MA(q) model which is a small subclass of
linear time series. Let us supppose that
t
are iid random variables with mean zero and nite
variance. Suppose the time series satises
X
t
=
q

j=0

tj
.
It is clear that X
t
is rolling nite weighted sum of
t
, therefore X
t
must be well dened
(which basically means it is almost surely nite, this you can see because it has a nite variance).
Now we extend this idea and look not only at nite sums but innite sums of random variables.
Things become more complicated. Care must be always be taken when ever we deal with
anything involving innite sums! For example

j=

j
X
tj
, for it to make sense if its subsums
S
n
=

n
j=n

j
X
tj
are (almost surely) nite and the sequence converges (hence [S
n
1
S
n
2
[ 0
as n
1
, n
2
). Eectively everything must be nite. We give conditions under which this is
true in the following lemma.
Lemma 2.0.1 Suppose X
t
is a strictly stationary time series with E[X
t
[ < , then Y
t

dened by
Y
t
=

j=

j
X
tj
,
11
where

j=0
[
j
[ < , is a strictly stationary time series (and converges almost surely - that is
Y
n,t
=

n
j=0

j
X
tj
Y
t
almost surely). If var(X
t
) < , then Y
t
is second order stationary
and converges in mean square (that is E(Y
n,t
Y
t
)
2
).
PROOF. See Brockwell and Davis (1998), Proposition 3.1.1 or Fuller (1995), Theorem 2.1.1
(page 31) (also Shumway and Stoer (2006), page 86).
Example 2.0.1 Suppose X
t
is any stationary process and var(X
t
) < . Dene Y
t
as the
following innite sum
Y
t
=

j=0
j
k

j
X
tj
where [[ < 1. Then Y
t
is also a stationary process with a nite variance.
We will use this example later in the course.
Having derived conditions under which innite sums are well dened (good), we can now
dene the general class of linear and MA() processes.
Denition 2.0.4 (The linear process and moving average (MA)()) (i) A time se-
ries is said to be a linear time series if it can be represented as
X
t
=

j=

tj
,
where
t
are iid random variables with nite variance.
(ii) The time series X
t
has a MA() representation if it satises
X
t
=

j=0

tj
,
where
t
are iid random variables,

j=0
[
j
[ < and E([
t
[) < . If E([
t
[
2
) < ,
then it is second order stationary.
The dierence between an MA() process and a linear process is quite subtle. The dierence
is that the linear process involves both past and present innovations
t
, whereas the MA()
uses only past innovations. From a modelling perspective, the MA() process has better inter-
pretation.
A very interesting class of models which have MA() representations are autoregressive
and ARMA models. But in order to dene this class we need to take a brief look at dierence
equations.
2.1 Dierence equations and back-shift operators
The autoregressive and ARMA models are dened in terms of inhomogenuous dierence equa-
tions. Often dierence equations are dened in terms of backshift operators, so we start by
12
dening them and how they work below. This representation can be very useful as it can be
used to obtain a solution to the equations.
The autoregressive process (AR(p)) is dened as
X
t

1
X
t1
. . .
p
X
tp
=
t
,
where
t
are zero mean, nite variance random variables. Often the above is written as
X
t

1
BX
t
. . .
p
B
p
X
t
=
t
, (B)X
t
=
t
where (B) = 1

p
j=1

j
B
j
, B is the backshift operator and is dened such that B
k
X
t
= X
tk
.
Simply rearranging (B)X
t
=
t
, gives the solution of the equation to be X
t
= (B)
1

t
,
however this is a simple algebraic manipulation. We need to investigate whether it really has
any meaning. To do this, we start with an example.
Example the AR(1) process
(i) Consider the AR(1) process
X
t
= 0.5X
t1
+
t
. (2.1)
Notice this is an equation (rather like 3x
2
+2x+1 = 0, or an innite number of simultaneous
equations), which may or may not have a solution. To obtain the solution we note that
X
t
= 0.5X
t1
+
t
and X
t1
= 0.5X
t2
+
t1
. Using this we get X
t
=
t
+0.5(0.5X
t2
+

t1
) =
t
+ 0.5
t1
+ 0.5
2
X
t2
. Continuing this backward iteration we obtain at the kth
iteration, X
t
=

k
j=0
(0.5)
j

tj
+(0.5)
k+1
X
tk
. Because (0.5)
k+1
0 as k by taking
the limit we can show that X
t
=

j=0
(0.5)
j

tj
is almost surely nite and a solution of
(2.1). Of course like any other equation one may wonder whether it is the unique solution
(recalling that 3x
2
+ 2x + 1 = 0 has two solutions). We show in a later example that it is
the unique solution.
Now let us see whether we can obtain a solution using the dierence equation representa-
tion. We recall that crudely taking inverses the solution would be X
t
= (10.5B)
1

t
. The
obvious question is whether this has any meaning. Note that (10.5B)
1
=

j=0
(0.5B)
j
,
for [B[ 2, hence substituting this power series expansion into X
t
= (1 0.5B)
1

t
=
(

j=0
(0.5B)
j
)
t
= (

j=0
(0.5
j
B
j
))
t
=

j=0
(0.5)
j

tj
, which corresponds to the solu-
tion above. Hence the backshift operator in this example helps us to obtain a solution.
(ii) Now let us consider the equation
X
t
= 2X
t1
+
t
. (2.2)
Doing what we did in (i) we nd that after the kth back iteration we have X
t
=

k
j=0
2
j

tj
+
2
k+1
X
tk
. However unlike example (i) 2
k
does not converge as k . This suggest that
if we continue the iteration X
t
=

j=0
2
j

tj
is not a quantity that is well dened (al-
most surely nite). Since does not make much sense as the solution of an equation,
X
t
=

j=0
2
j

tj
cannot be considered as a solution of (2.2).
13
However rewriting (2.2) we have X
t1
= 0.5X
t
+ 0.5
t
. Forward iterating this we get
X
t1
= (0.5)

k
j=0
(0.5)
j

t+j
(0.5)
t+k+1
X
t+k
. Since (0.5)
t+k+1
0 we have X
t1
=
(0.5)

j=0
(0.5)
j

t+j
as a solution of (2.2).
Let us see whether the dierence equation can also oer a solution. Since (12B)X
t
=
t
,
using the crude manipulation we have X
t
= (1 2B)
1

t
. Now we see that (1 2B)
1
=

j=0
(2B)
j
for [B[ < 1/2. Using this expansion gives X
t
=

j=0
2
j
B
j
X
t
, but as we
pointed above this sum is not well dened. What we nd is that (B)
1

t
only makes
sense (is well dened) if the series expansion of (B)
1
converges in a region that includes
the unit circle [B[ = 1.
What we need is another series expansion of (1 2B)
1
which converges in a region
which includes [B[ = 1. We note that a function does not necessarily have a unique se-
ries expansion, it can have dierence series expansions which may converge in dierent
regions. We now show that the appropriate series expansion will be in negative pow-
ers of B not positive powers. (1 2B) = (2B)(1 (2B)
1
), therefore (1 2B)
1
=
(2B)
1

j=0
(2B)
1
, which converges for [B[ > 1/2. Using this expansion we have
X
t
=

j=0
(0.5)
j+1
B
j1

t
=

j=0
(0.5)
j+1

t+j+1
, which we have shown above is a
well dened solution of (2.2).
Let us now summarise our observation for general AR(1) process X
t
= X
t1
+
t
. If [[ < 1,
then the solution is in terms of past values of
t
, if on the other hand [[ > 1 the solution
is in terms of future values of
t
. In terms of the polynomial (B) = 1 B (often called
the characteristic polynomial), we are looking for regions which include the unit circle [B[ = 1,
for which the inverse (B)
1
has a convergent power series expansion. We see if the roots of
(B) are less than one, then the power series of (B)
1
is in terms of positive powers, if its
greater than one, then (B)
1
is in terms of negative powers. Generalising this argument to
a general polynomial, if the roots of (B) are less than one, the power series of (B)
1
is in
terms of positive powers (hence the solution (B)
1

t
will be in past terms of
t
). If on the
other hand, the roots are both less than one and greater than one (but do no lie on the unit
circle), the power series of (B)
1
will be in both negative and positive powers and the solution
X
t
= (B)
1

t
will be in terms of both past and future values of
t
.
We see that where the roots of the characteristic polynomial (B) denes the solution of the
AR process. We will show in Section 2.3.1 that it not only denes the solution but determines
some of the characteristics of the time series.
Example 2.1.1 Suppose X
t
satises
X
t
= 0.75X
t1
0.125X
t2
+
t
,
where
t
are iid random variables. We want to obtain a solution for the above equations.
It is not easy to use the backward (or forward) iterating techique for AR processes beyond
order one. This is where using the backshift operator becomes useful. We start by writing
X
t
= 0.75X
t1
0.125X
t2
+
t
as (B)X
t
= , where (B) = 10.75B+0.125B
2
, which leads
to what is commonly known as the characteristic polynomial (z) = 1 0.75z + 0.125z
2
. The
solution is X
t
= (B)
1

t
, if we can nd a power series expansion of (B)
1
, which is valid for
[B[ = 1.
14
We rst observe that (z) = 1 0.75z +0.125z
2
= (1 0.5z)(1 0.25z). Therefore by using
partial fractions we have
1
(z)
=
1
(1 0.5z)(1 0.25z)
=
1
(1 0.5z)
+
2
(1 0.25z)
.
We recall from geometric expansions that
1
(1 0.5z)
=

j=0
(0.5)
j
z
j
[z[ 2,
2
(1 0.25z)
= 2

j=0
(0.25)
j
z
j
[z[ 4.
Putting the above together gives
1
(1 0.5z)(1 0.25z)
=

j=0
(0.5)
j
+ 2(0.25)
j
z
j
[z[ < 2.
Since the above expansion is valid for [z[ = 1, we have

j=0
[ (0.5)
j
+ 2(0.25)
j
[ < (see
Lemma 2.1.1, this is also clear to see). Hence
X
t
= (10.5B)(10.25B)
1

t
=
_

j=0
(0.5)
j
+2(0.25)
j
B
j
_

t
=

j=0
(0.5)
j
+2(0.25)
j

tj
,
which gives a stationary solution to the AR(2) process (see Lemma 2.0.1).
The discussion above motivates how the backshift operator can be applied and how it can
be used to obtain solutions to dierence equations. We formalise this below. Its worth noting
that if you pretty much understand it, you dont have to worry much about the formal setting.
Denition 2.1.1 (Analytic functions) Suppose that z C. (z) is an analytic complex
function in the region , if it has a power series expansion which converges in , that is (z) =

j=

j
z
j
.
If there exists a function

(z) =

j=

j
z
j
such that

(z)(z) = 1 for all z , then

(z) is the inverse of (z) in the region .


Well known examples of analytic functions include polynomials such as (z) = 1+
1
z+
2
z
2
(for all z C) and (1 0.5z)
1
=

j=0
(0.5z)
j
for [z[ 2.
We observe that for AR processes we can represent the equation as (B)X
t
=
t
, which
formally gives the solution X
t
= (B)
1

t
. This raises the question under what conditions on
(B)
1
is (B)
1

t
valid. For (B)
1

t
to make sense (B)
1
should be represented as a power
series expansion, we show below what conditions on the power series expansion give the solution.
It is worth noting this is closely related to Lemma 2.0.1.
Lemma 2.1.1 Suppose that (z) =

j=

j
z
j
is nite on a region that includes [z[ =
1 (hence it is analytic) and X
t
is a strictly stationary process with E[X
t
[ < . Then

j=
[
j
[ < and Y
t
= (B)X
tj
=

j=

j
X
tj
is almost surely nite and strictly
stationary time series.
15
PROOF. It can be shown that if sup
|z|=1
[(z)[ < , in other words on the unit circle

j=

j
z
j
<
, then

j=
[
j
[ < . Since the coecients are absolutely summable, then by Lemma 2.0.1
we have that Y
t
= (B)X
tj
=

j=

j
X
tj
is almost surely nite and strictly stationary.
Rules of the back shift operator:
(i) If a(z) is analytic in a region which includes the unit circle [z[ = 1 and this is not on
the boundary of , then a(B)X
t
is a well dened random variable.
(ii) The operator is commutative and associative, that is [a(B)b(B)]X
t
= a(B)[b(B)X
t
] =
[b(B)a(B)]X
t
(the square brackets are used to indicate which parts to multiply rst). This
may seems obvious, but remember matrices are not commutative!
(iii) Suppose that a(z) and its inverse
1
a(z)
are both nite in the region which includes the
unit circle [z[ = 1. If a(B)X
t
= Z
t
, then X
t
=
1
a(B)
Z
t
.
Example 2.1.2 (Useful analytic functions) (i) Clearly a(z) = 1 0.5z is analytic for all
z C, and has no zeros for [z[ < 2. The inverse is
1
a(z)
=

j=0
(0.5z)
j
is well dened in
the region [z[ < 2.
(ii) Clearly a(z) = 12z is analytic for all z C, and has no zeros for [z[ > 1/2. The inverse is
1
a(z)
= (2z)
1
(1(1/2z)) = (2z)
1
(

j=0
(1/(2z))
j
) well dened in the region [z[ > 1/2.
(iii) The function a(z) =
1
(10.5z)(12z)
is analytic in the region 0.5 < z < 2.
(iv) a(z) = 1 z, is analytic for all z C, but is zero for z = 1. Hence its inverse is not well
dened for regions which involve [z[ = 1.
The above is quite technical, but it allows us to obtain solutions for ARMA processes and
to derive conditions under which they are causal and invertible.
2.2 The ARMA model
We start by dening the ARMA process and then show that it has (under certain conditions)
and MA() representation.
Denition 2.2.1 (The AR, ARMA and MA processes) (i) The autoregressive AR(p)
model: X
t
satises
X
t
=
p

i=1

i
X
ti
+
t
. (2.3)
Observe we can write (B)X
t
=
t
(ii) The moving average MA(q) model: X
t
satises
X
t
=
t
+
q

j=1

tj
. (2.4)
Observe we can write X
t
= (B)
t
16
(iii) The autoregressive moving average ARMA(p, 1) model: X
t
satises
X
t

i=1

i
X
ti
=
t
+
q

j=1

tj
. (2.5)
We observe that we can write X
t
as (B)X
t
= (B)
t
.
Example 2.2.1 (The AR(1) model) Consider the AR(1) process X
t
= X
t1
+
t
, where
[[ < 1. It has almost surely the well dened, unique stationary, causal solution X
t
=

j=0

tj
.
By iterating the dierence equation, it is clear that X
t
=

j=0

tj
is a solution of X
t
=

1
X
t1
+
t
. We rst need to show that it is well dened (that it is almost surely nite). We note
that [X
t
[

j=0
[
j
[ [
tj
[, showing that

j=0
[
j
[ [
tj
[ is almost surely nite, will imply that
[X
t
[ is almost surely nite. By montone convergence we can exchange sum and expectatin and we
have E([X
t
[) E(lim
n

n
j=0
[
j

tj
[) = lim
n

n
j=0
[
j
[E[
tj
[) = E([
0
[)

j=0
[
j
[ < .
Therefore since E[X
t
[ < ,

j=0

tj
is a well dened solution of X
t
= X
t1
+
t
. To show
that it is the unique (causal) solution, let us suppose there is another (causal) solution, call it Y
t
(note that this part of the proof is useful to know as such methods are often used when obtaining
solutions of time series models). Clearly, by recursively applying the dierence equation to Y
t
,
for every s we have
Y
t
=
s

j=0

tj
+
s
Y
t
.
Evaluating the dierence between the two solutions gives Y
t
X
t
= A
s
B
s
where A
s
=
s
Y
t
and B
s
=

j=s+1

tj
for all s. Now to show that Y
t
and X
t
coincide almost surely we
show that for every > 0,

s=1
P([A
s
B
s
[ > ) < . By the Borel-Cantelli lemman, this
would imply that the event [A
s
B
s
[ > happens almost surely only nitely often. Since for
every , [A
s
B
s
[ > occurs (almost surely) only nite often for all , then Y
t
= X
t
almost
surely. We now show that

s=1
P([A
s
B
s
[ > ). We note if [A
s
B
s
[ > ), then either
[A
s
[ > /2 or [B
s
[ > /2, Therefore P([A
s
B
s
[ > ) P([B
s
[ > /2) +P([A
s
[ > /2), by using
Markovs inequality we have P([A
s
B
s
[ > ) C
s
/ (note that since Y
t
is assumed stationary
E[Y
t
[ E[
t
[/(1 [[) < ). Hence

s=1
P([A
s
B
s
[ > ) <

s=1
C
s
/ < , thus X
t
= Y
t
almost surely. Hence X
t
=

j=0

tj
is (almost surely) the unique causal solution.
We now consider a generalisation of the above example to ARMA processes.
Lemma 2.2.1 Let us suppose X
t
is an ARMA(p, q) process. Then if the roots of the polynomial
(z) lie outside the unit circle and are greater than 1 +, then X
t
almost surely has the solution
X
t
=

j=0
a
j

tj
, (2.6)
where for j > q, a
j
= [A
j
]
1,1
+

q
i=1

i
[A
ji
]
1,1
, with
A =
_
_
_
_
_

1

2
. . .
p1

p
1 0 . . . . . . 0
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
0 . . . . . . 1 0
_
_
_
_
_
.
17
where

j
[a
j
[ < (we note that really a
j
= a
j
(, ) since its a function of
i
and
i
).
Moreover for all j,
[a
j
[ K
j
(2.7)
for some nite constant K and 1/(1 +) < < 1.
If the roots of (z) have absolute value greater than 1 +, then (2.5) can be written as
X
t
=

j=1
b
j
X
tj
+
t
. (2.8)
where
[b
j
[ K
j
(2.9)
for some nite constant K and = 1 /2.
PROOF. We rst show that if X
t
comes from an ARMA process where the roots lie outside the
unit circle then it has the representation (2.6). There are several way to prove the result. The
proof we consider here, is similar to the proof given in Example 2.2.1. We write the ARMA
process as a vector dierence equation
X
t
= AX
t1
+
t
(2.10)
where X

t
= (X
t
, . . . , X
tp+1
),

t
= (
t
+

q
j=1

tj
, 0, . . . , 0). Now iterating (2.10), we have
X
t
=

j=0
A
j

tj
, (2.11)
concentrating on the rst element of the vector X
t
we see that
X
t
=

i=0
[A
i
]
1,1
(
ti
+
q

j=1

tij
).
Comparing (2.6) and the above it is clear that for j > q, a
j
= [A
j
]
1,1
+

q
i=1

i
[A
ji
]
1,1
. Observe
that the above representation is very similar to the AR(1) given in Example 2.2.1. Indeed as
we will show below the A
j
behaves in much the same way as the
j
in Example 2.2.1. As with

j
, we will show that A
j
converges to zero as j (because the eigenvalues of A are less
than one). We now show that [X
t
[ K

j=1

j
[
tj
[ for some 0 < < 1, this will mean that
[a
j
[ K
j
. To bound [X
t
[ we will bound |X
t
|
2
(since [X
t
[ |X
t
|
2
). Now we note using (2.11)
gives
|X
t
|
2

j=0
|A
j
|
spec
|
tj
|
2
.
Hence, a bound for |A
j
|
spec
gives a bound for [a
j
[ (note that |A|
spec
is the spectral norm of A,
which is the largest eigenvalue of the symmetric matrix AA

). To get this bound we use a few


18
tricks. Below we will show that the largest eigenvalue of A
j
is less than 1, this means that the
largest eigenvalue of A
j
is is gets smaller as j grows, hence A
j
is contracting. We formalise this
now. To show that the largest eigenvalue of A is less than one, we consider det(AzI) (which
gives the eigenvalues of A)
det(AzI) = z
p

i=1

i
z
pi
= z
p
(1
p

i=1

i
z
i
)
. .
=z
p
(z
1
)
,
where (z) = 1

p
i=1

i
z
i
is the characteristic polynomial of the AR part of the ARMA process.
Since the roots of (z) lie outside of the unit circle, the roots of (z
1
) lie inside the unit circle
and the eigenvalues of A are less than one. Clearly if the absolute value of smallest root of (z)
is greater than 1 + , then the largest eigenvalue of A is less than 1/(1 + ) and the largest
eigenvalue of A
j
is less than 1/(1 +)
j
. We now show that |A
j
|
spec
also decays at a geometric
rate. It can be shown that if the largest absolute eigenvalue of A denoted
max
(A), is such that

max
(A) 1/(1+), then there exists a , where 1/(1+) < 1 where |A
j
|
spec
K
j
for all
j > 0 (c.f. Moulines et al. (2005), Lemma 12). Therefore we have |X
t
|
2
K

j=0

j
|
tj
|
2
and [a
j
[ K
j
. To show that the solution is unique we use the same method given in Example
??.
To show (2.8) we use a similar proof, and omit the details.
Remark 2.2.1 As we mentioned in the proof of Lemma 2.2.1, there are several methods to prove
the result. Another method uses that the roots of (z) lie outside the unit circle, and a power
series expansion of
1
(z)
is made. Therefore we can obtain the coecients a
j
by considering
the coecients of the power series
(z)
(z)
. Using this method it may not be immediatley obvious
that the coecients in the MA() expansion decay exponentially. We will clarify this here.
Let us denote the power series expansion as
1
(z)
=

j=0

z
j
. We note that in the case that
the roots of the characteristic polynomial of (z) are
1
, . . . ,
p
and are distinct then,
1
(z)
=

j=0
(

p
k=1
C
k

j
k
)z
j
, for some constants C
k
. It is clear in this case that the coecients of
1
(z)
decay exponentially fast, that is for some constant C we have [

j
[ C(min
k
[
k
[)
j
.
However in the case that roots of (z) are not necessarily distinct, let us say
1
, . . . ,
s
with
multiplicity m
1
, . . . , m
s
(

k
m
s
= p). Then
1
(z)
=

j=0
(

s
k=1

j
k
P
m
k
(j))z
j
, where P
m
k
(j)
is a polynomial of order m
k
. Despite the appearance of the polynomial term in the expansion
the coecients

j
still decay exponentially fast. It can be shown that for any > (min
k
[
k
[)
1
,
that there exists a constant such that [

j
[ C
j
(we can see this if we make an expansion of
(
k
+ )
j
, where is any small quantity). Hence the inuence of the polynomial terms P
m
k
(j)
in the power series expansion is minimal.
Remark 2.2.2 In the case that the roots of (z) do not lie on the unit circle, the smallest root
outside the unit circle has absolute value greater than (1 +
1
) and the largest root inside the
unit circle has absolute value less than (1
2
), then 1/(z) has a Laurent series expansion
1(z) =

j=

j
z
j
which converges for 1/(1 +
1
) [z[ 1/(1
2
). Hence X
t
has the
solution X
t
= (B)
1
(B)
t
=

j=
a
j

tj
, where the coecients a
j
are obtained from the
expansion of (z)/(z).
19
We note that in Lemma 2.2.1 we assumed that the roots of the characteristic polynomial (z)
lay outside the unit circle [z[ = 1. This basically imposed a causality condition on the solution.
When the roots dont necessarily lie outside the unit circle the solution is no longer

j=0
a
j

tj
but

j=
a
j

tj
, hence we go from MA() to the more general linear process. We dene
below causality and a closely related concept called invertibility.
Denition 2.2.2 (Causality and Invertibility) (i) Causality A process is called Causal
if it can be written as the MA() process X
t
=

j=0

tj
.
(ii) Invertible A process is called invertible if it can be written as an AR() process, that is
X
t
=

j=1

j
X
tj
+
t
, where

j=1
[
j
[ < .
Typically we will consider processes which are causal. Invertibility is a closely related concept
which says that X
t
can be represented in terms of previous values of X
t
and an innovation which
is independent of the past. The following result states when an ARMA process is invertible.
Lemma 2.2.2 An ARMA process is invertible if the roots of (z) lie outside the unit circle and
causal if the roots of (z) lie outside the unit circle.
One of the main advantages of the invertibility property is in prediction and estimation. We will
consider this in detail below. It is worth noting that even if an ARMA process is not invertible,
one can generate a time series which has identical correlation structure but is invertible (see
Section 2.4).
2.3 The autocovariance function
The autocovariance function (ACF) is dened as the sequence of covariances of a stationary
process. That is suppose that X
t
is a stationary process, then c(k) : k Z, the ACF of
X
t
where c(k) = E(X
0
X
k
). Clearly dierent time series give rise to dierent features in the
ACF. We will explore some of these features below. First we consider a general result on the
covariance of a causal ARMA process.
We evaluate the covariance of an ARMA process using its MA() representation. Let us
suppose that X
t
is a causal ARMA process, then it has the representation in (2.6) (where the
roots of (z) have absolute value greater than 1 +). Using (2.6) and the independence of
t

we have
cov(X
t
, X

) = cov(

j=0
a
j

tj
,

j=0
a
j

j
) (2.12)
=

j=0
a
j
a
j+|t|
var(
t
). (2.13)
Using (2.7) we have
|cov(X
t
, X

)| var(
t
)

j=0

j+|t|

|t|

j=0

2j
=

|t|
1
2
, (2.14)
for any > 1/(1 +).
20
The above bound is useful and will be used in several proofs below. However other than it
tells us that the ACF decays exponentially fast it is not very enlightening about the features of
the process. In the following we consider the ACF of an autoregressive process. So far we have
used the characteristic polynomial assocaiated with an AR process to determine whether it was
causal. Now we show that the roots of the characteristic polynomial also give information about
the ACF and what a typical realisation of a autoregressive process could look like.
2.3.1 The autocovariance of an autoregressive process
Let us consider the zero mean causal AR(p) process X
t
where
X
t
=
p

j=1

j
X
tj
+
t
. (2.15)
Now given that X
t
is causal we can derive a recursion for the covariances. It can be shown
that multipying both sides of the above equation by X
tk
(k 0) and taking expectations, gives
the equation
E(X
t
X
tk
) =
p

j=1

j
E(X
tj
X
tk
) +E(
t
X
tk
)
=
p

j=1

j
E(X
tj
X
tk
).
These are the Yule-Walker equations, we will discuss them in detail when we consider estimation.
For now letting c(k) = E(X
0
X
k
) and using the above we see that the autocovariance satises
the homogenuous dierence equation
c(k)
p

j=1

j
c(k j) = 0, (2.16)
for k 0. In other words the autocovariance function of X
t
is the solution of this dierence
equation. The study of dierence equations is a entire eld of research, however we will now
scratch the surface to obtain a solution for (2.16). Solving (2.16) is very similar to solving
homogenuous dierential equations, which some of you may be familar with (do not worry if
you are not). Now consider the characteristic polynomial of the AR process 1

p
j=1

j
z
j
= 0,
which has the roots
1
, . . . ,
p
. The roots of the characteristic give the solution to (2.16). It can
be shown if the roots are distinct (not the same) the solution of (2.16) is
c(k) =
p

j=1
C
j

k
j
, (2.17)
where the constants C
j
are chosen depending on the initial values c(k) : 1 k p and
ensure that c(k) is real (recalling that
j
) can be complex. In the case that the roots are not
distinct let the roots be
1
, . . . ,
s
with multiplicity m
1
, . . . , m
s
(

k
m
s
= p). In this case the
solution is
c(k) =
s

j=1

k
j
P
m
j
(k), (2.18)
21
0 10 20 30 40 50

0
.4

0
.2
0
.0
0
.2
0
.4
0
.6
0
.8
1
.0
lag
a
c
f
Figure 2.1: The ACF of the time series X
t
= 1.5X
t1
0.75X
t2
+
t
where P
m
j
(k) is m
j
th order polynomial and the coecients C
j
are now hidden in P
m
j
(k).
We now study the covariance in greater details and see what it tells us about a realisation. As
a motivation consider the following example.
Example 2.3.1 Consider the AR(2) process
X
t
= 1.5X
t1
0.75X
t2
+
t
, (2.19)
where
t
are iid random variables with mean zero and variance one. The corresponding char-
acteristic polynomial is 1 1.5z + 0, 75z
2
, which has roots 1 i3
1/2
=
_
4/3 exp(i/6). Using
the discussion above we see that the autocovariance function of X
t
is
c(k) = (
_
4/3)
k
(C
1
exp(ik/6) +

C
1
exp(ik/6)),
for a particular value of C
1
. Now write C
1
= a exp(ib), then the above can be written as
c(k) = a(
_
4/3)
k
cos(k

6
+b).
We see that the covariance decays at an exponential rate, but there is a periodicity in this decay.
This means that observations separated by a lag k = 12 are closely correlated (similarish in
value), which suggests a quasi-periodicity in the time series. The ACF of the process is given
in Figure 2.1, notice that it has decays to zero but also observe that it undulates. A plot of a
realisation of the time series is given in Figure 2.2, notice the quasi-periodicity of about /6.
We now generalise the above example. Let us consider the general AR(p) process dened in
(2.15). Suppose the roots of the corresponding characteristic polynomial are distinct and let us
split them into real and complex roots. Because the characteristic polynomial is comprised of
22
Time
a
r
2

2
0
2
4
6
0 24 48 72 96 120 144
Figure 2.2: The a simulation of the time series X
t
= 1.5X
t1
0.75X
t2
+
t
real coecients, the complex roots come in complex conjugate pairs. Hence let us suppose the
real roots are
j

r
j=1
and the complex roots are
j
,

(pr)/2
j=r+1
. The covariance in (2.20) can
be written as
c(k) =
r

j=1
C
j

k
j
+
(p2)/2

j=r+1
a
j
[
j
[
k
cos(k
j
+b
j
) (2.20)
where for j > r we write
j
= [
j
[ exp(i
j
) and a
j
and b
j
are real constants. Notice that as the
example above the covariance decays exponentially with lag, but there is undulation. A typical
realisation from such a process will be quasi-periodic with periods at
r+1
, . . . ,
(pr)/2
, though
the magnitude of periods will vary.
An interesting discussion on covariances of an AR process and realisation of an AR process
is given in Shumway and Stoer (2006), Chapter 3.3 (it uses the example above). A discussion
of dierence equations is also given in Brockwell and Davis (1998), Sections 3.3 and 3.6 and
Fuller (1995), Section 2.4.
2.3.2 The autocovariance of a moving average process
Suppose that X
t
satises
X
t
=
t
+
q

j=1

tj
.
The covariance is
cov(X
t
, X
tk
) =
_
p
i=0

ik
k = q, . . . , q
0 otherwise
23
where
0
= 1 and
i
= 0 for i < 0 and i q. Therefore we see that there is no correlation when
the lag between X
t
and X
tk
is greater than q.
2.3.3 The autocovariance of an autoregressive moving average process
We see from the above that an MA(q) model is only really suitable when we believe that there is
no correlaton between two random variables separated by more than a certain distance. Often
autoregressive models are tted. However in several applications we nd that autoregressive
models of a very high order are needed to t the data. If a very long autoregressive model
is required a more suitable model may be the autoregressive moving average process. It has
several of the properties of an autoregressive process, but can more parsimonuous than a long
autoregressive process. In this section we consider the ACF of an ARMA process.
Let us suppose that the causal time series X
t
satises the equations
X
t

i=1

i
X
ti
=
t
+
q

j=1

tj
.
We now dene a recursion for ACF, which is similar to the ACF recursion for AR processes.
Let us suppose that k > q, then it can be shown that the autocovariance function of the ARMA
process satises
E(X
t
X
tk
)
p

i=1

i
E(X
ti
X
tk
) = 0
Now when k q we have
E(X
t
X
tk
)
p

i=1

i
E(X
ti
X
tk
) =
q

j=1

j
E(
tj
X
tk
)
=
q

j=k

j
E(
tj
X
tk
)
We recall that X
t
has the MA() representation X
t
=

j=0
a
j

tj
(see (2.6)), therefore for
k j q we have E(
tj
X
tk
) = a
jk
var(
t
) (where a(z) = (z)(z)
1
). Altogether the above
gives the dierence equations
c(k)
p

i=1

i
c(k i) = var(
t
)
q

j=k

j
a
jk
for 1 k q (2.21)
c(k)
p

i=1

i
c(k i) = 0, for q < k,
where c(k) = E(X
0
X
k
). Now since this is homogenuous dierence equation the solution is (as
in (2.18))
c(k) =
s

j=1

k
j
P
m
j
(k),
24
where
1
, . . . ,
s
with multiplicity m
1
, . . . , m
s
(

k
m
s
= p) are the roots of the characteristic
polynomial 1

p
j=1

j
z
j
. The coecients in the polynomials P
m
j
are determined by the initial
condition given in (2.21).
You can also look at Brockwell and Davis (1998), Chapter 3.3 and Shumway and Stoer
(2006), Chapter 3.4.
2.3.4 The partial covariance
We see that by using the autocovariance function we are able to identify the order of an MA(q)
process: recall when the covariance lag is greater than q the covariance is zero. However the
same is not true for AR(p) processes. The autocovariances do not enlighten us on the order p.
However a variant of the autocovariance, called the partial autocovariance is quite informative
about order of AR(p). We will consider the partial autocovariance in this section.
In order to dene the partial correlation we need to introduce the idea of projection onto
a subspace. We will investigate the idea of projections quite thoroughly in Section 3, however
we will briey introduce the concept here. The projection of X
t
onto the space spanned by
X
s
, X
s+1
, . . . , X
s+k
, is the best linear predictor of X
t
given X
s
, . . . , X
s+k
. We will denote the
projection of X
t
onto the space spanned by X
s
, X
s+1
, . . . , X
s+k
P
sp(X
s
,...,X
s+k
)
X
t
=
k

j=1
a
j
X
s+kj
,
where a
j
minimises the means squared error E(X
t

k
j=1
a
j
X
s+kj
)
2
. Having dened the
notion of projection we can now dene the partial correlation.
The partial correlation between X
t
and X
t+k
(where k > 0) is the correlation between X
t
and X
t+k
, conditioning out all the random variables between X
t
and X
t+k
. More precisely it
is dened as
cov(X
t+k
P
sp(X
t+k1
,...,X
t
)
X
t+k
, X
t
P
sp(X
t+k1
,...,X
t
)
X
t
).
We rst consider an example.
Example 2.3.2 Consider the causal AR(1) process X
t
= 0.5X
t1
+
t
where E(
t
) = 0 and
var(
t
) = 1. Using (2.12) it can be shown that cov(X
t
, X
t2
) = 2 0.5
2
(compare with the
MA(1) process X
t
=
t
+0.5
t1
, where the covariance cov(X
t
, X
t2
) = 0). Now let us consider
the partial covariance between X
t
and X
t2
. Remember we have to condition out the random
variables inbetween, which in this case is X
t1
. It is clear that the projection of X
t
onto X
t1
is
0.5X
t1
(since X
t
= 0.5X
t1
+
t
). Therefore X
t
P
sp(X
t1
)
X
t
= X
t
0.5X
t1
=
t
. The pro-
jection of X
t2
onto X
t1
is a little more complicated, it is P
sp(X
t1
)
X
t2
=
E(X
t1
E(X
t2
)
E(X
2
t1
)
X
t1
.
Therefore the partial correlation between X
t
and X
t2
cov(X
t
P
sp(X
t1
)
X
t
, X
t2
P
sp(X
t1
)
X
t2
) = cov(
t
, X
t2

E(X
t1
E(X
t2
)
E(X
2
t1
)
X
t1
) = 0.
In fact the above is true for the partial covariance between X
t
and X
tk
, for k 2. Hence we
see that despite the covariance not being zero for the autocovariance of an AR process greater
than order two, the partial covariance is zero for all lags greater than or equal to two.
25
Using the same argument as above, it is easy to show that partial covariance of an AR(p) for
lags greater than p is zero. Hence in may respects the partial covariance can be considered as an
analogue of the autocovariance. It should be noted that though the covariance of MA(q) is zero
for lag greater than q, the same is not true for the parial covariance. Whereas partial covariances
removes correlation for autoregressive processes it seems to add correlation for moving average
processes!
If the autocovariances after a certain lag are zero q, it may be appropriate to t an MA(q)
model to the time series.
The autocovariances of any AR(p) process will decay but not be zero.
If the partial autocovariances after a certain lag are zero p, it may be appropriate to t an
AR(p) model to the time series.
The partial autocovariances of any MA(q) process will decay but not to zero.
It is interesting to note that the partial covariance is closely related the coecients in linear
prediction. Suppose that X
t
is a stationary time series, and we consider the projection of
X
t+1
onto the space spanned by X
t
, . . . , X
1
(the best linear predictor). The projection is
P
sp(X
t
,...,X
1
)
=
t

j=1

t,j
X
t+1j
.
Then from the proof of the Durbin-Levinson algorithm in Section 3.2 it can be shown that

t,t
=
cov(X
t+1
P
sp(X
t
,...,X
2
)
, X
1
P
sp(X
t
,...,X
2
)
)
E(X
t+1
P
sp(X
t
,...,X
2
)
)
2
.
Hence the last coecient in the prediction is the (normalised) partial covariance. For further
reading see Shumway and Stoer (2006), Section 3.4 and Brockwell and Davis (1998), Section
3.4.
It is worth noting the partial covariance (correlation) is often used in to decide whether
there is direct (linear) dependence between random variables. It is has several application for
example in MRI data, where the partial coherence density (a closely related concept) is often
investigated.
2.4 The autocovariance function, invertibility and causality
Here we demonstrate that it is very dicult to identify whether a process is noninvertible/noncausal
or not from its covariance structure. Hence for most purposes one can usually suppose a process
is both causal and invertible (though in a series of papers Richard Davis and coauthors have
discussed the advantages of tting noncausal processes). To show this we will require the def-
inition of the spectral density function. We will briey introduce it here but return to it and
consider it in depth in later sections.
26
Denition 2.4.1 (The spectral density) Given the covariances c(k) the spectral density func-
tion is dened as
f() =

k
c(k) exp(ik).
The covariances can be obtained from the spectral density by using the inverse fourier transform
c(k) =
_
2
0
f() exp(ik).
Hence the covariance yields the spectral density and visa-versa.
We will show later in the course that the spectral density of the ARMA process which satises
X
t

i=1

i
X
ti
=
t
+
q

j=1

tj
and does not have any roots on the unit circle is
f() =
[1 +

q
j=1

j
exp(ij)[
2
[1

p
j=1

j
exp(ij)[
2
. (2.22)
Now let us supose the roots of the characteristic polynomial 1

q
j=1

j
z
j
are
j

q
j=1
and
the roots of 1+

q
j=1

j
z
j
are
j

p
j=1
, hence 1

q
j=1

j
z
j
=

q
j=1
(1
j
z), and 1+

p
j=1

j
z
j
=

p
j=1
(1
j
z). Then (2.22) can be written as
f() =

q
j=1
[1
1
j
exp(i)[
2

p
j=1
[1
1
j
exp(i)[
2
. (2.23)
Suppose that the roots
1
, . . . ,
r
lie outside the unit circle and the roots
r+1
, . . . ,
q
lie inside
the unit circle. Similarly, suppose
1
, . . . ,
s
lie outside the unit circle and the roots
s+1
, . . . ,
p
lie inside the unit circle. Clearly if r < q the process X
t
is not invertible (roots of the MA part
lie inside the unit circle) and if s < p the process is not causal (roots of the AR part lie inside
the unit circle). We now construct a new process, based on X
t
, which is both causal and
invertible and has the spectra (2.23) (upto a multiplicative constant). Dene the polynomials

(z) = [
r

j=1
(1
1
j
z)][
q

j=r+1
(1

j
z)]

(z) = [
1

j=1
(1
1
j
z)][
p

j=s+1
(1

j
z)]
The roots of

(z) are
j

r
j=1
and

1
j

p
j=r+1
and the roots of

(z) are
j

s
j=1
and

1
j

q
j=s+1
.
Clearly the roots of both (z) and (z) lie outside the unit circle. Now dene the process

(z)

X
t
=

(z)
t
.
27
Clearly

X
t
is an ARMA process which is both causal and invertible. Let us consider the spectral
density of

X
t
, using (2.22) we have

f() =

r
j=1
[1
1
j
exp(i))[
2

q
j=r+1
[1

j
exp(i))[
2

s
j=1
[1
1
j
exp(i))[
2

p
j=s+1
[1

j
exp(i)[
2
.
We observe that [1

j
exp(i))[
2
= [1
1
j
exp(i))[
2
= [
1
j
exp(i)(
j
exp(i) 1)[
2
=
[

j
[
2
[1
1
j
exp(i))[
2
. Therefore we can rewrite

f() as

f() =

q
i=r+1
[
i
[
2

p
i=s+1
[
i
[
2

q
j=1
[1
1
j
exp(i)[
2

p
j=1
[1
1
j
exp(i)[
2
=

q
i=r+1
[
i
[
2

p
i=s+1
[
i
[
2
f().
Hence

X
t
and X
t
have the same spectral density up to a multiplicative constant. The mul-
tiplicative constant can be treated the the variance of the innovation, which we now normalise.
Hence we dene a new process

X
t
where

(z)

X
t
=

(z)

q
i=r+1
[
i
[
2

p
i=s+1
[
i
[
2

t
,
then the process

X
t
and X
t
have identical spectral densities. As was mention above since
the spectral density gives the covariance, the covariances of X
t
and

X
t
are also the same.
This means based on only the covariances it is not possible to distinguish between a causal
(invertible) process and a noncausal (noninvertible) process.
In this course we will always assume the ARMA process is invertible and causal, however its
worth bearing in mind that there can arise situations where noncausal (noninvertible) processes
may be more appropriate.
Denition 2.4.2 An ARMA process is said to have minimum phase when the roots of (z) and
(z) both lie outside of the unit circle.
In the case that the roots of (z) lie on the unit circle, then X
t
is known as a unit root
process.
28
Chapter 3
Prediction
In this chapter we will consider prediction for stationary time series. The idea is to nd the
best linear predictor of X
t
given the previous observations X
t1
, . . . , X
1
. This is known as the
one-step ahead predictor, as we are prediction only one-step ahead of the known observations.
A very interesting application of the one-step ahead predictor is that it is has several useful
applications in estimation too, which we will consider later. A rather simple generalisation
of the one-step ahead predictor is the n-step ahead predictor. Once we have established the
one-step ahead predictor, it is easy to generalise to n-step. First some notation, we use
X
t+1|t
= BestLin(X
t+1
[X
t
, . . . , X
1
) = X
t+1|t,...,1
=
t

j=1

t,j
X
t+1j
, (3.1)
where
t,j
are chosen to minimise the mean squared error E(X
t+1

t
j=1
a
t,j
X
t+1j
)
2
. The
mean squared error E(X
t+1

t
j=1
a
t,j
X
t+1j
)
2
is known as the one-step ahead prediction error.
The predictors we will consider are for stationary time series. The rst is for any general
stationary time series (but it has interesting applications for AR processes) and the second is for
ARMA processes. A general prediction scheme for any type of time series (not necessarily sta-
tionary) called the Innovations Algorithm is considered in Brockwell and Davis (1998), Chapter
5.
3.1 Basis and linear vector spaces
Before we continue we rst discuss briey the idea of a vector spaces, spans and basis. A more
rigours approach is given in Brockwell and Davis (1998), Chapter 2, and any good linear algebra
book. However what is outlined here should be sucient for the course.
First a quick denition of a vector space. A is a vector space if for every x, y A and
a, b R, then ax +by A (and ideal example of a vector space is R
n
). A normed linear vector
space (usually called a Hilbert space), is a vector space dened with a norm (or inner product).
The norm satises a set of conditions, I wont give them here, but a good example of a normed
vector space is R
n
where the inner product between two vectors x, y R
n
is the inner product
< x, y >=

n
i=1
x
i
y
i
. In this course, the normed vector spaces we will be considering are vector
spaces comprising of random variables, and the inner product between two random variables in
29
the space is the covariance. From now on we will concentrate on spaces of random variables
which have a nite variance.
We say that the random variables X
t
, X
t1
, . . . , X
1
spans the space A
1
t
if for any Y A
t
,
there exists coecients a
j
such that
Y =
t

j=1
a
j
X
t+1j
. (3.2)
Conversely, the random variables X
t
, . . . , X
1
can be used to dene a vector space. That is we
dene the space A
1
t
, where Y A
1
t
if and only if there exists coecients a
j
with

j
a
2
j
<
such that Y =

t
j=1
a
j
X
t+1j
. We often write A
1
t
= sp(X
t
, . . . , X
1
) to denote the space spanned
by X
t
, X
t1
, . . . , X
1
. The basis of a vector space is closely related to a span. X
t
, . . . , X
1

is a basis of A
1
t
if (3.2) is true, however if X
t
, . . . , X
1
is a basis this representation is unique.
That is there does not exist another set of coecients b
j
such that Y =

t
j=1
b
j
X
t+1j
. For
this reason one can consider a basis as the minimal span, that is the smallest set of elements
which can span a space.
Denition 3.1.1 (Projections) We note that the projection of Y onto the space spanned by
(
X
t
, X
t1
, . . . , X
1
), is P(
X
t
,X
t1
,...,X
1
)
Y
t
=

t
j=1
c
j
X
t+1j
, where c
j
is chosen such that the
dierence Y P(
X
t
,X
t1
,...,X
1
)
Y
t
is uncorrelated (orthogonal) to any element in
(
X
t
, X
t1
, . . . , X
1
).
3.1.1 Orthogonal basis
In the context of what we will be doing, the most interesting example of an orthogonal basis is
related to the best linear predictor. We recall that X
t+1|t
is the best linear predictor of X
t+1
given
X
t
, . . . , X
1
(it is the projection of X
t+1
onto A
1
t
(the space spanned by X
t
, . . . , X
1
). Therefore
no (linear) information about X
t
, . . . , X
1
is contained in the dierence X
t+1
X
t+1|t
. In other
words X
t+1
X
t+1|t
and X
s
(1 s t) are orthogonal (cov(X
s
, (X
t+1
X
t+1|t
)). That is the
space spanned by sp(X
t+1
X
t+1|t
) and sp(X
t
, . . . , X
1
) are orthogonal (think perpendicular).
Continuing this argument we see that (X
t
X
t|t1
), . . . , (X
2
X
2|1
), X
1
are orthogonal random
variables (E((X
t
X
t|t1
)(X
s
X
s|s1
)) = 0 if s ,= t).
To see that (X
t
X
t|t1
), . . . , (X
2
X
2|1
), X
1
and X
t
, . . . , X
1
span the same space. We
now dene the sum of spaces. If U and V are two orthogonal vector spaces (which share the
same norm), then y U V , if there exists a u U and v V such that y = u + v. By
the denition of A
1
t
, it is clear that (X
t
X
t|t1
) A
1
t
, but (X
t
X
t|t1
) / A
1
t1
. Hence
A
1
t
= sp(X
t
X
t|t1
) A
1
t1
. Continuing this argument we see that A
1
t
= sp(X
t
X
t|t1
)
sp(X
t1
X
t1|t2
), . . . , sp(X
1
). Hence sp(X
t
, . . . , X
1
) = sp(X
t
X
t|t1
, . . . , X
2
X
2|1
, X
1
)
(the spaces spanned by (X
t
X
t|t1
), . . . , (X
2
X
2|1
), X
1
and X
t
, . . . , X
1
are the same). That
is there exist coecients b
j
such that
Y =
t

j=1
a
j
X
t+1j
=
t1

j=1
b
j
(X
t+1j
X
t+1j|tj
) +b
t
X
1
.
A useful application of orthogonal basis is the ease of obtaining the coecients b
j
. In other
words if Y =

t1
j=1
b
j
(X
t+1j
X
t+1j|tj
) +b
t
X
1
, then b
j
can be immediately obtained as
b
j
= E(Y (X
j
X
j|j1
))/E(X
j
X
j|j1
))
2
30
. Note that this is not necessarily the case for obtaining the coecients a
j
and that
_
_
_
a
1
.
.
.
a
t
_
_
_
=
1
t
r
t
(3.3)
where (
t
)
i,j
= E(X
i
X
j
) and (r
t
)
i
= E(X
i
Y ). The problem with using the orthogonal represen-
tation (X
t
X
t|t1
), . . . , (X
2
X
2|1
), X
1
, is that it is not easy to obtain E(Y (X
j
X
j|j1
))
and E(X
j
X
j|j1
))
2
.
3.1.2 Spaces spanned by innite number of elements
These ideas can be generalised to spaces which have an innite number of elements (random
variables) in their basis. Let now construct the space spanned by innite number random
variables X
t
, X
t1
, . . .. As always we need to dene precisely what we mean by an innite
basis. To do this we construct a sequence of subspaces all with an increasing number in the
basis and consider the limit of this space. Let A
n
t
= sp(X
t
, . . . , X
n
). Clearly if m > n, then
A
n
t
A
m
t
. Now we dene X
infty
t
=

n=1
A
n
t
. However we need to close this space, the
space needs to be complete, that is a the limit of a converging sequence must also belong to this
space too. To make this precise suppose the sequence of random variables is such that Y
s
A
s
t
,
and E(Y
s
1
Y
s
2
)
2
0 as s
1
, s
2
. It is clear that Y
s
A

t
. Since the sequence Y
s
is a
Cauchy sequence there exists a limit, that is a random variable Y such that E(Y
s
Y )
2
0 as
s . The closure of the space A
n
t
, denoted

A
n
t
contains the set A
n
t
and all the limits of
the cauchy sequences in this set. We often use sp(X
t
, X
t1
. . . , ) to denote

A

t
. You really do
not have to worry too much about the above, basically Y sp(X
t
, X
t1
. . .) if E(Y
2
) < and
we can represent Y (almost surely) as Y =

j=1
a
j
X
t+1j
, for some coecients a
j
.
The orthogonal basis of sp(X
t
, X
t1
, . . .)
An orthogonal basis of sp(X
t
, X
t1
, . . .) can be constructed in the same way that an orthogonal
basis of sp(X
t
, X
t1
, . . . , X
1
). The main dierence is how to deal with the initial value, which
in the case of sp(X
t
, X
t1
, . . . , X
1
) is X
1
and in the case of sp(X
t
, X
t1
, . . .) is in some sense
X

, but this it not really a well dened quantity (again we have to be careful with these
innities). Let X
t|t1,...
denote the best linear predictor of X
t
given X
t1
, X
t2
, . . .. As in
Section 3.1.1 it is clear that (X
t
X
t|t1,...
) and X
s
for s t 1 are uncorrelated and

X

t
=
sp(X
t
X
t|t1,...
)

X

t1
, where

X

t
= sp(X
t
, X
t1
, . . .). Now let us consider the space
sp((X
t
X
t|t1,...
), (X
t1
X
t1|t2,...
), . . .), comparing with the construction in Section 3.1.1, we
see that sp((X
t
X
t|t1,...
), (X
t1
X
t1|t2,...
), . . .) does not necessarily equal sp(X
t
, X
t1
, . . .),
because sp((X
t
X
t|t1,...
), (X
t1
X
t1|t2,...
), . . .) lacks the inital value X

. Of course
the time in the past is not really a well dened quantity. Instead the way we dene the
initial starting random variable as the intersection of the subspaces

X

t
, hence let A

n=
A

t
. Now we note since X
n
X
n|n1,...
and X
s
(for any s n 1) are orthogonal,
that sp((X
t
X
t|t1,...
), (X
t1
X
t1|t2,...
), . . .) and A

are orthogonal spaces and sp((X


t

X
t|t1,...
), (X
t1
X
t1|t2,...
), . . .) A

= sp(X
t
, X
t1
, . . .).
We will use this discussion when we prove the Wold decomposition theorem.
31
3.2 Durbin-Levinson algorithm
The Durbin-Levinson algorithm is a simple method for obtaining the coecients of the best
linear predictor of X
t+1
given X
t
, . . . , X
1
. It was rst proposed in the 40s by Norman Levinson
and improved (and adapted to time series) in the early 60s by Jim Durbin.
We recall we want to obtain the coecients
t,j
, where
X
t+1|t
=
t

j=1

t,j
X
tj
(3.4)
minimises the mean squared error E(X
t

t
j=1
a
j
X
tj
)
2
. Of course the coecients
t,j
can be
obtained using (3.3), however, if t is large this can be computationally quite intensive. Instead
we consider an algorithm which obtains
t,j
using
t1,j
and without the need to invert the
matrix
t
. This algorithm can only be applied to stationary time series (as it derived under
the assumption of stationarity). We will show later that it is also useful for estimating the
parameters of an autoregressive time series.
Let us suppose X
t
is a zero mean stationary time series and c(k) = E(X
k
X
0
). Let X
1|t,...,2
denote the best linear predictor of X
1
given X
t
, . . . , X
2
. We rst note that by construction that
X
t
, . . . , X
2
and X
1
X
1|t,...,2
are orthogonal, and X
t
, . . . , X
2
, X
1
and X
t
, . . . , X
2
, X
1

X
1|t,...,2
span the same space. Hence we can rewrite X
t+1|t
as
X
t+1|t
=
t

j=1

t,j
X
t+1j
=
t1

j=1

t,j
X
t+1j
+a
t
(X
1
X
1|t,...,2
).
Now this is rst note that by the orthogonality of X
t
, . . . , X
2
and X
1
X
1|t,...,2
we can
aggregate the predictions, that is
X
t+1|t
= X
t+1|t,...,2
+a
t
(X
1
X
1|t,...,2
). (3.5)
We can rewrite the above using the orthogonality of X
t
, . . . , X
2
and X
1
X
1|t,...,2
to obtain
a
t
=
E(X
t+1|t
(X
1
X
1|t,...,2
))
E(X
1
X
1|t,...,2
)
2
.
Furthermore, since X
t+1
= X
t+1|t
+ (X
t+1
X
t+1|t
) and (X
t+1
X
t+1|t
) and X
t
, . . . , X
1
are
orthogonal we observe that
a
t
=
E(X
t+1
(X
1
X
1|t,...,2
))
E(X
1
X
1|t,...,2
)
2
.
So far we have not used the stationarity of the time series X
t
, but we do now. We observe that
X
1|t,...,2
is the best linear predictor of X
1
given X
2
, . . . , X
t
. By stationarity the coecients
of the best linear predictor of X
t
given X
t1
, . . . , X
1
are the same as those of the best linear
predictor of X
t+1
given X
t
, . . . , X
2
(due to shift invariance). But the same is also true if we
ip the time series around. That is the coecients which given the best linear predictor of
32
X
t+1
given X
t
, . . . , X
2
are the same (but in reverse) of the best linear predictor of X
1
given
X
2
, . . . , X
t
. In other words under the assumption of stationarity we have
X
t|t1
=
t1

j=1

t1,j
X
tj
X
t+1|t,...,2
=
t1

j=1

t1,j
X
t+1j
and X
1|t,...,2
=
t1

j=1

t1,j
X
j+1
.
Therefore substituting the above into (3.5) we have X
t+1|t
= X
t+1|t,...,2
+a
t
(X
1
X
1|t,...,2
), hence
X
t+1|t
=
t1

j=1

t1,j
X
t+1j
+a
t
(X
1
X
1|t,...,2
)
=
t1

j=1
(
t1,j
a
t

t1,tj
)X
t+1j
+a
t
X
1
,
where
a
t
=
E((X
1

t1
j=1

t1,j
X
j+1
)X
t+1
)
E(X
1
X
1|t,...,2
)
2
=
c(t)

t1
j=1

t1,j
c(t j)
r(t)
(3.6)
where r(t) = E(X
t
X
t|t1
)
2
. Now we note that X
t+1|t
satises (3.4), therefore by comparing
coecients (the linear predictor is unique) we have

t,t
= a
t

t,j
=
t1,j
a
t

t1,tj
for j < t.
Finally to recursively obtain the one-step ahead prediction error v(t + 1) we note that by or-
thogonality of X
t
, . . . , X
2
and X
1
X
1|t,...,2
that
r(t + 1) = E(X
t+1
X
t+1|t
)
2
= E(X
t+1
BestLin(X
t+1
[X
t
, . . . , X
2
) a
t
(X
1
X
1|t,...,2
))
2
= E(X
t+1
BestLin(X
t+1
[X
t
, . . . , X
2
))
2
+a
2
t
E(X
1
X
1|t,...,2
))
2
2a
t
E(X
t+1
(X
1
X
1|t,...,2
))
= r(t) +a
2
t
r(t) 2a
t
E(X
t+1
(X
1
X
1|t,...,2
)).
We note that by construction of a
t
in (3.6) that r(t) = E(X
1
X
1|t,...,2
)
2
, substituting this into
the above gives
r(t + 1) = r(t) +a
2
t
r(t) 2a
2
t
r(t) = r(t)(1 a
2
t
).
Hence we have the recursion. The note that the initial values are
1,1
= c(1)/c(0) and r(1) = c(0)
(which is straightforward to prove). Further references: Brockwell and Davis (1998), Chapter 5
and Fuller (1995), pages 82.
33
3.3 Prediction for ARMA processes
Given the autocovariance of any stationary process the Durbin-Levinson algorithm allows us
to systematically obtain one-step predictors without too much computational burden. This
includes ARMA processes, where as we have shown in Section 2.3.3 the covariances can be
obtained from the ARMA parameters. However, there for ARMA processes there are easier
methods for doing the prediction which we describe below.
Let us suppose the ARMA process is both causal and invertible, that is X
t
satises (2.5)
(X
t

p
i=1

i
X
ti
=
t
+

q
j=1

tj
). Then by using Lemma 2.2.1, X
t
can be written as
X
t
=

j=0
a
j
(, )
tj
and X
t
=

j=1
b
j
(, )X
tj
+
t
, (3.7)
where we use a
j
(, ) and b
j
(, ) to emphasis that the AR() and MA() parameters are
functions of
1
, . . . ,
p
and
1
, . . . ,
q
. The above means that given
k

t
k=
we can construct X
t
and given (X
k

t1
k=
,
t
) we can construct X
t
(in other words the sigma-algebras (
k

t
k=
) =
(X
k

t1
k=
,
t
)).
We recall that X
t+1|t
is the best linear predictor of X
t+1
given X
t
, . . . , X
1
and the one step
ahead prediction error is E(X
t+1
X
t+1|t
)
2
. We now dene the best linear predictor of X
t+1
given the innite past X
t
, X
t1
, . . . as X
t+1|t,...
. In practice X
t+1|t
can be evaluated but not
X
t+1|t,...
, since we do not observe the entire past X
0
, X
1
, . . .. However, if t is large, then
most of the information in X
t+1|t,...
will be contained in the rst t terms. Since X
t+1|t,...
is easy
to obtain (if X
0
, X
1
were known) we will often use it, we will discuss later its relationship to
X
t+1|t
.
It is clear from (3.7) that
X
t+1|t,...
=

j=1
b
j
(, )X
t+1j
, (3.8)
since
t+1
is orthogonal to X
t
, X
t1
, . . .. Of course X
0
, X
1
, . . . is unknown so we approximate
X
t+1|t,...
with a truncated version

X
t+1|t,...
=
t

j=1
b
j
(, )X
t+1j
. (3.9)
It is clear that

X
t+1|t,...
sp(X
t
, . . . , X
1
), but it is not necessarily the best linear predictor (in
other words it may not be X
t+1|t
- or equivalently, X
t+1


X
t+1|t,...
may not be orthogonal to
X
k
for 1 k t). However in Proposition 3.3.1, we will show that for large t (a long past
X
t
, . . . , X
1
is observed), then

X
t+1|t,...
X
t+1|t
.
Of course, since b
j
(, ) is not easy to evaluate from
j
and
i
, using (3.9) to obtain

X
t+1|t,...
is not straightforward. But we now use the ARMA structure (which we have not used
previously), to derive a simple way to calculate

X
t+1|t,...
.
34
To do this we consider again X
t+1|t,...
. We recall that
t
= X
t
X
t|t1,...
using this we have
X
t+1
=
p

j=1

j
X
t+1j
+
q

i=1

t+1j
+
t+1
=
p

j=1

j
X
t+1j
+
q

i=1

j
_
X
t+1j
X
tj|tj1,...
_
+
t+1
.
Therefore
X
t+1|t,...
=

j=1
b
j
(, )X
t+1j
=
p

j=1

j
X
t+1j
+
q

i=1

t+1j
=
p

j=1

j
X
t+1j
+
q

i=1

j
_
X
t+1j
X
tj
(1)
_
.
We now return to

X
t+1|t,...
. Set Z
t
= X
t
for 1 t max(p, q), and dene the recursion for
t > max(p, q) that
Z
t
=
p

j=1

j
X
t+1j
+
q

i=1

j
_
X
t+1j
Z
tj
_
.
It is straightforward to show that for t max(p, q), that Z
t
=

t
j=1
b
j
(, )X
t+1j
, hence
X
t+1|t
= Z
t
. Hence given the parameters
j
and
j
, it is easily to evaluate

X
t+1|t,...
recur-
sively.
We show in the following proposition that

X
t+1|t,...
and X
t+1|t
are close when t is large (giving
some justication for using

X
t+1|t,...
). To prove the result we need a result that we will prove in
a later Chapter.
Lemma 3.3.1 Suppose X
t
is a stationary time series with spectral density f(). Let X
t
=
(X
1
, . . . , X
t
) and
t
= var(X
t
). If the spectral density function is bounded away from zero
(there is some > 0 such that inf

f() > 0), for any t,


min
(
t
) (where
min
and
max
denote the smallest and largest absolute eigenvalues of the matrix
t
). Hence
max
(
1
t
)
1
.
Since for symmetric matrices the spectral norm and the largest eigenvalue are the same, then
|
1
t
|
spec

1
.
Furthermore if sup

f() M < , then


max

t
M (hence |
t
|
spec
< M).
PROOF. Later in the course.
Remark 3.3.1 Now for an ARMA process, where the roots of the AR part have absolute value
which is greater than one, then corresponding spectral density is bounded away from zero. More-
over, the spectral density of an ARMA process is always bounded from above. In other words
if f is the spectral density of an ARMA process, where the roots of (z) and and have ab-
solute value greater than 1 +
1
and less than
2
, then the spectral density f() is bounded by
var(
t
)
(1
1

2
)
2p
(1(
1
1+
1
)
2p
f() var(
t
)
(1(
1
1+
1
)
2p
(1
1

2
)
2p
. This can be proved by using the spectral density
of an ARMA process given in (2.22).
35
Proposition 3.3.1 Suppose X
t
is an ARMA process where the roots of (z) and (z) have
roots which are greater in absolute value than 1 +. Let

X
t+1|t,...
, X
t+1|t
and X
t+1|t,...
be dened
as in (3.9), (3.1) and (3.8) respectively. Then
E(

X
t+1|t,...
X
t+1|t
)
2
K
t
, (3.10)

E(X
t+1
X
t+1|t
)
2

K
2
(3.11)
E(

X
t+1|t,...
X
t+1|t,...
)
2
K
t
(3.12)
where
1
1+
< < 1 and var(
t
) =
2
.
PROOF. The proof of (3.10) becomes clear when we use the expansion X
t+1
=

j=1
b
j
(, )X
t+1j
+

t+1
. Evaluating the best linear predictor of X
t+1
given X
t
, . . . , X
1
gives
X
t+1|t
=

j=1
b
j
(, )X
t+1j|t,...,1
+BestLin(
t+1
[X
t
, . . . , X
1
)
=
t

j=1
b
j
(, )X
t+1j
. .

X
t+1|t,...
+

j=t+1
b
t+j
(, )X
j+1|t,...,1
.
(to see this consider the Gaussian case where E(X
t+1
[X
t
, . . . , X
1
) = E(

j=1
b
j
(, )X
t+1j
+

t+1
[X
t
, . . . , X
1
) =

j=1
b
j
(, )X
t+1j
E(X
t+1j
[X
t
, . . . , X
1
)). Therefore the dierence be-
tween the best linear predictor and

X
t+1|t,...
is
X
t+1|t


X
t+1|t,...
=

j=0
b
t+j
(, )X
j+1|t,...,1
.
Intuitively it is clear that when t is large the dierence [X
t+1|t


X
t+1|t,...
[ decays geometrically
because the coecients b
t+j
(, ) decay geometrically. We formalise these ideas now. To obtain
a bound for this dierence we need to obtain bounds for X
j+1|t,...,1
(the best linear predictor
of the unobserved past terms X
j
given the future terms X
t
, . . . , X
1
). For j 0 we have
X
j+1|t,...,t
=
t

i=1

i,j,t
X
i
, (3.13)
where

j,t
=
1
t
r
t,j
, (3.14)
with

j,t
= (
1,j,t
, . . . ,
t,j,t
), X

t
= (X
1
, . . . , X
t
),
t
= E(X
t
X

t
) and r
t,j
= E(X
t
X
j
). This
gives
X
t+1|t


X
t+1|t,...
=
_

j=t+1
b
t+j
(, )

j,t
_
X
t
=
_

j=t+1
b
t+j
(, )r

t,j

t
_
X
t
. (3.15)
36
Taking expectations, we have
E(X
t+1|t


X
t+1|t,...
)
2
=
_

j=t+1
b
t+j
(, )r

t,j
_

1
t
_

j=t+1
b
t+j
(, )r
t,j
_
By using the Cauchy schwarz inequality (|aBb|
1
|a|
2
|Bb|
2
), the spectral norm inequality
(|a|
2
|Bb|
2
|a|
2
|B|
spec
|b|
2
) and Minkowiskis inequality (|

n
j=1
a
j
|
2


n
j=1
|a
j
|
2
) we
have
E(X
t+1|t


X
t+1|t,...
)
2

_
_

j=t+1
b
t+j
(, )r

t,j
_
_
2
2
|
1
t
|
2
spec

_

j=t+1
[b
t+j
(, )[ |r
t,j
|
2
_
2
|
1
t
|
2
spec
. (3.16)
We start by bound each of the terms on the right hand side of the above. We note that
for all t, using Remark 3.3.1 that |
1
t
|
spec
K(1 (
1
1+
1
)
2p
. We now consider r

t,j
=
(E(X
1
X
j
), . . . , E(X
t
X
j
)). By using (2.14) we have E(X
1
X
j
) K
j1
etc. Therefore
|r
t,j
|
2
K(
t

r=1

2(j+r)
)
1/2


j
(1
2
)
2
.
Substituting the above bounds into (3.16) gives
E(X
t+1|t


X
t+1|t,...
)
2

_
K(1 (
1
1 +
1
))
2p
_
2
_

j=0
[b
t+j
(, )[

j
(1
2
)
2
_
2
.
Now we note that by using Lemma 2.2.1, that [b
j
(, )[ K
j
and this gives
E(X
t+1|t


X
t+1|t,...
)
2
K
_

j=0

t+j

j
(1
2
)
2
_
2
.
Thus proving (3.10).
To prove (3.11) we use X
t
=

j=1
b
j
(, )X
t+1j
+
t+1
, X
t
=

t
j=1
b
j
(, )X
t+1j
+

j=t+1
b
t+j
(, )X
j+1|t,...,1
and (3.13) to obtain
X
t+1
X
t+1|t
=
t+1
+

j=0
b
t+j
(, )(X
j
r

t,j

t
X
t
_
.
Hence
E(X
t+1
X
t+1|t
)
2
= var(
t+1
) +E
_

j=0
b
t+j
(, )(X
j
r

t,j

t
X
t
)
_
2
.
Therefore using Minkowskis inequality and [b
j
(, ) K
j
[ we have
E(X
t+1
X
t+1|t
)
2
var(
t+1
) + (

j=0
b
t+j
(, )E(X
j
r

t,j

t
X
t
)
2

1/2
_
2
. .
K
t
,
37
thus proving (3.11).
To prove (3.12) we note that
E(

X
t
(1) X
t
(1))
2
= E(

j=t+1
b
j
(, )X
t+1j
)
2
,
now by using (2.9), it is straightforward to prove the result.
Remark 3.3.2 We note that the one-step ahead predictor depends on the parameters , which
are used to do the prediction and the previous observations. On the other hand, the one-step
ahead prediction error E(X
t+1
X
t+1|t
)
2
depends only on the parameters , and
2
.
3.4 The Wold Decomposition
The above discussion on prediction and Section 3.1.2 leads very nicely to the Wold decomposi-
tion. It states that any stationary process, almost, has an MA() representation. We state the
theorem below and use some of the notation introduced in Section 3.1.2.
Theorem 3.4.1 Suppose that X
t
is a second order stationary time series with a nite vari-
ance (we shall assume that it has mean zero, though this is not necessary). Then X
t
can be
uniquely expressed as
X
t
=

j=0

j
Z
j
+V
t
, (3.17)
where Z
t
are uncorrelated random variables, with var(Z
t
) = E(X
t
X
t|t1,...
)
2
(X
t|t1,...
is the
best linear predictor of X
t
given X
t1
, X
t2
, . . .) and V
t
A

n=
A

n
.
PROOF. First let is consider the one-step ahead prediction error X
t|t1,...
. Since X
t
is a
second order stationary process it is clear that X
t|t1
=

j=1
b
j
X
tj
, where the coecients
b
j
do not vary with t. For this reason X
t|t1,...
and X
t
X
t|t1,...
are second order
stationary random variables. Furthermore, since X
t
X
t|t1,...
is uncorrelated with X
s
for
any s t 1, then X
t
X
t|t1,...
are also uncorrelated random variables, let Z
t
= X
t

X
t|t1,...
, hence Z
t
is the one-step ahead prediction error. We recall from Section 3.1.2 that
X
t
sp((X
t
X
t|t1,...
), (X
t1
X
t1|t2,...
), . . .) sp(A

) = sp(Z
t
, Z
t1
, . . .) sp(A

).
Since the spaces sp(Z
t
, Z
t1
, . . .) and sp(A

) are orthogonal, we shall rst project X


t
onto
sp((X
t
X
t|t1,...
), (X
t1
X
t1|t2,...
), . . .), due to orthogonality the dierence between X
t
and
its projection will be in sp(A

). This will lead to the Wold decomposition.


First we consider the projection of X
t
onto the space sp(Z
t
, Z
t1
, . . .), which is
P
sp(Z
t
,Z
t1
,...)
X
t
=

j=0

j
Z
tj
,
where due to orthogonality
j
= cov(X
t
, (X
tj
X
tj|tj1,...
))/var(X
tj
X
tj|tj1,...
). Since
X
t
sp(Z
t
, Z
t1
, . . .) sp(A

), the dierence X
t
P
sp(Z
t
,Z
t1
,...)
X
t
is orthogonal to Z
t
and
38
belongs in sp(A

). Hence we have
X
t
=

j=0

j
Z
tj
+V
t
,
where V
t
= X
t

j=0

j
Z
tj
and is uncorrelated to Z
t
. Hence we have shown (3.17). To
show that the representation is unique we note that Z
t
, Z
t1
, . . . are an orthogonal basis of
sp(Z
t
, Z
t1
, . . .), which pretty much leads to uniqueness.
It is worth noting that variants on the proof can be found in Brockwell and Davis (1998),
Section 5.7 and Fuller (1995), page 94.
Remark 3.4.1 Notice that the representation in (3.17) looks like an MA() process. There
is, however, a signicant dierence. The random variables Z
t
of an MA() process are iid
random variables and not just uncorrelated. There are several example of time series which
uncorrelated but not independent, one such example is the ARCH process which we will consider
later in the course.
39
Chapter 4
Estimation for Linear models
We now consider various methods for estimating the parameters in a stationary time series.
We rst consider estimation of the mean and covariance and then look at estimation of the
parameters of an AR and ARMA process.
4.1 Estimation of the mean and autocovariance function
Let us suppose the stationary time series Y
t
satises
Y
t
= +X
t
,
where is the nite mean, X
t
is a zero mean stationary time series with absolutely summable
covariances (

k
[cov(X
0
, X
k
)[ < ). Below we consider methods to estimate the mean and
autocovariance function.
4.1.1 Estimating the mean
Suppose we observe Y
t

n
t=1
, and we want to estimate the mean . In an ideal world we would
observe independent replications of Y
t
. We would then use the average, that is

Y
n
= n
1

n
t=1
Y
t
as an estimator of . If the variance of Y
t
is nite, then

Y
n
is a good estimator of the mean
which convergences at the rate O(n
1
) (that is var(

Y
n
) = n
1
var(Y
1
)). However in the case
that Y
t
are not independent and we observe a time series, then we can still use

Y
n
as an
estimator of . The only drawback is that the dependency means that one observation will
inuence the next and the resulting estimator will not be so reliable. But it is easy to show that
var(

Y
n
)
2
n

k
[cov(X
0
, X
k
)[. Hence if

k
[cov(X
0
, X
k
)[ < , then E(

Y
n
)
2
K/n, where
K is a nite constant. This means, despite the estimator not being as good as an estimator an
estimator constructed from independent observations,

Y
n
is still

n-consistent.
4.1.2 Estimating the covariance
Suppose we observe Y
t

n
t=1
, to estimate the covariance we can estimate the covariance c(k) =
E(X
0
X
k
) from the the observations a plausible estimator is
c
n
(k) =
1
n
n|k|

t=1
(Y
t


Y
n
)(Y
t+|k|


Y
n
), (4.1)
40
since E((Y
t


Y
n
)(Y
t+|k|


Y
n
) c(k). Of course if the mean of Y
t
were zero (Y
t
= X
t
), then the
covariance estimator is
c
n
(k) =
1
n
n|k|

t=1
X
t
X
t+|k|
. (4.2)
The eagle-eyed amongst you may wonder why we dont use
1
T|k|

n|k|
t=1
X
t
X
t+|k|
, and that
c
n
(k) is more biased than
1
T|k|

n|k|
t=1
X
t
X
t+|k|
. However c
n
(k) has some very nice properties
which are discussed in the remark below.
Remark 4.1.1 Suppose we dene the empirical covariances
c
n
(k) =
_
1
n

nk
t=1
X
t
X
tk
[k[ n 1
0 otherwise
then hatc
n
(k) is positive denite sequence. Therefore, using Lemma 1.1.1 there exists a sta-
tionary time series Z
t
which has the covariance c
n
(k).
There are various ways to show that c
n
(k) is a positive denite sequence. One method uses
that corresponding spectral density is positive. We recall that the spectral density was dened in
Denition 2.4.1, but we have yet to discuss its properties. One the properties is that is positive.
In other words if c(k) is a positive denite sequence its fourier transform is positive, if f
is positive, then the fourier coecients are positive denite (we will look into detail at these
properties in a later chapter). Using this property we will show that c
n
(k) is positive denite.
But I briey describe the proof. The spectral density is the a positive denite sequence. The
fourier transform of c
n
(k) is
(n1)

k=(n1)
exp(ik) c
n
(k) =
(n1)

k=(n1)
exp(ik) c
n
(k) =
1
n
n|k|

t=1
X
t
X
t+|k|
=
1
n

t=1
X
t
exp(it)

0.
Since it is positive, this means that c
n
(k) is a positive denite sequence.
4.1.3 Some asymptotic results on the covariance estimator
The following theorem gives the asymptotic sampling properties of the covariance estimator
(4.1). The proof of the result can be found in Brockwell and Davis (1998), Chapter 8, Fuller
(1995), but it goes pretty much back to Bartlett (1981) (indeed its called Bartletts formula).
Theorem 4.1.1 Suppose X
t
is a stationary time series where
X
t
= +

j=

j
Z
tj
,
where

j
[
j
[ < , Z
t
are iid random variables with E(Z
4
t
) < . Suppose we observe
X
t
: t = 1, . . . , n and use (4.1) as an estimator of the covariance c(k) = cov(X
0
, X
k
). Then
for each h 1, . . . , n

n( c
n
(h) c(h))
D
^(0, W
h
) (4.3)
41
where c
n
(h) = ( c
n
(1), . . . , c
n
(h)), c(h) = (c(1), . . . , c(h)) and
(W
h
)
ij
=

k=
c(k +i) +c(k i) 2c(i)c(k)c(k +j) +c(k j) 2c(j)c(k).
Example 4.1.1 This example is quite an important application of the above theorem. It is used
to check by eye; whether a time series is uncorrelated (there are more sensitive tests, but this one
is often used to construct CI in for the sample autocovariances in several statistical packages).
Suppose X
t
are iid random variables, and we use (4.1) as an estimator of the autocovariances.
Recalling if X
t
are iid then c(k) = 0 for k ,=, using this and (4.3) we see that the asymptotic
distribution of c
n
(h) in this case is

n( c
n
(h) c(h))
D
^(0, W
h
)
where
(W
h
)
ij
=
_
var(X
t
) i = j
0 i ,= j
In other words

n( c
n
(h) c(h))
D
^(0, var(X
t
)I). Hence the sample autocovariances at dif-
ferent lags are uncorrelated. This allows us to easily construct condence intervals for the au-
tocovariances under the assumption of the observations. If the vast majority of the sample
autocovariance lie inside the condence there is not enough evidence to suggest that the data
is a realisation of a iid random variables (often called a white noise process). Axample of the
empirical ACF and the CI constructed under the assumption of independence is given in Figure
4.1. We see that the empirical autocorrelations of the realisation from iid random variables all
lie within the CI. The same cannot be said for the emprical correlations of a dependent time
series.
Remark 4.1.2 (Long range dependence versus changes in the mean) We rst note that
a process is said to have long range dependence if the covariances

k
[c(k)[ are not absolutely
summable. From a practical point of view data is said to exhibit long range dependence if the
autocovariances do not decay very fast to zero as the lag increases. We now demonstrate that
one must becareful in the diagnoses of long range dependence, because a slow decay of the auto-
covariance could also imply a change in mean if this has not been corrected for. This was shown
in Bhattacharya et al. (1983), and applied to econometric data in Mikosch and St aric a (2000)
and Mikosch and St aric a (2003). A test for distinguishing between long range dependence and
change points is proposed in Berkes et al. (2006).
Suppose that Y
t
satises
Y
t
=
t
+
t
,
where
t
are iid random variables and the mean
t
depends on t. We observe Y
t
but do not
know the mean is changing. We want to evaluate the autocovariance function, hence estimate
the autocovariance at lag k using
c
n
(k) =
1
n
n|k|

t=1
(Y
t


Y
n
)(Y
t+|k|


Y
n
).
42
0 5 10 15 20

0
.2
0
.0
0
.2
0
.4
0
.6
0
.8
1
.0
Lag
A
C
F
Series ACF1
0 5 10 15 20

0
.5
0
.0
0
.5
1
.0
Lag
A
C
F
Series ACF2
Figure 4.1: The top plot is the empirical ACF taken from a iid data and the lower lot is the
empirical ACF of a realisation from the AR(2) model dened in (2.19).
Observe that

Y
n
is not really estimating the mean but the average mean! If we plotted the
empirical ACF c
n
(k) we would see that the covariances do not decay with time. However the
true ACF would be zero and at all lags but zero. The reason the empirical ACF does not decay to
zero is because we have corrected for the correct mean. Indeed it can be shown that for large lags
c
n
(k)

s<t
(
s

t
)
2
. Hence because we are not correcting for the mean in the autocovariance,
it remains.
4.2 Estimation for AR models
Let us suppose that X
t
is a zero mean stationary time series which satises the AR(p) repre-
sentation
X
t
=
p

j=1

j
X
tj
+
t
,
where E(
t
) = 0 and var(
t
) =
2
and the roots of the characteristic polynomial 1

p
j=1

j
z
j
lie outside the unit circle. Our aim in this section is to construct estimator of the AR parameters

j
. We will show that in the case that X
t
has an AR(p) representation the estimation is rel-
atively straightforward, and the estimation methods all have properties which are asymptotically
equivalent to the Gaussian maximum estimator.
The following estimation scheme stem from the following observation. Suppose the AR(p)
time series X
t
is causal (that is the roots of the characteristic polynomial lie outside the
unit circle, hence it satises an MA() presentation). Then we can multiple X
t
by X
ti
for
1 i p, since the process is causal
t
and X
ti
. Therefore taking expectations we have for all
43
i > 0
E(X
t
X
ti
) =
p

j=1

j
E(X
tj
X
ti
), c(i) =
p

j=1

j
c(i j). (4.4)
Recall these are the Yule-Walker equations we considered in Section 2.3.3. Putting the cases
1 i p together we can write the above as

p
=
p

p
, (4.5)
where (
p
)
i,j
= c(i j), (
p
)
i
= c(i) and

p
= (
1
, . . . ,
p
).
4.2.1 The Yule-Walker estimator
The Yule-Walker equations inspire the method of moments estimator often called the Yule-
Walker estimator. We use (4.5) as the basis of the estimator. It is clear that
p
and

p
are
estimators of
p
and
p
where (

p
)
i,j
= c
n
(i j) and (
p
)
i
= c
n
(i). Therefore we can use

p
=

1
p

p
, (4.6)
as an estimator of the AR parameters

p
= (
1
, . . . ,
p
). We observe that if p is large this involves
inverting a large matrix. However, we can use the Durbin-Levinson algorithm to estimate

p
by tting lower order AR processes to the observations and increasing the order. This way an
explicit inversion can be avoided. We detail how the Durbin-Levinson algorithm can be used to
estimate the AR parameters below.
Using Remark 4.1.1 there exists a process Z
t
which has the autocovariance function c
n
(k).
This means the best linear predictor of Y
m+1
given Y
m
, . . . , Y
1
is
Y
m+1|m
=
m

j=1

m,j
Y
m+1j
,
where

m
= (
m,1
, . . . ,
m,m
)

1
m

m
with (

m
)
i,j
= c
n
(i j) and (
m
)
i
= c
n
(i). Hence

m
are in fact the estimators of the AR(m) parameters. Now recally from Section 3.2 that for
m 2, that

m
can be obtained from

m1
and the empirical covariances c
n
(k). Hence we
can use the Durbin-Levinson algorithm to estimate the parameters

p
, by rst tting an AR(1)
model to the time series, then iterating the Durbin-Levinson algorithm to t higher order AR
models, until we nally t the AR(p) model to the time series.
In the pevious sections we estimate the covariance E(X
0
X
k
) using
1
n

n|k|
t=1
X
t
X
t+|k|
, which
lead to the Yule-Walker estimators. In the following section we estimate the covariance in a
slighly dierent way. This will lead to the least squares estimator (or maximum likelihood
estimator). Both estimators are dierent but are asymptotically equivalent.
44
4.2.2 The Gaussian maximum likelihood (least squares estimator)
Our object here is to obtain the maximum likelihood estimator of the AR(p) parameters. It
turns out that this is the same as the least squares estimator. We recall that the maximum
likelihood estimator is the parameter which maximises the joint density of the observations.
Since the log-likelihood often has a simpler form, we often maximise the log density rather
than the density (since both the maximum likelihood estimator and maximum log likelihood
estimator yield the same estimator). We note that the Gaussian MLE is constructed as if the
observations X
t
were Gaussian, though it is not necessary that X
t
is Gaussian when doing
the estimation. They can have a dierent distribution, the only dierence is that estimate may
be less ecient (will not obtain the Cramer-Rao lower bound).
Suppose we observe X
t
; t = 1, . . . , n where X
t
are observations from an AR(1) process.
To construct the the MLE, we use that the joint distribution of X
t
is the product of the
conditional distributions. Hence we need an expression for the conditional distribution (in
terms of the densities). Let F

be the distribution function and the density function of


respectively. We rst note that the AR(p) process is p-Markovian, that is
P(X
t
x[X
t1
, X
t2
, . . .) = P(X
t
x[X
t1
, . . . , X
tp
) f
a
(X
t
[X
t1
, . . .) = f
a
(X
t1
[X
t1
, . . . X
tp
),(4.7)
where f
a
is the conditional density of X
t
given the past, where the distribution function is
derived as if a is the true AR(p) parameters.
Remark 4.2.1 To understand why (4.7) is true consider the simple case that p = 1 (AR(1)).
Studying the conditional probability gives
P(X
t
x
t
[X
t1
= x
t1
, . . .) = P(aX
t1
+
t
x
t
[X
t1
= x
t1
, . . .)
= P

(
t
x
t
ax
t1
) = P(X
t
x
t
[X
t1
= x
t1
).
By using the (4.7) we have P(X
t
x[X
t1
, . . .) = P

( x

p
j=1
a
j
X
tj
), hence
P(X
t
x[X
t1
, . . .) = F

(x
p

j=1
a
j
X
tj
), f
a
(X
t
[X
t1
, . . .) = f

(X
t

j=1
a
j
X
tj
). (4.8)
Therefore the joint density of X
t

n
t=1
is
f
a
(X
1
, X
2
, . . . , X
n
) = f
a
(X
1
, . . . , X
p
)
n

t=p+1
f
a
(X
t
[X
t1
, . . . , X
1
) (by Bayes theorem)
= f
a
(X
1
, . . . , X
p
)
n

t=p+1
f
a
(X
t
[X
t1
, . . . , X
tp
) (by the Markov property)
= f
a
(X
1
, . . . , X
p
)
n

t=p+1
f

(X
t

j=1
a
j
X
tj
) (by (4.8)).
Therefore the log likelihood is
log f
a
(X
1
, X
2
, . . . , X
n
) = log f
a
(X
1
, . . . , X
p
)
. .
often ignored
+
n

t=p+1
log f

(X
t

j=1
a
j
X
tj
)
. .
conditional likelihood
.
45
Usually we ignore the initial distribution log f
a
(X
1
, . . . , X
p
) and maximise the conditional likeli-
hood to obtain the estimator. In the case that the sample sizes are large n >> p, the contribution
of log f
a
(X
1
, . . . , X
p
) is minimal and the conditional likelihood and likelihood are asymptotically
equivalent.
We note in the case that f

is Gaussian, the conditional log-likelihood is nL


n
(a), where
L
n
(a) = log
2
+
1
n
2
n

t=p+1
(X
t

j=1
a
j
X
tj
)
2
.
Therefore the estimates of the AR(p) parameters is

p
= arg min L
n
(a). It is clear that

p
is
the least squares estimator and can be explicitly obtained using

p
=

1
p

p
,
where (

p
)
i,j
=
1
np

n
t=p+1
X
ti
X
tj
and (
n
)
i
=
1
np

n
t=p+1
X
t
X
ti
.
Remark 4.2.2 (A comparison of the Yule-Walker and least squares estimators) If we
compare the least squares (Gaussian conditional likelihood) estimator

p
with the Yule-Walker es-
timator

p
, then we see that they are very similar. The dierence lies in the way the covariances
are estimated. We see that for the Yule-Walker estimator
1
n

ni
t=1
X
t
X
t+i
is used exclusively to
estimate the covariance c(i). Whereas for the least squares estimator
1
n

nr
t=r
X
t
X
t+k
: k r
are all used as estimators of c(k). There is very little dierence between these two covariances
estimates, indeed the Yule-Walker estimates and the least squares estimates have asymptotically
the same properties. There are however subtle dierences, in the actual estimators. Because
the Yule-Walker is constructed from a positive denite sequence, using the parameter estimates

p
one can construct a stationary AR(p) process. The same is not necessarily true of the least
squares estimator, which does not necessarily construct a stationary AR(p) process. Moreover,
because

p
can be used to construct a stationary AR(p) process, it can be shown that |

p
|
2
2
p
,
the same does not necessarily hold for the least squares estimate

p
.
4.3 Estimation for ARMA models
Let us suppose that X
t
satises the ARMA representation
X
t

i=1

(0)
i
X
ti
=
t
+
q

j=1

(0)
j

tj
,
and
0
= (
(0)
1
, . . . ,
(0)
q
),
0
= (
(0)
1
, . . . ,
(0)
p
) and
2
0
= var(
t
). We will suppose for now that p
and q are known. In the following sections we consider dierent methods for estimating
0
and

0
.
4.3.1 The Hannan and Rissanen AR() expansion method
We rst describe an easy method to estimate the parameters of an ARMA process. These
estimates may not necessarily be ecient (we dene this term later) but they have an explicit
46
form and can be easily obtained. Therefore they are a good starting point, and can be used
as the initial value when using the Gaussian maximum likelihood to estimate the parameters
(as described below). The method was rst propose in Hannan and Rissanen (1982) and An
et al. (1982) and we describe it below. It is worth bearing in mind that currently the large
p small n problem is a hot topic. These are generally regression problems where the sample
size n is quite small but the number of regressors p is quite large (usually model selection is of
importance in this context). The methods proposed by Hannan involves expanding the ARMA
process (assuming invertibility) as an AR() process and estimating the parameters of the
AR() process. In some sense this can be considered as a regression problem with an innite
number of regressors. Hence there are some parallels between the estimation described below
and the large p small n problem.
As we mentioned in Lemma 2.2.2, if an ARMA process is invertible it is can be written as
X
t
=

j=1
b
j
X
tj
+
t
. (4.9)
The idea behind Hannans method is to estimate the parameters b
j
, then estimate the inno-
vations
t
, and use the estimated innovations to construct a multiple linear regression estimator
of the ARMA paramters
i
and
j
. Of course in practice we cannot estimate all parameters
b
j
as there are an innite number of them. So instead we do a type of sieve estimation where
we only estimate a nite number and let the number of parameters to be estimated grow as the
sample size increases. We describe the estimation steps below:
(i) Suppose we observe X
t

n
t=1
. Recalling (4.9), will estimate b
j

p
n
j=1
parameters. We will
suppose that p
n
as n and p
n
<< n (we will state the rate below).
We use least squares to estimate b
j

p
n
j=1
and dene

b
n
=

R
1
n
r
n
,
where

R
n
=
n

t=p
n
+1
X
t1
X

t1
r
n
=
T

t=p
n
+1
X
t
X
t1
and X

t1
= (X
t1
, . . . , X
tp
n
).
(ii) Having estimated the rst b
j

p
n
j=1
coecients we estimate the residuals with

t
= X
t

p
n

j=1

b
j,n
X
tj
.
(iii) Now use as estimates of
0
and
0

n
,

n
where

n
,

n
= arg min
n

t=p
n
+1
(X
t

j=1

j
X
tj

q

i=1

i

ti
)
2
.
47
We note that the above can easily be minimised. In fact
(

n
,

n
) =

1
n
s
n
where

n
=
1
n
n

t=max(p,q)

Y
t

Y
t
and s
n
=
1
T
n

t=max(p,q)

Y
t
X
t
,

t
= (X
t1
, . . . , X
tp
,
t1
, . . . ,
tq
).
4.3.2 The Gaussian maximum likelihood estimator
We now consider the Gaussian maximum likelihood estimator (GMLE) to estimate the parame-
ters
0
and
0
. Let X

T
= (X
1
, . . . , X
T
). We note that despite calling the estimate the GMLE, it
does not assume that the time series X
t
is Gaussian. The criterion (the GMLE) is constructed
as if X
t
were Gaussian, but this need not be the case.
It is clear that the negative Gaussian likelihood of X
t

n
t=1
, assuming that it is a realisation
from an ARMA process is
1
n
L
n
(, , ) =
1
n
log [(, )[ +X

n
(, )
1
X
n
, (4.10)
where (, , ) the variance covariance matrix of X
n
constructed as if X
n
came from an ARMA
process with parameters and . To directly evaluate the above for each (, ), never mind
about minimising over all (, ) can be a daunting task, its computationally extremely dicult
for even relatively large sample sizes. However, there exists a simple solution, which uses the
one-step predictions considered in the prediction section.
Let
X
(,)
t+1|t
= BestLin
(,)
(X
t+1
[X
t
, . . . , X
1
), (4.11)
be the best linear predictor of X
t+1
given X
t
, . . . , X
1
and the ARMA parameters and which
are used to calculate the covariances in the prediction. Let r
t+1
(, , ) be the one-step ahead
mean squared error E(X
t
X
(,)
t+1|t
)
2
. By using Cholskeys decomposition it can be shown that
1
n
L
n
(, , ) =
1
n
n1

t=1
log r
t+1
(, , ) +
1
n
n1

t=1
(X
t+1
X
(,)
t+1|t
)
2
r
t+1
(, , )
.
We see that we have avoided inverting the matrix (, , ). The GMLE is the parameter

n
,

n
which minimises L
n
(, , ). We note that the one-step ahead predictor X
(,)
t+1|t
can be obtained
using Durbin-Levinson Algorithm.
It is possible to obtain an approximation of L
n
(, , ) which is simple to evaluate. However
this approximation only really make sense when the sample size n is large. It is, however, useful
when obtaining the asymptotic sampling properties of the GMLE.
48
To motivate the approximation consider the one-step ahead prediction error considered in
Section 3.3. We have shown in Proposition 3.3.1 that for large t,

X
t+1|t,...
X
t+1|t
and
2

E(X
t+1
X
t+1|t
)
2
. Now dene

X
(,)
t+1|t,...
=
t

j=1
b
j
(, )X
t+1j
. (4.12)
We now replace in L
n
(, , ), X
(,)
t+1|t
with

X
(,)
t+1|t,...
and r
t+1
(, , ) with
2
to obtain
1
n

L
n
(, , ) = log
2
+
1
n
2
T1

t=1
(X
t+1


X
(,)
t+1|t,...
)
2
.
We show in Section 6 that
1
n
L
n
(, , ) and
1
n

L
n
(, , ) are asymptotically equivalent.
49
Chapter 5
Almost sure convergence,
convergence in probability and
asymptotic normality
In the previous chapter we considered estimator of several dierent parameters. The hope is
that as the sample size increases the estimator should get closer to the parameter of interest.
When we say closer we mean to converge. In the classical sense the sequence x
k
converges to
x (x
k
x), if [x
k
x[ 0 as k (or for every > 0, there exists an n where for all k > n,
[x
k
x[ < ). Of course the estimators we have considered are random, that is for every
(set of all out comes) we have an dierent estimate. The natural question to ask is what does
convergence mean for random sequences.
5.1 Modes of convergence
We start by dening dierent modes of convergence.
Denition 5.1.1 (Convergence) Almost sure convergence We say that the sequence
X
t
converges almost sure to , if there exists a set M , such that P(M) = 1 and for
every N we have
X
t
() .
In other words for every > 0, there exists an N() such that
[X
t
() [ < , (5.1)
for all t > N(). Note that the above denition is very close to classical convergence. We
denote X
t
almost surely, as X
t
a.s.
.
An equivalent denition, in terms of probabilities, is for every > 0 X
t
a.s.
if
P(;

m=1

t=m
[X
t
() [ > ) = 0.
It is worth considering briey what

m=1

t=m
[X
t
() [ > means. If

m=1

t=m
[X
t
() [ > ,= , then there exists an

m=1

t=m
[X
t
() [ > such that
50
for some innite sequence k
j
, we have [X
k
j
(

) [ > , this means X


t
(

) does not
converge to . Now let

m=1

t=m
[X
t
() [ > = A, if P(A) = 0, then for most
the sequence X
t
() converges.
Convergence in mean square
We say X
t
in mean square (or L
2
convergence), if E(X
t
)
2
0 as t .
Convergence in probability
Convergence in probability cannot be stated in terms of realisations X
t
() but only in terms
of probabilities. X
t
is said to converge to in probability (written X
t
P
) if
P([X
t
[ > ) 0, t .
Often we write this as [X
t
[ = o
p
(1).
If for any 1 we have
E(X
t
)

0 t ,
then it implies convergence in probability (to see this, use Markovs inequality).
Rates of convergence:
(i) We say the stochastic process X
t
is [X
t
[ = O
p
(a
t
), if the sequence a
1
t
[X
t

[ is bounded in probability (this is dened below). We see from the denition of
boundedness, that for all t, the distribution of a
1
t
[X
t
[ should mainly lie within a
certain interval. In general a
t
as t .
(ii) We say the stochastic process X
t
is [X
t
[ = o
p
(a
t
), if the sequence a
1
t
[X
t
[
converges in probability to zero.
Denition 5.1.2 (Boundedness) (i) Almost surely bounded If the random variable X
is almost surely bounded, then for a positive sequence e
k
, such that e
k
as k
(typically e
k
= 2
k
is used), we have
P(;

k=1
[X()[ e
k
) = 1.
Usually to prove the above we show that
P((;

k=1
[X[ e
k
)
c
) = 0.
Since (

k=1
[X[ e
k
)
c
=

k=1
[X[ > e
k

k=1

m=k
[X[ > e
k
, to show the above
we show
P( :

k=1

m=k
[X()[ > e
k
) = 0. (5.2)
We note that if ( :

k=1

m=k
[X()[ > e
k
) ,= , then there exists a

and an
innite subsequence k
j
, where [X(

)[ > e
k
j
, hence X(

) is not bounded (since e


k
).
To prove (5.2) we usually use the Borel Cantelli Lemma. This states that if

k=1
P(A
k
) <
, the events A
k
occur only nitely often with probability one. Applying this to our case,
51
if we can show that

m=1
P( : [X()[ > e
m
[) < , then [X()[ > e
m
[ happens only
nitely often with probability one. Hence if

m=1
P( : [X()[ > e
m
[) < , then
P( :

k=1

m=k
[X()[ > e
k
) = 0 and X is a bounded random variable.
It is worth noting that often we choose the sequence e
k
= 2
k
, in this case

m=1
P( :
[X()[ > e
m
[) =

m=1
P( : log [X()[ > log 2
k
[) CE(log [X[). Hence if we can
show that E(log [X[) < , then X is bounded almost surely.
(ii) Sequences which are bounded in probability A sequence is bounded in probability,
written X
t
= O
p
(1), if for every > 0, there exists a () < such that P([X
t
[ ()) <
. Roughly speaking this means that the sequence is only extremely large with a very small
probability. And as the largeness grows the probability declines.
5.2 Ergodicity
To motivate the notion of ergodicity we recall the strong law of large numbers (SLLN). Suppose
X
t

t
is an iid random sequence, and E([X
0
[) < then by the SLLN we have that
1
n
n

j=1
X
t
a.s.
E(X
0
),
for the proof see, for example, Grimmett and Stirzaker (1994). It would be useful to generalise
this result and nd weaker conditions on X
t
for this result to still hold true. A simple
application is when we want to estimate the mean , and we use
1
n

n
j=1
X
t
as an estimator of
the mean.
It can be shown that if X
t
is an ergodic process then the above result holds. That is if
X
t
is an ergodic process then for any function h such that E(h(X
0
)) < we have
1
n
n

j=1
h(X
t
)
a.s.
E(h(X
0
)).
Note that the result does not state anything about the rate of convergence. Ergodicity is
normally dened in terms of measure preserving transformations. However, we do not formally
dene ergodicity here, but needless to say all ergodic processes are stationary. For the denition
of ergodicity and a full treatment see, for example, Billingsley (1995).
However below we do state a result which characterises a general class of ergodic processes.
Theorem 5.2.1 Suppose Z
t
is an ergodic sequence (for example iid random variables) and
g : R

R is a measureable function (its really hard to think up nonmeasureable functions).


Then the sequence Y
t

t
, where
Y
t
= g(Z
t
, Z
t1
, . . . , ),
is an ergodic process.
PROOF. See Stout (1974), Theorem 3.5.8.
52
Example 5.2.1 (i) The process Z
t

t
, where Z
t
are iid random variables, is probably the
simplest example of an ergodic sequence.
(ii) A simple example of a time series X
t
which is not independent but is ergodic is the
AR(1) process. We recall that the AR(1) process satises the representation
X
t
= X
t1
+
t
, (5.3)
where
t

t
are iid random variables with E(
t
) = 0, E(
2
t
) = 1 and [[ < 1. It has the
unique causal solution
X
t
=

j=0

tj
.
The solution motivates us to dene the function
g(x
0
, x
1
, . . .) =

j=0

j
x
j
.
Since g() is bounded, it is suciently well behaved (thus measureable). Which implies,
by using Theorem 5.2.1, that X
t
is an ergodic process. We note if E(
2
) < , then
E(X
2
) < .
The ARCH(p) process X
t
dened by X
t
= Z
t

t
where
2
t
= a
0
+

p
j=1
a
j
X
2
tj
with

p
j=1
a
j
< 1 is ergodic stochastic process (we look at this model in a later Chapter).
Example 5.2.2 (Application) If X
t
is an AR(1) process with [a[ < 1 and E(
2
t
) < , then
by using the ergodic theorem we have
1
n
n

t=1
X
t
X
t+k
a.s.
E(X
0
X
k
).
5.3 Sampling properties
Often we will estimate the parameters by maximising (or minimising) a criterion. Suppose we
have the criterion L
n
(a) (eg. likelihood, quasi-likelihood, Kullback-Leibler etc) we use as an
estimator of a
0
, a
n
where
a
n
= arg max
a
L
n
(a)
and is the parameter space we do the maximisation (minimisation) over. Typically the true
parameter a should maximise (minimise) the limiting criterion L.
If this is to be a good estimator, as the sample size grows the estimator should converge (in
some sense) to the parameter we are interesting in estimating. As we discussed above, there
are various modes in which we can measure this convergence (i) almost surely (ii) in probability
and (iii) in mean squared error. Usually we show either (i) or (ii) (noting that (i) implies (ii)),
in time series its usually quite dicult to show (iii).
53
Denition 5.3.1 (i) An estimator a
n
is said to be almost surely consistent estimator of a
0
,
if there exists a set M , where P(M) = 1 and for all M we have
a
n
() a.
(ii) An estimator a
n
is said to converge in probability to a
0
, if for every > 0
P([ a
n
a[ > ) 0 T .
To prove either (i) or (ii) usually involves verifying two main things, pointwise convergence and
equicontinuity.
5.4 Showing almost sure convergence of an estimator
We now consider the general case where L
n
(a) is a criterion which we maximise. Let us suppose
we can write L
n
as
L
n
(a) =
1
n
n

t=1

t
(a), (5.4)
where for each a ,
t
(a)
t
is a ergodic sequence. Let
L(a) = E(
t
(a)), (5.5)
we assume that L(a) is continuous and has a unique maximum in . We dene the estimator

n
where
n
= arg min
a
L
n
(a).
Denition 5.4.1 (Uniform convergence) L
n
(a) is said to almost surely converge uniformly
to L(a), if
sup
a
[L
n
(a) L(a)[
a.s.
0.
In other words there exists a set M where P(M) = 1 and for every M,
sup
a
[L
n
(, a) L(a)[ 0.
Theorem 5.4.1 (Consistency) Suppose that a
n
= arg max
a
L
n
(a) and a
0
= arg max
a
L(a)
is the unique minimum. If sup
a
[L
n
(a) L(a)[
a.s.
0 as n and L(a) has a unique maxi-
mum. Then Then a
n
a.s.
a
0
as n .
PROOF. We note that by denition we have L
n
(a
0
) L
n
( a
n
) and L( a
n
) L(a
0
). Using this
inequality we have
L
n
(a
0
) L(a
0
) L
n
( a
n
) L(a
0
) L
n
( a
n
) L( a
n
).
Therefore from the above we have
[L
n
( a
T
) L(a
0
)[ max [L
n
(a
0
) L(a
0
)[, [L
n
( a
T
) L( a
n
)[ sup
a
[L
n
(a) L(a)[.
54
Hence since we have uniform converge we have [L
n
( a
n
) L(a
0
)[
a.s.
0 as n . Now since
L(a) has a unique maximum, we see that [L
n
( a
n
) L(a
0
)[
a.s.
0 implies a
n
a.s.
a
0
.
We note that directly establishing uniform convergence is not easy. Usually it is done by
assuming the parameter space is compact and showing point wise convergence and stochastic
equicontinuity, these three facts imply uniform convergence. Below we dene stochastic equicon-
tinuity and show consistency under these conditions.
Denition 5.4.2 The sequence of stochastic functions f
n
(a)
n
is said to be stochastically
equicontinuous if there exists a set M where P(M) = 1 and for every M and and
> 0, there exists a and such that for every M
sup
|a
1
a
2
|
[f
n
(, a
1
) f
n
(, a
2
)[ ,
for all n > N().
A sucient condition for stochastic equicontinuity of f
n
(a) (which is usually used to prove
equicontinuity), is that f
n
(a) is in some sense Lipschitz continuous. In other words,
sup
a
1
,a
2

[f
n
(a
1
) f
n
(a
2
)[ < K
n
|a
1
a
2
|,
where k
n
is a random variable which converges to a nite constant as n (K
n
a.s.
K
0
as
n ). To show that this implies equicontinuity we note that K
n
a.s.
K
0
means that for every
M (P(M) = 1) and > 0, we have [K
n
() K
0
[ < for all n > N(). Therefore if we
choose = /(K
0
+) we have
sup
|a
1
a
2
|/(K
0
+)
[f
n
(, a
1
) f
n
(, a
2
)[ < ,
for all n > N().
In the following theorem we state sucient conditions for almost sure uniform convergence.
It is worth noting this is the Arzela-Ascoli theorem for random variables.
Theorem 5.4.2 (The stochastic Ascoli Lemma) Suppose the parameter space is com-
pact, for every a we have L
n
(a)
a.s.
L(a) and L
n
(a) is stochastic equicontinuous. Then
sup
a
[L
n
(a) L(a)[
a.s.
0 as n .
We use the theorem below.
Corollary 5.4.1 Suppose that a
n
= arg max
a
L
n
(a) and a
0
= arg max
a
L(a), moreover
L(a) has a unique maximum. If
(i) we have point wise convergence, that is for every a we have L
n
(a)
a.s.
L(a).
(ii) The parameter space is compact.
(iii) L
n
(a) is stochastic equicontinuous.
Then a
n
a.s.
a
0
as n .
We prove Theorem 5.4.2 in the section below, but it can be omitted on rst reading.
55
5.4.1 Proof of Theorem 5.4.2 (The stochastic Ascoli theorem)
We now show that stochastic equicontinuity and almost pointwise convergence imply uniform
convergence. We note that on its own, pointwise convergence is a much weaker condition than
uniform convergence, since for pointwise convergence the rate of convergence can be dierent
for each parameter.
Before we continue a few technical points. We recall that we are assuming almost pointwise
convergence. This means for each parameter a there exists a set N
a
(with P(N
a
) = 1)
such that for all N
a
L
t
(, a) L(a). In the following lemma we unify this set. That is
show (using stochastic equicontinuity) that there exists a set N (with P(N) = 1) such that
for all N L
t
(, a) L(a).
Lemma 5.4.1 Suppose the sequence L
n
(a)
n
is stochastically equicontinuous and also point-
wise convergent (that is L
n
(a) converges almost surely to L(a)), then there exists a set M
where P(

M) = 1 and for every

M and a we have
[L
n
(, a) L(a)[ 0.
PROOF. Enumerate all the rationals in the set and call this sequence a
i

i
. Then for every
a
i
there exists a set M
a
i
where P(M
a
i
) = 1, such that for every M
a
i
we have [L
T
(, a
i
)
L(a
i
)[ 0. Dene M = M
a
i
, since the number of sets is countable P(M) = 1 and for every
M and a
i
we have L
n
(, a
i
) L(a
i
).
Since we have stochastic equicontinuity, there exists a set

M (with P(M) = 1), such that
for every

M, L
n
(, ) is equicontinuous. Let

M =

M M
a
i
, we will show that for all
a and

M we have L
n
(, a) L(a). By stochastic equicontinuity for every

M and
/3 > 0, there exists a > 0 such that
sup
|b
1
b
2
|
[L
n
(, b
1
) L
n
(, b
2
)[ /3, (5.6)
for all n > N(). Furthermore by denition of

M for every rational a
j
and

N we have
[L
T
(, a
i
) L(a
i
)[ /3, (5.7)
where n > N

(). Now for any given a , there exists a rational a


i
such that |a a
j
| .
Using this, (5.6) and (5.7) we have
[L
n
(, a) L(a)[ [L
n
(, a) L
n
(, a
i
)[ +[L
n
(, a
i
) L(a
i
)[ +[L(a) L(a
i
)[ ,
for n > max(N(), N

()). To summarise for every



M and a , we have [L
n
(, a)
L(a)[ 0. Hence we have pointwise covergence for every realisation in

M.
We now show that equicontinuity implies uniform convergence.
Proof of Theorem 5.4.2. Using Lemma 5.4.1 we see that there exists a set

M with
P(

M) = 1, where L
n
is equicontinuous and also pointwise convergent. We now show uniform
convergence on this set. Choose /3 > 0 and let be such that for every

M we have
sup
|a
1
a
2
|
[L
T
(, a
1
) L
T
(, a
2
)[ /3, (5.8)
56
for all n > n(). Since is compact it can be divided into a nite number of open sets.
Construct the sets O
i

p
i=1
, such that
p
i=1
O
i
and sup
x,y,i
|x y| . Let a
i

p
i=1
be such
that a
i
O
i
. We note that for every

M we have L
n
(, a
i
) L(a
i
), hence for every /3,
there exists an n
i
() such that for all n > n
i
() we have [L
T
(, a
i
) L(a
i
)[ /3. Therefore,
since p is nite (due to compactness), there exists a n() such that
max
1ip
[L
n
(, a
i
) L(a
i
)[ /3,
for all n > n() = max
1ip
(n
i
()). For any a , choose the i, such that open set O
i
such
that a O
i
. Using (5.8) we have
[L
T
(, a) L
T
(, a
i
)[ /3,
for all n > n(). Altogether this gives
[L
T
(, a) L(a)[ [L
T
(, a) L
T
(, a
i
)[ +[L
T
(, a
i
) L(a
i
)[ +[L(a) L(a
i
)[ ,
for all n max(n(), n()). We observe that max(n(), n()) and /3 does not depend on
a, therefore for all n max(n(), n()) and we have sup
a
[L
n
(, a) L(a)[ < . This gives
for every

M (P(

M) = 1), sup
a
[L
n
(, a) L(a)[ 0, thus we have almost sure uniform
convergence.
5.5 Almost sure convergence of the least squares estimator for
an AR(p) process
In Chapter 6 we will consider the sampling properties of many of the estimators dened in
Chapter 4. However to illustrate the consistency result above we apply it to the least squares
estimator of the autoregressive parameters.
To simply notation we only consider estimator for AR(1) models. Suppose that X
t
satises
X
t
= X
t1
+
t
(where [[ < 1). To estimate we use the least squares estimator dened
below. Let
L
n
(a) =
1
n 1
n

t=2
(X
t
aX
t1
)
2
, (5.9)
we use

n
as an estimator of , where

n
= arg max
a
L
T
(a), (5.10)
where = [1, 1].
How can we show that this is consistent?
In the case of least squares for AR processes, a
T
has the explicit form

n
=
1
n1

n
t=2
X
t
X
t1
1
n1

T1
t=1
X
2
t
.
Now by just applying the ergodic theorem to the numerator and denominator we get

n
a.s.
.
It is worth noting, that

1
n1
P
n
t=2
X
t
X
t1
1
n1
P
n1
t=1
X
2
t

< 1 is not necessarily true.


57
However we will tackle the problem in a rather artical way and assume that it does not
have an explicit form and instead assume that

n
is obtained by minimising L
n
(a) using
a numerical routine. In general this is the most common way of minimising a likelihood
function (usually explicit solutions do not exist).
In order to derive the sampling properties of

n
we need to directly study the likelihood
function L
n
(a). We will do this now in the least squares case.
We will rst show almost sure convergence, which will involve repeated use of the ergodic
theorem. We will then demonstrate how to show convergence in probability. We look at almost
sure convergence as its easier to follow. Note that almost sure convergence implies convergence
in probability (but the converse is not necessarily true).
The rst thing to do it let

t
(a) = (X
t
aX
t1
)
2
.
Since X
t
is an ergodic process (recall Example 5.2.1(ii)) by using Theorem 5.2.1 we have for
a, that
t
(a)
t
is an ergodic process. Therefore by using the ergodic theorem we have
L
n
(a) =
1
n 1
n

t=2

t
(a)
a.s.
E(
0
(a)).
In other words for every a [1, 1] we have that L
n
(a)
a.s.
E(
0
(a)) (almost sure pointwise
convergence).
Since the the parameter space [1, 1] is compact and a is the unique minimum of () in the
parameter space, then all that remains is to show show stochastic equicontinuity, from this we
deduce almost sure uniform convergence.
To show stochastic equicontinuity we expand L
T
(a) and use the mean value theorem to
obtain
L
n
(a
1
) L
n
(a
2
) = L
T
( a)(a
1
a
2
), (5.11)
where a [min[a
1
, a
2
], max[a
1
, a
2
]] and
L
n
( a) =
2
n 1
n

t=2
X
t1
(X
t
aX
t1
).
Because a [1, 1] we have
[L
n
( a)[ T
n
, where T
n
=
2
n 1
n

t=2
([X
t1
X
t
[ +X
2
t1
).
Since X
t

t
is an ergodic process, then [X
t1
X
t
[ + X
2
t1
is an ergodic process. Therefore, if
var(
0
) < , by using the ergodic theorem we have
T
n
a.s.
2E([X
t1
X
t
[ +X
2
t1
).
58
Let T := 2E([X
t1
X
t
[ + X
2
t1
). Therefore there exists a set M , where P(M) = 1 and for
every M and > 0 we have
[T
T
() T[

,
for all n > N(). Substituting the above into (5.11) we have
[L
n
(, a
1
) L
n
(, a
2
)[ T
n
()[a
1
a
2
[
(T +

)[a
1
a
2
[,
for all n N(). Therefore for every > 0, there exists a := /(T +

) such that
sup
|a
1
a
2
|/(D+

)
[L
n
(, a
1
) L
n
(, a
2
)[ ,
for all n N(). Since this is true for all M we see that L
n
(a) is stochastically
equicontinuous.
Theorem 5.5.1 Let

n
be dened as in (5.10). Then we have

n
a.s.
.
PROOF. Since L
n
(a) is almost sure equicontinuous, the parameter space [1, 1] is compact
and we have pointwise convergence of L
n
(a)
a.s.
L(a), by using Theorem 5.4.1 we have that

n
a.s.
a, where a = min
a
L(a). Finally we need to show that a = . Since
L(a) = E(
0
(a)) = E(X
1
aX
0
)
2
,
we see by dierentiating L(a) with respect to a, that it is minimised at a = E(X
0
X
1
)/E(X
2
0
),
hence a = E(X
0
X
1
)/E(X
2
0
). To show that this is , we note that by the Yule-Walker equations
X
t
= X
t1
+
t
E(X
t
X
t1
) = E(X
2
t1
) +E(
t
X
t1
)
. .
=0
.
Therefore = E(X
0
X
1
)/E(X
2
0
), hence

n
a.s.
.
We note that by using a very similar methods we can show strong consistency of the least
squares estimator of the parameters in an AR(p) model.
5.6 Convergence in probability of an estimator
We described above almost sure (strong) consistency where we showed a
T
a.s.
a
0
. Sometimes
its not possible to show strong consistency (when ergodicity etc. cannot be veried). Often, as
an alternative, weak consistency where a
T
P
a
0
(convergence in probability), is shown. This
requires a weaker set of conditions, which we now describe:
(i) The parameter space should be compact.
(ii) We pointwise convergence: for every a L
n
(a)
P
L(a).
59
(iii) The sequence L
n
(a) is equicontinuous in probability. That is for every > 0 and > 0
there exists a such that
lim
n
P
_
sup
|a
1
a
2
|
[L
n
(a
1
) L
n
(a
2
)[ >
_
< . (5.12)
If the above conditions are satisied we have a
T
P
a
0
.
Verifying conditions (ii) and (iii) may look a little daunting but actually with the use of
Chebyshevs (or Markovs) inequality it can be quite straightforward. For example if we can
show that for every a
E(L
n
(a) L(a))
2
0 T .
Therefore by applying Chebyshevs inequality we have for every > 0 that
P([L
n
(a) L(a)[ > )
E(L
n
(a) L(a))
2

2
0 T .
Thus for every a we have L
n
(a)
P
L(a).
To show (iii) we often use the mean value theorem L
n
(a). Using the mean value theorem we
have
[L
n
(a
1
) L
n
(a
2
)[ sup
a
|
a
L
n
(a)|
2
|a
1
a
2
|.
Now if we can show that sup
n
Esup
a
|
a
L
n
(a)|
2
< (in other words it is uniformly bounded
in probability over n) then we have the result. To see this observe that
P
_
sup
|a
1
a
2
|
[L
n
(a
1
) L
n
(a
2
)[ >
_
P
_
sup
a
|
a
L
n
(a)|
2
[a
1
a
2
[ >
_

sup
n
E([a
1
a
2
[ sup
a
|
a
L
n
(a)|
2
)

.
Therefore by a careful choice of > 0 we see that (5.12) is satised (and we have equicontinuity
in probability).
5.7 Asymptotic normality of an estimator
The rst central limit theorm goes back to the asymptotic distribution of sums of binary ran-
dom variables (these have a binomial distribution and Bernoulli showed that they could be
approximated to a normal distribution). This result was later generalised to sums of iid random
variables. However from mid 20th century to late 20th century several advances have been made
for generalisating the results to dependent random variables. These include generalisations to
random variables which have n-dependence, mixing properties, cumulant properties, near-epoch
dependence etc (see, for example, Billingsley (1995) and Davidson (1994)). In this section we
will concentrate on a central limit theore for martingales. Our reason for choosing this avour
of CLT is that it can be applied in various estimation settings - as it can often be shown that
the derivative of a criterion at the true parameter is a martingale.
60
Let us suppose that
a
n
= arg max
a
L
n
(a),
where
L
n
(a) =
1
n
n

t=1

t
(a),
and for each a ,
t
(a)
t
are identically distributed random variables.
In this section we shall show asymptotic normality of

n( a
n
a
0
). The reason for normalising
by

n, is that ( a
n
a
0
)
a.s.
0 as n , hence in terms of distributions it converges towards
the point mass at zero. Therefore we need to increase the magnitude of the dierence a
n
a
0
.
We can show that ( a
n
a
0
) = O(n
1/2
), therefore

n( a
n
a) = O(1).
We often use L
n
(a) to denote the partial derivative of L
n
(a) with respect to a (L
n
(a) =
L
n
(a)
a
1
, . . . ,
L
n
(a)
a
p
). Since a
T
= arg max L
n
(a), we observe that L
n
( a
n
) = 0. Now expanding
L
n
( a
n
) about a
0
(the true parameter) we have
L
n
( a
n
) = L
n
(a
0
) + ( a
n
a
0
)
2
L
n
( a)
( a
n
a
0
) =
2
L
n
( a)
1
L
n
(a
0
) (5.13)
To show asymptotically normality of

n( a
n
a
0
), rst asymptotic normality of L
n
(a
0
) is
shown, second it is shown that
2
L
n
( a)
P
E(
2

0
(a
0
)), together they yield asymptotically
normality of

n( a
n
a
0
). In many cases L
n
(a
0
) is a martingale, hence the martingale central
limit theorem is usually applied to show asymptotic normality of L
n
(a
0
). We start by dening
a martingale and stating the martingale central limit theorem.
Denition 5.7.1 The random variables Z
t
are called martingale dierences if
E(Z
t
[Z
t1
, Z
t2
, . . .) = 0.
The sequence o
T

T
, where
o
T
=
T

k=1
Z
t
are called martingales if Z
t
are martingale dierences.
Remark 5.7.1 (Martingales and covariances) We observe that if Z
t
are martingale dif-
ferences then if t > s and T
s
= (Z
s
, Z
s1
, . . .)
cov(Z
s
, Z
t
) = E(Z
s
Z
t
) = E
_
E(Z
s
Z
t
[T
s
)
_
= E
_
Z
s
E(Z
t
[T
s
)
_
= E(Z
s
0) = 0.
Hence martingale dierences are uncorrelated.
Example 5.7.1 Suppose that X
t
= X
t1
+
t
, where
t
are idd rv with E(
t
) = 0 and [[ < 1.
Then
t
X
t1

t
are martingale dierences.
61
Let us dene o
T
as
o
T
=
T

t=1
Z
t
, (5.14)
where T
t
= (Z
t
, Z
t1
, . . .), E(Z
t
[T
t1
) = 0 and E(Z
2
t
) < . In the following theorem adapted
from Hall and Heyde (1980), Theorem 3.2 and Corollary 3.1, we show that o
T
is asymptotically
normal.
Theorem 5.7.1 Let o
T

T
be dened as in (6.36). Further suppose
1
T
T

t=1
Z
2
t
P

2
, (5.15)
where
2
is a nite constant, for all > 0,
1
T
T

t=1
E(Z
2
t
I([Z
t
[ >

T)[T
t1
)
P
0, (5.16)
(this is known as the conditional Lindeberg condition) and
1
T
T

t=1
E(Z
2
t
[T
t1
)
P

2
. (5.17)
Then we have
T
1/2
o
T
D
^(0,
2
). (5.18)
5.8 Asymptotic normality of the least squares estimator
In this section we show asymptotic normality of the least squares estimator of the AR(1) (X
t
=
X
t1
+
t
, with var(
t
) =
2
) dened in (5.9).
We call that the least squares estimator is

n
= arg max
a[1,1]
L
n
(a). Recalling the criterion
L
n
(a) =
1
n 1
n

t=2
(X
t
aX
t1
)
2
,
the rst and the second derivative is
L
n
(a) =
2
n 1
n

t=2
X
t1
(X
t
aX
t1
) =
2
n 1
n

t=2
X
t1

t
and
2
L
n
(a) =
2
n 1
n

t=2
X
2
t1
.
Therefore by using (5.13) we have
(

n
) =
_

2
L
n
_
1
L
n
(). (5.19)
62
Since X
2
t
are ergodic random variables, by using the ergodic theorem we have
2
L
n
a.s.

2E(X
2
0
). This with (5.19) implies

n(

n
) =
_

2
L
n
_
1
. .
a.s.
(2E(X
2
0
))
1

nL
n
(). (5.20)
To show asymptotic normality of

n(

n
), will show asymptotic normality of

nL
n
().
We observe that
L
n
() =
2
n 1
n

t=2
X
t1

t
,
is the sum of martingale dierences, since E(X
t1

t
[X
t1
) = X
t1
E(
t
[X
t1
) = X
t1
E(
t
) = 0
(here we used Denition 5.7.1). In order to show asymptotic of L
n
() we will use the martingale
central limit theorem.
We now use Theorem 5.7.1 to show that

nL
n
() is asymptotically normal, which means
we have to verify conditions (5.15)-(5.17). We note in our example that Z
t
:= X
t1

t
, and that
the series X
t1

t
is an ergodic process. Furthermore, since for any function g, E(g(X
t1

t
)[T
t1
) =
E(g(X
t1

t
)[X
t1
), where T
t
= (X
t
, X
t1
, . . .) we need only to condition on X
t1
rather than
the entire sigma-algebra T
t1
.
C1 : By using the ergodicity of X
t1

t
we have
1
n
n

t=1
Z
2
t
=
1
n
n

t=1
X
2
t1

2
t
P
E(X
2
t1
) E(
2
t
)
. .
=1
=
2
c(0).
C2 : We now verify the conditional Lindeberg condition.
1
n
n

t=1
E(Z
2
t
I([Z
t
[ >

n)[T
t1
) =
1
n
n

t=1
E(X
2
t1

2
t
I([X
t1

t
[ >

n)[X
t1
)
We now use the Cauchy-Schwartz inequality for conditional expectations to split X
2
t1

2
t
and I([X
t1

t
[ > ). We recall that the Cauchy-Schwartz inequality for conditional expec-
tations is E(X
t
Z
t
[() [E(X
2
t
[()E(Z
2
t
[()]
1/2
almost surely. Therefore
1
n
n

t=1
E(Z
2
t
I([Z
t
[ >

n)[T
t1
)

1
n
n

t=1
_
E(X
4
t1

4
t
[X
t1
)E(I([X
t1

t
[ >

n)
2
[X
t1
)
_
1/2

1
n
n

t=1
X
2
t1
E(
4
t
)
1/2
_
E(I([X
t1

t
[ >

n)
2
[X
t1
)
_
1/2
. (5.21)
We note that rather than use the Cauchy-Schwartz inequality we can use a generalisation
of it called the H older inequality. The H older inequality states that if p
1
+q
1
= 1, then
63
E(XY ) E(X
p
)
1/p
E(Y
q
)
1/q
(the conditional version also exists). The advantage of
using this inequality is that one can reduce the moment assumptions on X
t
.
Returning to (5.21), and studying E(I([X
t1

t
[ > )
2
[X
t1
) we use that E(I(A)) = P(A)
and the Chebyshev inequality to show
E(I([X
t1

t
[ >

n)
2
[X
t1
) = E(I([X
t1

t
[ >

n)[X
t1
)
= E(I([
t
[ >

n/X
t1
)[X
t1
)
= P

([
t
[ >

n
X
t1
))
X
2
t1
var(
t
)

2
n
. (5.22)
Substituting (5.22) into (5.21) we have
1
n
n

t=1
E(Z
2
t
I([Z
t
[ >

n)[T
t1
)

1
n
n

t=1
X
2
t1
E(
4
t
)
1/2
_
X
2
t1
var(
t
)

2
n
_
1/2

E(
4
t
)
1/2
n
3/2
n

t=1
[X
t1
[
3
E(
2
t
)
1/2

E(
4
t
)
1/2
E(
2
t
)
1/2
n
1/2
1
n
n

t=1
[X
t1
[
3
.
If E(
4
t
) < , then E(X
4
t
) < , therefore by using the ergodic theorem we have
1
n

n
t=1
[X
t1
[
3
a.s.

E([X
0
[
3
). Since almost sure convergence implies convergence in probability we have
1
n
n

t=1
E(Z
2
t
I([Z
t
[ >

n)[T
t1
)
E(
4
t
)
1/2
E(
2
t
)
1/2
n
1/2
. .
0
1
n
n

t=1
[X
t1
[
3
. .
P
E(|X
0
|
3
)
P
0.
Hence condition (5.16) is satised.
C3 : We need to verify that
1
n
n

t=1
E(Z
2
t
[T
t1
)
P

2
.
Since X
t

t
is an ergodic sequence we have
1
n
n

t=1
E(Z
2
t
[T
t1
) =
1
n
n

t=1
E(X
2
t1

2
[X
t1
)
=
1
n
n

t=1
X
2
t1
E(
2
[X
t1
) = E(
2
)
1
n
n

t=1
X
2
t1
. .
a.s.
E(X
2
0
)
P
E(
2
)E(X
2
0
) =
2
c(0),
64
hence we have veried condition (5.17).
Altogether conditions C1-C3 imply that

nL
n
() =
1

n
n

t=1
X
t1

t
D
^(0,
2
c(0)). (5.23)
Recalling (5.20) and that

nL
n
()
D
^(0,
2
) we have

n(

n
) =
_

2
L
n
_
1
. .
a.s.
(2E(X
2
0
))
1

nL
n
()
. .
D
N(0,
2
c(0))
. (5.24)
Using that E(X
2
0
) = c(0), this implies that

n(

n
)
D
^(0,
1
4

2
c(0)
1
). (5.25)
Thus we have derived the limiting distribution of

n
.
Remark 5.8.1 We recall that
(

n
) =
_

2
L
n
_
1
L
n
() =
2
n1

n
t=2

t
X
t1
2
n1

n
t=2
X
2
t1
, (5.26)
and that var(
2
n1

n
t=2

t
X
t1
) =
2
n1

n
t=2
var(
t
X
t1
) = O(
1
n
). This implies
(

n
) = O
p
(n
1/2
).
Indeed the results also holds almost surely
(

n
) = O(n
1/2
). (5.27)
The same result is true for autoregressive processes of arbitrary nite order. That is

n(

n
)
D
^(0, E(
p
)
1

2
). (5.28)
65
Chapter 6
Sampling properties of ARMA
parameter estimators
In this section we obtain the sampling properties of estimates of the parameters in an ARMA
process
X
t

i=1

(0)
i
X
ti
=
t
+
q

j=1

(0)
j

tj
, (6.1)
where
t
are iid random variables with mean zero and var(
t
) =
2
. Let
0
= (
(0)
1
, . . . ,
(0)
p
)
and
0
= (
(0)
1
, . . . ,
(0)
q
) and
0
= (
0
,
0
).
6.1 Asymptotic properties of the Hannan and Rissanen estima-
tion method
In this section we will derive the sampling properties of the Hannan-Rissanen estimator. We
will obtain an almost sure rate of convergence (this will be the only estimator where we obtain
an almost sure rate). Typically obtaining only sure rates can be more dicult than obtaining
probabilistic rates, moreover the rates can be dierent (worse in the almost sure case). We now
illustrate why that is with a small example. Suppose X
t
are iid random variables with mean
zero and variance one. Let S
n
=

n
t=1
X
t
. It can easily be shown that
var(S
n
) =
1
n
therefore S
n
= O
p
(
1

n
). (6.2)
However, from the law of iterated logarithm we have for any > 0
P(S
n
(1 +)
_
2nlog log n innitely often) = 0P(S
n
(1 )
_
2nlog log n innitely often) = 1.(6.3)
Comparing (6.2) and (6.3) we see that for any given trajectory (realisation) most of the time
1
n
S
n
will be within the O(
1

n
) bound but there will be excursions above when it to the O(
log log n

n
bound. In other words we cannot say that
1
n
S
n
= (
1

n
) almost surely, but we can say that This
basically means that
1
n
S
n
= O(

2 log log n

n
) almost surely.
66
Hence the probabilistic and the almost sure rates are (slightly) dierent. Given this result is
true for the average of iid random variables, it is likely that similar results will hold true for
various estimators.
In this section we derive an almost sure rate for Hannan-Rissanen estimator, this rate will
be determined by a few factors (a) an almost sure bound similar to the one derived above (b)
the increasing number of parameters p
n
(c) the bias due to estimating only a nite number of
parameters when there are an innite number in the model.
We rst recall the algorithm:
(i) Use least squares to estimate b
j

p
n
j=1
and dene

b
n
=

R
1
n
r
n
, (6.4)
where

b

n
= (

b
1,n
, . . . ,

b
p
n
,n
),

R
n
=
n

t=p
n
+1
X
t1
X

t1
r
n
=
T

t=p
n
+1
X
t
X
t1
and X

t1
= (X
t1
, . . . , X
tp
n
).
(ii) Estimate the residuals with

t
= X
t

p
n

j=1

b
j,n
X
tj
.
(iii) Now use as estimates of
0
and
0

n
,

n
where

n
,

n
= arg min
n

t=p
n
+1
(X
t

j=1

j
X
tj

q

i=1

i

ti
)
2
. (6.5)
We note that the above can easily be minimised. In fact
(

n
,

n
) =

1
n
s
n
where

n
=
1
n
n

t=p
n
+1

Y
t

Y
t
s
n
=
1
T
n

t=p
n
+1

Y
t
X
t
,

t
= (X
t1
, . . . , X
tp
,
t1
, . . . ,
tq
). Let
n
= (

n
,

n
).
We observe that in the second stage of the scheme where the estimation of the ARMA parameters
are done, it is important to show that the empirical residuals are close to the true residuals.
That is
t
=
t
+ o(1). We observe that from the denition of
t
, this depends on the rate of
convergence of the AR estimators

b
j,n

t
= X
t

p
n

j=1

b
j,n
X
tj
=
t
+
p
n

j=1
(

b
j,n
b
j
)X
tj

j=p
n
+1
b
j
X
tj
. (6.6)
67
Hence

p
n

j=1
(

b
j,n
b
j
)X
tj

j=p
n
+1
b
j
X
tj

. (6.7)
Therefore to study the asymptotic properties of =

n
,

n
we need to
Obtain a rate of convergence for sup
j
[

b
j,n
b
j
[.
Obtain a rate for [
t

t
[.
Use the above to obtain a rate for
n
= (

n
,

n
).
We rst want to obtain the uniform rate of convergence for sup
j
[

b
j,n
b
j
[. Deriving this
is technically quite challanging. We state the rate in the following theorem, an outline of the
proof can be found in Section 6.1.1. The proofs uses results from mixingale theory which can
be found in Chapter 10.
Theorem 6.1.1 Suppose that X
t
is from an ARMA process where the roots of the true char-
acteristic polynomials (z) and (z) both have absolute value greater than 1 + . Let

b
n
be
dened as in (6.4), then we have almost surely
|

b
n
b
n
|
2
= O
_
p
2
n
_
(log log n)
1+
log n
n
+
p
3
n
n
+p
n

p
n
_
for any > 0.
PROOF. See Section 6.1.1.
Corollary 6.1.1 Suppose the conditions in Theorem 6.1.1 are satised. Then we have

p
n
max
1jp
n
[

b
j,n
b
j
[Z
t,p
n
+K
p
n
Y
tp
n
, (6.8)
where Z
t,p
n
=
1
p
n

p
n
t=1
[X
tj
[ and Y
t
=

p
n
t=1

j
[X
t
[,
1
n
n

t=p
n
+1


ti
X
tj

ti
X
tj

= O(p
n
Q(n) +
p
n
) (6.9)
1
n
n

t=p
n
+1


ti

tj

ti

tj

= O(p
n
Q(n) +
p
n
) (6.10)
where Q(n) = p
2
n
_
(log log n)
1+
log n
n
+
p
3
n
n
+p
n

p
n
.
68
PROOF. Using (6.7) we immediately obtain (6.8).
To obtain (6.9) we use (6.7) to obtain
1
n
n

t=p
n
+1


ti
X
tj

ti
X
tj

1
n
n

t=p
n
+1
[X
tj
[


ti

ti

O(p
n
Q(n))
1
n
n

t=p
n
+1
[X
t
[[Z
t,p
n
[ +O(
p
n
)
1
n
n

t=p
n
+1
[X
t
[[Y
tp
n
[
= O(p
n
Q(n) +
p
n
).
To prove (6.10) we use a similar method, hence we omit the details.
We apply the above result in the theorem below.
Theorem 6.1.2 Suppose the assumptions in Theorem 6.1.1 are satised. Then
_
_

n

0
_
_
2
= O
_
p
3
n
_
(log log n)
1+
log n
n
+
p
4
n
n
+p
2
n

p
n
_
.
for any > 0, where
n
= (

n
,

n
) and
0
= (
0
,
0
).
PROOF. We note from the denition of
n
that
_

n

0
_
=

1
n
_
s
n

n

0
_
.
Now in the

n
and s
n
we replace the estimated residuals
n
with the true unobserved residuals.
This gives us
_

n

0
_
=
1
n
_
s
n

0
_
+ (
1
n
s
n

1
n
s
n
) (6.11)

n
=
1
n
n

t=max(p,q)
Y
t
Y
t
s
n
=
1
n
n

t=max(p,q)
Y
t
X
t
,
Y

t
= (X
t1
, . . . , X
tp
,
t1
, . . . ,
tq
) (recalling that Y

t
= (X
t1
, . . . , X
tp
,
t1
, . . . ,
tq
). The
error term is
(
1
n
s
n

1
n
s
n
) =
1
n
(

n
)

1
n
s
n
+

1
n
(s
n
s
n
).
Now, almost surely
1
n
,

1
n
= O(1) (if E(
n
) is non-singular). Hence we only need to obtain
a bound for

n
and s
n
s
n
. We recall that

n
=
1
n

t=p
n
+1
(

Y
t

Y

t
Y
t
Y

t
),
hence the terms dier where we replace the estimated
t
with the true
t
, hence by using (6.9)
and (6.10) we have almost surely
[

n
[ = O(p
n
Q(n) +
p
n
) and [s
n
s
n
[ = O(p
n
Q(n) +
p
n
).
69
Therefore by substituting the above into (6.12) we obtain
_

n

0
_
=
1
n
_
s
n

0
_
+O(p
n
Q(n) +
p
n
). (6.12)
Finally using straightforward algebra it can be shown that
s
n

n
=
1
n
n

t=max(p,q)

t
Y
t
.
By using Theorem 6.1.3, below, we have s
n

n
= O((p+q)
_
(log log n)
1+
log n
n
). Substituting
the above bound into (??), and noting that O(Q(n)) dominates O(
_
(log log n)
1+
log n
n
) gives
_
_

n

n
_
_
2
= O
_
p
3
n
_
(log log n)
1+
log n
n
+
p
4
n
n
+p
2
n

p
n
_
and the required result.
6.1.1 Proof of Theorem 6.1.1 (A rate for

b
T
b
T

2
)
We observe that

b
n
b
n
= R
1
n
_
r
n


R
n
b
n
_
+
_

R
1
n
R
1
n
__
r
n


R
n
b
n
_
where b, R
n
and r
n
are deterministic, with b
n
= (b
1
. . . , b
p
n
), (R
n
)
i,j
= E(X
i
X
j
) and (r
n
)
i
=
E(X
0
X
i
). Evaluating the Euclidean distance we have
|

b
n
b
n
|
2
|R
1
n
|
spec
_
_
r
n


R
n
b
n
_
_
2
+|R
1
n
|
spec
|

R
1
n
|
spec
_
_
R
n
R
n
_
_
2
_
_
r
n


R
n
b
n
_
_
2
,(6.13)
where we used that

R
1
n


R
1
n
=

R
1
n
(R
n


R
n
)R
1
n
and the norm inequalities. Now by using
Lemma 3.3.1 we have
min
(R
1
n
) > /2 for all T. Thus our aim is to obtain almost sure bounds
for |r
n


R
n
b
n
|
2
and |

R
n
R
n
|
2
, which requires the lemma below.
Theorem 6.1.3 Let us suppose that X
t
has an ARMA representation where the roots of the
characteristic polynomials (z) and (z) lie are greater than 1 +. Then
(i)
1
n
n

t=r+1

t
X
tr
= O(
_
(log log n)
1+
log n
n
) (6.14)
(ii)
1
n
n

t=max(i,j)
X
ti
X
tj
= O(
_
(log log n)
1+
log n
n
). (6.15)
for any > 0.
70
PROOF. The result is proved in Chapter 10.2.
To obtain the bounds we rst note that if the there wasnt an MA component in the
ARMA process, in other words X
t
was an AR(p) process with p
n
p, then r
n


R
n
b
n
=
1
n

n
t=p
n
+1

t
X
tr
, which has a mean zero. However because an ARMA process has an AR()
representation and we are only estimating the rst p
n
parameters, there exists a bias in
r
n


R
n
b
n
. Therefore we obtain the decomposition
(r
n


R
n
b
n
)
r
=
1
n
n

t=p
n
+1
_
X
t

j=1
b
j
X
tj
_
X
tr
+
1
n
n

t=p
n
+1

j=p
n
+1
b
j
X
tj
X
tr
(6.16)
=
1
n
n

t=p
n
+1

t
X
tr
. .
stochastic term
+
1
n
n

t=p
n
+1

j=p
n
+1
b
j
X
tj
X
tr
. .
bias
(6.17)
Therefore we can bound the bias with

(r
n


R
n
b
n
)
r

1
n
n

t=p
n
+1

t
X
tr

K
p
n
1
n
n

t=1
[X
tr
[

j=1

j
[X
tp
n
j
[. (6.18)
Let Y
t
=

j=1

j
[X
tj
and S
n,k,r
=
1
n

n
t=1
[X
tr
[

j=1

j
[X
tkj
[. We note that Y
t
and
X
t
are ergodic sequences. By applying the ergodic theorm we can show that for a xed k and
r, S
n,k,r
a.s.
E(X
tr
Y
tk
). Hence S
n,k,r
are almost surely bounded sequences and

p
n
1
n
n

t=1
[X
tr
[

j=1

j
[X
tp
n
j
[ = O(
p
n
).
Therefore almost surely we have
|r
n


R
n
b
n
|
2
= |
1
n
n

t=p
n
+1

t
X
t1
|
2
+O(p
n

p
n
).
Now by using (6.14) we have
|r
n


R
n
b
n
|
2
= O
_
p
n
__
(log log n)
1+
log n
n
+
p
n
__
. (6.19)
This gives us a rate for r
n


R
n
b
n
. Next we consider

R
n
. It is clear from the denition of

R
n
that almost surely we have
(

R
n
)
i,j
E(X
i
X
j
) =
1
n
n

t=p
n
+1
X
ti
X
tj
E(X
i
X
j
)
=
1
n
n

t=min(i,j)
[X
ti
X
tj
E(X
i
X
j
)]
1
n
p
n

t=min(i,j)
X
ti
X
tj
+
min(i, j)
n
E(X
i
X
j
)
=
1
n
T

t=min(i,j)
[X
ti
X
tj
E(X
i
X
j
)] +O(
p
n
n
).
71
Now by using (6.15) we have almost surely
[(

R
n
)
i,j
E(X
i
X
j
)[ = O(
p
n
n
+
_
(log log n)
1+
log n
n
).
Therefore we have almost surely
|

R
n
R
n
|
2
= O
_
p
2
n
_
p
n
n
+
_
(log log n)
1+
log n
n
__
. (6.20)
We note that by using (6.13), (6.19) and (6.20) we have
|

b
n
b
n
|
2
|R
1
n
|
spec
|

R
1
n
|
spec
O
_
p
2
n
_
(log log n)
1+
log n
n
+
p
2
n
n
+p
n

p
n
_
.
As we mentioned previously, because the spectrum of X
t
is bounded away from zero,
min
(R
n
)
is bounded away from zero for all T. Moreover, since
min
(

R
n
)
min
(R
n
)
max
(

R
n
R
n
)

min
(R
n
) tr((

R
n
R
n
)
2
), which for a large enough n is bounded away from zero. Hence we
obtain almost surely
|

b
n
b
n
|
2
= O
_
p
2
n
_
(log log n)
1+
log n
n
+
p
3
n
n
+p
n

p
n
_
, (6.21)
thus proving Theorem 6.1.1 for any > 0.
6.2 Asymptotic properties of the GMLE
Let us suppose that X
t
satises the ARMA representation
X
t

i=1

(0)
i
X
ti
=
t
+
q

j=1

(0)
j

tj
, (6.22)
and
0
= (
(0)
1
, . . . ,
(0)
q
),
0
= (
(0)
1
, . . . ,
(0)
p
) and
2
0
= var(
t
). In this section we consider the
sampling properties of the GML estimator, dened in Section 4.3.2. We rst recall the estimator.
We use as an estimator of (
0
,
0
),

n
= (

n
,

n
,
n
) = arg min
(,)
L
n
(, , ), where
1
n
L
n
(, , ) =
1
n
n1

t=1
log r
t+1
(, , ) +
1
n
n1

t=1
(X
t+1
X
(,)
t+1|t
)
2
r
t+1
(, , )
. (6.23)
To show consistency and asymptotic normality we will use the following assumptions.
Assumption 6.2.1 (i) X
t
is both invertible and causal.
(ii) The parameter space should be such that all (z) and (z) in the parameter space have
roots whose absolute value is greater than 1 +.
0
(z) and
0
(z) belong to this space.
72
Assumption 6.2.1 means for for some nite constant K and
1
1+
< 1, we have [(z)
1
[
K

j=0
[
j
[[z
j
[ and [(z)
1
[ K

j=0
[
j
[[Z
j
[.
To prove the result, we require the following approximations of the GML. Let

X
(,)
t+1|t,...
=
t

j=1
b
j
(, )X
t+1j
. (6.24)
This is an approximation of the one-step ahead predictor. Since the likelihood is constructed
from the one-step ahead predictors, we can approximated the likelihood
1
n
L
n
(, , ) with the
above and dene
1
n

L
n
(, , ) = log
2
+
1
n
2
T1

t=1
(X
t+1


X
(,)
t+1|t,...
)
2
. (6.25)
We recall that

X
(,)
t+1|t,...
was derived from X
(,)
t+1|t,...
which is the one-step ahead predictor of X
t+1
given X
t
, X
t1
, . . ., this is
X
(,)
t+1|t,...
=

j=1
b
j
(, )X
t+1j
. (6.26)
Using the above we dene a approximation of
1
n
L
n
(, , ) which in practice cannot be obtained
(since the innite past of X
t
is not observed). Let us dene the criterion
1
n
L
n
(, , ) = log
2
+
1
n
2
T1

t=1
(X
t+1
X
(,)
t+1|t,...
)
2
. (6.27)
In practice
1
n
L
n
(, , ) can not be evaluated, but it proves to be a convenient tool in obtaining
the sampling properties of

n
. The main reason is because
1
n
L
n
(, , ) is a function of X
t
and
X
(,)
t+1|t,...
=

j=1
b
j
(, )X
t+1j
both of these are ergodic (since the ARMA process is ergodic
when its roots lie outside the unit circle and the roots of , are such that they lie outside
the unit circle). In contrast looking at L
n
(, , ), which is comprised of X
t+1|t
, which not
an ergodic random variable because X
t+1
is the best linear predictor of X
t+1
given X
t
, . . . , X
1
(see the number of elements in the prediction changes with t). Using this approximation really
simplies the proof, though it is possible to prove the result without using these approximations.
First we obtain the result for the estimators

n
= (

n
,

n
,
n
) = arg min
(,)
L
n
(, , )
and then show the same result can be applied to
n
.
Proposition 6.2.1 Suppose X
t
is an ARMA process which satises (6.22), and Assumption
6.2.1 is satised. Let X
(,)
t+1|t
,

X
(,)
t+1|t,...
and X
(,)
t+1|t,...
be the predictors dened in (4.11), (6.24) and
(6.26), obtained using the parameters =
j
and =
i
, where the roots the corresponding
characteristic polynomial (z) and (z) have absolute value greater than 1 +. Then

X
(,)
t+1|t


X
(,)
t+1|t,...


t
1
t

i=1

i
[X
i
[, (6.28)
73
E(X
(,)
t+1|t


X
(,)
t+1|t,...
)
2
K
t
, (6.29)


X
t+1|t,...
(1) X
t+1|t,...

j=t+1
b
j
(, )X
t+1j

K
t

j=0

j
[X
j
[, (6.30)
E(X
(,)
t+1|t,...


X
(,)
t+1|t,...
)
2
K
t
(6.31)
and
[r
t
(, , )
2
[ K
t
(6.32)
for any 1/(1 +) < < 1 and K is some nite constant.
PROOF. The proof follows closely the proof of Proposition 6.2.1. First we dene a separate
ARMA process Y
t
, which is driven by the parameters and (recall that X
t
is drive by the
parameters
0
and
0
). That is Y
t
satises Y
t

p
j=1

j
Y
tj
=
t
+

q
j=1

tj
. Recalling that
X
,
t+1|t
is the best linear predictor of X
t+1
given X
t
, . . . , X
1
and the variances of Y
t
(noting
that it is the process driven by and ), we have
X
,
t+1|t
=
t

j=1
b
j
(, )X
t+1j
+
_

j=t+1
b
j
(, )r

t,j
(, )
t
(, )
1
_
X
t
, (6.33)
where
t
(, )
s,t
= E(Y
s
Y
t
), (r
t,j
)
i
= E(Y
ti
Y
j
) and X

t
= (X
t
, . . . , X
1
). Therefore
X
,
t+1|t


X
t+1|t,...
=
_

j=t+1
b
j
r

t,j

t
(, )
1
_
X
t
.
Since the largest eigenvalue of
t
(, )
1
is bounded (see Lemma 3.3.1) and [(r
t,j
)
i
[ = [E(Y
ti
Y
j
)[
K
|ti+j|
we obtain the bound in (6.28). Taking expectations, we have
E(X
,
t+1|t


X
,
t+1|t,...
)
2
=
_

j=t+1
b
j
r

t,j
_

t
(, )
1

t
(
0
,
0
)
t
(, )
1
_

j=t+1
b
t+j
r
t,j
_
.
Now by using the same arguments given in the proof of (3.10) we obtain (6.29).
To prove (6.31) we note that
E(

X
t+1|t,...
(1) X
t+1|t,...
)
2
= E(

j=t+1
b
j
(, )X
t+1j
)
2
= E(

j=1
b
t+j
(, )X
j
)
2
,
now by using (2.9), we have [b
t+j
(, )[ K
t+j
, for
1
1+
< < 1, and the bound in (6.30).
Using this we have E(

X
t+1|t,...
(1) X
t+1|t,...
)
2
K
t
, which proves the result.
74
Using
t
= X
t

j=1
b
j
(
0
,
0
)X
tj
and substituting this into L
n
(, , ) gives
1
n
L
n
(, , ) = log
2
+
1
n
2
_
X
t

j=1
b
j
(, )X
t+1j
_
2
=
1
n
L
n
(, , ) log
2
+
1
n
2
T1

t=1
_
(B)
1
(B)X
t
__
(B)
1
(B)X
t
_
= log
2
+
1
n
2
n

t=1

2
t
+
2
n
n

t=1

t
_

j=1
b
j
(, )X
tj
_
+
1
n
n

t=1
_

j=1
(b
j
(, ) b
j
(
0
,
0
))X
tj
_
2
.
Remark 6.2.1 (Derivatives involving the Backshift operator) Consider the transforma-
tion
1
1 B
X
t
=

j=0

j
B
j
X
t
=

j=0

j
X
tj
.
Suppose we want to dierentiate the above with respect to , there are two ways this can be done.
Either dierentiate

j=0

j
X
tj
with respect to or dierentiate
1
1B
with respect to . In
other words
d
d
1
1 B
X
t
=
B
(1 B)
2
X
t
=

j=0
j
j1
X
tj
.
Often it is easier to dierentiate the operator. Suppose that (B) = 1 +

p
j=1

j
B
j
and (B) =
1

q
j=1

j
B
j
, then we have
d
d
j
(B)
(B)
X
t
=
B
j
(B)
(B)
2
X
t
=
(B)
(B)
2
X
tj
d
d
j
(B)
(B)
X
t
=
B
j
(B)
2
X
t
=
1
(B)
2
X
tj
.
Moreover in the case of squares we have
d
d
j
(
(B)
(B)
X
t
)
2
= 2(
(B)
(B)
X
t
)(
(B)
(B)
2
X
tj
),
d
d
j
(
(B)
(B)
X
t
)
2
= 2(
(B)
(B)
X
t
)(
1
(B)
2
X
tj
).
Using the above we can easily evaluate the gradient of
1
n
L
n
1
n

i
L
n
(, , ) =
2

2
n

t=1
((B)
1
(B)X
t
)
(B)
(B)
2
X
ti
1
n

j
L
n
(, , ) =
2
n
2
n

t=1
((B)
1
(B)X
t
)
1
(B)
X
tj
1
n

2L
n
(, , ) =
1

2

1
n
4
n

t=1
_
X
t

j=1
b
j
(, )X
tj
_
2
. (6.34)
75
Let = (

i
,

j
,

2). We note that the second derivative


2
L
n
can be dened similarly.
Lemma 6.2.1 Suppose Assumption 6.2.1 holds. Then
sup
,
|
1
n
L
n
|
2
KS
n
sup
,
|
1
n

3
L
n
|
2
KS
n
(6.35)
for some constant K,
S
n
=
1
n
max(p,q)

r
1
,r
2
=0
n

t=1
Y
tr
1
Y
tr
2
(6.36)
where
Y
t
= K

j=0

j
[X
tj
[.
for any
1
(1+)
< < 1.
PROOF. The proof follows from the the roots of (z) and (z) having absolute value greater
than 1 +.
Dene the expectation of the likelihood L(, , )) = E(
1
n
L
n
(, , )). We observe
L(, , )) = log
2
+

2
0

2
+
1

2
E(Z
t
(, )
2
)
where
Z
t
(, ) =

j=1
(b
j
(, ) b
j
(
0
,
0
))X
tj
Lemma 6.2.2 Suppose that Assumption 6.2.1 are satised. Then for all , , we have
(i)
1
n

i
L
n
(, , ))
a.s.

i
L(, , )) for i = 0, 1, 2, 3.
(ii) Let S
n
dened in (6.36), then S
n
a.s.
E(

max(p,q)
r
1
,r
2
=0

n
t=1
Y
tr
1
Y
tr
2
).
PROOF. Noting that the ARMA process X
t
are ergodic random variables, then Z
t
(, )
and Y
t
are ergodic random variables, the result follows immediately from the Ergodic theorem.
We use these results in the proofs below.
Theorem 6.2.1 Suppose that Assumption 6.2.1 is satised. Let (

n
,

n
,

n
) = arg min L
n
(, , )
(noting the practice that this cannot be evaluated). Then we have
(i) (

n
,

n
,

n
)
a.s.
(
0
,
0
,
0
).
76
(ii)

n(

0
,

0
)
D
^(0,
2
0

1
), where
=
_
E(U
t
U

t
) E(V
t
U

t
)
E(U
t
V

t
) E(V
t
V

t
)
_
and U
t
and V
t
are autoregressive processes which satisfy
0
(B)U
t
=
t
and
0
(B)V
t
=

t
.
PROOF. We prove the result in two stages below.
PROOF of Theorem 6.2.1(i) We will rst prove Theorem 6.2.1(i). Noting the results
in Section 5.4, to prove consistency we recall that we must show (a) the (
0
,
0
,
0
) is the
unique minimum of L() (b) pointwise convergence
1
T
L(, , ))
a.s.
L(, , )) and (b) stochastic
equicontinuity (as dened in Denition 5.4.2). To show that (
0
,
0
,
0
) is the minimum we note
that
L(, , )) L(
0
,
0
,
0
)) = log(

2
0
) +

2

2
0
1 +E(Z
t
(, )
2
).
Since for all positive x, log x+x1 is a positive function and E(Z
t
(, )
2
) = E(

j=1
(b
j
(, )
b
j
(
0
,
0
))X
tj
)
2
is positive and zero at (
0
,
0
,
0
) it is clear that
0
,
0
,
0
is the minimum of
L. We will assume for now it is the unique minimum. Pointwise convergence is an immediate
consequence of Lemma 6.2.2(i). To show stochastic equicontinuity we note that for any
1
=
(
1
,
1
,
1
) and
2
= (
2
,
2
,
2
) we have by the mean value theorem
L
n
(
1
,
1
,
1
) L
n
(
2
,
2
,
2
)) = (
1

2
)L
n
(

, ).
Now by using (6.35) we have
L
n
(
1
,
1
,
1
) L
n
(
2
,
2
,
2
)) S
T
|(
1

2
), (
1

2
), (
1

2
)|
2
.
By using Lemma 6.2.2(ii) we have S
n
a.s.
E(

max(p,q)
r
1
,r
2
=0

n
t=1
Y
tr
1
Y
tr
2
), hence S
n
is almost
surely bounded. This implies that L
n
is equicontinuous. Since we have shown pointwise con-
vergence and equicontinuity of L
n
, by using Corollary 5.4.1, we almost sure convergence of the
estimator. Thu proving (i).
PROOF of Theorem 6.2.1(ii) We now prove Theorem 6.2.1(i) using the Martingale
central limit theorem (see Billingsley (1995) and Hall and Heyde (1980)) in conjunction with
the Cramer-Wold device (see Theorem 5.7.1).
Using the mean value theorem we have
_

0
_
=
2
L

n
(
n
)
1
L

n
(
0
,
0
,
0
)
where

n
= (

n
,

n
,

n
),
0
= (
0
,
0
,
0
) and
n
=

,

, lies between

n
and
0
.
Using the same techniques given in Theorem 6.2.1(i) and Lemma 6.2.2 we have pointwise
convergence and equicontinuity of
2
L
n
. This means that
2
L
n
(
n
)
a.s.
E(
2
L
n
(
0
,
0
,
0
)) =
1

2
(since by denition of
n

n
a.s.

0
). Therefore by applying Slutskys theorem (since is
nonsingular) we have

2
L
n
(
n
)
1
a.s.

2

1
. (6.37)
77
Now we show that L
n
(
0
) is asymptotically normal. By using (6.34) and replacing X
ti
=

0
(B)
1

0
(B)
ti
we have
1
n

i
L
n
(
0
,
0
,
0
) =
2

2
n
n

t=1

t
(1)

0
(B)

ti
=
2

2
n
n

t=1

t
V
ti
i = 1, . . . , q
1
n

j
L
n
(
0
,
0
,
0
) =
2

2
n
n

t=1

t
1

0
(B)

tj
=
2

2
n
T

t=1

t
U
tj
j = 1, . . . , p
1
n

2L
n
(
0
,
0
,
0
) =
1

2

1

4
n
T

t=1

2
=
1

4
n
T

t=1
(
2

2
),
where U
t
=
1

0
(B)

t
and V
t
=
1

0
(B)

t
. We observe that
1
n
L
n
is the sum of vector martingale
dierences. If E(
4
t
) < , it is clear that E((
t
U
tj
)
4
) = E((
4
t
)E(U
tj
)
4
) < , E((
t
V
ti
)
4
) =
E((
4
t
)E(V
ti
)
4
) < and E((
2

2
t
)
2
) < . Hence Lindebergs condition is satised (see the
proof given in Section 5.8, for why this is true). Hence we have

nL
n
(
0
,
0
,
0
)
D
^(0, ).
Now by using the above and (6.37) we have

n
_

0
_
=

n
2
L
n
(
n
)
1
L
n
(
0
)

n
_

0
_
D
^(0,
4

1
).
Thus we obtain the required result.
The above result proves consistency and asymptotically normality of (

n
,

n
,

n
), which is
based on L
n
(, , ), which in practice is impossible to evaluate. However we will show below
that the gaussian likelihood, L
n
(, , ) and is derivatives are suciently close to L
n
(, , )
such that the estimators (

n
,

n
,

n
) and the GMLE, (

n
,

n
,
n
) = arg min L
n
(, , ) are
asymptotically equivalent. We use Lemma 6.2.1 to prove the below result.
Proposition 6.2.2 Suppose that Assumption 6.2.1 hold and L
n
(, , ),

L
n
(, , ) and L
n
(, , )
are dened as in (6.23), (6.25) and (6.27) respectively. Then we have for all (, ) Theta we
have almost surely
sup
(,,)
1
n
[
(k)

L(, , )
k
L
n
(, , )[ = O(
1
n
) sup
(,,)
1
n
[

L
n
(, , ) L(, , )[ = O(
1
n
),
for k = 0, 1, 2, 3.
PROOF. The proof of the result follows from (6.28) and (6.30). We show that result for
sup
(,,)
1
n
[

L(, , ) L
n
(, , )[, a similar proof can be used for the rest of the result.
Let us consider the dierence
L
n
(, ) L
n
(, ) =
1
n
(I
n
+II
n
+III
n
),
78
where
I
n
=
n1

t=1
_
r
t
(, , )
2
_
, II
n
=
n1

t=1
1
r
t
(, , )
(X
(,)
t+1
X
(,)
t+1|t
)
2
III
n
=
n1

t=1
1

2
_
2X
t+1
(X
(,)
t+1|t


X
(,)
t+1|t,...
) + ((X
(,)
t+1|t
)
2
(

X
(,)
t+1|t,...
)
2
)
_
.
Now we recall from Proposition 6.2.1 that

X
(,)
t+1|t


X
(,)
t+1|t,...

K V
t

t
(1 )
where V
t
=

t
i=1

i
[X
i
[. Hence since E(X
2
t
) < and E(V
2
t
) < we have that sup
n
E[I
n
[ < ,
sup
n
E[II
n
[ < and sup
n
E[III
n
[ < . Hence the sequence [I
n
+ II
n
+ III
n
[
n
is almost
surely bounded. This means that almost surely
sup
,,

L
n
(, ) L
n
(, )

= O(
1
n
).
Thus giving the required result.
Now by using the above proposition the result below immediately follows.
Theorem 6.2.2 Let (

,

) = arg min L
T
(, , ) and (

,

) = arg min

L
T
(, , )
(i) (

,

)
a.s.
(
0
,
0
) and (

,

)
a.s.
(
0
,
0
).
(ii)

T(

T

0
,

T

0
)
D
^(0,
4
0

1
)
and

T(

T

0
,

T

0
)
D
^(0,
4
0

1
).
PROOF. The proof follows immediately from Proposition 6.2.1.
79
Chapter 7
Residual Bootstrap for estimation in
autoregressive processes
In Chapter 6 we consider the asymptotic sampling properties of the several estimators includ-
ing the least squares estimator of the autoregressive parameters and the gaussian maximum
likelihood estimator used to estimate the parameters of an ARMA process. The asymptotic dis-
tributions are often used for statistical testing and constructing condence intervals. However
the results are asymptotic, and only hold (approximately), when the sample size is relatively
large. When the sample size is smaller, the normal approximation is not valid and better approx-
imations are sought. Even in the case where we are willing to use the asymptotic distribution,
often we need to obtain expressions for the variance or bias. Sometimes this may not be pos-
sible or only possible with a excessive eort. The Bootstrap is a power tool which allows one
to approximate certain characteristics. To quote from Wikipedia Bootstrap is the practice of
estimating properties of an estimator (such as its variance) by measuring those properties when
sampling from an approximating distribution. Bootstrap essentially samples from the sample.
Each subsample is treated like a new sample from a population. Using these new multiple
realisations one can obtain approximations for CIs and variance estimates for the parameter
estimates. Of course in reality we do not have multiple-realisations, we are sampling from the
sample. Thus we are not gaining more as we subsample more. But we do gain some insight into
the nite sample distribution. In this chapter we will details the residual bootstrap method,
and then show that the asymptotically the bootstrap distribution coincides with asymptotic
distribution.
The residual bootstrap method was rst proposed by J. P. Kreiss (Kreiss (1997) is a very nice
review paper on the subject), (see also Franke and Kreiss (1992), where an extension to AR()
processes is also given here). One of the rst theoretical papers on the bootstrap is Bickel and
Freedman (1981). There are several other boostrapping methods for time series, these include
bootstrapping the periodogram, block bootstrap, bootstrapping the Kalman lter (Stoer and
Wall (1991), Stoer and Wall (2004) and Shumway and Stoer (2006)). These methods have
not only been used for variance estimation but also determining orders etc. At this point it is
worth mentioning methods Frequency domain approaches are considered in Dahlhaus and Janas
(1996) and Franke and H ardle (1992) (a review of subsampling methods can be found in Politis
et al. (1999)).
80
7.1 The residual bootstrap
Suppose that the time series X
t
satises the stationary, causal AR process
X
t
=
p

j=1

j
X
tj
+
t
,
where
t
are iid random variables with mean zero and variance one and the roots of the
characteristic polynomial have absolute value greater than (1 + ). We will suppose that the
order p is known.
The residual bootstrap for autoregressive processes
(i) Let

p
=
1
n
n

t=p+1
X
t1
X

t1
and
p
=
1
n
n

t=p+1
X
t1
X
t
, (7.1)
where X

t
= (X
t
, . . . , X
tp+1
). We use

n
= (

1
, . . . ,

p
)

==

1
p

p
as an estimator of
= (
1
, . . . ,
p
).
(ii) We create the bootstrap sample by rst estimating the residuals
t
and sampling from
the residuals. Let

t
= X
t

j=1

j
X
tj
.
(iii) Now create the empirical distribution function based on
t
. Let

F
n
(x) =
1
n p
n

t=p+1
I
(,
t
]
(x).
we notice that sampling from the distribution

F
n
(x), means observing
t
with probability
(n p)
1
.
(iv) Sample independently from the distribution

F
n
(x) n times. Label this sample as
+
k
.
(v) Let X
k
=
+
k
for 1 k p and
X
+
k
=
p

j=1

j
X
+
kj
+
k
, p < k n.
(vi) We call X
+
k
. Repeating step (vi,v) N times gives us N bootstrap samples. To distinguish
each sample we can label each bootstrap sample as ((X
+
k
)
(i)
; i = p + 1, . . . , n).
(vii) For each bootstrap sample we can construct a bootstrap matrix, vector and estimator
(
+
p
)
(i)
, (
+
p
)
(i)
and (

+
n
)
(i)
= ((
+
p
)
(i)
)
1
(
+
p
)
(i)
.
(viii) Using (

+
n
)
(i)
we can estimate the variance of

n
with
1
n

n
j=1
((

+
n
)
(i)

n
) and the
distribution function of

n
.
81
7.2 The sampling properties of the residual bootstrap estimator
In this section we show that the distribution of

n(

+
n

n
) and

n(

n
) asymptotically
coincide. This means that using the bootstrap distribution is no worse than using the asymptotic
normal approximation. However it does not say the bootstrap distribution better approximates
the nite sample distribution of (

n
), to show this one would have to use Edgeworth expansion
methods.
In order to show that the distribution of the bootstrap sample

n(

+
n

n
) asymptotically
coincides with the asymptotic distribution of

n(

n
), we will show convergence of the
distributions under the following distance
d
p
(H, G) = inf
XH,Y G
E(X Y )
p

1/p
,
where p > 1. Roughly speaking, if d
p
(F
n
, G
n
) 0, then the limiting distributions of F
n
and G
n
are the same (see Bickel and Freedman (1981)). The case that p = 2 is the most commonly used
p, and for p = 2, this is called Mallows distance. The Mallows distance between the distribution
H and G is dened as
d
2
(H, G) = inf
XH,Y G
E(X Y )
2

1/2
,
we will use the Mallow distance to prove the results below. It is worth mentioning that the
distance is zero when H = G are the same (as a distance should be). To see this, set the joint
distribution between X and Y to be F(x, y) == G(x) when y = x and zero otherwise, then it
clear that d
2
(H, G) = 0. To reduce notation rather than specify the distributions, F and G, we
let d
p
(X, Y ) = d
p
(H, G), where the random variables X and Y have the marginal distributions
H and G, respectively. We mention that distance d
p
satises the triangle inequality.
The main application of showing that d
p
(F
n
, G
n
) 0 is stated in the following lemma, which
is a version of Lemma 8.3, Bickel and Freedman (1981).
Lemma 7.2.1 Let ,
n
be two probability measures then d
p
(
n
, ) 0 if and only if
E

n
([X[
p
) =
_
[x[
p

n
(dx) E

([X[
p
) =
_
[x[
p
(dx) n .
and the distribution
n
converges weakly to the distribution .
Our aim is to show that
d
2
_
n(

+
n

n
),

n(

_
0,
which implies that their distributions asymptotically coincide. To do this we use
(

n(

n
) =

1
p
(
p

p
)
(

n(

+
n

) =

n(
+
p
)
1
(
+
p

+
p

n
).
Studying how

p
,
p
,
+
p
and
+
p
are constructed, we see as a starting point we need to show
d
2
(X
+
t
, X
t
) 0 t, n , d
2
(Z
+
t
, Z
t
) 0 n .
We start by showing that d
2
(Z
+
t
, Z
t
) 0
82
Lemma 7.2.2 Suppose
+
t
is the bootstrap residuals and
t
are the true residuals. Dene the
discrete random variable J = p + 1, . . . , n and let P(J = k) =
1
np
. Then
E
_
(
J

J
)
2
[X
1
, . . . , X
n
_
= O
p
(
1
n
) (7.2)
and
d
2
(

F
n
, F) d
2
(

F
n
, F
n
) +d
2
(F
n
, F) 0 as , (7.3)
where F
n
=
1
n1

n
t=p+1
I
(,
t
)
(x),

F
n
(x) =
1
np

n
t=p+1
I
(,
t
]
(x) are the empirical distribu-
tion function based on the residuals
t

n
p
and estimated residuals
t

n
p
, and F is the distribution
function of the residual
t
.
PROOF. We rst show (7.2). From the denition of
+
J
and
J
we have
E([
J

J
[
2
[X
1
, . . . , X
n
) =
1
n p
n

t=p+1
(
t

t
)
2
=
1
n p
n

t=p+1
(
p

j=1
[

j

j
]X
tj
)
2
=
p

j
1
,j
2
=1
[

j
1

j
1
][

j
2

j
2
]
1
n p
n

t=p+1
X
tj
1
X
tj
2
.
Now by using (5.27) we have sup
1jp
[

j

j
[ = O
p
(n
1/2
), therefore we have E[
J

J
[
2
=
O
p
(n
1/2
).
We now prove (7.3). We rst note by the triangle inequality we have
d
2
(F, F
n
) d
2
(F, F
n
) +d
2
(

F
n
, F
n
).
By using Lemma 8.4, Bickel and Freedman (1981), we have that d
2
(F
n
, F) 0. Therefore we
need to show that d
2
(

F
n
, F
n
) 0. It is clear by denition that d
2
(

F
n
, F
n
) = d
2
(
+
t
,
t
), where
+
t
is sampled from

F
n
=
1
n1

n
t=p+1
I
(,
t
)
(x) and
t
is sampled fromF
n
=
1
n1

n
t=p+1
I
(,
t
)
(x).
Hence,
t

+
t
have the same distribution as
J
and
J
. We now evaluate d
2
(
+
t
,
t
). To evaluate
d
2
(
+
t
,
t
) = inf

+
t

F
n
,
t
F
n
E[
+
t

t
[ we need that the marginal distributions of (
+
t
,
t
) are

F
n
and F
n
, but the inmum is over all joint distributions. It is best to choose a joint distribu-
tion which is highly dependent (because this minimises the distance between the two random
variables). An ideal candidate is to suppose that
+
t
=
J
and
t
=
J
, since these have the
marginals

F
n
and F
n
respectively. Therefore
d
2
(

F
n
, F
n
)
2
= inf

+
t

F
n
,
t
F
n
E[
+
t

t
[
2
E
_
(
J

J
)
2
[X
1
, . . . , X
n
_
= O
p
(
1
n
),
where the above rate comes from (7.2). This means that d
2
(

F
n
, F
n
)
P
0, hence we obtain (7.3).

Corollary 7.2.1 Suppose


+
t
is the bootstrapped residual. Then we have
E

F
n
((
+
t
)
2
[X
1
, . . . , X
n
)
P
E
F
(
2
t
)
83
PROOF. The proof follows from Lemma 7.2.1 and Lemma 7.2.2.
We recall that since X
t
is a causal autoregressive process, there exists some coecients a
j

such that
X
t
=

j=0
a
j

tj
,
where a
j
= a
j
() = [A()
j
]
1,1
= [A
j
]
1,1
(see Lemma 2.2.1). Similarly using the estimated
parameters

n
we can write X
+
t
as
X
+
t
=
t

j=0
a
j
(

n
)
+
tj
,
where a
j
(

n
) = [A(

n
)
j
]
1,1
. We now show that d
2
(X
+
t
, X
t
) 0 as n and t .
Lemma 7.2.3 Let J
p+1
, . . . , J
n
be independent samples from n p + 1, . . . , n with P(J
i
=
k) =
1
np
. Dene
Y
+
t
=
t

j=p+1
a
j
(

n
)
+
J
tj
,

Y
+
t
=
t

j=p+1
a
j
(

n
)
+
J
tj
,

Y
t
=
t

j=p+1
a
j

J
tj
, Y
t
=

Y
t
+

j=t+p+1
a
j

tj
,
where
J
j
is a sample from
p+1
, . . . ,
n
and
J
is a sample from
p+1
, . . . ,
n
. Then we have
E
_
(Y
+
t


Y
+
t
)
2
[X
1
, . . . , X
n
_
= O
p
(
1
n
), d
2
(Y
+
t
,

Y
+
t
) 0 n , (7.4)
E
_
(

Y
+
t


Y
t
)
2
[X
1
, . . . , X
n
_
= O
p
(
1
n
), d
2
(

Y
+
t
,

Y
t
) 0 n , (7.5)
and
E
_
(

Y
t
Y
t
)
2
[X
1
, . . . , X
n
_
K
t
, d
2
(

Y
t
, Y
t
) 0 n . (7.6)
PROOF. We rst prove (7.4). It is clear from the denitions that
E
_
(Y
+
t


Y
+
t
)
2
[ X
1
, . . . , X
n
_

j=0
([A()
j
]
1,1
[A(

n
)
j
]
1,1
)
2
E((
+
j
)
2
[X
1
, . . . , X
n
). (7.7)
Using Lemma 7.2.1 we have that E((
+
j
)
2
[X
1
, . . . , X
n
) is the same for all j and E((
+
j
)
2
[X
1
, . . . , X
n
)
P

E(
2
t
), hence we will consider for now ([A()
j
]
1,1
[A(

n
)
j
]
1,1
)
2
. Using (5.27) we have (

n
) =
O
p
(n
1/2
), therefore by the mean value theorem we have [A() A(

n
)[ = (

n
)D
K
n
D
(for some random matrix D). Hence
A(

n
)
j
= (A() +
K
n
D)
j
= A()
j
_
1 +A()
1
K
n
_
j
84
(note these are heuristic bounds, and this argument needs to be made precise). Applying the
mean value theorem again we have
A()
j
_
1 +A()
1
K
n
D
_
j
= A()
j
+
K
n
DA()
j
(1 +A()
1
K
n
B)
j
,
where B is such that |B|
spec
|
K
n
D|. Altogether this gives
[[A()
j
A(

n
)
j
]
1,1
[
K
n
DA()
j
(1 +A()
1
K
n
B)
j
.
Notice that for large enough n, (1 + A()
1 K
n
B)
j
is increasing slower (as n ) than A()
j
is contracting. Therefore for a large enough n we have

[A()
j
A(

n
)
j
]
1,1

K
n
1/2

j
,
for any
1
1+
< < 1. Subsituting this into (7.7) gives
E
_
(Y
+
t


Y
+
t
)
2
[X
1
, . . . , X
n
_

K
n
1/2
E((
+
t
)
2
)
t

j=0

j
= O
p
(
1
n
) 0 n .
hence d
2
(

Y
+
t
, Y
+
t
) 0 as n .
We now prove (7.5). We see that
E
_
(

Y
+
t


Y
t
)
2
[X
1
, . . . , X
n
_
=
t

j=0
a
2
j
E(
J
tj

J
tj
)
2
= E(
J
tj

J
tj
)
2
t

j=0
a
2
j
. (7.8)
Now by substituting (7.2) into the above we have E(

Y
+
t


Y
t
)
2
= O(n
1
), as required. This
means that d
2
(

Y
+
t
,

Y
t
) 0.
Finally we prove (7.6). We see that
E
_
(

Y
t
Y
t
)
2
[X
1
, . . . , X
n
_
=

j=t+1
a
2
j
E(
2
t
). (7.9)
Using (2.7) we have E(

Y
t
Y
t
)
2
K
t
, thus giving us (7.6).
We can now almost prove the result. To do this we note that
(
p

p
) =
1
n p
n

t=p+1

t
X
t1
, (
+
p

+
p

n
) =
1
n p
n

t=p+1

+
t
X
+
t1
. (7.10)
Lemma 7.2.4 Let Y
t
, Y
+
t
,

Y
+
t
and

Y
t
, be dened as in Lemma 7.2.3. Dene

p
and

+
p
,
p
and
+
p
in the same way as

p
and
p
dened in (7.1), but using Y
t
and Y
+
t
dened in Lemma
7.2.3, rspectively, rather than X
t
. We have that
d
2
(Y
t
, Y
+
t
) E(Y
t
Y
+
t
)
2

1/2
= O
p
(K(n
1/2
+
t
), (7.11)
85
d
2
(Y
t
, X
t
) 0, n , (7.12)
and
d
2
_
n(
p

p
),

n(
+
p

+
p

n
)
_
nE
_
(
p

p
) (
+
p

+
p

n
)
_
2
0 n , (7.13)
where

p
,

+
p
,
p
and
+
p
are dened in the same was as

p
,
+
p
,
p
and
+
p
, but with Y
t

replacing X
t
in

p
and
p
and Y
+
t
replacing X
+
t
in

+
p
and
+
p
. Furthermore we have
E[

+
p

p
[ 0, (7.14)
d
2
_
(
p

p
), (
p

p
)
_
0, E[

p
[ 0 n . (7.15)
PROOF. We rst prove (7.11). Using the triangle inequality we have
E
_
(

Y
t
Y
+
t
)
2
[X
1
, . . . , X
n
_

1/2

_
E(Y
t


Y
t
)
2
[X
1
, . . . , X
n
_

1/2
+
_
E(

Y
t


Y
+
t
)
2
[X
1
, . . . , X
n
_

1/2
+E
_
(

Y
+
t
Y
+
t
)
2
[X
1
, . . . , X
n
_

1/2
= O(n
1/2
+
t
),
where we use Lemma 7.2.3 we get the second inequality above. Therefore by denition of
d
2
(X
t
, X
+
t
) we have (7.11). To prove (7.12) we note that the only dierence between Y
t
and
X
t
is that the
J
k
in Y
t
, is sampled from
p+1
, . . . ,
n
hence sampled from F
n
, where as the

n
t=p+1
in X
t
are iid random variables with distribution F. Since d
2
(F
n
, F) 0 (Bickel and
Freedman (1981), Lemma 8.4) it follows that d
2
(Y
t
, X
t
) 0, thus proving (7.12).
To prove (7.13) we consider the dierence (
p

p
) (
+
p

+
p

n
) and use (7.10) to get
1
n
n

t=p+1
_

t
Y
t1

+
t
Y
+
t1
_
=
1
n
n

t=p+1
_
(
t

+
t
)Y
t1
+
+
t
(Y
t1
Y
+
t1
)
_
,
where we note that Y
+
t1
= (Y
+
t1
, . . . , Y
+
tp
)

and Y
t1
= (Y
t1
, . . . , Y
tp
)

. Using the above,


and taking conditional expectations with respect to X
1
, . . . , X
n
and noting that conditioned
on X
1
, . . . , X
n
, (
t

+
t
) are independent of X
k
and X
+
k
for k < t we have
_
1
n
E
_
n

t=p+1
_

t
Y
t1

+
t
Y
+
t1
__
2
[X
1
, . . . , X
n
_
1/2
I +II
where
I =
1
n
n

t=p+1
E
_
(
t

+
t
)
2
[X
1
, . . . , X
n
_

1/2
E(Y
2
t1
[X
1
, . . . , X
n
)
1/2
= E
_
(
t

+
t
)
2
[X
1
, . . . , X
n
_

1/2
1
n
n

t=p+1
E((Y
2
t1
[X
1
, . . . , X
n
)
1/2
II =
1
n
n

t=p+1
E((
+
t
)
2
[X
1
, . . . , X
n
)
1/2
E((Y
t1
Y
+
t1
)
2
[X
1
, . . . , X
n

1/2
= E((
+
t
)
2
[X
1
, . . . , X
n
)
1/2
1
n
n

t=p+1
E((Y
t1
Y
+
t1
)
2
[X
1
, . . . , X
n
)
1/2
.
86
Now by using (7.2) we have I Kn
1/2
, and (7.13) and Corollary 7.2.1 we obtain II Kn
1/2
,
hence we have (7.13). Using a similar technique to that given above we can prove (7.14).
(7.15) follows from (7.13), (7.14) and (7.12).
Corollary 7.2.2 Let
+
p
,

p
,
p
and
+
p
be dened in (7.1). Then we have
d
2
_

n(
p

p
),

n(
+
p

+
p

n
)
_
0 (7.16)
d
1
(
+
p
,

p
) 0, (7.17)
as n .
PROOF. We rst prove (7.16). Using (7.13), (7.15) and the triangular inequality gives (7.16).
To prove (7.17) we use (7.14) and (7.15) and the triangular inequality and (7.16) immediately
follows.
Now by using (7.17) and Lemma 7.2.1 we have

+
p
P
E(
p
),
and by using (7.16), the distribution of

n(
+
p

+
p

n
converges weakly to the distribution of

n(
p

p
). Therefore

n(

+
n

n
)
D
^(0, 2
1
p
),
hence the distributions of

n(
p

p
) and

n(
+
p

+
p

n
) aymptotically coincide. From
(5.28) we have

n(

n
)
D
^(0,
2

1
p
). Thus we see that the distribution of

n(

n
)
and

n(

+
n

n
) asymptotically coincide.
87
Chapter 8
Spectral Analysis
8.1 Some Fourier background
The background given here is a extremely sketchy (to say the least), for a more thorough
background the reader is referred, for example, to Priestley (1983), Chapter 4 and Fuller (1995),
Chapter 3.
(i) Fourier transforms of nite sequences
It is straightforward to show (by using that

n
j=1
exp(i2k/n) = 0 for k ,= 0) that if
d
k
=
1

n
n

j=1
x
j
exp(i2jk/n),
then x
r
can be recovered by inverting this transformation
x
r
=
1

n
n

k=1
d
k
exp(i2rk/n),
(ii) Fourier sums and integrals
Of course the above only has meaning when x
k
is a nite sequence. However suppose
that x
k
is a sequence which belongs to
2
(that is

k
x
2
k
< ), then we can dene the
function
f() =
1

k=
x
k
exp(ik),
where
_
2
0
f()
2
d =

k
x
2
k
, and we we can recover x
k
from f(). That is
x
k
=
1

2
_
2
0
f() exp(ik).
88
(iii) Convolutions. Let us suppose that the Fourier transform of the sequence a
k
is A() =
1

k
a
k
exp(ik) and Fourier transform of the sequence b
k
is B() =
1

k
b
k
exp(ik).
Then

j=
a
j
b
kj
=
_
A()B() exp(ik)d

j=
a
j
b
j
exp(ij) =
_
A()B( )d. (8.1)
8.2 Motivation
To give a taster of the spectral representations below let us consider the following example.
Suppose that X
t

n
t=1
is a stationary time series. The Fourier transform of this sequence is
J
n
(
j
) =
1

n
n

t=1
X
t
exp(it
j
)
where
j
= 2j/n (these are often called the fundamental frequencies). Using (i) above we see
that
X
t
=
1

n
n

k=1
J
n
(
j
) exp(ik
t
). (8.2)
This is just the inverse Fourier transform, however J
n
(
j
) has some interesting properties.
Under certain conditions it can be shown that cov(J
n
(
s
), J
n
(
t
)) 0 if s ,= t. So in some sense
(8.2) can be considered as the decomposition of X
t
in terms of frequencies whose amplitudes
are uncorrelated. Now if we let f
n
() = n
1
E([J
n
()[
2
), and take the above argument further
we see that
c(k) = cov(X
k
, X
t+k
) =
1
n
n

s=1
E([J
n
(
s
)[
2
) exp(ik
s
i(k +t)
s
) =
n

s=1
f
n
(
s
) exp(ik
s
).(8.3)
For more details on this see Priestley (1983), Section 4.11 (pages 259-261). Note that the above
can be considered as the eigen decomposition of the stationary covariance function, since
c(u, v) = c(u v) =
n

s=1
f
n
(
s
) exp(iu
s
) exp(iv
s
),
where exp(it
s
) are the eigenfunctions and f
n
(
s
) the eigenvalues.
Of course the entire time series X
t
will have innite length (and in general it will not
belong to
2
), so it is natural to ask whether the above results can be generalised to X
t
. The
answer is yes, by replacing the sum in (8.3) by an integral to obtain
c(k) =
_
2
0
exp(ik)dF(),
89
where F() is a positive nondecreasing function. Comparing with (8.3), we observe that f
n
(
k
)
is a positive function, thus its integral (the equivalent of F()) is positive and nondecreasing.
Therefore heuristically we can suppose that F()
_

0
f
n
()d.
Moreover the analogue of (8.2) is
X
t
=
_
exp(ik)dZ(),
where Z() is right continuous orthogonal increment process (that is E((Z(
1
) Z(
2
)(Z(
3
)
Z(
4
)) = 0, when the intervals [
1
,
2
] and [
3
,
4
] do not overlap) and E([Z()[
2
) = F().
We give the proof for these results the following section.
We mention that a more detailed discussion on spectral analysis in time series is give in
Priestley (1983), Chpaters 4 and 6, Brockwell and Davis (1998), Chapters 4 and 10, Fuller
(1995), Chapter 3, Shumway and Stoer (2006), Chapter 4. In many of these references they also
discuss tests for periodicity etc (see also Quinn and Hannan (2001) for estimation of frequencies
etc.).
8.3 Spectral representations
8.3.1 The spectral distribution
We rst state a theorem which is very useful for checking positive deniteness of a sequence.
See Brockwell and Davis (1998), Corollary 4.3.2 or Fuller (1995), Theorem 3.1.9.
To prove part of the result we use the fact that if a sequence a
k

2
, then g() =
1
2

k=
a
k
exp(ik) L
2
(by Parsevals theorem) and a
k
=
_
2
0
g() exp(ik) and the follow-
ing result.
Lemma 8.3.1 Suppose

k=
[c(k)[ < , then we have
1
n
(n1)

k=(n1)
[kc(k)[ 0
as n .
PROOF. The proof in straightforward in the case that

k=
[kc(k)[ < , in this case

(n1)
k=(n1)
|k|
n
[c(k)[ =
O(
1
n
). The proof is slightly more tricky in the case that

k=
[c(k)[ < . First we note
that since

k=
[c(k)[ < for every > 0 there exists a N

such that for all n N

|k|n
[c(k)[ < . Let us suppose that n > N

, then we have the bound


1
n
(n1)

k=(n1)
[kc(k)[
1
n
(N

1)

k=(N

1)
[kc(k)[ +
1
n

|k|N

[kc(k)[

1
2n
(N

1)

k=(N

1)
[kc(k)[ +.
Hence if we keep N

xed we see that


1
n

(N

1)
k=(N

1)
[kc(k)[ 0 as n . Since this is true
for all (for dierent thresholds N

) we obtain the required result.


90
Theorem 8.3.1 (The spectral density) Suppose the coecients c(k) are absolutely summable
(that is

k
[c(k)[ < ). Then the sequence c(k) is nonnegative denite if an only if the func-
tion f(), where
f() =
1
2

k=
c(k) exp(ik)
is nonnegative. Moreover
c(k) =
_
2
0
exp(ik)f()d. (8.4)
It is worth noting that f is called the spectral density corresponding to the covariances c(k).
PROOF. We rst show that if c(k) is a non-negative denite sequence, then f() is a nonnega-
tive function. We recall that since c(k) is non-negative then for any sequence x = (x
1
, . . . , x
N
)
(real or complex) we have

n
s,t=1
x
s
c(s t) x
s
0 (where x
s
is the complex conjugate of x
s
).
Now we consider the above for the particular case x = (exp(i), . . . , exp(in)). Dene the
function
f
n
() =
1
2n
n

s,t=1
exp(is)c(s t) exp(it).
Clearly f
n
() 0. We note that f
n
() can be rewritten as
f
n
() =
1
2
(n1)

k=(n1)
_
n [k[
n
_
c(k) exp(ik).
Comparing f() =
1
2

k=
c(k) exp(ik) with f
n
() we see that

f() f
n
()

1
2

|k|n
c(k) exp(ik)

+
1
2

(n1)

k=(n1)
[k[
n
c(k) exp(ik)

:= I
n
+II
n
.
Now since

k=
[c(k)[ < it is clear that I
n
0 as n . Using Lemma 8.3.1 we have
II
n
0 as n . Altogether the above implies

f() f
n
()

0 n . (8.5)
Now it is clear that since for all n, f
n
() are nonnegative functions, the limit f must be nonneg-
ative (if we suppose the contrary, then there must exist a sequence of functions f
n
k
() which
are not necessarily nonnegative, which is not true). Therefore we have shown that if c(k) is a
nonnegative denite sequence, then f() is a nonnegative function.
We now show that f(), dened by
1
2

k=
c(k) exp(ik), is a nonnegative function then
c(k) is a nonnegative sequence. We rst note because c(k)
1
it is also in
2
hence we
have that c(k) =
_
2
0
f() exp(ik). Now we have
n

s,t=1
x
s
c(s t) x
s
=
_
2
0
f()
_
n

s,t=1
x
s
exp(i(s t)) x
s
_
d =
_
2
0
f()[
n

s=1
x
s
exp(is)[
2
d 0.
91
Hence we obtain the desired result.
The above theorem is very useful. It basically gives a simple way to check whether a sequence
c(k) is non-negative denite or not (hence whether it is a covariance function - recall Theorem
1.1.1).
Example 8.3.1 Suppose we dene the empirical covariances
c
n
(k) =
_
1
n

nk
t=1
X
t
X
tk
[k[ n 1
0 otherwise
then c
n
(k) is positive denite sequence. Therefore, using Lemma 1.1.1 there exists a stationary
time series Z
t
which has the covariance c
n
(k).
To show that the sequence is non-negative denite we will consider the Fourier transform of
the sequence (the spectral density) and show that it is nonnegative. The fourier transform of
c
n
(k) is
(n1)

k=(n1)
exp(ik) c
n
(k) =
(n1)

k=(n1)
exp(ik) c
n
(k) =
1
n
n|k|

t=1
X
t
X
t+|k|
=
1
n

t=1
X
t
exp(it)

0.
Since it is positive, this means that c
n
(k) is a positive denite sequence.
We now state a useful result which relates the largest and smallest eigenvalue of of a variance
matrix of a stationary process to the smallest and largest values of the spectral density.
Lemma 8.3.2 Suppose that X
k
is a stationary process with covariance function c(k) and
spectral density f(). Let
n
= var(X
n
), where X
n
= (X
1
, . . . , X
n
). Suppose inf

f() m > 0
and sup

f() M < Then for all n we have

min
(
n
) inf

f()
max
(
n
) sup

f().
PROOF. Let e
1
be the eigenvector with smallest eigenvalue
1
corresponding to
n
. Then using
c(s t) =
_
f() exp(i(s t))d we have

min
(
n
) = e

n
e
1
=
n

s,t=1
e
s,1
c(s t)e
t,1
=
_
f()
n

s,t=1
e
s,1
exp(i(s t))e
t,1
d =
=
_
2
0
f()[
n

s=1
e
s,1
exp(is)[
2
d
_
2
0
f()
_
2
0
[
n

s=1
e
s,1
exp(is)[
2
d inf

f(),
since
_
[

n
s=1
e
s,1
exp(is)[
2
d = 1. Using a similar method we can show that
max
(
n
)
sup f().
A consequence of the above result is if a spectral density is bounded from above and bounded
away from zero then it is non-singular and with a bounded spectral norm.
Lemma 8.3.3 Suppose the covariance c(k) decays to zero as k , then for all n,
n
=
var(X
n
) is a non-singular matrix (Note we do not specify that the covariances are absolutely
summable).
92
PROOF. See Brockwell and Davis (1998), Proposition 5.1.1.
Theorem 8.3.1 only holds when the sequence c(k) is absolutely summable. Of course
this may not always be the case. An example of an extreme case is the time series X
t
= Z.
Clearly this is a stationary time series and its covariance is c(k) = var(Z) for all k. In this case
the autocovariances c(k) = 1, is not absolutely summable, hence the representation of the
covariance in Theorem 8.3.1 can not be applied to this case. The reason is because the fourier
transform of the innite sequence 1 is not well dened (since 1 does not belong to
1
and
also
2
).
However, we now show that Theorem 8.3.1 can be generalised to include all non-negative
denite sequences and stationary processes, by considering the spectral distribution rather than
the spectral density (we use the integral
_
g(x)dF(x), a denition is given in the Appendix).
Theorem 8.3.2 A function c(k) is non-negative denite sequence if and only if
c(k) =
_
2
0
exp(ik)dF(), (8.6)
where F() is a right-continuous (this means that f(x + h) f(x) as 0 < h 0), non-
decreasing, non-negative bounded function on [, ] (hence it has all the properties of a dis-
tribution and it can be consider as a distribution - it is usually called the spectral distribution).
This representation is unique.
PROOF. We rst show that if c(k) is non-negative denite sequence, then we can write
c(k) =
_
2
0
exp(ik)dF(), where F() is a distribution function. Had c(k) been absolutely
summable, then we can use Theorem 8.3.1 to write c(k) =
_
2
0
exp(ik)dF(), where F() =
_

0
f()d and f() =
1
2

k=
c(k) exp(ik). By using Theorem 8.3.1 we know that f() is
nonnegative, hence F() is a distribution, and we have the result.
In the case that c(k) is not absolutely summable we cannot use this approach but we adapt
some of the ideas used to prove Theorem 8.3.1. As in the proof of Theorem 8.3.1 dene the
nonnegative function
f
n
() =
1
2n
n

s,t=1
exp(is)c(s t) exp(it) =
1
2
(n1)

k=(n1)
_
n [k[
n
_
c(k) exp(ik).
When the c(k) is not absolutely summable, the limit of f
n
() may no longer be well dened.
To circumvent our dealing with functions which may have awkward limits, we consider instead
their integral, which we will show will always be a distribution function. Let us dene the
function F
n
() whose derivative is f
n
(), that is
F
n
() =
_

0
f
n
()d 0 2.
Since f
n
() is nonnegative, F
n
is a nondecreasing function, and it is bounded (F
n
() =
_
2
0
f
n
()d
c(0)). Hence F
n
satises all properties of a distribution and can be treated as a distribution
function. Now it is clear that for every k we have
_
2
0
exp(ik)dF
n
() =
_
(1
|k|
n
)c(k) [k[ n
0 0
(8.7)
93
If we let d
n,k
=
_
2
0
exp(ik)dF
n
(), we see that for every k, d
n,k
d
k
as n . But we should
ask what this tells us about the limit of the distribution F
n
? Intuitively, the distributions F
n

should (weakly converge) to a function F and this function should also be a distribution function
(if F
n
are all nondecreasing functions, then its limit must be nondecreasing). In fact this turns
out to be the case by applying Hellys theorem (see Appendix). Roughly speaking, it states that
given a sequence of distributions G
k
which are all bounded, then there exists a distribution
G, which is the limit of a subsequence of G
k
i
(eectivly this determining conditions for a
sequence of functions to be compact), hence for every h L
2
we have
_
h()dG
k
i
()
_
h()dG().
We now now apply this result to the sequence F
n
. We observe that the sequence of distribu-
tions F
n
are all uniformly bounded (by c(0)). Therefore applying Hellys theorem there must
exist a distribution F, which is the limit of a subsequence of F
n
, that is for every h L
2
we
have
_
h()dF
k
i
()
_
h()dF(), i ,
for some subsequence F
k
i
. We now show that above is true not only for a subsequence but
the actual sequence F
k
. We observe that exp(ik) is a basis of L
2
, and that the sequence

_
2
0
exp(ik)dF
n
()
n
converges for all k, to c(k). Therefore for all h L
2
we have
_
h()dF
k
()
_
h()dF(), k ,
for some distribution function F. Therefore looking at the case exp(ik) we have
_
exp(ik)dF
k
()
_
exp(ik)dF(), k .
Since
_
exp(ik)dF
k
() c(k) and
_
exp(ik)dF
k
()
_
exp(ik)dF(), then we have c(k) =
_
exp(ik)dF(), where F is a distribution.
To show that c(k) is a non-negative denite sequence when c(k) is dened as c(k) =
_
exp(ik)dF
k
(), we use the same method given in the proof of Theorem 8.3.1.
Example 8.3.2 We now construct the spectral distribution for the time series X
t
= Z. Let
F() = 0 for < 0 and F() = var(Z) for 0 (hence F is the step function). Then we have
cov(X
0
, X
k
) = var(Z) =
_
exp(ik)dF().
8.3.2 The spectral representation theorem
We now state the spectral representation theorem and give a rough outline of the proof.
94
Theorem 8.3.3 If X
t
is a second order stationary time series with mean zero, and spectral
distribution F(), and the spectral distribution function is F(), then there exists a right con-
tinuous, orthogonal increment process Z() (that is E((Z(
1
) Z(
2
)(Z(
3
) Z(
4
)) = 0,
when the intervals [
1
,
2
] and [
3
,
4
] do not overlap) such that
X
t
=
_
2
0
exp(it)dZ(), (8.8)
where for
1

2
, E(Z(
1
) Z(
2
))
2
= F(
1
) F(
2
) (noting that F(0) = 0). (One example
of a right continuous, orthogonal increment process is Brownian motion, though this is just one
example, and usually Z() will be far more general than Brownian motion).
Heuristically we see that (8.8) is the decomposition of X
t
in terms of frequencies, whose
amplitudes are orthogonal. In other words X
t
is decomposed in terms of frequencies exp(it)
which have the orthogonal amplitudes dZ() (Z( +) Z()).
Remark 8.3.1 Note that so far we have not dened the integral on the right hand side of
(8.8), this is known as a stochastic integral. Unlike many deterministic functions (functions
whose derivative exists), one cannot really suppose dZ() Z

()d, because usually a typical


realisation of Z() will not be smooth enough to dierentiate. For example, it is well known that
Brownian is quite rough, that is a typical realisation of Brownian motion satises [B(t
1
, )
B(t
2
, )[ K( )[t
1
t
t
[

, where is a realisation and 1/2, but in general will not be


larger. The integral
_
g()dZ() is well dened if it is dened as the limit (in the mean squared
sense) of discrete sums. In other words let Z
n
() =

n
k=1
Z(
k
)I

n
k
1
,
n
k
() and
_
g()dZ
n
() =
n

k=1
g(
k
)Z(
k
) Z(
k1
),
then
_
g()dZ() is the mean squared limit of
_
g()dZ
n
()
n
that is E[
_
g()dZ()
_
g()dZ
n
()]
2
.
For a more precise explanation, see Priestley (1983), Sections 3.6.3 and Section 4.11 and Brock-
well and Davis (1998), Section 4.7.
A very elegant explanation on the dierent proofs of the spectral representation theorem
is given in Priestley (1983), Section 4.11. We now give a rough outline of the proof using the
functional theory approach.
PROOF of the Spectral Representation Theorem To prove the result we will dene
two Hilbert spaces H
1
and H
2
, where H
1
one contains deterministic functions and H
2
contains
random variables. We will dene what is known as an isomorphism (a one-to-one mapping which
preserves the norm and is linear) between these two spaces.
Let H
1
be dened by all functions f, if
_
2
0
f
2
()dF() < , then f H
1
and dene the
inner product on H
1
to be
< f, g >=
_
2
0
f(x)g(x)dF(x). (8.9)
We rst note that exp(ik) belongs to H
1
, moreover they also span the space H
1
. Hence if
f H
1
, then there exists coecients a
j
such that f(x) =

j
a
j
exp(ij). Let H
2
be the
95
space spanned by X
t
, hence H
2
= sp(X
t
) (it necessary to dene the closure of this space,
but we wont do so here) and the inner product is the covariance cov(Y, X).
Now let us dene the mapping T : H
1
H
2
T(
n

j=1
a
j
exp(ik)) =
n

j=1
a
j
X
k
, (8.10)
for any n (it is necessary to show that this can be extended to innite n, but we wont do so
here). We need to shown that T denes an isomorphism. We rst observe that this mapping
perserves the inner product. That is suppose f, g H
1
, then there exists f
j
and g
j
such
that f(x) =

j
f
j
exp(ij) and g(x) =

j
g
j
exp(ij). Hence by denition of T in (8.10) we
have
< Tf, Tg > = cov(

j
f
j
X
j
,

j
g
j
X
j
) =

j
1
,j
2
f
j
1
g
j
2
cov(X
j
1
, X
j
2
)
=
_
2
0
_

j
1
,j
2
f
j
1
g
j
2
exp(i(j
1
j
2
))
_
dF() =
_
2
0
f(x)g(x)dF(x) =< f, g > .
Hence < Tf, Tg >=< f, g >, so the inner product is preserved. To show that it is a one-to-one
mapping see Brockwell and Davis (1998), Section 4.7. Altogether this means that T denes an
isomorphism betwen H
1
and H
2
. Therefore all functions which are in H
1
have a corresponding
random variable in H
2
which display many similar properties.
Since for all [0, 2], the identity functions I
[0,]
(x) H
1
, we can dene the random
function Z(); 0 2 to be T(I
[0,]
) = Z(). Now since that mapping T is linear we
observe that T(I
[
1
,
2
]
) = Z(
1
) Z(
2
). Moreover, since T preserves the norm we have for any
non-intersecting intervals [
1
,
2
] and [
3
,
4
] that
E((Z(
1
) Z(
2
)(Z(
3
) Z(
4
)) = < T(I
[
1
,
2
]
), T(I
[
3
,
4
]
) >=< I
[
1
,
2
]
, I
[
3
,
4
]
>
=
_
I
[
1
,
2
]
(x)I
[
3
,
4
]
dF() = 0.
Therefore by construction Z(); 0 2 is an orthogonal increment process, with
E((Z(
1
) Z(
2
)
2
) = < T(I
[
1
,
2
]
), T(I
[
1
,
2
]
) >< I
[
1
,
2
]
, I
[
2
,
3
]
>
=
_

2

1
dF() = F(
1
) F(
2
).
Having dened the two spaces which are isomorphic and the random function Z(); 0
2 and function I
[0,]
(x) which are have orthogonal increments. We can now prove the
result. We note that for any function g L
2
we can write
g() =
_
2
0
g(s)dI(s ),
where I(s) is the identity function with I(s) = 0, for s < 0, and I(s) = 1, for s 0 (hence
dI( s) =

(s)ds and

(s) is the dirac delta function). We now consider the special case
g(t) = exp(it), and apply the isomophism T to this
T(exp(it)) =
_
2
0
exp(its)dT(I( s)),
96
where the mapping goes inside the integral due to the linearity of the isomorphism. Now we
observe that I(s ) = I
[0,s]
() and by denition of Z(); 0 2 we have T(I
[0,s]
()) =
Z(s). Substituting this into the above gives
X
t
=
_
2
0
exp(its)dZ(s),
which gives the required result.
8.3.3 The spectral densities of MA, AR and ARMA models
We obtain the spectral density function for MA() processes. Using this we can easily obtain
the spectral density for ARMA processes. Let us suppose that X
t
satises the representation
X
t
=

j=

tj
(8.11)
where
t
are iid random variables with mean zero and variance
2
and

j=
[
j
[ < . We
recall that the covariance of above is
c(k) = E(X
t
X
t+k
) =

j=

j+k
. (8.12)
Since

j=
[
j
[ < , it can be seen that

k
[c(k)[

j=
[
j
[ [
j+k
[ < .
Hence by using Theorem 8.3.1, the spectral density function of X
t
is well dened. There
are several ways to derive the spectral density of X
t
, we can either use (8.12) and f() =
1
2

k
c(k) exp(ik) or obtain the spectral representation of X
t
and derive f() from the
spectral representation. We prove the results using the latter method.
Since
t
are iid random variables, using Theorem 8.3.3 there exists an orthogonal random
function Z() such that

t
=
1
2
_
2
0
exp(it)dZ().
Since E(
t
) = 0 and E(
2
t
) =
2
multiplying the above by
t
, taking expectations and noting
that due to the orthogonality of Z() we have E(dZ(
1
)d

Z(
2
)) = 0 unless
1
=
2
we have
that E([dZ()[
2
) =
2
d.
Using the above we can obtain the spectral representation for X
t

X
t
=
1
2
_
2
0
_

j=

j
exp(ij)
_
exp(it)dZ().
Hence
X
t
=
_
2
0
A() exp(it)dZ(),
97
where A() =
1
2

j=

j
exp(ij), noting that this is the unique spectral representation of
X
t
.
Now multiplying the above by X
t+k
and taking expectations gives
E(X
t
X
t+k
) = c(k) =
_
2
0
A(
1
)A(
2
) exp(it
1
i(t +k)
2
)E(dZ(
1
)d

Z(
2
)).
Due to the orthogonality of Z() we have E(dZ(
1
)d

Z(
2
)) = 0 unless
1
=
2
, altogether
this gives
E(X
t
X
t+k
) = c(k) =
_
2
0
[A()[
2
exp(ik)E([dZ()[
2
) =
_
2
0

2
[A()[
2
exp(ik)d.
Comparing the above with (8.4) we see that the spectral density f() =
2
[A()[
2
=

2
2
[

j=

j
exp(ij)[
2
.
Therefore the spectral density function corresponding to the MA() process dened in (8.11)
is
f() =
2
[A()[
2
=

2
2
[

j=

j
exp(ij)[
2
.
Example 8.3.3 Let us suppose that X
t
is a stationary ARMA(p, q) time series (not neces-
sarily invertible or causal), where
X
t

j=1

j
X
tj
=
q

j=1

tj
,

t
are iid random variables with E(
t
) = 0 and E(
2
t
) =
2
. Then the spectral density of X
t

is
f() =

2
2
[1 +

q
j=1

j
exp(ij)[
2
[1

q
j=1

j
exp(ij)[
2
We note that because the ARMA is the ratio of trignometric polynomials, this is known as a
rational spectral density.
8.3.4 Higher order spectrums
We recall that the covariance is measure of linear dependence between two random variables.
Higher order cumulants are a measure of higher order dependence. For example, the third order
cumulant for the zero mean random variables X
1
, X
2
, X
3
is
cum(X
1
, X
2
, X
3
) = E(X
1
X
2
X
3
)
and the fourth order cumulant for the zero mean random variables X
1
, X
2
, X
3
, X
4
is
cum(X
1
, X
2
, X
3
, X
4
) = E(X
1
X
2
X
3
X
4
) E(X
1
X
2
)E(X
3
X
4
) E(X
1
X
3
)E(X
2
X
4
) E(X
1
X
4
)E(X
2
X
3
).
From the denition we see that if X
1
, X
2
, X
3
, X
4
are independent then cum(X
1
, X
2
, X
3
) = 0
and cum(X
1
, X
2
, X
3
, X
4
) = 0.
98
Moreover, if X
1
, X
2
, X
3
, X
4
are Gaussian random variables then cum(X
1
, X
2
, X
3
) = 0 and
cum(X
1
, X
2
, X
3
, X
4
) = 0. Indeed all cumulants higher than order two is zero. This comes from
the fact that cumulants are the coecients of the power series expansion of the logarithm of the
moment generating function of X
t
.
Since the spectral density is the fourier transform of the covariance it is natural to ask
whether one can dene the higher order spectra as the fourier transform of the higher order
cumulants. This turns out to be the case, and the higher order spectra have several interesting
properties.
Let us suppose that X
t
is a stationary time series (notice that we are assuming it is
strictly stationary and not second order). Let cum(t, s) = E(X
0
, X
t
, X
s
) and cum(t, s, r) =
E(X
0
, X
t
, X
s
, X
r
) (noting that like the covariance the higher order cumulants are invariant to
shift). The third and fourth order spectras is dened as
f(
1
,
2
) =

s=

t=
cum(s, t) exp(is
1
+it
2
)
f(
1
,
2
,
3
) =

s=

t=

r=
cum(s, t, r) exp(is
1
+it
2
+ir
3
).
Example 8.3.4 (Third and Fourth order spectra of a linear process) Let us suppose that
X
t
satises
X
t
=

j=

tj
where

j=
[
j
[ < , E(
t
) = 0 and E(
4
t
) < . Let A() =

j=

j
exp(ij). Then it
is straightforward to show that
f
3
(
1
,
2
) =
3
A(
1
)A(
2
)A(
1

2
)
f
4
(
1
,
2
,
3
) =
4
A(
1
)A(
2
)A(
3
)A(
1

3
),
where
3
= cum(
t
,
t
,
t
) and
4
= cum(
t
,
t
,
t
,
t
).
We see from the example, that unlike the spectral density, the higher order spectras are not
necessarily positive or even real.
A review of higher order spectra can be found in Brillinger (2001). Higher order spectras
have several applications especially in nonlinear processes, see Subba Rao and Gabr (1984). We
will consider one such application in a later chapter.
8.4 The Periodogram and the spectral density function
Our aim is to construct an estimator of the spectral density function f() associated with the
second order stationary process X
t
.
99
8.4.1 The periodogram and its properties
Let us suppose we observe X
t

n
t=1
which are observations from a zero mean, second order
stationary time series. Let us suppose that the autocovariance function is c(k), where c(k) =
E(X
t
X
t+k
) and

k
[c(k)[ < . We recall that we can be estimate c(k) using
c
n
(k) =
1
n
n|k|

t=1
X
t
X
t+k
.
Given that the spectral density is
f() =
1
2

k=
c(k) exp(ik),
then a natural estimator of f() is
I
X
() =
1
2
n1

k=(n1)
c
n
(k) exp(ik). (8.13)
We usually call I
X
() the periodogram. We will show that the periodogram has several nice
properties that make it a suitable candidate for the spectral density estimator. The only problem
is that the raw periodogram turns out to be an inconsistent estimator. However with some
modications of the periodogram we can construct a good estimator of the spectral density.
Lemma 8.4.1 Suppose that X
t
is a second order stationary time series with

k
[c(k)[ <
and I
X
() is dened in (8.13). Then we have
I
X
() =
1
2
n1

k=(n1)
c
n
(k) exp(ik) =
1
2n

t=1
X
t
exp(it)

2
(8.14)
and

E(I
X
()) f()

1
2
_

|k|n
[c(k)[ +

|k|n
[k[
n
[c(k)[
_
0 (8.15)
as n . Hence in the case that

k=
[kc(k)[ < we have

E(I
X
()) f()

= O(
1
n
).
Moreover var(
1

2n

n
t=1
X
t
exp(it)) = E(I
X
()).
PROOF. (8.14) follows immediately from the denition of the periodogram.
To obtain the rst inequality in (8.15) is straightforward. To show that

|k|n
[c(k)[ +

|k|n
|k|
n
[c(k)[ 0 as n , we note that

k
[c(k)[ < , therefore

|k|n
[c(k)[ 0 as
n . To show that

|k|n
|k|
n
[c(k)[ 0 as n we use

k
[c(k)[ < and Lemma 8.3.1,
thus we obtain the desired result.
We see from the above that the periodogram is both a non-negative function as well as an
asymptotically unbiased estimator of the spectral density. Hence is has inherited several of the
100
characteristics of the spectral density. However a problem with the periodogram is that it is
extremely irratic in its behaviour, in fact in its limit it does not converge to the spectral density.
Hence as an estimator of the spectral density it is inappropriate. We will demonstrate this in
the following two propositions and later discuss why this is so and how it can be made into a
consistent estimator.
We start by considering the periodogram of iid random variables.
Proposition 8.4.1 Suppse
t

n
t=1
are iid random variables with mean zero and variance
2
.
We dene J

() =
1

2n

n
t=1

t
exp(it) and I

() =
1
2n

n
t=1

t
exp(it)

2
. Then we have
J

() =
_
'(J
z
())
(J
z
())
_
D

2
^(0,

2
2
I
2
), (8.16)
for any nite m
(J

(
k
1
)

, . . . , J

(
k
m
)

)
D

2
^(0,

2
2
I
2m
), (8.17)
I

()/
2

2
(2) (which is equivalent to the exponential distribution with mean one), (I

(
k
1
), . . . , I

(
k
m
))
converges in distribution to
cov(I

(
j
), I

(
k
)) =
_

4
2n
j ,= k

4
2n
+
2
2
2
2
j = k
(8.18)
where
j
= 2j/n and
k
= 2k/n (and k, j ,= 0, n).
PROOF. We rst show (8.16). We note that '(J

(
k
)) =
1

2n

n
t=1

t,n
and (J

(
k
)) =
1

2n

n
t=1

t,n
where
t,n
=
t
cos(2kt/n) and
t,n
=
t
sin(2kt/n). We note that '(J

(
k
)) =
1

2n

n
t=1

t,n
and (J

(
k
)) =
1

2n

n
t=1

t,n
are the weighted sum of iid random variables,
hence
t,n
and
t,n
are martingale dierences. Therefore, to show asymptotic normality,
we will use the martingale central limit theorem with the Cramer-Wold device to show that
(8.16). We show the result we need to verify the three conditions of the martingale CLT. First
we consider the variances and the conditional variances
1
2n
n

t=1
E
_
[
t,n
[
2

t1
,
t2
, . . .
_
=
1
2n
n

t=1
cos(2kt/n)
2

2
t
P


2
2
1
2n
n

t=1
E
_
[
t,n
[
2

t1
,
t2
, . . .
_
=
1
2n
n

t=1
sin(2kt/n)
2

2
t
P


2
2
1
2n
n

t=1
E
_

t,n

t1
,
t2
, . . .
_
=
1
2n
n

t=1
cos(2kt/n) sin(2kt/n)
2
t
P
E(
2
t
)
1
n
n

t=1
sin(2k 2t/n) = 0.
Finally we need to verify the Lindeberg condition, we only verify it for
1

2n

n
t=1

t,n
, the same
argument holds true for
1

2n

n
t=1

t,n
. We note that for every > 0 we have
1
2n
n

t=1
E
_
[
t,n
[
2
I([
t,n
[ 2

n)

t1
,
t2
, . . .
_
=
1
2n
n

t=1
E
_
[
t,n
[
2
I([
t,n
[ 2

n)
_

1
2n
n

t=1
E
_
[
t
[
2
I([
t
[ 2

n)
_
= E
_
[
t
[
2
I([
t
[ 2

n)
_
P
0
101
as n , noting that the second to last inequality is because [
t,n
[ = [ cos(2t/n)
t
[
t
.
Hence we have veried Lindeberg condition and we obtain (8.16). The proof of (8.17) is similar,
hence we omit the details. We can show that I

()
2
(2), because I

() = '(J

())
2
+
(J

())
2
, hence from (8.16) we have I

()/
2
(2).
To prove (8.18) we note that
cov(I

(
j
), I

(
k
)) =
1
2n
2

k
1

k
2

t
1

t
2
cov(X
t
1
X
t
1
+k
1
, X
t
2
X
t
2
+k
2
).
We recall that
cov(
t
1

t
1
+k
1
,
t
2

t
2
+k
2
) = cov(
t
1
,
t
2
+k
2
)cov(
t
2
,
t
1
+k
1
) + cov(
t
1
,
t
2
)cov(
t
1
+k
1
,
t
2
+k
2
) +
cum(
t
1
,
t
1
+k
1
,
t
2
,
t
2
+k
2
).
We note that since
t
are iid random variables, then for most t
1
, t
2
, k
1
and k
2
the above
covariance is zero. The exceptions are when t
1
= t
2
and k
1
= k
2
or t
1
= t
2
and k
1
= k
2
= 0 or
t
1
t
2
= k
1
= k
2
. Counting all these combinations we have
cov(I

(
j
), I

(
k
)) =
2
2n
2

t
exp(ik(
j

k
))
2
2
+
1
2n
2

4
where
2
= var(
t
) and
4
= cum(
t
,
t
,
t
,
t
). We note that for j ,= k,

t
exp(ik(
j

k
)) = 0
and for j = k,

t
exp(ik(
j

k
)) = n, substutiting this into cov(I

(
j
), I

(
k
)) gives us the
desired result.
We have seen that the periodogram for iid random variables does not converge to a constant
and indeed its distribution is asymptotically exponential. This suggests that something similar
holds true for linear processes. This is the case. In the following lemma we show that the
periodogram of a general linear process X
t
=

j=

t
is
I
X
() = [

j
exp(ij)[
2
I

() +o
p
(1) = f()I

() +o
p
(1),
where f() = [

j

j
exp(ij)[
2
is the spectral density of X
t
.
Lemma 8.4.2 Let us suppose that X
t
satisfy X
t
=

j=

t
, where

j=
[
j
[ < ,
and
t
are iid random variables with mean zero and variance
2
. Then we have
J
X
() =

j
exp(ij)J

() +Y
n
(), (8.19)
where Y
n
() =
1

j

j
exp(ijU
n,j
, U
n,j
=

nj
t=1j
exp(it)
t

n
t=1
exp(it)
t
, E(Y
n
())
2

(
1
n
1/2

j=
[
j
[ min([j[, n)
1/2
)
2
. Furthermore
I
X
() = [

j
exp(ij)[
2
[J

()[
2
+R
n
(), (8.20)
where E(sup

[R
n
()[) 0 and as n .
If in addition E(
4
t
) < and

j=
[j
1/2

j
[ < then E(sup

[R
n
()[
2
) = O(n
1
).
102
PROOF. See Priestley (1983), Theorem 6.2.1 or Brockwell and Davis (1998), Theorem 10.3.1.

Using the above we see that I


X
() f()I

(). This suggest that most of the properties


which apply to I

() also apply to I
X
(). Indeed in the following theorem we show that the
asympototic distribution of I
X
() is exponential with mean and variance f().
By using the above result we now generalise Proposition 8.4.1 to linear processes.
Theorem 8.4.1 Suppose X
t
satises X
t
=

j=

t
, where

j=
[
j
[ < . Let I
n
()
denote the periodogram associated with X
1
, . . . , X
n
and f() be the spectral density. Then
(i) If f() > 0 for all [0, 2] and 0 <
1
, . . . ,
m
< , then
_
I
n
(
1
)/f(
1
), . . . , I
n
(
m
)/f(
m
)
_
converges in distribution (as n ) to a vector of independent exponential distributions
with mean one.
(ii) If in addition E(
4
t
) < and

j=
[j
1/2

j
[ < . Then for
j
=
2j
n
and
k
=
2k
n
we
have
cov(I(
k
), I(
j
)) =
_
_
_
2(2)
2
f(
k
) +O(n
1/2
)
j
=
k
= 0 or
(2)
2
f(
k
) +O(n
1/2
) 0 <
j
=
k
<
O(n
1
)
j
,=
k
where the bound is uniform in
j
and
k
.
PROOF. See Brockwell and Davis (1998), Theorem 10.3.2.
Remark 8.4.1 (Summary of properties of the periodogram for linear processes) (i)
The periodogram is nonnegative and is an asymptotically an unbiased estimator of the spec-
tral density (when

j
[
j
[ < ).
(ii) Like the spectral density is it symmetric about zero: I
n
() = I
n
( +).
(iii) At the fundemental frequencies I(
j
) are asymptotically uncorrelated.
(iv) If 0 < < , I() is asymptotically exponentially distributed with mean f().
We see that the periodogram is extremely irratic and does not converge (in anyway) to the spec-
tral density as n . In the following section we discuss this further and consider modications
of the spectral density which lead to a consistent estimate.
8.4.2 Estimating the spectral density
There are several (pretty much equivalent) explanations as to why the raw periodogram is not a
good estimator of the spectrum. Intuitively, the simplest explanation is that we have included too
many covariance estimators in the estimation of f(). We see from (8.13) that the periodogram
is the Fourier transform of the estimates covariances at n dierent lags. Typically the variance
for each covariance c
n
(k) will be about O(n
1
), hence roughly speaking the variance of I
n
()
103
will be the sum of these n O(n
1
) variances which leads to a variance of O(1), which clearly
does not converge to zero. This suggest that if we use m (m << n) covariances in the estimation
of f() rather than all n, (where we let m ) we may reduce the variance in the estimation
(with the cost of introducing some bias) to yield a good estimator of the spectral density. This
indeed turns about to be the case.
Another way is to approach the problem is from a nonparametric angle. We note that from
Theorem 8.4.1, at the fundemental frequencies I(
k
) can be treated as uncorrelated random
variables, with mean f(
k
) and variance f(
k
). Therefore we can rewrite I(
k
) as
I(
k
) = E(I(
k
)) + (I(
k
) E(I(
k
)))
f(
k
) +f(
k
)U
k
, k = 1, . . . , n, (8.21)
where U
k
is approximately a mean zero, variance one sequence of uncorrelated random vari-
ables and
k
= 2k/n. We note that (8.21) resembles the usual nonparametric function plus
additive noise, often considered in nonparametric statistics. This suggest that another way to
estimate the spectral density us to use a locally weighted average of I(
k
). Interestingly both
the estimation methods mentioned above are practically the same method.
It is worth noting that Parzen (1957) rst proposed a consistent method to estimate the
spectral density. Furthermore, classical density estimation and spectral density estimation are
very similar, and it was spectral density estimation motivated methods, which motivated meth-
ods to estimate the density function (one of the rst papers on density estimation is Parzen
(1962)).
Equation (8.21) motivates the following nonparametric estimator of f().

f
n
(
j
) =

k
1
bn
K(
j k
bn
)I(
k
), (8.22)
where W() is a kernel which satises
_
W(x)dx = 1 and
_
W(x)
2
dx < . An example of

f
n
(
j
) is the local average about frequency
j
:

f
n
(
j
) =
1
bn
j+bn/2

k=jbn/2
I(
k
).
Theorem 8.4.2 Suppose X
t
satisfy X
t
=

j=

t
, where

j=
[j
1/2

j
[ < and
E(
4
t
) < . Let

f
n
() be the spectral estimator dened in (8.22). Then
E(

f
n
(
j
)) f(
j
) (8.23)
and
var(

f
n
(
j
))
_
1
bn
f(
j
) 0 <
j
<
2
bn
f(
j
)
j
= 0 or .
(8.24)
bn , b 0 as n .
PROOF. The proof of both (8.23) and (8.24) are based on the kernel K(x/b) getting narrow
as b 0, hence there is more localisation as the sample size grows (just like nonparametric
104
regression). We note that since f() =
2
[

j=
exp(ij)[
2
, the spectral density, f, is
continuous in .
To prove (8.23) we take expections

E(

f
n
(
j
)) f(
j
)

k
1
bn
K(
k
bn
)
_
E
_
I(
jk
)
_
f(
j
)
_

k
1
bn
K(
k
bn
)

E
_
I(
jk
)
_
f(
jk
)

k
1
bn
[K(
k
b
)
_
f(
j
) f(
jk
)
_

:= I +II.
Using Lemma 8.4.1 we have
I =

k
1
bn
[K(
k
bn
)[

E
_
I(
jk
)
_
f(
jk
)

K
_
1
bn

k
[K(
k
bn
)[
_

_

|k|n
[c(k)[ +

|k|n
[k[
n
[c(k)[
_
0.
Now we consider
II =

k
1
bn
K(
k
bn
)
_
f(
j
) f(
jk
)
_

.
Since the spectral density f() is continuous, then we have II as bn , b 0 and
n . The above two bounds mean give (8.23).
We will use Theorem 8.4.1 to prove (8.24). We rst assume that j ,= 0 or n. Evaluating the
variance using Theorem 8.4.1 we have
var(

f
n
(
j
)) =

k
1
,k
2
1
(bn)
2
K(
j k
1
bn
)K(
j k
2
b
)cov(I(
k
1
), I(
k
1
))
=

k
1
(bn)
2
K(
j k
bn
)K(
j k
b
)var(I(
k
)) +O(
1
n
)
=

k
1
(bn)
2
K(
k
bn
)
2
f(
jk
) +O(
1
n
1/2
)
1
bn
f(
j
).
A similar proof can be used to prove the case j = 0 or n.
The above result means that the mean squared error of the estimator is
E
_

f
n
(
j
) f(
j
)
_
2
0
as bn , b 0 and n . Moreover
E
_

f
n
(
j
) f(
j
)
_
2
= O(
1
bn
) +
_
E(

f
n
(
j
)) f(
j
)
_
2
= O(
1
bn
) +O
_

|k|n
[c(k)[ +

|k|n
[k[
n
[c(k)[
_
.
105
Hence the rate of convergence depends on the bias of the estimator, in particular the rate of decay
of the covariances. If the covariance decay exponentially (as in the case of ARMA processes)
the bias is extremely small and E
_

f
n
(
j
) f(
j
)
_
2
= O(
1
bn
).
There are several example of kernels that one can use and each has its own optimality
property. An interesting discussion on this is given in Priestley (1983), Chapter 6.
As mentioned briey above we can also estimate the spectrum by truncating the number of
covariances estimated. We recall that
I
X
() =
1
2
n1

k=(n1)
c
n
(k) exp(ik).
Hence a viable estimator of the spectrum is

f
n
() =
1
2
n1

k=(n1)
(
k
m
) c
n
(k) exp(ik),
hence () is a weight function with very little weight (or no weight) for the covariances at large
lags. A useful example, using the rectangular function is (x) = 1 if [x[ 1 and zero otherwise
is

f
n
() =
1
2
m

k=m
c
n
(k) exp(ik),
where m << n. Now

f
n
() has similar properties to

f
n
() with m playing the same role as
the window bn. Indeed there is a very close relationship between the two which can be seen by
using (8.1). Substituting c
n
(k) =
1
2
_
2
0
I
n
() exp(ik)d into

f
n
gives

f
n
() =
1
(2)
2
n
_
I
n
()
n1

k=(n1)
(
k
m
) exp(ik( ))d =
1
2
_
I
n
()W
m
( )d,
where W
m
() =
1
2n

n1
k=(n1)
(
k
m
) exp(ik). Now W
m
() and
1
b
W(

b
) (dened in (8.22)) are
not necessarily the same function, but they share many of the same characteristics. In fact they
are both asymptotically normal (we discuss this in the remark below).
Remark 8.4.2 (The distribution of the spectral density estimator) Using that the pe-
riodogram I
n
()/f() is asymptotically
2
(2) distributed and uncorrelated at the fundemental
frequencies, we can deduce approximate the distribution of

f
n
(). To obtain the distribution
consider the example

f
n
(
j
) =
1
bn
j+bn/2

k=jbn/2
I(
k
).
Since I(
k
)/f(
k
) are approximately
2
(2), then since the sum

j+bn/2
k=jbn/2
I(
k
) is taken over a
local neighbourhood of
j
, we have that f(
j
)
1

j+bn/2
k=jbn/2
I(
k
) is approximately
2
(2bn). Ex-
tending this argument to aribtrary kernels we have that bn

f
n
(
j
)/f(
j
is approximately
2
(2bn).
106
We note that when bn is large, then
2
(2bn) is close to normal with mean 2bn and variance
4bn. Hence bn

f
n
(
j
)/f(
j
is approximately normal with mean 2bn and variance 4bn. Therefore

bn

f
n
(
j
) N(2f(
j
), 4f(
j
)).
Using this approximation, we can construct condence intervals for f(
j
).
8.5 The Whittle Likelihood
In Chapter 4 we considered various methods for estimating the parameters of an ARMA pro-
cess. The most ecient method (when the errors were Gaussian) was the Gaussian maximum
likelihood estimator. This estimator was dened in the time domain, but it is interesting to note
that a very similar estimator which is asymptotically equivalent to the GMLE estimator can be
dened in the frequency domain. We rst dene the estimator using heuristics to justify it. We
then show how it is related to the GMLE (it is the frequency domain approximation of the time
domain estimator).
First let us suppose that we observe X
t

n
t=1
, where
X
t
=
p

j=1

(0)
j
X
tj
+
q

j=1

(0)
j

tj
+
t
,
and
t
are iid random variables. As before we will assume that
0
=
(0)
j
and
0
=
(0)
j

are such that the roots of the characteristic polynomial are greater than 1 + . Let us dened
the Discrete Fourier Transform
J
n
() =
1

2n
n

j=1
X
t
exp(it).
We will consider at the fundamental frequencies
k
=
2k
n
. As we mentioned in Section 8.2
we have
cov(J
n
(
k
1
), J
X
(
k
2
))
_
var(J
n
(
k
1
)) f(
k
1
) k
1
= k
2
0 k
1
,= k
2
.
_
.
Hence if the innovations are Gaussian then J
X
() is complex Gaussian and we have approxi-
mately

n
=
_
_
_
J
n
(
1
)
.
.
.
J
n
(
n
)
_
_
_
^(0, diag(f(
1
), . . . , f(
n
))).
Therefore since
n
is normally distributed (complex) random vector with mean zero and diagonal
matrix variance matrix diag(f(
1
), . . . , f(
n
)), the the log likelihood of
n
is approximately
L
w
(, ) =
n

k=1
_
log [f
,
(
k
)[ +
[J
n
(
k
)[
2
f
,
(
k
)
_
.
107
To estimate the parameter we would choose the and which minimises the above criterion,
that is
(
w
n
,
w
n
) = arg max
,
L
w
(, ), (8.25)
where consistents of all parameters where the roots of the characteristic polynomial have
absolute value greater than (1 +).
Whittle (1962) showed that the above criterion is an approximation of the GMLE. The
correct proof is quite complicated and uses several matrix approximations due to Grenander
and Szeg o (1958). Instead we give a heuristic proof which is quite enlightening.
Remark 8.5.1 (Some properties of circulant matrices) (i) Let us dene the n dimen-
sional circulant matrix
=
_
_
_
_
c(0) c(1) c(2) . . . c(n-2) c(n-1)
c(n-1) c(0) c(2) . . . c(n-3) c(n-2)
c(n-1) c(1) c(0)
_
_
_
_
.
See that the elements in each row are the same, just a rotation of each other.
The eigenvalues and vectors of have interesting properties. Dene f
n
() =

n
k=1
c(k) exp(ik),
then the eigenvalues are f
n
(
j
) with corresponding eigevectors e
j
= (1, exp(2ij/n), . . . , exp(2ij(n
1)/n)). Hence let E = (e
1
, . . . , e
n
), then we can write as
= EE
1
(8.26)
where = diag(f(
1
), . . . , f(
n
)).
(ii) You may wonder about the relevance of circulant matrices to the current setting. However
the variance covariance matrix of a stationary process is
var(X
n
) = =
_
_
_
_
c(0) c(1) c(2) . . . c(n-2) c(n-1)
c(1) c(0) c(2) . . . c(n-3) c(n-2)
c(n-1) c(1) c(0)
_
_
_
_
.
This is a Toeplitz matrix and we observe for large n it is very close to the circulant matrix,
the dierences are in the endpoints of the matrix. Hence we it can be show that for large
n
EE
1
, (8.27)
and E
1


E (where

E denotes the complex conjugate of E).
We will use the results in the lemma above to prove the lemma below. We rst observe that
in the Gaussian maximum likelihood for the ARMA process can be written as in terms of its
spectrum (see 4.10)
L
n
(, ) = det [(, )[ +X

n
(, )
1
X
n
= det [(f
,
)[ +X

n
(f
,
)
1
X
n
, (8.28)
where (f
,
)
s,t
=
_
f
,
() exp(i(s t))d and X

n
= (X
1
, . . . , X
n
). We now show that
L
n
(, ) L
w
(, ).
108
Lemma 8.5.1 Suppose that X
t
is a stationary ARMA time series with absolutely summable
covariances and f
,
() is the corresponding spectral density function. Then
det [(f
,
)[ +X

n
(f
,
)
1
X
n

n

k=1
_
log [f
,
(
k
)[ +
[J
X
(
k
)[
2
f
,)
(
k
)
_
,
for large n.
PROOF. We give a heuristic proof (details on the precise proof can be found in the remark
below). Using (8.27) we have see that (f
,
) can be approximately written in terms of the
eigenvalue and eigenvectors of the circulant matrix associated with (f
,
), that is
(f
,
) E(f
,
)E
1
and (f
,
)
1


E(f
,
)
1
E, (8.29)
where (f
,
) = diag(f
(n)
(
1
), . . . , f
(n)
(
n
)), f
(n)
() =

(n1)
j=(n1)
c
,
(k) exp(ik) and
k
=
2k/n. A basic calculations gives that
X
n

E = (J
n
(
1
), . . . , J
n
(
n
)). (8.30)
Substituting (8.30) and (8.29) into (8.31) yields
1
n
L
n
(, )
1
n
n

k=1
_
det f
(n)
,
(
k
) +
[J
n
()[
2
f
(n)
,
(
k
)
_
=
1
n
L
w
(, ). (8.31)
Hence by using the approximation (8.29) we have derived the Whittle likelihood. This proof
was rst derived by Tata Subba Rao.
Remark 8.5.2 (A rough avour of the proof ) There are various ways to precisely prove
this result. All of them show that the Toeplitz matrix can in some sense be approximated by a
circulant matrix. This result uses Szeg os identity (Grenander and Szeg o (1958)). Studying the
Gaussian likelihood in (8.31) we note that the Whittle likelihood has a similar representation.
That is
L
(w)
n
(, ) =
n

k=1
_
log [f
,
(
k
)[ +
[J
n
(
k
)[
2
f(
k
; , )
_
=
n

k=1
log [f
,
(
k
)[ +X
n
U(f
1
,
)X
n
,
where U(f
1
,
)
s,t
=
_
f
,
()
1
exp(i(st))d. Hence we can show that [U(f
1
,
)(f
,
)
1
[
0 as n (and its derivatives with respect and also converge), then we can show that

L
n
(, ) and L
n
(, ) and its derivatives are asymptotically equivalent. Hence the GMLE and
the Whittle estimator are asymptotic equivalent. The dicult part in the proof is stablishing
involves showing that [U(f
1
,
) (f
,
)
1
[ 0 as n . It is worth noting that Rainer
Dahlhaus has extensively developed this area and considered several interesting generalisations
(see for example Dahlhaus (1996), Dahlhaus (1997) and Dahlhaus (2000)).
We now show consistency of the estimator (without showing that its equivalent to the
GMLE). To simply calculations we slightly modify the estimator and exchange the summand in
(8.25) with an integral to obtain
L
w
(, ) =
_
2
0
_
log [f
,
(
k
)[ +
[J
n
(
k
)[
2
f
,
(
k
)
_
.
109
To estimate the parameter we would choose the and which minimises the above criterion,
that is
(
w
n
,
w
n
) = arg max
,
L
w
(, ), (8.32)
Lemma 8.5.2 (Consistency) Let us suppose that d
k
(, ) =
_
f
,
()
1
exp(ik)d and

k
[d
k
(, )[ <
. Let (

n
,

n
) be dened a in (8.32). Then we have
(
w
n
,
w
n
)
P
(
0
,
0
).
PROOF. Recall to show consistency we need to show pointwise convergence of

L
n
and equicon-
tinuity. First we show pointwise convergence by evaluating the variance of

L
n
, and show that it
converges to zero as n . We rst note that by using d
k
(, ) =
_
f
,
()
1
exp(ik)d and
using I
n
() =
1
2

n1
k=(n1)
c
n
(k) exp(ik) we can write L
w
n
as
1
n
L
w
n
(, ) =
_
2
0
log [f
,
(
k
)[ +
1
n
n1

r=(n1)
d
r
(, )
n|r|

k=1
X
k
X
k+r
.
Therefore taking the variance gives
var(L
w
n
(, )) =
1
n
2
n1

r
1
,r
2
=(n1)
d
r
1
(, )d
r
2
(, )
n|r
1
|

k
1
=1
n|r
2
|

k
2
=1
cov(X
k
1
X
k
1
+r
1
, X
k
2
X
k
2
+r
2
) (8.33)
We note that
cov(X
k
1
X
k
1
+r
1
, X
k
2
X
k
2
+r
2
) = cov(X
k
1
, X
k
2
)cov(X
k
1
+r
1
, X
k
2
+r
2
) + cov(X
k
1
, X
k
2
+r
2
)cov(X
k
1
+r
1
, X
k
2
) +
cums(X
k
1
X
k
1
+r
1
, X
k
2
X
k
2
+r
2
).
Now we note that

k
2
[cum(X
k
1
X
k
1
+r
1
, X
k
2
X
k
2
+r
2
)[ < and

k
2
cov(X
k
1
, X
k
2
) < , hence
substituting this and

k
[d
k
(, )[ < into (8.33), we have that
var(L
w
n
(, )) = O(
1
n
),
hence var(L
w
n
(, )) 0 as n . Dene
Lw(, )) = E(L
w
n
(, )) =
_
det f
,
() +
f

0
,
0
()
f
,
(
k
)
.
Hence, since var(L
w
n
(, )) 0 we have
L
w
n
(, )
P
L
w
(, ).
To show equicontinuity we apply the mean value theorem to L
w
n
. We note that because the
parameters (, ) , have characteristic polynomial whose roots are greater than (1 +) then
f
,
() is bounded away from zero (indeed there exists a

> 0 where inf


,(,)
f
,
()

). Hence it can be shown that there exists a random sequence /


n
such that [

L
n
(
1
,
1
)
110

L
n
(
2
,
2
))[ /
n
(|(
1

2
), (
1

2
)|) and /
n
converges almost surely to a nite constant
as n . Therefore

L
n
is stochastically equicontinuous (and equicontinuous in probability).
Since the parameter space is compact, all three conditions in Section 5.6 are satised and we
have consistency of the Whittle estimator.
We now show asymptotic normality of the Whittle estimator and in the following remark
show its relationship to the GMLE estimator.
Theorem 8.5.1 Let us suppose that d
k
(, ) =
_
f
,
()
1
exp(ik)d and

k
[d
k
(, )[ < .
Let (
w
n
,
w
n
) be dened a in (8.32)

n
_
(
w
n

0
), (
w
n

0
)
_
D
^(0, V
1
+V
1
WV
1
)
where
V =
1
2
_
2
0
_
f

0
,
0
()
f

0
,
0
()
2
__
f

0
,
0
()
f

0
,
0
()
2
_

d
W =
1
2
_
2
0
_
2
0
_
f

0
,
0
()
1
__
f

0
,
0
()
1
_

f
4,
0
,
0
(
1
,
1
,
2
),
and f
4,
0
,
0
(
1
,
2
,
3
) =
4
A(
1
)A(
2
)A(
3
)A(
1

2

3
) is the fourth order spectrum
corresponding to the ARMA process with A() =

0
(exp(i))

0
(exp(i))
.
PROOF. See, for example, Brockwell and Davis (1998), Chapter 10.8.
Remark 8.5.3 (i) It is interesting to note that in the case that X
t
comes from a linear time
series (such as an ARMA process) then using f
4,
0
,
0
(
1
,
1
,
2
) =
4
[A(
1
)[
2
[A(
2
)[
2
=

2
2
f(
1
)f(
2
) (for linear processes) we have
W =
1
2
_
2
0
_
2
0
_
f

0
,
0
()
1
__
f

0
,
0
()
1
_

f
4,
0
,
0
(
1
,
1
,
2
)
=

4

2
2
_
1
2
_
2
0
f

0
,
0
()
f

0
,
0
()
2
f

0
,
0
()d
_
2
=

4

2
2
_
1
2

_
2
0
log f

0
,
0
()d
_
2
=

4

2
2
_
1
2
2 log

2
2
_
2
= 0,
where we note that
_
2
0
log f

0
,
0
()d = 2 log

2
2
by using Kolmogorovs formula. Hence
for linear processes the higher order cumulant plays no role and above theorem reduces to

n
_
(

0
), (

0
)
_
D
^(0, V
1
)
(ii) Since the GMLE and the Whittle likelihood are asymptotically equivalent they should lead
to the same asymptotic distributions. We recall that the GMLE has the asymptotic distri-
bution

n(

0
,

0
)
D
^(0,
2
0

1
), where
=
_
E(U
t
U

t
) E(V
t
U

t
)
E(U
t
V

t
) E(V
t
V

t
)
_
111
and U
t
and V
t
are autoregressive processes which satisfy
0
(B)U
t
=
t
and
0
(B)V
t
=

t
.
It can be shown that
_
E(U
t
U

t
) E(V
t
U

t
)
E(U
t
V

t
) E(V
t
V

t
)
_
=
1
2
_
2
0
_
f

0
,
0
()
f

0
,
0
()
2
__
f

0
,
0
()
f

0
,
0
()
2
_

d.
112
Chapter 9
Nonlinear Time Series
So far we have focused on linear time series, that is time series which have the representation
X
t
=

j=

tj
,
where
t
are iid random variables. Such models are exrtemely useful and are used widely
in several applications. However, a typical realisation from a linear time series, will be quite
regular with no suddent bursts or jumps. This is due to the linearity of the system. However, if
one looks at nancial data, for example, there are sudden burst of volatility and extreme values,
which calm down after a while. It is not possible to model such behaviour well with a linear
time series. In order to capture this nonlinear behaviour several nonlinear models have been
proposed. The models typically consists of products of random variables which help to explain
the sudden irratic bursts in the data. Over the past 30 years there has been a lot research into
nonlinear time series models. Popular nonlinear models include the bilinear model, (G)ARCH-
type models, random autoregressive coecient models and threshold models, to name but a few
(see, for example, Subba Rao (1977), Granger and Andersen (1978), Nicholls and Quinn (1982),
Engle (1982), Subba Rao and Gabr (1984), Bollerslev (1986), Terdik (1999), Fan and Yao (2003)
and Straumann (2005))
In this chapter we will focus on the ARCH model which is extremely popular in nancial
time series (and its closely related cousin the GARCH model). Before tting a nonlinear model
it is important to establish whether it is worth tting a nonlinear time series model to the data.
In the second part of the paper we consider a test for linearity of the time series.
9.1 The ARCH model
ARCH-type processes are often used to model the volatilities in nancial markets. They are an
example of a nonlinear stochastic process. The ARCH model was rst proposed by Engle (1982),
and since its conceptions various dierent ARCH avours have been proposed. These include
the benchmark GARCH process (Bollerslev (1986)), the EGARCH, IGARCH and AGARCH
models to name but a few.
In this section we will focus on the original ARCH model. X
t
is called an ARCH(p) process
113
if it satises
X
t
=
t
Z
t
,
2
t
= a
0
+
p

j=1
a
j
X
2
tj
(9.1)
where E(Z
t
) = 0, a
0
> 0, a
j
> 0 for j = 1, . . . , p and

p
j=1
a
j
= < 1. Excellent references
for the properties of stationary ARCH processes and estimation are Giraitis et al. (2000) and
Straumann (2005).
9.1.1 Some properties of the ARCH process
By expanding X
t
as a Volterra series expansion (a nonlinear generalisation of the moving-average
process) we obtain the following theorem.
Theorem 9.1.1 Suppose X
t
is an ARCH(p) process and E(Z
2
t
) = 1 then the series
X
2
t
= a
0
Z
2
t
+

k1
m
t
(k) (9.2)
where m
t
(k) =

j
1
,...,j
k
1
a
0
_
k

r=1
a
j
r
_
k

r=0
Z
2
t
P
r
s=0
j
s
(j
0
= 0),
converges almost surely, has a nite mean and is the unique, stationary, ergodic solution of
(9.1).
PROOF. A formal expansion of (9.1) gives (9.2). We rst show that X
t
is well-dened. Since
(9.2) is the sum of positive random variables and the coecients are also positive we need only
to show that the expectation of (9.2) is nite. By using E(Z
2
t
) = 1,

p
j=1
a
j
< < 1 and the
monotone convergence theorem we can obtain a nite bound for the expectation of (9.2). Since
Z
2
t
are iid random variables, we notice that X
t
= g(Z
2
t
, . . . , ) where
g(x
t
, . . . , ) = a
0
x
2
t
+

k=1

j
1
,...,j
k
1
a
0
_
k

r=1
a
j
r
_
k

r=0
B
P
r
s=0
j
s
x
t
(j
0
= 0)
and B
k
x
t
= x
tk
. Therefore g() is a time-invariant function. Thus by using Theorem 5.2.1 that
X
t
is a stationary, ergodic process.
To show uniqueness of X
2
t
we must show that any other solution is equal to X
2
t
with prob-
ability one. Suppose Y
2
t
is another solution of (9.1). By recursively applying relation (9.1) r
times to Y
2
t
we have
Y
2
t
= a
0
Z
2
t
+
r1

k=1
m
t
(k) +B
r
where B
r
=

j
r
<...<j
0
=t
_
r

i=1
a
j
i1
j
i
_Y
2
j
r
a
0
r1

i=0
Z
2
j
i
Thus the dierence between Y
2
t
and X
2
t
is
X
2
t
Y
2
t
= A
r
B
r
where A
r
=

k=r
m
t
(k).
114
We now show for any > 0

r=1
P([A
r
B
r
[ > ) < (since this implies by the Borel-Cantelli
Lemma that the the event [A
r
B
r
[ > can only occur nitely often with probability one, if
this is true for all > 0, we have [A
r
B
r
[
a.s
0). By using E(Z
2
t
) = 1 and

p
j=1
a
j
= we have
E(A
r
) C
r
. Furthermore Y
2
j
r
and

r1
i=0
Z
2
j
i
are independent (if i < r then j
i
> j
r
). Therefore
E(Y
2
j
r

r1
i=0
Z
2
j
i
) = E(Y
2
j
r
) and we have
E(B
r
) =

j
r
<...<j
0
=t
_
k

r=1
a
j
r1
j
r
_E(Y
2
j
r
)
a
0

1
a
0
E(Y
2
j
r
)
r
.
Now by using the Markov inequality we have P(A
r
> ) C
1

r
/ and P(B
r
> ) C
1

r
/ for
some constant C
1
. Therefore P([A
r
B
r
[ > ) C
2

r
/. Thus

r=1
P([A
r
B
r
[ > ) < .
Since this is true for all > 0, we have Y
2
t
a.s
= X
2
t
and therefore the required result.
We rst observe that since X
t
=
t
Z
t
we have that cov(
t
Z
t
,
s
Z
s
) = 0 for s ,= t. Hence the
ARCH process is uncorrelated but dependent process.
In some sense the ARCH model can be considered as a generalisation of the AR model. That
is the squares of ARCH model satisfy
X
2
t
=
2
Z
2
t
= a
0
+
p

j=1
a
j
X
2
tj
+ (Z
2
t
1)
2
t
. (9.3)
We observe that the since

p
j=1
[a
j
[ < 1 the roots of the characteristic polynomial a(z) =
1

p
j=1
a
j
z
j
have roots outside the unit circle. Moreover
t
= (Z
2
t
1)
2
t
are martingale
dierences (since E((Z
2
t
1)
2
t
[X
t1
, X
t2
, . . .) =
2
t
E(Z
2
t
1) = 0), hence cov(
t
,
s
) = 0 for
s ,= t. In many respects (9.3) is similar to an AR representation except that
t
are martingale
dierences and not iid random variables. We now obtain the best predictor of X
2
t
given X
2
t1
, . . ..
We rst note that the best predictor of X
2
t
in the sigma algebra T
t1
= (X
t1
, . . .) is the random
variable E(X
t
[T
t1
) T
t1
, since it minimises the mean squared error min
Y F
t1
E(X
2
t
Y )
2
.
Hence the conditional expectation is
E
_
X
2
t
[X
2
t1
, . . .
_
= E
_
a
0
+
p

j=1
a
j
X
2
tj
+ (Z
2
t
1)
2
t

X
2
t1
, . . .
_
= a
0
+
p

j=1
a
j
X
2
tj
.
We mention that usually the best predictor gives smaller mean squared error than the best liner
predictor. However in this case, since the best predictor is a linear combination of X
2
t1
, . . ., it
is also the best linear predictor.
Using (9.3) we see that by taking expectations we have
E(X
2
t
) = a
0
+
p

j=1
a
j
E(X
2
tj
) E(X
2
t
) =
a
0
1

p
j=1
a
j
.
Moreover, by using (9.2) it can be shown that E(X
2n
t
) is nite if and only if E(Z
2n
t
)
1/n

p
j=1
a
j
<
1. We see this places quite a huge restriction on the innovations Z
t
, and in general only a few
moments of the ARCH process tend exist. This, however ts with empirical observations, where
it is believed that often nancial data is thick tailed.
115
9.2 The quasi-maximum likelihood for ARCH processes
In this section we consider an estimator of the parameters a
0
= a
j
: j = 0, . . . , p given the
observations X
t
: t = 1, . . . , N, where X
t
is a ARCH(p) process. We use the conditional
log-likelihood to construct the estimator. We will assume throughout that E(Z
2
t
) = 1 and

p
j=1

j
= < 1.
We now construct an estimator of the ARCH parameters based on Z
t
^(0, 1). It is worth
mentioning that despite the criterion being constructed under this condition it is not necessary
that the innovations Z
t
are normally distributed. In fact in the case that the innovations are
not normally distributed but have a nite fourth moment the estimator is still good. This is
why it is called the quasi-maximum likelihood , rather than the maximum likelihood (similar
to the how the GMLE estimates the parameters of an ARMA model regardless of whether the
innovations are Gaussian or not).
Let us suppose that Z
t
is Gaussian. Since Z
t
= X
t
/
_
a
0
+

p
j=1
a
j
X
2
tj
, E(X
t
[X
t1
, . . . , X
tp
) =
0 and var(X
t
[X
t1
, . . . , X
tp
) = a
0
+

p
j=1
a
j
X
2
tj
, then the log density of X
t
given X
t1
, . . . , X
tp
is
log(a
0
+
p

j=1
a
j
X
2
tj
) +
X
2
t
a
0
+

p
j=1
a
j
X
2
tj
.
Therefore the conditional log density of X
p+1
, X
p+2
, . . . , X
n
given X
1
, . . . , X
p
is
n

t=p+1
_
log(a
0
+
p

j=1
a
j
X
2
tj
) +
X
2
t
a
0
+

p
j=1
a
j
X
2
tj
_
.
This inspires the the conditional log-likelihood
L
n
() =
1
n p
n

t=p+1
_
log(
0
+
p

j=1

j
X
2
tj
) +
X
2
t

0
+

p
j=1

j
X
2
tj
_
.
To obtain the estimator we dene the parameter space
= = (
0
, . . . ,
p
) :
p

j=1

j
1, 0 < c
1

0
c
2
< , c
1

j

and assume the true parameters lie in its interior a = (a


0
, . . . , a
p
) Int(). We let
a
n
= arg min

L
n
(). (9.4)
9.2.1 Consistency of the quasi-maximum likelihood estimator
In this section we will show consistency and asymptotic normality of the estimator a
n
. As
mentioned about, very few moments of the ARCH process exist (the more moments that exist
the more restricted the parameters a
0
are). Hence we want to prove the results under weak
moment conditions. We will see below that the choice of the parameter space (where the
parameters are bounded from below) helps reduce the number of moments. We mention that
116
the results proved here are not under the weakest conditions and that ARCH models are a subset
of GARCH models. Consistency and asymptotic normality of the GARCH QMLE parameter
estimator (which is close to the ARCH QMLE discussed in the previous sections, but uses many
of the ideas in ARMA estimation) was rst shown (under rather weak conditions) in Berkes
et al. (2003).
It is straightforward to prove the result by using the erdogic theorem repeatedly. Let
f

(X
t
) = log(
0
+

p
j=1

j
X
2
tj
) +
X
2
t
2(
0
+
P
p
j=1

j
X
2
tj
)
. By Theorem 5.2.1, for every ,
f

(X
t
) : t Z is a stationary, ergodic processes. This allows us to obtain the limit of L
n
().
We rst want to show that the limit of these quantities are bounded.
Lemma 9.2.1 Let us suppose that X
t
is a stationary process with

p
j=1
a
j
< 1. Then
sup

1
n p
n

t=p+1

log(
0
+
p

j=1

j
X
2
tj
)

1
n p
n

t=p+1
log(c
2
+
p

j=1
c
2
X
2
tj
)
a.s.
Elog
_
c
2
+
p

j=1
c
2
X
2
tj
_
sup

1
n p
n

t=p+1

X
2
t

0
+

p
j=1

j
X
2
tj

1
n p
n

t=p+1
X
2
t
c
1
[
a.s.

1
c
1
E(X
2
t
)
sup

1
n p
n

t=p+1

X
2
tj

0
+

p
j=1

j
X
2
tj

1
c
1
1 j p
sup

1
n p
n

t=p+1

X
2
t
X
2
tj
(
0
+

p
j=1

j
X
2
tj
)
2

1
n p
n

t=p+1
X
2
t
c
2
1
a.s.

1
c
2
1
E(X
2
t
) 1 j p
sup

1
n p
n

t=p+1

X
2
t
X
2
tj
1
X
tj
2
(
0
+

p
j=1

j
X
2
tj
)
3

1
n p
n

t=p+1
X
2
t
c
3
1
a.s.

1
c
3
1
E(X
2
t
) 1 j
1
, j
2
p
sup

1
n p
n

t=p+1

X
2
t
X
2
tj
1
X
2
tj
2
X
2
tj
3
(
0
+

p
j=1

j
X
2
tj
)
4

1
n p
n

t=p+1
X
2
t
c
4
1
a.s.

1
c
4
1
E(X
2
t
) 1 j
1
, j
2
, j
3
p.
PROOF. The proof is straightforward from the denition of .
We note that if we did not bound the parameters
j
in away from zero, then it would not
be possible to show that the expectations were nite without assuming that E(X
4
t
) < , which
we recall is a highly restrictive assumption.
Let
L() = E
_
log(
0
+
p

j=1

j
X
2
tj
)
_
+E
_
X
2
t

0
+

p
j=1

j
X
2
tj
_
.
Lemma 9.2.2 Let us suppose that X
t
is a stationary process with

p
j=1
a
j
< 1. Then
L
n
()
a.s.
L()
2
L
n
()
a.s.

2
L()
sup

|L
n
()|
1
n p
n

t=p+1
1
c
1
_
p

j=1
X
2
tj
+c
1
1
X
2
t
_
a.s.

1
c
1
_
pE(X
2
tj
) +c
1
1
E(X
2
t
)
_
sup

|
3
L
n
()|
1
n p
n

t=p+1
1
c
3
1
_
p

j=1
X
2
tj
+c
1
1
X
2
t
_
a.s.

1
c
3
1
_
pE(X
2
tj
) +c
1
1
E(X
2
t
)
_
.
117
Lemma 9.2.3 Suppose L
n
() is dened as in (9.4) then
sup

[L
n
() L()[
a.s
0. (9.5)
PROOF. To prove uniform convergence it is sucient to show pointwise convergence and
equicontinuity of L
n
() (since is compact) (see Theorem 5.4.1). By using the ergodic theorem
and Lemma 9.2.1 we have
[L
n
() L()[
a.s
0. (9.6)
We now show that L
n
() is stochastically equicontinuous. By the mean value theorem for every

1
,
2
there exists an

such that
[L
n
(
1
) L
n
(
2
)[ |L
N
(

)|
2
|
1

2
|
2

_
1
n p
n

t=p+1
1
2c
1
_
1 +
X
2
t
c
1
_
_
|
1

2
|
2
.
Since X
t
is an ergodic sequence then
1
np

n
t=p+1
1
2c
1
_
1+
X
2
t
c
1
)
a.s.
E(
1
2c
1
_
1+
X
2
t
c
1
). Hence L
n
() is
stochastically equicontinuous. Now by pointwise convergence of L
n
(), equicontinuity of L
n
()
and the compactness of we have uniform convergence of L
n
().
Theorem 9.2.1 Suppose X
t
: t = 1 . . . , n is from a ARCH(p) process and the estimator a
n
is as dened in (9.4). Then we have a
n
a.s
a
0
as n .
PROOF. The result follows immediately from Lemma 9.2.3 and Theorem 5.4.1.
9.2.2 Asymptotic normality of the quasi-maximum likelihood estimator
To prove asymptotic normality we use Taylor expansion techniques. Since L( a
n
) = 0 we have
( a
n
a
0
) =
2
L
n
( a
n
)
1
L
n
(a
0
), (9.7)
where a
n
lies between a
n
and a
0
. In the following lemma we show that
2
L
n
( a
n
)
a.s.

2
L(a
0
).
Lemma 9.2.4 Let us suppose that X
t
is a stationary process and a
n
lies between a
n
and a
0
.
Then

2
L
n
( a
n
)
a.s.

2
L(a
0
),
where

2
L(a
0
) = E
_X
t1
X

t1

4
t
_
.
PROOF. To prove the result we consider

2
L
n
( a
n
)
2
L(a
0
)

2
L
n
( a
n
)
2
L
n
(a
0
)

2
L
n
(a
0
)
2
L
0
(a
0
)

_
_

3
L
n
( a
n
)
_
_
2
| a
n
a
0
|
2
+

2
L
n
(a
0
)
2
L
0
(a
0
)

.
118
Using Lemma 9.2.1 we have that sup

_
_

3
L
n
()
_
_
2
is in its limit almost surely bounded.
Hence by using the above, Lemma 9.2.2, and that a
n
a.s.
a
0
we have

2
L
n
( a
n
)
2
L(a
0
)

a.s.
0,
as required.
We now show asymptotic normality of L
n
(a
0
), which requires the following assumption.
Assumption 9.2.1 For some > 0
E([Z
t
[
4(1+)
) < . (9.8)
It can be shown that the rst derivative of L
n
() is
L
n
() =
1
n p
n

t=p+1
_
X
t1

0
+

p
j=1

j
X
2
tj

X
t
X
t1
(
0
+

p
j=1

j
X
2
tj
)
2
_
,
where X

t
= (1, X
2
t1
, . . . , X
2
tp
). Now evaluating the above at the true parameters (using X
2
t
=

2
t
+ (Z
2
t
1)
2
t
) we have
L
n
(a
0
) =
1
n p
n

t=p+1
(Z
2
t
1)

2
t
X
t1
Lemma 9.2.5 Let us suppose that X
t
is a stationary process with

p
j=1
a
j
< 1 and Assump-
tion 9.2.1 is satised. Then we have

nL
n
(a
0
)
D
^(0,
2
),
where
2
= var(Z
2
t
) and
= E
_X
t1
X

t1

2
t
_
.
PROOF. To prove the result we note that each element of
(Z
2
t
1)

2
t
X
t1

t
. Hence we use the
martingale central limit theorem, together with the Cramer-Wold device to prove the result.
Theorem 9.2.2 Let us suppose that X
t
is a stationary process with

p
j=1
a
j
< 1 and As-
sumption 9.2.1 is satised. Then we have

n( a
n
a
0
)
D
^(0,
2

1
).
PROOF. To prove the result we use (9.7), Lemmas 9.2.4 and 9.2.5. It is straightforward, hence
we omit the details.
119
9.3 Testing for linearity of a time series
In this section we consider a test for linearity, rst proposed by Subba Rao and Gabr (1980)
(where the full details can be found). In the same paper a test for Gaussianity is also given,
however in this section we will focus on the test for linearity. We mention that the test is
based on the third order spectrum, however without much modication one can also construct
a similar test based on the fourth (or higher order) spectrums (which may be useful in the case
that innovation density is symmetric about zero, hence cum(
t
,
t
,
t
) = 0, in this case we need
to the fourth order spectrum).
9.3.1 Motivating the test statistic
In order to motivate the test let us review some of the results in Section 8.3.4. In Section 8.3.4
we dened the higher order spectrum which is based on the higher order cumulants of a process.
The higher order spectrum of a linear process has a very nice form. Let us suppose that X
t

satises
X
t
=

j=

tj
where

j=
[
j
[ < , E(
t
) = 0 and E([
t
[
3
) < . Let A() =

j=

j
exp(ij). Then it
is straightforward to show that the third order spectrum is
f
3
(
1
,
2
) =
3
A(
1
)A(
2
)A(
1

2
) (9.9)
where
3
= cum(
t
,
t
,
t
). We recall that the fourth order spectrum has a similar form and that
the (second order) spectral density is f() =
2
A()A(), where
2
= var(
t
). In other words
the spectrums (of all orders) of a linear process can be deduced from the transfer function A()
and the cumulants of the innovations. This is not the case for nonlinear processes, where the
higher order spectrum can have a very complicated form. It is worth noting that the third and
fourth order spectrum of an ARCH(1) process is messy to evaluate however, it can be shown its
third order spectrum does not have the form given in (9.9).
The representation of f
3
in terms of the transfer functions A() is not unique to the lin-
ear processes, it is possible to construct a very strange nonlinear process where f
3
(
1
,
2
) =

3
A(
1
)A(
2
)A(
1

2
). However, if the third order spectrum does not have this form then
we can conclude it is a nonlinear process.
Returning to (9.9) and taking its absolute square gives
[f
3
(
1
,
2
)[
2
=
2
3
[A(
1
)[
2
[A(
2
)[
2
[A(
1

2
)[
2
=

2
3
2
3
2
f(
1
)f(
2
)f(
1

3
).
Therefore
[f
3
(
1
,
2
)[
2
f(
1
)f(
2
)f(
1

3
)
=

2
3
2
3
2
. (9.10)
In other words, in the case that f
3
satises (9.9) at all frequencies (
1
,
2
) the ratio
|f
3
(
1
,
2
)|
2
f(
1
)f(
2
)f(
1

3
)
are constant.
120
This discussion motivates the following test for linearity. We estimate f and f
3
, then con-
struct the test statistic based on (9.10) and the estimates of f and f
3
. Hence, in the test for
linearity we will test
H
0
:
[f
3
(
1
,
2
)[
2
f(
1
)f(
2
)f(
1

3
)
= constant H
A
:
[f
3
(
1
,
2
)[
2
f(
1
)f(
2
)f(
1

3
)
depends on(
1
,
2
).
9.3.2 Estimates of the higher order spectrum
Let us suppose X
t
is a stationary time series and we observe X
t

n
t=1
. In this section we use
X
t

n
t=1
to estimate f and f
3
We recall that in Section 8.4.2 we considered methods to estimate the spectral density. We
consider the periodogram (frequency domain) estimator

f and the time domain estimator

f,
both estimates are asymptotically equivalent. We now dene an estimator of f
3
based on the
latter method. We recall that an estimate of the spectral density is

f
n
() =
1
2
n1

k=(n1)
(
k
m
) c
n
(k) exp(ik), (9.11)
where is the lag window, m << n and
c
n
(k) =
1
n
n|k|

t=1
(X
t


X)(X
t+k


X).
Using a similar method we can dene an estimator of f
3
. We rst need to estimate the third
order cumulant
c
n
(k
1
, k
2
) =
1
n
nmax(0,k
1
,k
2
)

t=1
(X
t


X)(X
t+k
1


X)(X
t+k
2


X).
Using the above the third order spectrum estimator is

f
3,n
(
1
,
2
) =
1
(2)
2
n1

k
1
=(n1)
n1

k
2
=(n1)
(
k
1
m
)(
k
2
m
)(
(k
1
+k
2
)
m
) c
n
(k
1
, k
2
) exp(ik
1

1
+ik
2

2
)
(9.12)
(for details on this estimator see Van Ness (1966)). Now using (9.10) it is clear that the test
statistic should be based on the ratios

f
3,n
(
1
,
2
)/

f
n
(
1
)

f
n
(
2
)

f
n
(
1

2
) over several dif-
ferent values of (
1
,
2
). That is
g(
1
,
2
) =

f
3,n
(
1
,
2
)

f
n
(
1
)

f
n
(
2
)

f
n
(
1

2
)
, (9.13)
where g(
1
,
2
) is an estimator of g(
1
,
2
) =
f
3,n
(
1
,
2
)
f
n
(
1
)f
n
(
2
)f
n
(
1

2
)
. If the sample size n the
window length m are suciently large, Van Ness (1966) and Brillinger (2001) have shown that
g(
1
,
2
) is normally distributed.
121
The ratio g(
1
,
2
) can be evaluated at all the fundemental frequencies (
2k
1
n
,
2k
2
n
), and
potentially a test can be made on the constantness of g(
1
,
2
) over all (
2k
1
n
,
2k
2
n
); 1
k
1
, k
2
n could be performed (note that it can be shown that there are regions of [0, 2]
2
,
where there are repetitions, but for ease of presentation we shall not take this into account). A
problem is that due to the smoothing in (9.11) and (9.12) there will be quite a lot of depen-
dence in g over adjacent frequencies. Therefore, rather than consider the grid (
2k
1
n
,
2k
2
n
) we
consider the coarser grid (
2k
1
m
n
,
2k
2
m
n
) : 1 k
1
, k
2
n/m (notice that m is the window
length). On the coarser grid the random variables g(
1
,
2
) tend to be less correlated. To see
why frequencies which are not too close tend to uncorrelated, consider the simpler example of
the smooth periodogram. The periodogram is asymptotically uncorrelated at adjacent frequen-
cies, howeversmoothing introduces dependence. But by choosing frequencies which are not too
close, the smooth periodogram is close to uncorrelated. We now dene a neighbourhood. Each
neighbourhood N
s
1
,s
2
consists of all the frequencies
N
s
1
,s
2
=
_
_
2s
1
rm
n
+
2k
1
m
n
,
2s
2
rm
n
+
2k
2
m
n
_
: (r + 1) k
1
, k
2
r
_
,
hence N
s
1
,s
2
is the neighbourhood of
_
2s
1
rm
n
,
2s
2
rm
n
_
. An illustration of this grouping is given
in Figure 9.1.
The idea is that all the small squares inside the local neighbour should have a mean which is
about the same because we have assumed that the spectrums f and f
3
are suciently smooth
functions. In other words
E(

f
3
(
1
,
2
)) f
3
_
2s
1
rm
n
,
2s
2
rm
n
_
,
E(

f
n
(
1
)

f
n
(
2
)

f
n
(
1

2
)) f
_
2s
1
rm
n
_
f
_
2s
2
rm
n
_
f
_
2s
1
rm
n

2s
2
rm
n
_
, (9.14)
for all
1
,
2
N
s
1
,s
2
. Now we dene a vector containing the g evaluated for all the values in
N
s
1
,s
2

s
1
,s
2
=
_
g
_
2s
1
rm
n

2(r + 1)m
n
,
2s
2
rm
n

2(r + 1)m
n
_
, . . . ,
2s
1
rm
n
+
2rm
n
,
2s
2
rm
n

2rm
n
__
:=
_

s
1
,s
2
(1), . . . ,
s
1
,s
2
(r
2
)
_
We see that
s
1
,s
2
is an (2r)
2
-dimensional vector which contains g evaluated in a neighbourhood
of (
2s
1
rm
n
,
2s
2
rm
n
). Since all the frequencies are in the neighbourhood of (
2s
1
rm
n
,
2s
2
rm
n
) by us-
ing (9.14) we can expect the mean of every single element of
s
1
,s
2
to be about g(
2s
1
rm
n
,
2s
2
rm
n
).
There are (n/(2mr))
2
dierent neighbourhoods. We now construct r
2
dierent vectors each
of dimension (n/(2mr))
2
, where each vector contains one element from each neighbourhood
N
s
1
,s
2
. That is let
Y
k
= (
1,1
(k), . . . ,
n
2mr
,
n
2mr
(k)).
Hence we have (2r)
2
random vectors Y
k
. Now, as mentioned above (see below (9.13)), each
element of Y
k
is normally distributed where the mean of Y
k
is approximately
E(Y
k
)
_
g
_
2rm
n
,
2rm
n
_
, . . . , g
_
2, 2
_
_
.
122
(w(3),w(4)) (w(1),w(2)) (w(5),w(6))
(w(7),w(8))
m
m2r
Figure 9.1: The grid of all frequencies in (
2k
1
n
,
2k
2
n
) are made into a coarser grid. Each small
square is a locally averaged frequency.
123
Since we have chosen the element of each neighbourhood to be approximately uncorrelated, then
we can assume that Y
k
are iid normally distributed random vectors.
Returning to the null hypothesis we recall that we are testing the hypothesis that
H
0
:
[f
3
(
1
,
2
)[
2
f(
1
)f(
2
)f(
1

3
)
= constant H
A
:
[f
3
(
1
,
2
)[
2
f(
1
)f(
2
)f(
1

3
)
depends on(
1
,
2
).
This means that under the null the elements of the vector Y
k
have the same mean. Hence the
above can be restated as
H
0
:
1,1
=
2,1
= . . . =
n
2mr
,
n
2mr
H
A
: at least one of the means are dierent
where
s
1
,s
2
= E( g(
2s
1
rm
n
,
2s
2
rm
n
)). We now review some results in multivariate analysis which
will help us to construct the test statistic.
9.3.3 Hotellings T
2
-statistic
In this section we summarise some results from multivariate analysis. An interested reader is
referred to the excellent book Anderson (2003) (Chapter 5) for a detailed introduction.
Let us suppose that X
t
are iid random vectors of dimension p, which are normally dis-
tributed with mean and variance . Hotellings T
2
-statistic is generally used to test hypothesis
on the mean . In many respects in can be considered as a multivariate generalisation of the
t-test.
We rst construct the test statistic to test the hypothesis H
0
: = 0 and then consider
the generalisation to the case H
0
:
1
=
2
= . . . =
p
. Let us suppose we observe X
t

n
t=1
,
then under the null the distribution of

n

X
n
^(0, ), where

X
n
= n
1

n
t=1
X
t
. Therefore,
n

1

X
n

2
(p). Hence, if the variance were known, we could use n

1

X
n
as the test
statistic. Of course, often will be unknown and needs to be estimated from the observations.
Let
S =
n

t=1
(X
t


X
n
)(X
t


X
n
)

.
Then we can use
1
np
S as an estimator of (noting that we normalise by n p because we are
estimating p means). Therefore under the null we can use as the test statistic
T
2
=

X

n
S
1

X
n
,
(note we could normalise by n/(np)). We note that under the assumption of normality of X
t
that S is eectively the sum of n p random variables, hence roughly speaking S
2
(n p)
(of course this is not strictly true since S is a random matrix and not a scalar). We note that if
Y
2
(p) and X
2
(n p), then (Y/p)/(X/(n p)) =
np
p
Y
X
F
p,np
. Now under the null
we have n

1

X
n

2
(p) and S
2
(n p), therefore the distribution of T
2
is
(n p)n
p
T
2
F
p,np
.
We note that under the alternative the distribution of
(n(p1))n
(p1)
T
2
will be a non-central F
p,np
,
where the non-centrality parameter depends on the deviation of the means
k
from zero.
124
We now apply the above results to test the hypothesis
H
0
:
1
=
2
= . . . =
p
H
A
: at least one of the means are dierent.
Dene the (p 1) p matrix
B =
_
_
_
_
_
_
_
1 1 0 0 0 . . . . . . . . .
0 1 1 0 0 . . . . . . . . .
0 0 1 1 0 . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
. . . .
.
.
.
.
.
.
.
.
.
0 0 0 0 0 . . . 1 1
_
_
_
_
_
_
_
(9.15)
We note that under the null, the p 1-dimensional vector B

X
n
satises

nB

X
n
=

n
_
_
_
_
_

X
1


X
2

X
2


X
3
.
.
.

X
p1


X
p
_
_
_
_
_
^(0, BB

),
where

X = (

X
1
, . . . ,

X
p
)

. Therefore the T
2
-statistic is
T
2
=
_
B

X
n
)

(B

SB)
1
_
B

X
n
_
. (9.16)
Since B

X
n
is a p 1-dimensional vector, under the null the distribution of the test statistic is
(n (p 1))n
(p 1)
T
2
F
p1,np+1
.
9.3.4 The test statistic for the test for linearity
We now apply the above results to test for linearity of the time series. We want to test
H
0
:
[f
3
(
1
,
2
)[
2
f(
1
)f(
2
)f(
1

3
)
= constant H
A
:
[f
3
(
1
,
2
)[
2
f(
1
)f(
2
)f(
1

3
)
depends on(
1
,
2
).
We note that the above is equivalent to
H
0
:
1,1
=
2,1
= . . . =
n
2mr
,
n
2mr
H
A
: at least one of the means are dierent
where
s
1
,s
2
= E( g(
2s
1
rm
n
,
2s
2
rm
n
)). We recall that we have constructed the vectors Y
k
which
are approximately iid normally distributed.
We use (9.16) to construct the test statistic. Using the denition of B given in (9.15) we use
as the test statistic
T
2
=
_
B

Y
n
)

(B

S
Y
B)
1
_
B

Y
n
_
,
where

Y =
1
(2r)
2

k=1
Y
k
and
S
Y
=
(2r)
2

k=1
(Y
k


Y )(Y
k


Y )

.
Under the null the test statistic has the distribution
((2r)
2
(
n
2mr
1))(2r)
2
(
n
2mr
1)
T
2
F n
2mr
1,(2r)
2

n
2mr
+1
.
125
Chapter 10
Mixingales
In this section we prove some of the results stated in the previous sections using mixingales.
We rst dene a mixingale, noting that the denition we give is not the most general de-
nition.
Denition 10.0.1 (Mixingale) Let T
t
= (X
t
, X
t1
, . . .), X
t
is called a mixingale if it
satises

t,k
=
_
E
_
E(X
t
[T
tk
) E(X
t
)
_
2
_
1/2
,
where
t,k
0 as k . We note if X
t
is a stationary process then
t,k
=
k
.
Lemma 10.0.1 Suppose X
t
is a mixingale. Then X
t
almost surely satises the decompo-
sition
X
t
=

j=0
_
E(X
t
[T
tj1
) E(X
t
[T
tj1
)
_
. (10.1)
PROOF. We rst note that by using a telescoping argument that
X
t
E(X
t
) =
m

k=0
_
E(X
t
[T
tk
) E(X
t
[T
tk1
)
_
+
_
E(X
t
[T
tm1
) E(X
t
)
_
.
By denition of a martingale E
_
E(X
t
[T
tm1
) E(X
t
)
_
2
0 as k , hence the remainder
term in the above expansion becomes negligable as m and we have almost surely
X
t
E(X
t
)
=

k=0
_
E(X
t
[T
tk
) E(X
t
[T
tk1
)
_
.
Thus giving the required result.
We observe that (10.1) resembles the Wold decomposition. The dierence is that the Wolds
decomposition decomposes a stationary process into elements which are the errors in the best
126
linear predictors. Whereas the result above decomposes a process into sums of martingale
dierences.
It can be shown that functions of several ARCH-type processes are mixingales (where

t,k
K
k
(rho < 1)), and Subba Rao (2006) and Dahlhaus and Subba Rao (2007) used
these properties to obtain the rate of convergence for various types of ARCH parameter estima-
tors. In a series of papers, Wei Biao Wu considered properties of a general class of stationary
processes which satised Denition 10.0.1, where

k=1

k
< .
In Section 10.2 we use the mixingale property to prove Theorem 6.1.3. This is a simple
illustration of how useful mixingales can be. In the following section we give a result on the rate
of convergence of some random variables.
10.1 Obtaining almost sure rates of convergence for some sums
The following lemma is a simple variant on a result proved in M oricz (1976), Theorem 6.
Lemma 10.1.1 Let o
T
be a random sequence where E(sup
1tT
[o
t
[
2
) (T) and phi(t)
is a monotonically increasing sequence where (2
j
)/(2
j1
) K < for all j. Then we have
almost surely
1
T
o
T
= O
_
_
(T)(log T)(log log T)
1+
T
_
.
PROOF. The idea behind the proof is to that we nd a subsequence of the natural numbers
and dene a random variables on this subsequence. This random variable, should dominate
(in some sense) S
T
. We then obtain a rate of convergence for the subsequence (you will see
that for the subsequence its quite easy by using the Borel-Cantelli lemma), which, due to the
dominance, can be transfered over to S
T
. We make this argument precise below.
Dene the sequence V
j
= sup
t2
j [S
t
[. Using Chebyshevs inequality we have
P(V
j
> )
(2
j
)

.
Let (t) =
_
(t)(log log t)
1+
log t. It is clear that

j=1
P(V
j
> (2
j
))

j=1
C(2
j
)
(2
j
)(log j)
1+
j
< ,
where C is a nite constant. Now by Borel Cantelli, this means that almost surely V
j
(2
j
).
Let us now return to the orginal sequence S
T
. Suppose 2
j1
T 2
j
, then by denition of V
j
we have
S
T
(T)

V
j
(2
j1
)
a.s

(2
j
)
(2
j1
)
<
under the stated assumptions. Therefore almost surely we have S
T
= O((T)), which gives us
the required result.
127
We observe that the above result resembles the law of iterated logarithms. The above
result is very simple and nice way of obtaining an almost sure rate of convergence. The
main problem is obtaining bounds for E(sup
1tT
[o
t
[
2
). There is on exception to this, when
o
t
is the sum of martingale dierences then one can simply apply Doobs inequality, where
E(sup
1tT
[o
t
[
2
) E([o
T
[
2
). In the case that S
T
is not the sum of martingale dierences then
its not so straightforward. However if we can show that S
T
is the sum of mixingales then with
some modications a bound for E(sup
1tT
[o
t
[
2
) can be obtained. We will use this result in
the section below.
10.2 Proof of Theorem 6.1.3
We summarise Theorem 6.1.3 below.
Theorem 1 Let us suppose that X
t
has an ARMA representation where the roots of the
characteristic polynomials (z) and (z) lie are greater than 1 +. Then
(i)
1
n
n

t=r+1

t
X
tr
= O(
_
(log log n)
1+
log n
n
) (10.2)
(ii)
1
n
n

t=max(i,j)
X
ti
X
tj
= O(
_
(log log n)
1+
log n
n
). (10.3)
for any > 0.
By using Lemma ??, and that

n
t=r+1

t
X
tr
is the sum of martingale dierences, we prove
Theorem 6.1.3(i) below.
PROOF of Theorem 6.1.3. We rst observe that
t
X
tr
are martingale dierences,
hence we can use Doobs inequality to give E(sup
r+1sT
(

s
t=r+1

t
X
tr
)
2
) (Tr)E(
2
t
)E(X
2
t
).
Now we can apply Lemma ?? to obtain the result.
We now show that
1
T
T

t=max(i,j)
X
ti
X
tj
= O(
_
(log log T)
1+
log T
T
).
However the proof is more complex, since X
ti
X
tj
are not martingale dierences and we
cannot directly use Doobs inequality. However by showing that X
ti
X
tj
is a mixingale we
can still show the result.
To prove the result let T
t
= (X
t
, X
t1
, . . .) and (
t
= (X
ti
X
tj
, X
t1i
X
tji
, . . .). We
observe that if i > j, then (
t
T
ti
.
128
Lemma 10.2.1 Let T
t
= (X
t
, X
t1
, . . .) and suppose X
t
comes from an ARMA process, where
the roots are greater than 1 +. Then if E(
4
t
) < we have
E
_
E(X
ti
X
tj
[T
tmin(i,j)k
) E(X
ti
X
tj
)
_
2
C
k
.
PROOF. By expanding X
t
as an MA() process we have
E(X
ti
X
tj
[T
tmin(i,j)k
) E(X
ti
X
tj
)
=

j
1
,j
2
=0
a
j
1
a
j
2
_
E(
tij
1

tjj
2
[T
tkmin(i,j)
) E(
tij
1

tjj
2
)
_
.
Now in the case that t i j
1
> t k min(i, j) and t j j
2
> t k min(i, j),
E(
tij
1

tjj
2
[T
tkmin(i,j)
) = E(
tij
1

tjj
2
). Now by considering when t i j
1

t k min(i, j) or t j j
2
t k min(i, j) we have have the result.
Lemma 10.2.2 Suppose X
t
comes from an ARMA process. Then
(i) The sequence X
ti
X
tj

t
satises the mixingale property
E
_
E(X
ti
X
tj
[T
tmin(i,j)k
) E(X
ti
X
tj
[T
tk1
)
_
2
K
k
, (10.4)
and almost surely we can write X
ti
X
tj
as
X
ti
X
tj
E(X
ti
X
tj
) =

k=0
n

t=min(i,j)
V
t,k
(10.5)
where V
t,k
= E(X
ti
X
tj
[T
tkmin(i,j)
) E(X
ti
X
tj
[T
tkmin(i,j)1
), are martingale dif-
ferences.
(ii) Furthermore E(V
2
t,k
) K
k
and
E
_
sup
min(i,j)sn
_
s

t=min(i,j)
X
ti
X
tj
E(X
ti
X
tj
))
2
_
Kn, (10.6)
where K is some nite constant.
PROOF. To prove (i) we note that by using Lemma 10.2.1 we have (10.4). To prove (10.5) we
use the same telescoping argument used to prove Lemma 10.0.1.
To prove (ii) we use the above expansion to give
E
_
sup
min(i,j)sn
_
s

t=min(i,j)
X
ti
X
tj
E(X
ti
X
tj
))
2
_
(10.7)
= E
_
sup
min(i,j)sn
_

k=0
s

t=min(i,j)
V
t,k
_
2
_
= E
_

k
1
=0

k
2
=0
sup
min(i,j)sn

t=min(i,j)
V
t,k
1

t=min(i,j)
V
t,k
2

_
=
_

k=0
_
E
_
sup
min(i,j)sn

t=min(i,j)
V
t,k
1

2
__
1/2
_
2
129
Now we see that V
t,k

t
= E(X
ti
X
tj
[T
tkmin(i,j)
)E(X
ti
X
tj
[T
tkmin(i,j)1
)
t
, therefore
V
t,k

t
are also martingale dierences. Hence we can apply Doobs inequality to E
_
sup
min(i,j)sn
_
s
t=min(i,j)
V
t,k
_
2
_
and by using (10.4) we have
E
_
sup
min(i,j)sn
_
s

t=min(i,j)
V
t,k
_
2
_
E
_
n

t=min(i,j)
V
t,k
_
2
=
n

t=min(i,j)
E(V
2
t,k
) K n
k
.
Therefore now by using (10.7) we have
E
_
sup
min(i,j)sn
_
s

t=min(i,j)
X
ti
X
tj
E(X
ti
X
tj
))
2
_
Kn.
Thus giving (10.6).
We now use the above to prove Theorem 6.1.3(ii).
PROOF of Theorem 6.1.3(ii). To prove the result we use (10.6) and Lemma 10.1.1.
130
Appendix A
Appendix
A.1 Background: some denition and inequalities
Some norm denitions.
The norm of an object, is a postive numbers which measure the magnitude of that object.
Suppose x = (x
1
, . . . , x
n
) R
n
, then we dene |x|
1
=

n
j=1
[x
j
[ and |x|
2
= (

n
j=1
[x
2
j
)
1/2
(this is known as the Euclidean norm). There are various norms for matrices, the most
popular is the spectral norm | |
spec
: let A be a matrix, then |A|
spec
=
max
(AA

), where

max
denotes the largest eigenvalue.
Z denotes the set of a integers . . . , 1, 0, 1, 2, . . .. R denotes the real line (, ).
Complex variables.
i =

1 and the complex variable z = x +iy, where x and y are real.


Often the radians representation of a complex variable is useful. If z = x +iy, then it can
also be written as r exp(i), where r =
_
x
2
+y
2
and = tan
1
(y/x).
If z = x +iy, its complex conjugate is z = x iy.
The roots of a rth order polynomial a(z), are those values
1
, . . . ,
r
where a(
i
) = 0 for
i = 1, . . . , r.
The mean value theorem.
This basically states that if the partial derivative of the function f(x
1
, x
2
, . . . , x
n
) has a
bounded in the domiain , then for x = (x
1
, . . . , x
n
) and y = (y
1
, . . . , y
n
)
f(x
1
, x
2
, . . . , x
n
) f(y
1
, y
2
, . . . , y
n
) =
n

i=1
(x
i
y
i
)
f
x
i
|
x=x

where x

lies somewhere between x and y.


The Taylor series expansion.
This is closely related to the mean value theorem and a second order expansion is
f(x
1
, x
2
, . . . , x
n
) f(y
1
, y
2
, . . . , y
n
) =
n

i=1
(x
i
y
i
)
f
x
i
+
n

i,j=1
(x
i
y
i
)(x
j
y
j
)
f
2
x
i
x
j
|
x=x

131
Partial Fractions.
We use the following result mainly for obtaining the MA() expansion of an AR process.
Suppose that [g
i
[ > 1 for 1 i n. Then if g(z) =

n
i=1
(1 z/g
i
)
r
i
, the inverse of g(z)
satises
1
g(z)
=
n

i=1
_
r
i

j=1
g
i,j
(1
z
g
i
)
j
_
,
where g
i,j
= ..... Now we can make a polynomial series expansion of (1
z
g
i
)
j
which is
valid for all [z[ 1.
Dominated and monotone convergence.
We will use this all over the place to exchange innite sums and expectations. Basically if

j=1
[a
j
[E([Z
j
[) < , then by using Dominated convergence we have
E(

j=1
a
j
Z
j
) =

j=1
a
j
E(Z
j
).
Cauchy Schwarz inequality.
In terms of sequences it is
[

j=1
a
j
b
j
[ (

j=1
a
2
j
)
1/2
(

j=1
b
2
j
)
1/2
. For integrals and expectations it is
E[XY [ E(X
2
)
1/2
E(Y
2
)
1/2
Holders inequality.
This is a generalisation of the Cauchy Schwarz inequality. It states that if 1 p, q
and p +q = 1, then
E[XY [ E([X[
p
)
1/p
E([Y [
q
)
1/q
. A similar results is true for sequences too.
Martingale dierences. Let T
t
be a sigma-algebra, where X
t
, X
t1
, . . . T
t
. Then X
t
is
a sequence of martingale dierences if E(X
t
[T
t1
) = 0.
Minkowskis inequality.
If 1 < p < , then
(E(
n

i=1
X
i
)
p
)
1/p

i=1
(E([X
i
[
p
))
1/p
.
132
Doobs inequality.
This inequality concerns martingale dierences. Let o
n
=

n
t=1
X
t
, then
E( sup
nN
[o
n
[
2
) E(o
2
N
).
Burkh olders inequality.
Suppose that X
t
are martingale dierences and dene S
n
=

n
k=1
X
t
. For any p 2 we
have
E(S
p
n
)
1/p

_
2p
n

k=1
E(X
p
k
)
2/p
_
1/2
.
An application, is to the case that X
t
are identically distributed random variables, then
we have the bound E(S
p
n
) E(X
p
0
)
2
(2p)
p/2
n
p/2
.
It is worthing noting that the Burkh older inequality can also be dened for p < 2 (see
Davidson (1994), pages 242). It can also be generalised to random variables X
t
which
are not necessarily martingale dierences (see Dedecker and Doukhan (2003)).
Riemann-Stieltjes Integrals.
In basic calculus we often use the basic denition of the Riemann integral,
_
g(x)f(x)dx,
and if the function F(x) is continuous and F

(x) = f(x), we can write


_
g(x)f(x)dx =
_
g(x)dF(x). There are several instances where we need to broaden this denition to
include functions F which are not continuous everywhere. To do this we dene the
Riemann-Stieltjes integral, which coincides with the Riemann integral in the case that
F(x) is continuous.
_
g(x)dF(x) is dened in a slightly dierent way to the Riemann integral
_
g(x)f(x)dx.
Let us rst consider the case that F(x) is the step function F(x) =

n
i=1
a
i
I
[x
i1
,x
i
]
, then
_
g(x)dF(x) is dened as
_
g(x)dF(x) =

n
i=1
(a
i
a
i1
)g(x
i
) (with a
1
= 0). Already
we see the advantage of this denition, since the derivative of the step function is not
well dened at the jumps. As most functions can be written as the limit of step functions
(F(x) = lim
k
F
k
(x), where F
k
(x) =

n
k
i=1
a
i,n
k
I
[x
i
k1
1,x
i
k
]
), we dene
_
g(x)dF(x) =
lim
k

n
k
i=1
(a
i,n
k
a
i1,n
k
)g(x
i
k
).
In statistics the function F will usually non-decreasing and bounded. We call such func-
tions distributions.
Theorem A.1.1 (Hellys Theorem) Suppose that F
n
are a sequence of distributions with
F
n
() = 0 and sup
n
F
n
() M < . There exists a distribution F, and a subsequence F
n
k
such that for each x R F
n
k
F and F is right continuous.
Remark A.1.1 Martingales arise all the time. Its useful to know if the true distributional is
used, the gradient of the conditional log likelihood evaluated at the true parameter is the sum of
martingale dierences. We show why this is true now. Let B
T
=

T
t=2
log f

(X
t
[X
t1
, . . . , X
1
)
be the conditonal log likelihood and (
T
() its derivative, where
(
T
() =
T

t=2
log f

(X
t
[X
t1
, . . . , X
1
)

.
133
We want to show that (
T
(
0
) is the sum of martingale dierences. By denition if (
T
(
0
) is the
sum of martingale dierences then
E
_
log f

(X
t
[X
t1
, . . . , X
1
)

|
=
0

X
t1
, X
t2
, . . . , X
1
_
= 0,
we will show this. Rewriting the above in terms of integrals and exchanging derivative with
integral we have
E
_
log f

(X
t
[X
t1
, . . . , X
1
)

|
=
0

X
t1
, X
t2
, . . . , X
1
_
=
_
log f

(x
t
[X
t1
, . . . , X
1
)

|
=
0
f

0
(x
t
[X
t1
, . . . , X
1
)dx
t
=
_
1
f

0
(x
t
[X
t1
, . . . , X
1
)
f

(x
t
[X
t1
, . . . , X
1
)

|
=
0
f

0
(x
t
[X
t1
, . . . , X
1
)dx
t
=

__
f

(x
t
[X
t1
, . . . , X
1
)dx
t
_
|
=
0
= 0.
Therefore
log f

(X
t
|X
t1
,...,X
1
)

|
=
0

t
are a sequence of martingale dierences and (
t
(
0
) is the
sum of martingale dierences (hence it is a martingale).
134
Bibliography
Hong-Zhi An, Zhao-Guo. Chen, and E.J. Hannan. Autocorrelation, autoregression and autore-
gressive approximation. Ann. Statist., 10:926936, 1982.
T. W. Anderson. Statistical Analysis of Time Series. Wiley, 1994.
T. W. Anderson. An Introduction to Multivariate Analysis. Wiley, New Jersey, 2003.
M. Bartlett. Introduction to Stochastic Processes: With Special Reference to Methods and Ap-
plications. Cambridge University Press, Cambridge, 1981.
I. Berkes, L. Horv ath, and P. Kokoskza. GARCH processes: Structure and estimation. Bernoulli,
9:20012007, 2003.
I. Berkes, L. Horvath, P. Kokoszka, and Q. Shao. On discriminating between long range depen-
dence and changes in mean. Ann. Statist., 34:11401165, 2006.
R.N. Bhattacharya, V.K. Gupta, and E. Waymire. The hurst eect under trend. J. Appl.
Probab., 20:649662, 1983.
P.J. Bickel and D.A. Freedman. Some asymptotic theory for the bootstrap. Ann. Statist., pages
11961217, 1981.
P. Billingsley. Probability and Measure. Wiley, New York, 1995.
T Bollerslev. Generalized autoregressive conditional heteroscedasticity. J. Econometrics, 31:
301327, 1986.
G. E. P. Box and G. M. Jenkins. Time Series Analysis, Forecasting and Control. Cambridge
University Press, Oakland, 1970.
D.R. Brillinger. Time Series: Data Analysis and Theory. SIAM Classics, 2001.
P. Brockwell and R. Davis. Time Series: Theory and Methods. Springer, New York, 1998.
R Dahlhaus. Maximum likelihood estimation and model selection for locally stationary processes.
J. Nonparametric Statist., 6:171191, 1996.
R Dahlhaus. Fitting time series models to nonstationary processes. Ann. Stat., 16:137, 1997.
R Dahlhaus. A likelihood approximation for locally stationary processes. Ann. Stat., 28:1762
1794, 2000.
135
R. Dahlhaus and D. Janas. A frequency domain bootstrap for ratio statistics in time series
analysis. Ann. Statistic., 24:19341963, 1996.
R. Dahlhaus and S. Subba Rao. A recursive online algorithm for the estimation of time-varying
arch parameters. Bernoulli, 13:389422, 2007.
J Davidson. Stochastic Limit Theory. Oxford University Press, Oxford, 1994.
J. Dedecker and P. Doukhan. A new covariance inequality. Stochastic Processes and their
applications, 106:6380, 2003.
R. Engle. Autoregressive conditional heteroscedasticity with estimates of the variance of the
United Kingdom ination. Econometrica, 50:9871006, 1982.
J. Fan and Q. Yao. Nonlinear time series: Nonparametric and parametric methods. Springer,
Berlin, 2003.
J. Franke and W. H ardle. On bootstrapping kernel spectral estimates. Ann. Statist., 20:121145,
1992.
J. Franke and J-P. Kreiss. Bootstrapping stationary autoregressive moving average models. J.
Time Ser. Anal., pages 297317, 1992.
W. Fuller. Introduction to Statistical Time Series. Wiley, New York, 1995.
L. Giraitis, P. Kokoskza, and R. Leipus. Stationary ARCH models: Dependence structure and
central limit theorem. Econometric Theory, 16:322, 2000.
C. W. J. Granger and A. P. Andersen. An introduction to Bilinear Time Series models. Van-
denhoek and Ruprecht, G ottingen, 1978.
U. Grenander and G. Szeg o. Toeplitz forms and Their applications. Univ. California Press,
Berkeley, 1958.
G.R. Grimmett and D. R. Stirzaker. Probability and Random Processes. Oxford University
Press, Oxford, 1994.
P Hall and C.C. Heyde. Martingale Limit Theory and its Application. Academic Press, New
York, 1980.
E. Hannan. Multiple Time Series. Wiley, New York, 1970.
E.J. Hannan and Rissanen. Recursive estimation of ARMA order. Biometrika, 69:8194, 1982.
J.-P. Kreiss. Asymptotical properties of residual bootstrap for autoregression. Technical
report www.math.tu-bs.de/stochastik/kreiss.htm, 1997.
Rosenblatt. M. and U. Grenander. Statistical Analysis of Stationary Time Series. Chelsea
Publishing Co, 1997.
T. Mikosch and C. St aric a. Is it really long memory we see in nancial returns? In P. Embrechts,
editor, Extremes and Integrated Risk Management, pages 149168. Risk Books, London, 2000.
136
T. Mikosch and C. St aric a. Long-range dependence eects and arch modelling. In P. Doukhan,
G. Oppenheim, and M.S. Taqqu, editors, Theory and Applications of Long Range Dependence,
pages 439459. Birkh auser, Boston, 2003.
F. M oricz. Moment inequalities and the strong law of large numbers. Z. Wahrsch. verw. Gebiete,
35:298314, 1976.
E. Moulines, P. Priouret, and F. Roue. On recursive estimation for locally stationary time
varying autoregressive processes. Ann. Statist., 33:26102654, 2005.
D.F. Nicholls and B.G. Quinn. Random Coecient Autoregressive Models, An Introduction.
Springer-Verlag, New York, 1982.
E. Parzen. On consistent estimates of the spectrum of a stationary process. Ann. Math. Statist.,
1957.
E. Parzen. On estimation of the probability density function and the mode. Ann. Math. Statist.,
1962.
E. Parzen. Stochastic Processes (Classics in Applied Mathematics). Society for Industrial Math-
ematics, 1999.
D. N. Politis, J. P. Romano, and M. Wolf. Subsampling. Springer, New York, 1999.
M. B. Priestley. Spectral Analysis and Time Series: Volumes I and II. Academic Press, London,
1983.
B.G. Quinn and E.J. Hannan. The Estimation and Tracking of Frequency. Cambridge University
Press, 2001.
R. Shumway and D. Stoer. Time Series Analysis and Its applications: With R examples.
Springer, New York, 2006.
D. S. Stoer and K. D. Wall. Bootstrappping state space models: Gaussian maximum likelihood
estimation. Journal of the American Statistical Association, 86:10241033, 1991.
D. S. Stoer and K. D. Wall. Resampling in State Space Models. Cambridge University Press,
2004.
W.F. Stout. Almost Sure Convergence. Academic Press, New York, 1974.
D. Straumann. Estimation in Conditionally Heteroscedastic Time Series Models. Springer,
Berlin, 2005.
S. Subba Rao. A note on uniform convergence of an arch() estimator. Sankhya, pages 600620,
2006.
T. Subba Rao. On the estimation of bilinear time series models. In Bull. Inst. Internat. Statist.
(paper presented at 41st session of ISI, New Delhi, India), volume 41, 1977.
T. Subba Rao and M. M. Gabr. A test for linearity of a stationary time series. J of Time Series
Analysis, 1:145158, 1980.
137
T. Subba Rao and M. M. Gabr. An Introduction to Bispectral Analysis and Bilinear Time Series
Models. Lecture Notes in Statistics (24). Springer, New York, 1984.
Gy. Terdik. Bilinear Stochastic Models and Related Problems of Nonlinear Time Series Analysis;
A Frequency Domain Approach, volume 142 of Lecture Notes in Statistics. Springer Verlag,
New York, 1999.
J. W. Van Ness. Asymptotic properties of the bi-spectra. Ann. Math. Stat, 37:12571257, 1966.
P. Whittle. Gaussian estimation in stationary time series. Bulletin of the International Statistical
Institute, 39:105129, 1962.
138

S-ar putea să vă placă și