Documente Academic
Documente Profesional
Documente Cultură
Paul Sderlind1
3 January 2012
1 University
Contents
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
4
4
11
13
15
.
.
.
.
.
.
.
22
22
35
43
50
52
58
64
.
.
.
.
.
.
.
.
.
72
72
74
90
94
99
100
104
105
108
117
117
128
132
137
139
140
141
147
154
Factor Models
5.1 CAPM Tests: Overview . . . . . . . . . . . . . . . . . . .
5.2 Testing CAPM: Traditional LS Approach . . . . . . . . .
5.3 Testing CAPM: GMM . . . . . . . . . . . . . . . . . . .
5.4 Testing Multi-Factor Models (Factors are Excess Returns)
5.5 Testing Multi-Factor Models (General Factors) . . . . . .
5.6 Linear SDF Models . . . . . . . . . . . . . . . . . . . . .
5.7 Conditional Factor Models . . . . . . . . . . . . . . . . .
5.8 Conditional Models with Regimes . . . . . . . . . . . .
5.9 Fama-MacBeth . . . . . . . . . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
162
162
162
167
176
180
193
197
198
201
204
.
.
.
.
.
214
214
218
223
229
232
236
236
238
242
245
245
247
248
253
264
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
288
288
288
289
296
297
298
303
1.1
1.1.1
GMM
The Basic GMM
In general, the q
(1.1)
where g./
N
is short hand notation for the sample average and where the value of the
moment conditions clearly depend on the parameter vector. We let 0 denote the true
value of the k 1 parameter vector. The GMM estimator is
Ok
(1.2)
q weighting matrix.
Example 1.1 (Moment condition for a mean) To estimated the mean of x t , use the following moment condition
1 XT
xt
D 0:
t D1
T
Example 1.2 (Moments conditions for IV/2SLS/OLS) Consider the linear model y t D
x t0 0 C u t , where x t and are k 1 vectors. Let z t be a q 1 vector, with q k. The
sample moment conditions are
T
1X
gN ./ D
z t .y t
T t D1
x t0 / D 0q
Example 1.3 (Moments conditions for MLE) The maximum likelihood estimator maximizes the log likelihood function, tTD1 ln L .w t I / =T , with the K first order conditions
(one for each element in )
T
1 X @ ln L .w t I /
gN ./ D
D 0K
T t D1
@
hp
1
i
X
T g.
N 0/ D
Cov g t .0 /; g t s .0 / ;
(1.3)
sD 1
6 :
@g.
N 0/ 6
6 ::
D
6 :
6 ::
@ 0
4
@gN q . /
@1
@gN 1 . /
@k
::
:
::
:
@gN q . /
@k
7
7
7
7 (evaluated at 0 ).
7
5
T .O
0 / ! N.0; V / if W D S0 1 , where
V D D00 S0 1 D0
(1.5)
which assumes that we have used S0 1 as the weighting matrix. This gives the most efficient GMM estimatorfor a given set of moment conditions. The choice of the weighting
5
matrix is irrelevant if the model is exactly identified (as many moment conditions as parameters), so (1.5) can be applied to this case (even if we did not specify any weighting
matrix at all). In practice, the gradient D0 is approximated by using the point estimates
and the available sample of data. The Newey-West estimator is commonly used to estimate the covariance matrix S0 . To implement W D S0 1 , an iterative procedure is often
used: start with W D 1, estimate the parameters, estimate SO0 , then (in a second step) use
W D SO0 1 and reestimate. In most cases this iteration is stopped at this stage, but other
researchers choose to continue iteratingt until the point estimates converge.
Example 1.5 (Estimating a mean) For the moment condition in Example 1.1, assuming
iid data gives
S0 D Var.x t / D 2 :
In addition,
D0 D
@g.
N
@
0/
1;
which in this case is just a constant (and does not need to be evaluated at true parameter).
Combining gives
p
T.O
0/
! N.0;
/, so O
0;
=T /:
x t0
"p
D0 D plim
N.
T
T X
zt ut
T t D1
T
1X
z t x t0
T tD1
!
D
zx :
T
1X
x t x t0 D
T t D1
xx ;
so combining gives
h
V D xx
xx
xx
xx1 :
To test if the moment conditions are satisfied, we notice that under the hull hypothesis
(that the model is correctly specified)
p
(1.6)
T gN .0 / ! N 0q 1 ; S0 ;
where q is the number of moment conditions. Since O chosen is such a way that k (number
of parameters) linear combinations of the first order conditions always (in every sample)
are zero, we get that there are effectively only q k non-degenerate random variables.
We can therefore test the hypothesis that gN .0 / D 0 on the the J test
O 0 S 1 g.
O d
T g.
N /
0 N / !
2
q k;
if W D S0 1 :
(1.7)
The left hand side equals T times of value of the loss function in (1.2) evaluated at the
point estimates With no overidentifying restrictions (as many moment conditions as parameters) there are, of course, no restrictions to test. Indeed, the loss function value is
then always zero at the point estimates.
1.1.2
It can be shown that if we use another weighting matrix than W D S0 1 , then the variancecovariance matrix in (1.5) should be changed to
V2 D D00 WD0
(1.8)
(1.9)
2
q k;
D0 D00 WD0
i h
D00 W S0 Iq
D0 D00 WD0
D00 W
i0
(1.10)
Remark 1.7 (Quadratic form with degenerate covariance matrix) If the n 1 vector
2
C
X
N.0; /, where has rank r n then Y D X 0 C X
is the
r where
pseudo inverse of .
Suppose we sidestep the whole optimization issue and instead specify k linear combinations (as many as there are parameters) of the q moment conditions directly
0k
O ;
D
A g.
N /
k q
(1.11)
q 1
D0 .A0 D0 /
D0 .A0 D0 /
A0 S0 Iq
A0 0 :
(1.13)
Suppose x t has a zero mean. To estimate the mean we specify the moment condition
g t D x t2
(1.14)
To derive the asymptotic distribution, we take look at the simple case when x t is iid
N.0; 2 / This gives S0 D Var.g t /, because of the iid assumption. We can simplify this
further as
S0 D E.x t2
D E.x t4 C
D2
2 2
2x t2
/ D E x t4
(1.15)
where the second line is just algebra and the third line follows from the properties of
normally distributed variables (E x t4 D 3 4 ).
Note that the Jacobian is
D0 D 1;
(1.16)
so the GMM formula says
p
1.1.5
T .O 2
/ ! N.0; 2
/:
(1.17)
Let R t be a vector of net returns of N assets. We want to estimate the mean vector and
the covariance matrix. The moment conditions for the mean vector are
E Rt
D 0N
1;
(1.18)
and the moment conditions for the unique elements of the second moment matrix are
E vech.R t R0t /
(1.19)
Remark 1.9 (The vech operator) vech(A) where A is m m gives an m.m C 1/=2 1
vector with the elements on and below the principal diagonal
A3stacked on top of each
2
#
"
a11
a11 a12
6
7
other (column wise). For instance, vech
D 4 a21 5.
a21 a22
a22
Stack (1.18) and (1.19) and substitute the sample mean for the population expectation
to get the GMM estimator
"
# "
# "
#
T
Rt
O
0N 1
1X
D
:
(1.20)
T t D1 vech.R t R0t /
vech. O /
0N.N C1/=2 1
In this case, D0 D I , so the covariance matrix of the parameter vector ( O ; vech. O /) is
just S0 (defined in (1.3)), which is straightforward to estimate.
1.1.6
y t D F .x t I 0 / C " t ;
(1.22)
F .x t I /2 :
To express this as a GMM problem, use the first order conditions for (1.22) as moment
conditions
1 PT @F .x t I /
y t F .x t I / :
gN ./ D
(1.23)
T t D1
@
The model is then exactly identified so the point estimates are found by setting all moment
conditions to zero, gN ./ D 0k 1 . The distribution of the parameter estimates is thus as in
p
(1.5). As usual, S0 D Cov T gN .0 /, while the Jacobian is
@g.
N 0/
@ 0
1 PT @F .x t I / @F .x t I /
D plim
T t D1
@
@ 0
D0 D plim
plim
1 PT
y t
T t D1
F .x t I /
@2 F .x t I /
:
@@ 0
(1.24)
10
1.2
1.2.1
MLE
The Basic MLE
Let L be the likelihood function of a sample, defined as the joint density of the sample
L D pdf.x1 ; x2 ; : : : xT I /
(1.25)
(1.26)
D L1 L2 : : : LT ;
where are the parameters of the density function. In the second line, we define the likelihood function as the product of the likelihood contributions of the different observations.
For notational convenience, their dependce of the data and the parameters are suppressed.
The idea of MLE is to pick parameters to make the likelihood (or its log) value as
large as possible
O D arg max ln L:
(1.27)
MLE is typically asymptotically normally distributed
p
N .O
with
(1.28)
@2 ln L
=T or
@@ 0
@2 ln L t
;
E
@@ 0
E
where I./ is the information matrix. In the second line, the derivative is of the whole
loglikelihood function (1.25), while in the third line the derivative is of the likelihood
contribution of observation t.
Alternatively, we can use the outer product of the gradients to calculate the information matrix as
@ ln L t @ ln L t
J./ D E
:
(1.29)
@
@ 0
A key strength of MLE is that it is asymptotically efficient, that is, any linear combination of the parameters will have a smaller asymptotic variance than if we had used any
other estimation method.
11
1.2.2
QMLE
A MLE based on the wrong likelihood function (distribution) may still be useful. Suppose
we use the likelihood function L, so the estimator is defined by
@ ln L
D 0:
@
(1.30)
If this is the wrong likelihood function, but the expected value (under the true distribution)
of @ ln L=@ is indeed zero (at the true parameter values), then we can think of (1.30) as
a set of GMM moment conditionsand the usual GMM results apply. The result is that
this quasi-MLE (or pseudo-MLE) has the same sort of distribution as in (1.28), but with
the variance-covariance matrix
V D I./ 1 J./I./
(1.31)
Example 1.11 (LS and QMLE) In a linear regression, y t D x t0 C " t , the first order
T
O t D 0.
condition for MLE based on the assumption that " t N.0; 2 / is tD1
.y t x t0 /x
This has an expected value of zero (at the true parameters), even if the shocks have a, say,
t22 distribution.
1.2.3
/. The pdf of x t is
pdf .x t / D p
1
2
exp
1 x t2
:
2 2
(1.32)
1 PT x t2
2 T =2
/
exp
, so
2 t D1 2
T
ln.2
2
1 PT
2
2
t D1 x t :
(1.33)
(1.34)
12
T 1
1 PT
2
C
x 2 D 0 so
22 2
2. 2 /2 t D1 t
P
O 2 D TtD1 x t2 =T:
(1.35)
1 PT
x 2 , so
. 2 /3 t D1 t
T
T
2
D
. 2 /3
2 4
(1.36)
(1.37)
1.3
T .O 2
@2 ln L
1
=T D
;
2
2
@ @
2 4
2
/ !d N.0; 2
/:
(1.38)
(1.39)
Many estimators (including GMM) are based on some sort of sample average. Unless we
are sure that the series in the average is iid, we need an estimator of the variance (of the
sample average) that takes serial correlation into account. The Newey-West estimator is
probably the most popular.
Example 1.12 (Variance of sample average) The variance of .x1 C x2 /=2 is Var.x1 /=4 C
Var.x2 /=4 C Cov.x1 ; x2 /=2. If Var.xi / D 2 for all i, then this is 2 =2 C Cov.x1 ; x2 /=2.
If there is no autocorrelation, then we have the traditional result, Var.x/
N D 2 =T .
Example 1.13 (x t is a scalar iid process.) When x t is a scalar iid process, then
1 PT
1 PT
x
D
Var .x t / (since independently distributed)
Var
t
T t D1
T 2 t D1
1
D 2 T Var .x t / (since identically distributed)
T
1
D Var .x t / :
T
13
Std( T sample mean), AR(1)
100
10
50
5
0
1
0.5
0
0.5
AR(1) coefficient
0
1
0.5
0
0.5
AR(1) coefficient
Cov
n
X
T gN D
1
jsj
(1.40)
Cov .g t ; g t s /
nC1
sD n
X
s
D Cov .g t ; g t / C
1
Cov .g t ; g t s / C Cov .g t ; g t s /0
nC1
sD1
(1.41)
p
1
T gN D Cov .g t ; g t / C
Cov .g t ; g t 1 / C Cov .g t ; g t 1 /0 :
Cov
2
14
so
Var
X
T xN D
R.s/
sD 1
1
X
sD 1
2
jsj
1C
;
2 1
r rt
s r s
s rsrs
1.4
Consider an estimator Ok
which satisfies
p
T .O
0 / ! N .0; Vk
k/ ;
(1.42)
D f ./ ;
(1.43)
15
O
T f ./
f .0 / ! N 0;
q q
; where
2
@f .0 / @f .0 /0
@f ./ 6
V
, where
D6
0
4
@
@
@ 0
@f1 . /
@1
::
:
::
@f1 . /
@k
@fq . /
@1
::
:
(1.44)
7
7
5
@fq . /
@k
q k
Example 1.18 (Testing a Sharpe ratio) Stack the mean ( D E x t ) and second moment
( 2 D E x t2 ) as D ; 2 0 . The Sharpe ratio is calculated as a function of
E.x/
D f ./ D
.x/
.
@f ./ h
,
so
D
2 /1=2
@ 0
2
2 /3=2
i
2.
2 /3=2
6
6
4
::
:
@fq . /
@j
Q
7 f ./
7D
5
f ./
A MatLab code:
rsq
r
r s rtrs
t
t
16
MeanStd frontier
Mean, %
20
15
HD
A C
GJ
I F
10
A
B
C
D
E
F
G
H
I
J
BE
5
0
10
15
Std, %
20
25
12.40
12.52
12.22
14.07
13.23
10.22
12.00
13.09
10.74
11.42
Std
14.23
20.69
16.73
18.00
21.73
14.96
16.98
17.11
13.32
17.52
Mean, %
20
15
0.72
0.56
1.82
10
5
0
10
15
Std, %
20
25
p/
where D ; vech. /
17
that calculates the minimum standard deviation at a given expected return, p . For this,
you may find the duplication matrix (see remark) useful. Second, evaluate it, as well as
the Jacobian, at the point estimates. Third, combine with the variance-covariance matrix
of O ; vech. O / to calculate the variance of the output (the minimum standard deviation).
Repeat this for other values of the expected returns, p .
Remark 1.21 (Duplication matrix) The duplication matrix Dm is defined such that for
any symmetric m m matrix A we have vec.A/ D Dm vech.A/. For instance,
2
3
2
3
3
1 0 0 2
a11
6
7 a
6
7
6 0 1 0 7 6 11 7 6 a21 7
6
7 4 a21 5 D 6
7
6 0 1 0 7
6 a 7 or D2 vech.A/ D vec.A/:
4
5
4 21 5
a22
0 0 1
a22
The duplication matrix is therefore useful for inverting the vech operatorthe transformation from vec.A/ is trivial.
Remark 1.22 (MatLab coding) The command reshape(x,m,n) creates an m n matrix by
putting the first m elements of x in column 1, the next m elements in column 2, etc.
1.4.2
Delta Method Example 2: Testing the 1=N vs. the Tangency Portfolio
18
Bibliography
Cochrane, J. H., 2005, Asset pricing, Princeton University Press, Princeton, New Jersey,
revised edn.
DeMiguel, V., L. Garlappi, and R. Uppal, 2009, Optimal Versus Naive Diversification:
How Inefficient is the 1/N Portfolio Strategy?, Review of Financial Studies, 22, 1915
1953.
Singleton, K. J., 2006, Empirical dynamic asset pricing, Princeton University Press.
19
Critical values
10%
5%
1%
10
1:81 2:23 3:17
20
1:72 2:09 2:85
30
1:70 2:04 2:75
40
1:68 2:02 2:70
50
1:68 2:01 2:68
60
1:67 2:00 2:66
70
1:67 1:99 2:65
80
1:66 1:99 2:64
90
1:66 1:99 2:63
100
1:66 1:98 2:63
Normal 1:64 1:96 2:58
n
Table 1.1: Critical values (two-sided test) of t distribution (different degrees of freedom)
and normal distribution.
20
Critical values
10%
5%
1%
1
2:71
3:84
6:63
2
4:61
5:99
9:21
3
6:25
7:81 11:34
4
7:78
9:49 13:28
5
9:24 11:07 15:09
6 10:64 12:59 16:81
7 12:02 14:07 18:48
8 13:36 15:51 20:09
9 14:68 16:92 21:67
10 15:99 18:31 23:21
n
Table 1.2: Critical values of chisquare distribution (different degrees of freedom, n).
21
Return Distributions
2.1
Reference: Harvey (1989) 260, Davidson and MacKinnon (1993) 267, Silverman (1986);
Mittelhammer (1996), DeGroot (1986)
2.1.1
The cdf (cumulative distribution function) measures the probability that the random variable Xi is below or at some numerical value xi ,
ui D Fi .xi / D Pr.Xi xi /:
(2.1)
For instance, with an N.0; 1/ distribution, F . 1:64/ D 0:05. Clearly, the cdf values
are between (and including) 0 and 1. The distribution of Xi is often called the marginal
distribution of Xi to distinguish it from the joint distribution of Xi and Xj . (See below
for more information on joint distributions.)
The pdf (probability density function) fi .xi / is the height of the distribution in the
sense that the cdf F .xi / is the integral of the pdf from minus infinity to xi
Z xi
Fi .xi / D
fi .s/ds:
(2.2)
sD 1
(Conversely, the pdf is the derivative of the cdf, fi .xi / D @Fi .xi /=@xi .) The Gaussian
pdf (the normal distribution) is bell shaped.
Remark 2.1 (Quantile of a distribution) The quantile of a distribution ( ) is the value
of x such that there is a probability of of a lower value. We can solve for the quantile by
inverting the cdf, D F . / as D F 1 ./. For instance, the 5% quantile of a N.0; 1/
distribution is 1:64 D 1 .0:05/, where 1 ./ denotes the inverse of an N.0; 1/ cdf.
See Figure 2.1 for an illustration.
22
Density of N(8,162)
Density of N(0,1)
0.4
5% quantile is c = 1.64
2
pdf
0.3
5% quantile is + c* = 18
0.2
1
0.1
0
40
0
R
40
cdf of N(8,162)
0.5
cdf
40
0
40
0
40
0
R
40
0.2
0.4
0.6
0.8
cdf
) distribution
QQ Plots
Are returns normally distributed? Mostly not, but it depends on the asset type and on the
data frequency. Options returns typically have very non-normal distributions (in particular, since the return is 100% on many expiration days). Stock returns are typically
distinctly non-linear at short horizons, but can look somewhat normal at longer horizons.
To assess the normality of returns, the usual econometric techniques (BeraJarque
and Kolmogorov-Smirnov tests) are useful, but a visual inspection of the histogram and a
QQ-plot also give useful clues. See Figures 2.22.4 for illustrations.
Remark 2.2 (Reading a QQ plot) A QQ plot is a way to assess if the empirical distribution conforms reasonably well to a prespecified theoretical distribution, for instance,
a normal distribution where the mean and variance have been estimated from the data.
Each point in the QQ plot shows a specific percentile (quantile) according to the empiri23
cal as well as according to the theoretical distribution. For instance, if the 2th percentile
(0.02 percentile) is at -10 in the empirical distribution, but at only -3 in the theoretical
distribution, then this indicates that the two distributions have fairly different left tails.
There is one caveat to this way of studying data: it only provides evidence on the
unconditional distribution. For instance, nothing rules out the possibility that we could
estimate a model for time-varying volatility (for instance, a GARCH model) of the returns
and thus generate a description for how the VaR changes over time. However, data with
time varying volatility will typically not have an unconditional normal distribution.
Daily returns, full
Number of days
8000
6000
4000
2000
0
20
10
0
Daily excess return, %
10
20
15
10
5
0
20
10
0
Daily excess return, %
10
Number of days
8000
6000
4000
2000
0
3
1
0
1
2
Daily excess return, %
24
Empirical quantiles
4
Daily S&P 500 returns, 1957:12011:5
6
6
2
0
2
Quantiles from estimated N(,2), %
The skewness, kurtosis and Bera-Jarque test for normality are useful diagnostic tools.
They are
Test statistic
P
skewness
D T1 TtD1 xt
P
kurtosis
D T1 TtD1 xt
Bera-Jarque D T6 skewness2 C
Distribution
N .0; 6=T /
N .3; 24=T /
3
4
T
24
.kurtosis
3/2
(2.3)
2
2:
This is implemented by using the estimated mean and standard deviation. The distributions stated on the right hand side of (2.3) are under the null hypothesis that x t is iid
N ; 2 . The excess kurtosis is defined as the kurtosis minus 3.
The intuition for the 22 distribution of the Bera-Jarque test is that both the skewness
and kurtosis are, if properly scaled, N.0; 1/ variables. It can also be shown that they,
under the null hypothesis, are uncorrelated. The Bera-Jarque test statistic is therefore a
25
5
0
5
6
4 2
0
2
4
Quantiles from N(,2), %
Empirical quantiles
10
10
5
0
5
10
10
5
0
5
Quantiles from N(,2), %
10
0
10
20
20
10
0
10
2
Quantiles from N(, ), %
26
0.2
cdf of N(4,1)
EDF
0
3
3.5
4.5
x
5.5
F .x t /j :
(2.6)
27
KolmogorovSmirnov test
The KS test statistic is
the length of the longest arrow
1
0.8
0.6
0.4
data
0.2
cdf of N(4,1)
EDF
0
3
3.5
4.5
x
5.5
T !1
T DT c D 1
1
X
. 1/i
2i 2 c 2
(2.7)
i D1
It can be approximated by replacing 1 with a large number (for instance, 100). For
instance, c D 1:35 provides a 5% critical value. See Figure 2.7. There is a corresponding
test for comparing two empirical cdfs.
Pearsons 2 test does the same thing as the K-S test but for a discrete distribution.
Suppose you have K categories with Ni values in category i. The theoretical distribution
28
Cdf of KS test
Quantiles of KS distribution
2
1.5
0.5
cdf
1
0
0.5
1.5
0.5
0.5
cdf
1.8
1.6
1.4
1.2
0.85
0.9
0.95
Critical value
1.138
1.224
1.358
1.480
1.628
cdf
Tpi /2
Tpi
PK
i D1
T DT
pi D 1. Then
(2.8)
2
K 1:
i;
1;
2
i /
2;
2
1;
2
2;
/ D .1
/ .x t I
1;
2
1/
.x t I
i
2;
2
2 /;
and variance
(2.9)
2
i .
It
29
Distribution of of S&P500,1957:12011:5
0.6
normal pdf
0.5
0.4
0.3
0.2
0.1
0
5
1
0
1
Monthly excess return, %
where f ./ is the pdf in (2.9) A numerical optimization method could be used to maximize
this likelihood function. However, this is tricky so an alternative approach is often used.
This is an iterative approach in three steps:
(1) Guess values of 1 ; 2 ; 12 ; 22 and . For instance, pick 1 D x1 , 2 D x2 , 12 D
2
D 0:5.
2 D Var.x t / and
(2) Calculate
t
.1
.x t I 2 ;
/ .x t I 1 ; 12 / C
2
2/
.x t I
2;
2
2/
for t D 1; : : : ; T:
30
Distribution of S&P500,1957:12011:5
0.6
mean
std
weight
0.5
pdf 1
0.03
0.66
0.85
pdf 2
0.04
2.01
0.15
Mixture pdf 1
Mixture pdf 2
Total pdf
0.4
0.3
0.2
0.1
0
5
1
0
1
Monthly excess return, %
XT
t D1
tD1
2
1/
t =T .
Iterate over (2) and (3) until the parameter values converge. (This is an example of the
EM algorithm.) Notice that the calculation of i2 uses i from the same (not the previous)
iteration.
2.1.6
Empirical quantiles
4
Daily S&P 500 returns, 1957:12011:5
6
6
4
2
0
2
4
Quantiles from estimated mixture normal, %
32
xj h=2/:
(2.11)
In fact, that h1 .jx t xj h=2/ is the pdf value of a uniformly distributed variable
(over the interval x h=2 to x C h=2). This shows that our estimate of the pdf (here:
the histogram) can be thought of as a average of hypothetical pdf values of the data in
the neighbourhood of x. However, we can gain efficiency and get a smoother (across x
values) estimate by using another density function that the uniform. In particular, using a
density function that tapers off continuously instead of suddenly dropping to zero (as the
uniform density does) improves the properties. In fact, the N.0; h2 / is often used. The
kernel density estimator of the pdf at some point x is then
1 x t x 2
1 XT
1
O
f .x/ D
:
(2.12)
p exp
t D1 h 2
T
2
h
Notice that the function in the summation is the density function of a N.x; h2 / distribution.
The value h D 1:06 Std.x t /T 1=5 is sometimes recommended, since it can be shown
to be the optimal choice (in MSE sense) if data is normally distributed and the gaussian
kernel is used. The bandwidth h could also be chosen by a leave-one-out cross-validation
technique.
See Figure 2.12 for an example and Figure 2.13 for a QQ plot which is a good way to
visualize the difference between the empirical and a given theoretical distribution.
It can be shown that (with iid data and a Gaussian kernel) the asymptotic distribution
is
p
1
d
O
O
T hf .x/ E f .x/ ! N 0; p f .x/ ;
(2.13)
2
The easiest way to handle a bounded support of x is to transform the variable into one
with an unbounded support, estimate the pdf for this variable, and then use the change
of variable technique to transform to the pdf of the original variable.
We can also estimate multivariate pdfs. Let x t be a d 1 matrix and O be the estimated
covariance matrix of x t . We can then estimate the pdf at a point x by using a multivariate
33
Effective weights for histogram and kernel density estimation (at x=4)
2
Data
Histogram weights
Kernel density weights
1.5
0.5
0
3
3.5
4.5
x
5.5
XT
1
1
1
fO .x/ D
.x t
exp
t D1 .2 /d=2 jH 2 j
O 1=2
T
2
O 1 .x t
x/0 .H 2 /
x/ :
(2.14)
Notice that the function in the summation is the (multivariate) density function of a
O distribution. The value H D 1:06T 1=.d C4/ is sometimes recommended.
N.x; H 2 /
Remark 2.6 ((2.14) with d D 1) With just one variable, (2.14) becomes
"
2 #
XT
1
1
1
x
x
t
fO .x/ D
;
p exp
t D1 H Std.x / 2
T
2 H Std.x t /
t
which is the same as (2.12) if h D H Std.x t /.
2.1.7
0.2
0.2
0.15
0.15
0.1
0.1
0.05
0.05
10
yt
15
20
Small h
Large h
10
yt
15
20
2.2
Reference: Breeden and Litzenberger (1978); Cox and Ross (1976), Taylor (2005) 16,
Jackwerth (2000), Sderlind and Svensson (1997a) and Sderlind (2000)
2.2.1
A European call option price with strike price X has the price
C D E M max .0; S
X/ ;
(2.15)
where M is the nominal discount factor and S is the price of the underlying asset at the
expiration date of the option k periods from now.
We have seen that the price of a derivative is a discounted risk-neutral expectation of
35
Empirical quantiles
15
10
0
Daily federal funds rates, 1954:72011:5
5
5
5
10
15
2
Quantiles from estimated N(, ), %
20
(2.16)
X/ ;
C .X D 109/ D 0:5
89/ C 0:4.100
0 C 0:4.100
0 C 0:4
89/ C 0:1.110
99/ C 0:1.110
0 C 0:1.110
89/ D 7
99/ D 1: 5
109/ D 0:1:
Clearly, with information on the option prices, we could in this case back out what the
36
1300
1250
1200
Long MA ()
Long MA (+)
Short MA
1150
Jan
Feb
Mar
Apr
1999
.S
X/ h .S / dS;
(2.17)
where i is the per period (annualized) interest rate so exp. ik/ D Bk and h .S / is the
(univariate) risk-neutral probability density function of the underlying price (not its log).
Differentiating (2.17) with respect to the strike price and rearranging gives the risk-neutral
distribution function
@C .X /
Pr .S X/ D 1 C exp.ik/
:
(2.18)
@X
Proof. Differentiating the call price with respect to the strike price gives
Z 1
@C
D exp . ik/
h .S / dS D exp . ik/ Pr .S > X/ :
@X
X
Use Pr .S > X/ D 1 Pr .S X/.
Differentiating once more gives the risk-neutral probability density function of S at
37
0.6
Std
1.16
0.6
0.4
0.4
0.2
0.2
0
3
0
Return
0
3
Mean
0.06
Std
1.72
0
Return
Mean
0.04
Std
0.93
0.4
0.4
0.2
0.2
0
3
0
Return
0
3
Mean
0.01
Std
0.90
0
Return
@2 C.X/
:
(2.19)
@X 2
Figure 2.16 shows some data and results for German bond options on one trading date.
(A change of variable approach is used to show the distribution of the log assset price.)
A difference quotient approximation of the derivative in (2.18)
@C
1 C .Xi C1 / C .Xi / C .Xi / C .Xi 1 /
C
(2.20)
@X
2
Xi C1 Xi
Xi Xi 1
pdf .X/ D exp.ik/
gives the approximate distribution function. The approximate probability density function, obtained by a second-order difference quotient
Approximate cdf
Approximate pdf
20
15
10
0.5
5
0
0
4.54
4.56
4.58
4.6
4.62
Log strike price
4.64
5
4.54
4.56
4.58
4.6
4.62
Log strike price
4.64
Figure 2.16: Bund options 6 April 1994. Options expiring in June 1994.
is also shown. The approximate distribution function is decreasing in some intervals,
and the approximate density function has some negative values and is very jagged. This
could possibly be explained by some aberrations of the option prices, but more likely
by the approximation of the derivatives: changing approximation method (for instance,
from centred to forward difference quotient) can have a strong effect on the results, but
all methods seem to generate strange results in some interval. This suggests that it might
be important to estimate an explicit distribution. That is, to impose enough restrictions on
the results to guarantee that they are well behaved.
2.2.2
Mixture of Normals
39
N
mix N
0.09
10
0.08
0.07
0.06
5.5
6
6.5
7
7.5
Strike price (yield to maturity, %)
23Feb1994
03Mar1994
10
0
5.5
6
6.5
7
Yield to maturity, %
7.5
5
0
5.5
6
6.5
7
Yield to maturity, %
7.5
Figure 2.17: Bund options 23 February and 3 March 1994. Options expiring in June 1994.
P
with jnD1 .j / D 1 and .j / 0. One interpretation of mixing normal distributions is
that they represent different macro economic states, where the weight is interpreted as
the probability of state j .
/
Let .:/ be the standardized (univariate) normal distribution function. If .j
m D m
.j /
and mm
D mm in (2.22), then the marginal distribution of the log SDF is gaussian
(while that of the underlying asset price is not). In this case the European call option price
(2.15) has a closed form solution in terms of the spot interest rate, strike price, and the
40
n
X
.j / 6
.j /
C D exp. ik/
4exp
s C
0
.j /
ms
j D1
0
B
X @
1
2
.j /
ss
B
@
1
.j /
s
.j /
ms
C
q
.j /
ss
.j /
ss
ln X C
A
13
.j /
s
C
q
.j /
ms
.j /
ss
ln X C7
A5 :
(2.23)
(For a proof, see Sderlind and Svensson (1997b).) Notice that this is like using the
/
.j /
.j /
physical distribution, but with .j
s C ms instead of s .
Notice also that this is a weighted average of the option price that would hold in each
state
n
X
C D
.j / C .j / :
(2.24)
j D1
.j / in (2.23) is replaced by Q .j /
D .j / exp.m
N .j / C
.j /
.j /
In this case, Q , not , will be estimated from option
.j /
mm =2/.
41
ss =2/,
ln F=X C
C D exp. ik/F
p
ss =2
exp. ik/X
ss
ln F=X
p
ss =2
; (2.26)
ss
Remark 2.10 (Robust measures of the standard deviation and skewness) Let P be the
th quantile (for instance, quantile 0.1) of a distribution. A simple robust measure of the
standard deviation is just the difference between two symmetric quantile,
Std D P1
P ;
where it is assumed that < 0:5. Sometimes this measure is scaled so it would give the
right answer for a normal distribution. For instance, with D 0:1, the measure would be
divided by 2.56 and for D 0:25 by 1.35.
One of the classical robust skewness measures was suggested by Hinkley
Skew D
.P1
P0:5 /
P1
.P0:5
P
P /
This skewness measure can only take on values between 1 (when P1 D P0:5 ) and
1 (when P D P0:5 ). When the median is just between the two percentiles (P0:5 D
.P1 C P /=2), then it is zero.
42
16Mar2009
02Mar2009
2
0
1.3
1.4
1.5
CHF/EUR
1.6
0
1.3
1.7
1.4
16Nov2009
1.7
1.6
1.7
8
pdf
1.6
17May2010
15
10
5
0
1.3
1.5
CHF/EUR
6
4
2
1.4
1.5
CHF/EUR
1.6
0
1.3
1.7
1.4
1.5
CHF/EUR
2.3
R;
(2.27)
that is, the negative of the return. Second, the attention is on how the distribution looks
like beyond a threshold and also on the the probability of exceeding this threshold. In contrast, the exact shape of the distribution below that point is typically disregarded. Third,
modelling the tail of the distribution is best done by using a distribution that allows for a
much heavier tail that suggested by a normal distribution. The generalized Pareto (GP)
distribution is often used. See Figure 2.21 for an illustration.
43
200904
200907
200910
201001
201004
0 and z
g.z/ D
1
exp. z=/
D 0:
The mean is defined (finite) if < 1 and is then E.z/ D =.1 /. Similarly, the variance
is finite if < 1=2 and is then Var.z/ D 2 =.1
/2 .1 2 /. See Figure 2.22 for an
illustration.
Remark 2.12 (Random number from a generalized Pareto distribution ) By inverting
the Cdf, we can notice that if u is uniformly distributed on .0; 1, then we can construct
44
0
0.2
0.05
0.4
0
200901
201001
2009
2010
25 risk reversal/iv(atm)
iv atm
0.1
0
0.2
0.05
0.4
0
2009
2010
2009
2010
u/
ln.1
1
u/
if
D 0:
Consider the loss X (the negative of the return) and let u be a threshold. Assume
that the threshold exceedance (X u) has a generalized Pareto distribution. Let Pu be
probability of X u. Then, the cdf of the loss for values greater than the threshold
(Pr.X x/ for x > u) can be written
F .x/ D Pu C G.x
u/.1
Pu /, for x > u;
(2.28)
where G.z/ is the cdf of the generalized Pareto distribution. Noticed that, the cdf value is
Pu at at x D u (or just slightly above u), and that it becomes one as x goes to infinity.
45
unknown shape
generalized Pareto dist
Loss
u/.1
Pu /, for x > u;
(2.29)
where g.z/ is the pdf of the generalized Pareto distribution. Notice that integrating the
pdf from x D u to infinity shows that the probability mass of X above u is 1 Pu . Since
the probability mass below u is Pu , it adds up to unity (as it should). See Figure 2.24 for
an illustration.
It is often to calculate the tail probability Pr.X > x/, which in the case of the cdf in
(2.28) is
1 F .x/ D .1 Pu /1 G.x u/;
(2.30)
where G.z/ is the cdf of the generalized Pareto distribution.
The VaR (say, D 0:95) is the -th quantile of the loss distribution
VaR D cdfX 1 ./;
(2.31)
where cdfX 1 ./ is the inverse cumulative distribution function of the losses, so cdfX 1 ./
is the quantile of the loss distribution. For instance, VaR95% is the 0:95 quantile of the
loss distribution. This clearly means that the probability of the loss to be less than VaR
equals
Pr.X VaR / D :
(2.32)
(Equivalently, the Pr.X >VaR / D 1 :)
Assuming is higher than Pu (so VaR
6
5
4
3
2
1
0
0.1
0.2
0.3
Outcome of random variable
0.4
0.5
1
< uC
1
1 Pu
VaR D
:
u ln 11 Pu
if
, for
Pu :
(2.33)
D0
Proof. (of (2.33)) Set F .x/ D in (2.28) and use z D x u in the cdf from Remark
2.11 and solve for x.
If we assume < 1 (to make sure that the mean is finite), then straightforward integration using (2.29) shows that the expected shortfall is
ES D E.XjX
D
Let
VaR /
VaRa
C
1
1
< 1:
(2.34)
47
N(0.08,0.162)
generalized Pareto (=0.22,=0.16)
0.8
VaR
ES
Normal dist
18.2
25.3
GP dist
24.5
48.4
0.6
0.4
0.2
0
15
20
25
30
35
40
Loss, %
45
50
55
60
Figure 2.23: Comparison of a normal and a generalized Pareto distribution for the tail of
losses
expected exceedance of the loss over another threshold
e. / D E .X
D
>u
jX > /
, for
> u and
(2.35)
< 1.
jX > / D
. 0/
. 0 /
; with
D.
N. ;
/, then
/=
where ./ and are the pdf and cdf of a N.0; 1/ variable respectively.
The expected exceedance over
(2.36)
If it is found that e.
O / is increasing (more or less) linearly with the threshold level ( ),
then it is reasonable to model the tail of the distribution from that point as a generalized
Pareto distribution.
The estimation of the parameters of the distribution ( and ) is typically done by
maximum likelihood. Alternatively, A comparison of the empirical exceedance (2.36)
with the theoretical (2.35) can help. Suppose we calculate the empirical exceedance for
different values of the threshold level (denoted i all large enough so the relation looks
linear), then we can estimate (by LS)
e.
O i/ D a C b
(2.37)
C "i :
Then, the theoretical exceedance (2.35) for a given starting point of the GPD u is related
to this regression according to
aD
D
and b D
b
and D a.1
1Cb
, or
(2.38)
/ C u:
Remark 2.14 (Log likelihood function of the loss distribution) Since we have assumed
that the threshold exceedance (X u) has a generalized Pareto distribution, Remark 2.11
shows that the log likelihood for the observation of the loss above the threshold (X t > u)
is
LD
X
t st. X t >u
(
ln L t D
Lt
ln
.1= C 1/ ln 1 C .X t
ln .X t u/ =
u/ =
if
0
D 0:
15
10
5
0
15
20
25
30
Threshold v, %
35
40
2.4
Exceedance Correlations
(2.39)
50
Expected exceedance
1.2
0.1
1
0.05
0.8
0.6
0
0.5
1
1.5
2
Threshold v, %
2.5
1.5
2.5
3
Loss, %
3.5
Empirical quantiles
QQ plot
(94th to 99th percentiles)
2.5
2
1.5
1.5
2
2.5
Quantiles from estimated GPD, %
51
=0.75
=0.5
=0.25
0.6
0.5
0.4
0.3
0.2
0.1
0
0.1
0.2
0.3
0.4
Upper boundary (prob of lower value)
0.5
Figure 2.26: Correlation in lower tail when data is drawn from a normal distribution with
correlation
2.5
52
Corr
freq
All
0.64
1.00
Low
0.68
0.02
Mid
0.40
0.85
High
0.51
0.02
10
15
15
10
5
0
5
Returns of small value stocks, %
10
x t rank.x t / y t rank.y t /
2
2:5
7
2
10
4
8
3
3
1
2
1
2
2:5
10
4
(2.40)
In the second step, simply estimate the correlation of the ranks of two variables
Spearmans
D Corrrank.x t /; rank.y t /:
(2.41)
0.8
0.8
0.7
0.7
0.6
0.6
0.5
0.4
0
0.1
0.2
0.3
0.4
0.5
Upper boundary (prob of lower value)
0.4
0.5
0.6
0.7
0.8
0.9
Lower boundary (prob of lower value)
(2.42)
1 O !d N.0; 1/:
.T
2/=.1
O2 / O
Remark 2.16 (Spearmans for a distribution ) If we have specified the joint distribution of the random variables X and Y , then we can also calculate the implied Spearmans
(sometimes only numerically) as CorrFX .X/; FY .Y / where FX .X/ is the cdf of X and
FY .Y / of Y .
Kendalls rank correlation (called Kendalls ) is similar, but is based on comparing changes of x t (compared to x1 ; : : : x t 1 ) with the corresponding changes of y t . For
instance, with three data points (.x1 ; y1 /; .x2 ; y2 /; .x3 ; y3 /) we first calculate
Changes of x Changes of y
x2 x1
y2 y1
x3 x1
y3 y1
x3 x2
y3 y2 ;
which gives T .T
(2.43)
Corr = 0.03
Corr = 0.90
= 0.88, = 0.69
0
x
= 0.03, = 0.01
0
x
Corr = 0.49
Corr = 0.88
1
2
1
2
= 0.84, = 0.65
0
x
= 1.00, = 1.00
0
x
xi /.yj
yi / > 0
ij is discordant if .xj
xi /.yj
yi / < 0:
(2.44)
Finally, we count the number of concordant (Tc ) and discordant (Td ) pairs and calculate
Kendalls tau as
Tc Td
Kendalls D
:
(2.45)
T .T 1/=2
It can be shown that
Kendalls
so it is straightforward to test
4T C 10
! N 0;
;
9T .T 1/
d
(2.46)
by a t-test.
55
1 2
D
3.3 1/=2
1
:
3
If x and y actually has bivariate normal distribution with correlation , then it can be
shown that on average we have
Spearmans rho =
Kendalls tau D
arcsin. =2/
(2.47)
(2.48)
arcsin. /:
In this case, all three measures give similar messages (although the Kendalls tau tends to
be lower than the linear correlation and Spearmans rho). This is illustrated in Figure 2.30.
Clearly, when data is not normally distributed, then these measures can give distinctly
different answers.
A joint -quantile exceedance probability measures how often two random variables
(x and y, say) are both above their quantile. Similarly, we can also define the probability
that they are both below their quantile
G D Pr.x
x; ; y
y; /;
(2.49)
56
0.8
0.6
0.4
0.2
0
0.2
0.4
0.6
Correlation
0.8
Figure 2.30: Spearmans rho and Kendalls tau if data has a bivariate normal distribution
T of this sorted list (do this individually for x and y). Then, calculate the estimate
(
XT
1 if x t Ox; and y t Oy;
1
t ; where t D
(2.50)
GO D
t D1
T
0 otherwise.
See Figure 2.31 for an illustration based on a joint normal distribution.
Pr(x<quantile,y<quantile), Gauss
Pr(x<quantile,y<quantile), Gauss
100
10
corr = 0.25
corr = 0.5
50
20
40
60
80
Prob of lower value, %
100
2
4
6
8
Prob of lower value, %
10
57
2.6
Copulas
Reference: McNeil, Frey, and Embrechts (2005), Alexander (2008) 6, Jondeau, Poon, and
Rockinger (2007) 6
Portfolio choice and risk analysis depend crucially on the joint distribution of asset
returns. Emprical evidence suggest that many returns have non-normal distribution, especially when we focus on the tails. There are several ways of estimating complicated
(non-normal) distributions: using copulas is one. This approach has the advantage that
it proceeds in two steps: first we estimate the marginal distribution of each returns separately, then we model the comovements by a copula.
2.6.1
(2.51)
where c./ is a copula density function and ui D Fi .xi / is the cdf value as in (2.1). The
extension to three or more random variables is straightforward.
Equation (2.51) means that if we know the joint pdf f1;2 .x1 ; x2 /and thus also the
cdfs F1 .x1 / and F2 .x2 /then we can figure out what the copula density function must
be. Alternatively, if we know the pdfs f1 .x1 / and f2 .x2 /and thus also the cdfs F1 .x1 /
and F2 .x2 /and the copula function, then we can construct the joint distribution. (This
is called Sklars theorem.) This latter approach will turn out to be useful.
The correlation of x1 and x2 depends on both the copula and the marginal distributions. In contrast, both Spearmans rho and Kendalls tau are determined by the copula
only. They therefore provide a way of calibrating/estimating the copula without having to
involve the marginal distributions directly.
Example 2.18 (Independent X and Y ) If X and Y are independent, then we know that
f1;2 .x1 ; x2 / D f1 .x1 /f2 .x2 /, so the copula density function is just a constant equal to
one.
Remark 2.19 (Joint cdf) A joint cdf of two random variables (X1 and X2 ) is defined as
F1;2 .x1 ; x2 / D Pr.X1 x1 and X2 x2 /:
58
This cdf is obtained by integrating the joint pdf f1;2 .x1 ; x2 / over both variables
Z x1 Z x2
F1;2 .x1 ; x2 / D
f1;2 .s; t/dsdt:
sD 1
tD 1
(Conversely, the pdf is the mixed derivative of the cdf, f1;2 .x1 ; x2 / D @2 F1;2 .x1 ; x2 /=@x1 @x2 .)
See Figure 2.32 for an illustration.
Remark 2.20 (From joint to univariate pdf) The pdf of x1 (also called the marginal pdf
R1
of x1 ) can be calculate from the joint pdf as f1 .x1 / D x2 D 1 f1;2 .x1 ; x2 /dx2 .
pdf of bivariate normal distribution, corr=0.8 cdf of bivariate normal distribution, corr=0.8
0.2
0.1
0.5
2
0
2
0
y
0
x
0
2
0
y
0
x
@2 C.u1 ; u2 /
:
@u1 @u2
Notice the derivatives are with respect to ui D Fi .xi /, not xi . Conversely, integrating the
density over both u1 and u2 gives the copula function C./.
59
2.6.2
1 2C
2/
2 2
2
, with
.ui /;
(2.52)
where 1 ./ is the inverse of an N.0; 1/ distribution. Notice that when using this function
in (2.51) to construct the joint pdf, we have to first calculate the cdf values ui D Fi .xi /
from the univariate distribution of xi (which may be non-normal) and then calculate
the quantiles of those according to a standard normal distribution i D 1 .ui / D
1 Fi .xi /.
It can be shown that assuming that the marginal pdfs (f1 .x1 / and f2 .x2 /) are normal
and then combining with the Gaussian copula density recovers a bivariate normal distribution. However, the way we typically use copulas is to assume (and estimate) some
other type of univariate distribution, for instance, with fat tailsand then combine with a
(Gaussian) copula density to create the joint distribution. See Figure 2.33 for an illustration.
A zero correlation ( D 0) makes the copula density (2.52) equal to unityso the
joint density is just the product of the marginal densities. A positive correlation makes the
copula density high when both x1 and x2 deviate from their means in the same direction.
The easiest way to calibrate a Gaussian copula is therefore to set
D Spearmans rho,
(2.53)
as suggested by (2.47).
Alternatively, the parameter can calibrated to give a joint probability of both x1
and x2 being lower than some quantile as to match data: see (2.50). The values of this
probability (according to a copula) is easily calculated by finding the copula function
(essentially the cdf) corresponding to a copula density. Some results are given in remarks
below. See Figure 2.31 for results from a Gaussian copula. This figure shows that a
higher correlation implies a larger probability that both variables are very lowbut that
the probabilities quickly become very small as we move towards lower quantiles (lower
returns).
Remark 2.23 (The Gaussian copula function ) The distribution function corresponding
60
to the Gaussian copula density (2.52) is obtained by integrating over both u1 and u2 and
the value is C.u1 ; u"2 I #/ D
1 ; 2 / where i is defined in (2.52) and is the bivariate
" .#!
0
1
normal cdf for N
;
. Most statistical software contains numerical returns
0
1
for calculating this cdf.
Remark 2.24 (Multivariate Gaussian copula density ) The Gaussian copula density for
n variables is
1
1 0
c.u/ D p
exp
.R 1 In / ;
2
jRj
The Gaussian copula is useful, but it has the drawback that it is symmetricso the
downside and the upside look the same. This is at odds with evidence from many financial
markets that show higher correlations across assets in down markets. The Clayton copula
density is therefore an interesting alternative
c.u1 ; u2 / D . 1 C u1 C u2 /
2 1=
.u1 u2 /
.1 C /;
(2.54)
where 0. When > 0, then correlation on the downside is much higher than on the
upside (where it goes to zero as we move further out the tail).
See Figure 2.33 for an illustration.
For the Clayton copula we have
Kendalls
, so
C2
2
D
:
1
D
(2.55)
(2.56)
The easiest way to calibrate a Clayton copula is therefore to set the parameter according
to (2.56).
Figure 2.34 illustrates how the probability of both variables to be below their respective quantiles depend on the parameter. These parameters are comparable to the those
for the correlations in Figure 2.31 for the Gaussian copula, see (2.47)(2.48). The figure
are therefore comparableand the main point is that Claytons copula gives probabilities
of joint low values (both variables being low) that do not decay as quickly as according to
61
the Gaussian copulas. Intuitively, this means that the Clayton copula exhibits much higher
correlations in the lower tail than the Gaussian copula doesalthough they imply the
same overall correlation. That is, according to the Clayton copula more of the overall
correlation of data is driven by synchronized movements in the left tail. This could be
interpreted as if the correlation is higher in market crashes than during normal times.
Remark 2.25 (Multivariate Clayton copula density ) The Clayton copula density for n
variables is
c.u/ D 1
nC
Pn
i D1 ui
n 1=
Qn
i D1 ui
Qn
i D1 1
C .i
1/ :
Remark 2.26 (Clayton copula function ) The copula function (the cdf) corresponding to
(2.54) is
C.u1 ; u2 / D . 1 C u1 C u2 / 1= :
The following steps summarize how the copula is used to construct the multivariate
distribution.
1. Construct the marginal pdfs fi .xi / and thus also the marginal cdfs Fi .xi /. For instance, this could be done by fitting a distribution with a fat tail. With this, calculate
the cdf values for the data ui D Fi .xi / as in (2.1).
2. Calculate the copula density as follows (for the Gaussian or Clayton copulas, respectively):
(a) for the Gaussian copula (2.52)
i. assume (or estimate/calibrate) a correlation
ula
ii. calculate
tion
.ui /, where
0
2
0
2
0
x2 2
x1
0
x2 2
0
2
0
2
x2
x1
x1
0
x2 2
x1
63
Pr(x<quantile,y<quantile), Clayton
Pr(x<quantile,y<quantile), Clayton
100
10
= 0.16
= 0.33
50
20
40
60
80
Prob of lower value, %
100
2
4
6
8
Prob of lower value, %
10
It can be shown that a Gaussian copula gives zero or very weak tail dependence,
unless the correlation is 1. It can also be shown that the lower tail dependence of the
Clayton copula is
1=
if > 0
l D 2
and zero otherwise.
2.7
The methods for estimating the (marginal, that is, for one variable at a time) distribution
of the lower tail can be combined with a copula to model the joint tail distribution. In
particular, combining the generalized Pareto distribution (GPD) with the Clayton copula
provides a flexible way.
This can be done by first modelling the loss (X t D R t ) beyond some threshold (u),
that is, the variable X t u with the GDP. To get a distribution of the return, we simply use
the fact that pdfR . z/ D pdfX .z/ for any value z. Then, in a second step we calibrate the
copula by using Kendalls for the subsample when both returns are less than u. Figures
2.372.39 provide an illustration.
Remark 2.28 Figure 2.37 suggests that the joint occurrence (of these two assets) of re64
1
x2
2
2
0
x1
0
x1
1
x2
2
2
0
x1
0
x1
1
x2
2
2
x1
Notice: marginal distributions
are t5
1
x2
2
1
0
x1
0
x1
0
x1
4
%
Data
estimated N()
50
3
2
1
20
40
60
80
Prob of lower value, %
100
2
4
6
8
Prob of lower value, %
10
Bibliography
Alexander, C., 2008, Market Risk Analysis: Practical Financial Econometrics, Wiley.
Ang, A., and J. Chen, 2002, Asymmetric correlations of equity portfolios, Journal of
Financial Economics, 63, 443494.
Breeden, D., and R. Litzenberger, 1978, Prices of State-Contingent Claims Implicit in
Option Prices, Journal of Business, 51, 621651.
67
0.2
0.15
0.15
0.1
0.1
0.05
0.05
3
Loss, %
0.15
0.15
0.1
0.1
0.05
0.05
3
2
Return, %
3
Loss, %
0.2
0.2
0
5
0.2
0
5
3
2
Return, %
68
1
2
3
4
4
3
2
small value stocks
1
2
3
Spearmans = 0.49
4
4
3
2
small value stocks
2
3
Kendalls = 0.35, = 1.06
4
4
3
2
small value stocks
69
2
Asset 2
Asset 2
2
0
2
4
4
2
marginal pdf:
normal
0
Asset 1
4
4
0
Asset 1
= 0.51
2
Asset 2
2
Asset 2
= 0.35
0
2
4
4
0
2
0
Asset 1
4
4
0
Asset 1
Figure 2.40: Example of scatter plots of two asset returns drawn from different copulas
tions Prices: An Application to Crude Oil During the Gulf Crisis, Journal of Financial
and Quantitative Analysis, 32, 91115.
Mittelhammer, R. C., 1996, Mathematical statistics for economics and business, SpringerVerlag, New York.
Ritchey, R. J., 1990, Call option valuation for discrete normal mixtures, Journal of
Financial Research, 13, 285296.
Silverman, B. W., 1986, Density estimation for statistics and data analysis, Chapman and
Hall, London.
70
1
1.2
1.4
1.6
1.8
N
Clayton, = 0.56
Clayton, = 1.06
Clayton, = 2.06
2
2.2
0.01
0.02
0.03
0.04
0.05
0.06
Prob of lower outcome
0.07
0.08
0.09
0.1
Figure 2.41: Quantiles of an equally weighted portfolio of two asset returns drawn from
different copulas
Sderlind, P., 2000, Market expectations in the UK before and after the ERM crisis,
Economica, 67, 118.
Sderlind, P., and L. E. O. Svensson, 1997a, New techniques to extract market expectations from financial instruments, Journal of Monetary Economics, 40, 383420.
Sderlind, P., and L. E. O. Svensson, 1997b, New techniques to extract market expectations from financial instruments, Journal of Monetary Economics, 40, 383429.
Taylor, S. J., 2005, Asset price dynamics, volatility, and prediction, Princeton University
Press.
71
3.1
The traditional interpretation of autocorrelation in asset returns is that there are some
irrational traders. For instance, feedback trading would create positive short term autocorrelation in returns. If there are non-trivial market imperfections, then predictability
can be used to generate economic profits. If there are no important market imperfections,
then predictability of excess returns should be thought of as predictable movements in
risk premia.
To see illustrate the latter, let RetC1 be the excess return on an asset. The canonical
asset pricing equation then says
E t m t C1 RetC1 D 0;
(3.1)
E x E y) as
Cov t .m t C1 ; RetC1 /= E t m t C1 :
(3.2)
This says that the expected excess return will vary if risk (the covariance) does. If there is
some sort of reasonable relation between beliefs and the properties of actual returns (not
72
necessarily full rationality), then we should not be too surprised to find predictability.
Example 3.2 (Epstein-Zin utility function) Epstein and Zin (1991) define a certainty
equivalent of future utility as Z t D E t .U t1C1 /1=.1 / where is the risk aversionand
then use a CES aggregator function to govern the intertemporal trade-off between current
consumption and the certainty equivalent: U t D .1 /C t1 1= C Z t1 1= 1=.1 1= /
where is the elasticity of intertemporal substitution. If returns are iid (so the consumptionwealth ratio is constant), then it can be shown that this utility function has the same
pricing implications as the CRRA utility, that is,
E.C t =C t
1/
R t D constant.
Example 3.4 (Habit persistence) The habit persistence model of Campbell and Cochrane
(1999) has a CRRA utility function, but the argument is the difference between consumption and a habit level, C t X t , instead of just consumption. The habit is parameterized
in terms of the surplus ratio S t D .C t X t /=C t . The log surplus ratio.(s t )is assumed
to be a non-linear AR(1)
s t D s t 1 C .s t 1 / c t :
It can be shown (see Sderlind (2006)) that if .s t 1 / is a constant and the excess return
is unpredictable (by s t ) then the habit persistence model is virtually the same as the CRRA
model, but with .1 C / as the effective risk aversion.
Example 3.5 (Reaction to news and the autocorrelation of returns) Let the log asset
price, p t , be the sum of a random walk and a temporary component (with perfectly correlated innovations, to make things simple)
p t D u t C " t , where u t D u t
D ut
Let r t D p t
pt
C .1 C /" t :
C "t
.1 C / Var." t /;
so 0 < < 1 (initial overreaction of the price) gives a negative autocorrelation. See
Figure 3.1 for the impulse responses with respect to a piece of news, " t .
3.2
Autocorrelations
74
0
Price
Return
1
0
1
5
period
period
Figure 3.1: Impulse reponses when price is random walk plus temporary component
3.2.1
y/
N .y t
T
1X
yt :
with yN D
T t D1
y/
N 0;
(3.3)
(3.4)
s observations to estimate
(3.5)
The sampling properties of Os are complicated, but there are several useful large sample results for Gaussian processes (these results typically carry over to processes which
are similar to the Gaussiana homoskedastic process with finite 6th moment is typically
enough, see Priestley (1981) 5.3 or Brockwell and Davis (1991) 7.2-7.3). When the true
autocorrelations are all zero (not 0 , of course), then for any i and j different from zero
#!
"
#
" # "
p
Oi
0
1 0
d
T
! N
;
:
(3.6)
Oj
0
0 1
75
This result can be used to construct tests for both single autocorrelations (t-test or
and several autocorrelations at once ( 2 test).
test)
Example 3.6 (t-test) We want to test the hypothesis that 1 D 0. Since the N.0; 1/ distribution has 5% of the probability mass below -1.65 and another 5% above 1.65, we
p
can reject the null hypothesis at the 10% level if T j O1 j > 1:65. With T D 100, we
p
therefore need j O1 j > 1:65= 100 D 0:165 for rejection, and with T D 1000 we need
p
j O1 j > 1:65= 1000 0:053.
p
The Box-Pierce test follows directly from the result in (3.6), since it shows that T Oi
p
and T Oj are iid N(0,1) variables. Therefore, the sum of the square of them is distributed
as an 2 variable. The test statistic typically used is
QL D T
L
X
sD1
Os2 !d
2
L:
(3.7)
This section discusses how GMM can be used to test if a series is autocorrelated. The
analysis focuses on first-order autocorrelation, but it is straightforward to extend it to
higher-order autocorrelation.
76
SMI
10
SMI
bill portfolio
1990
1995
2000
Year
2005
2010
10
1990
1995
2000
Year
2005
2010
T .O
77
0.3
0.3
Autocorr with 90% conf band around 0
S&P 500, 1979:12011:5
0.2
0.2
0.1
0.1
0.1
3
lags (days)
0.1
0.2
0.2
0.1
0.1
0
1
3
lags (days)
3
4
lags (weeks)
0.3
0.1
0.1
3
4
lags (weeks)
Assume that there is no autocorrelation in g t .0 / (which means, among other things, that
volatility, x t2 ; is not autocorrelated). We can then simplify as
S0 D E g t .0 /g t .0 /0 :
78
0.05
0.05
0.05
0.1
3
lags (days)
0.1
3
lags (days)
Figure 3.4: Predictability of US stock returns, results from a regression with interactive
dummies
This assumption is stronger than assuming that D 0, but we make it here in order to
illustrate the asymptotic distribution. Moreover, assume that x t is iid N.0; 2 /. In this
case (and with D 0 imposed) we get
"
"
#"
#0
#
2 2
2
2
2
.x t2
/
.x t2
/x t x t 1
x t2
x t2
DE
S0 D E
2
.x t2
/x t x t 1
.x t x t 1 /2
xt xt 1
xt xt 1
"
# "
#
E x t4 2 2 E x t2 C 4
0
2 4 0
D
D
:
4
0
E x t2 x t2 1
0
To make the simplification in the second line we use the facts that E x t4 D 3 4 if x t
N.0; 2 /, and that the normality and the iid properties of x t together imply E x t2 x t2 1 D
79
0.3
0.3
0.2
0.2
0.1
0.1
0.1
3
lags (days)
0.1
3
lags (days)
0.3
0.2
0.1
0
0.1
3
lags (days)
#1
0
2
T O !d N.0; 1/.
80
3.2.3
Autoregressions
C a2 y t
C ::: C ap y t
(3.9)
C "t ;
and then test if all the slope coefficients are zero with a 2 test. This approach is somewhat
less general than the Box-Pierce test, but most stationary time series processes can be well
approximated by an AR of relatively low order. To account for heteroskedasticity and
other problems, it can make sense to estimate the covariance matrix of the parameters by
an estimator like Newey-West.
Return = a + b*lagged Return, R2
0.1
0
0.05
0.5
0
20
40
Return horizon (months)
60
20
40
Return horizon (months)
60
1
0
1
2
2
0
1
Lagged return
0.5
0.1
0
0.05
0.5
0
20
40
Return horizon (months)
60
20
40
Return horizon (months)
60
0/y t
1 Cb.y t 1
> 0/y t
1,
where .q/ D
1 if q is true
0 else.
(3.10)
It is straightforward to see the relation between autocorrelations and the AR model when
the AR model is the true process. This relation is given by the Yule-Walker equations.
For an AR(1), the autoregression coefficient is simply the first autocorrelation coeffi-
82
Variance Ratios
1/
Var.y t C y t 1 /
2 Var.y t /
D1C
1;
to 2 Var.y t /
(3.13)
(3.14)
where s is the sth autocorrelation. If y t is not serially correlated, then this variance ratio
is unity; a value above one indicates positive serial correlation and a value below one
indicates negative serial correlation.
Proof. (of (3.14)) Let y t have a zero mean (or be demeaned), so Cov.y t ; y t s / D
83
E y t y t s . We then have
E.y t C y t 1 /2
VR2 D
2 E y t2
Var.y t / C Var.y t 1 / C 2 Cov.y t ; y t
D
2 Var.y t /
1C1C2 1
;
D
2
1/
Pq 1
Var
y
sD0 t s
(3.15)
VRq D
q Var.y t /
q 1
X
jsj
D
1
(3.16)
s or
q
sD .q 1/
D1C2
q 1
X
sD1
s
q
(3.17)
s:
The third line exploits the fact that the autocorrelation (and autocovariance) function is
symmetric around zero, so s D s . (We could equally well let the summation in (3.16)
and (3.17) run from q to q since the weight 1 jsj =q, is zero for that lag.) It is immediate
that no autocorrelation means that VRq D 1 for all q. If all autocorrelations are nonpositive, s 0, then VRq 1, and vice versa.
Example 3.9 (VR3 ) For q D 3, (3.15)(3.17) are
Var .y t C y t 1 C y t 2 /
3 Var.y t /
1
2
2
1
D
2C
1C1C
1C
3
3
3
3
2
1
D1C2
1C
2 :
3
3
VR3 D
84
1.5
1.5
0.5
0.5
0
20
40
Return horizon (months)
60
20
40
Return horizon (months)
60
C : : : C yt
qC1 /
D q Var.y t / C 2.q
C 2 Cov.y t ; y t
1/ Cov.y t ; y t
1/
C 2.q
2/ Cov.y t ; y t
qC1 /:
C yt
2/
D Var.y t / C Var.y t
2 Cov.y t ; y t
1/
2 Cov.y t ; y t
2 /:
1/
C Var.y t
C 2 Cov.y t
2 /C
1 ; y t 2 /C
Assume that variances and covariances are constant over time. Divide by q Var.y t / to get
1
2
1
VRq D 1 C 2 1
1C2 1
2 C ::: C 2
q 1:
q
q
q
85
2/
C :::
Variance ratio
Autocorr of LR returns
0.8
a= 0.05
a= 0.25
a= 0.5
0.6
0.4
1
0.2
Process: yt = ayt1 + t
4
6
q (horizon)
10
4
6
q (horizon)
10
Figure 3.9: Variance ratio and long run autocorrelation of an AR(1) process
noise (and y t has a zero mean or is demeaned), then
VR2 D 1 C a and
2
4
VR3 D 1 C a C a2 :
3
3
See Figure 3.9 for a numerical example.
The estimation of VRq is done by replacing the population variances in (3.15) with
the sample variances, or the autocorrelations in (3.17) by the sample autocorrelations.
The sampling distribution of V Rq under the null hypothesis that there is no autocorrelation follows from the sampling distribution of the autocorrelation coefficient. Rewrite
(3.17) as
q 1
X
p
s p
T V Rq 1 D 2
1
T Os :
(3.18)
q
sD1
If the assumptions behind (3.6) are satisfied, then we have that, under the null hypothesis of no autocorrelation, (3.18) is a linear combination of (asymptotically) uncorrelated
p
N.0; 1/ variables (the T Os ). It then follows that
" q 1
2 #
X
p
s
T V Rq 1 !d N 0;
4 1
:
(3.19)
q
sD1
p
p
T V R2 1 !d N .0; 1/ and T V R3 1 !d N .0; 20=9/ :
86
These distributional results depend on the assumptions behind the results in (3.6). One
way of handling deviations from those assumptions is to estimate the autocorrelations and
their covariance matrix with GMM, alternatively, the results in Taylor (2005) can be used.
See Figure 3.8 for an illustration.
3.2.6
Long-Run Autoregressions
(3.20)
C y t / C " t C2 :
Notice that it is important that dependent variable and the regressor are non-overlapping
(dont include the return for the same period)otherwise we are likely to find spurious
autocorrelation. The least squares population regression coefficient is
Cov .y t C1 C y t C2 ; y t 1 C y t /
Var .y t 1 C y t /
1 1C2 2C 3
D
:
VR2
2
(3.21)
b2 D
(3.22)
2 Var .y t / Cov .y t C1 C y t C2 ; y t
Var .y t 1 C y t /
2 Var .y t /
C yt /
C y t / D Cov .y t C1 ; y t
2/ :
The general pattern that emerges from these expressions is that the slope coefficient
in an AR(1) of (non-overlapping) long-run returns
q
X
sD1
y t Cs D a C bq
q
X
sD1
y t Cs
C " t Cq
(3.23)
87
is
1
bq D
VRq
q 1
X
sD .q 1/
jsj
q
qCs :
(3.24)
Note that the autocorrelations are displaced by the amount q. As for the variance ratio,
the summation could run from q to q instead, since the weight, 1 jsj=q, is zero for that
lag.
Equation (3.24) shows that the variance ratio and the AR(1) coefficient of long-run
returns are closely related. A bit of manipulation (and using the fact that s D s ) shows
that
VR2q
1 C bq D
:
(3.25)
VRq
If the variance ratio increases with the horizon, then this means that the long-run returns
are positively autocorrelated.
Example 3.12 (Long-run autoregression of an AR(1)) When y t D ay t 1 C " t where " t
is iid white noise, then the variance ratios are as in Example (3.10), and we know that
qCs
. From (3.22) we then have
qCs D a
1 a C 2a2 C a3
VR2
2
1 a C 2a2 C a3
D
:
1Ca
2
b2 D
See Figure 3.9 for a numerical example. For future reference, note that we can simplify to
get b2 D .1 C a/ a=2.
Example 3.13 (Trying (3.25) on an AR(1)) From Example (3.10) we have that
VR4
VR2
1 C 32 a C a2 C 12 a3
1Ca
1
D .1 C a/ a;
2
1D
overlapping periods, there is still the issue of whether we should use all available data
points or not.
Suppose one-period returns actually are serially uncorrelated and have zero means (to
simplify)
y t D u t , where u t is iid with E uu D 0,
(3.26)
and that we are studying two-periods returns. One possibility is to use y t C1 C y t C2 as the
first observation and y t C3 C y t C4 as the second observation: no common period. This
clearly halves the sample size, but has an advantage when we do inference. To see that,
notice that two successive observations are then
y t C1 C y t C2 D a C b2 .y t
C y t / C " t C2
y t C3 C y t C4 D a C b2 .y t C1 C y t C2 / C " t C4 :
(3.27)
(3.28)
" t C4 D u t C3 C u t C4 ;
(3.29)
(3.30)
C y t / C " tC2
y t C2 C y t C3 D a C b2 .y t C y t C1 / C " t C3 :
(3.31)
(3.32)
As before, if (3.26) is true, then a D b2 D 0 (so there is no problem with the point
estimates), but the residuals are
" t C2 D u t C1 C u t C2
(3.33)
" t C3 D u t C2 Cu t C3 ;
(3.34)
which are correlated since u tC2 shows up in both. This demonstrates that overlapping
return data introduces autocorrelation of the residualswhich has to be handled in order
to make correct inference.
89
3.3
Multivariate (Auto-)correlations
3.3.1
A momentum strategy invests in assets that have performed well recentlyand often goes
short in those that have underperformed. See 3.10 for an empirical illustration.
Buy winners and sell losers
excess return
alpha
8
6
4
Buy (sell) the 5 assets with highest (lowest) return over the last month
4
6
8
Evalutation horizon, days
10
12
and
.k/ D E.R t
(3.35)
/.R t
/0 :
k/
k/
N
1 X
D
Ri t D 10 R t =N
N i D1
(3.36)
90
(3.37)
D 10 =N:
Rt
Rmt
N
(3.38)
which basically says that wi t .k/ is positive for assets with an above average return k periods back. Notice that the weights sum to zero, so this is a zero cost portfolio. However,
the weights differ from fixed weights (for instance, put 1=5 into the best 5 assets, and
1=5 into the 5 worst assets) since the overall size of the exposure (10 jw t j) changes over
time. A large dispersion of the past returns means large positions and vice versa. To
analyse a contrarian strategy, reverse the sign of (3.38).
The profit from this strategy is
t .k/
N
X
Ri t
i D1
Rmt
N
wit
Ri t D
where the last term uses the fact that iND1 Rmt
The expected value is
E
t .k/
1
10 .k/1
2
N
N
X
Ri t
i D1
k Ri t =N
k Ri t
N
D Rmt
Rmt
k Rmt ;
(3.39)
k Rmt .
N
1 X
N 1
tr .k/ C
tr .k/ C
.
N2
N i D1
2
m/ ;
(3.40)
where the 10 .k/1 sums all the elements of .k/ and tr .k/ sums the elements along
the main diagonal. (See below for a proof.) To analyse a contrarian strategy, reverse the
sign of (3.40).
With a random walk, .k/ D 0, then (3.40) shows that the momentum strategy wins
money: the first two terms are zero, while the third term contributes to a positive performance. The reason is that the momentum strategy (on average) invests in assets with high
average returns ( i > m ).
The first term of (3.40) sums all elements in the atocovariance matrix and then subtracts the sum of the diagonal elementsso it only depends on the sum of the cross-
91
covariances, that is, how a return is correlated with the lagged return of other assets. In
general, negative cross-covariances benefit a momentum strategy. To see why, suppose a
low lagged return on asset 1 predicts a high return on asset 2, but asset 2 cannot predict
asset 1 (Cov.R2;t ; R1;t k / < 0 and Cov.R1;t ; R2;t k / D 0). This helps the momentum
strategy since we are overweighting asset 2 as it performed relatively well in the previous
period.
Example 3.15 ((3.40) with 2 assets) Suppose we have
"
Cov.R1;t ; R1;t k / Cov.R1;t ; R2;t
.k/ D
Cov.R2;t ; R1;t k / Cov.R2;t ; R2;t
# "
#
/
0
0
k
D
:
0:1 0
k/
Then
1
1
0:1 0 D 0:025, and
10 .k/1 tr .k/ D
2
N
22
N 1
2 1
tr .k/ D
0 D 0;
2
N
2
so the sum of the first two terms of (3.40) is positive (good for a momentum strategy).
The second term of (3.40) depends only on own autocovariances, that is, how a return
is correlated with the lagged return of the same asset. If these own autocovariances are
(on average) positive, then a strongly performing asset in t k tends to perform well in
t , which helps a momentum strategy (as the strongly performing asset is overweighted).
See Figure 3.12 for an illustration based on Figure 3.11.
Example 3.16 Figure 3.12 shows that a momentum strategy works reasonably well on
daily data on the 25 FF portfolios. While the cross-covariances have a negative influence
(because they are mostly positive), they are dominated by the (on average) positive autocorrelations. The correlation matrix is illustrated in Figure 3.11. In short, the small firms
(asset 1-5) are correlated with the lagged returns of most assets, while large firms are not.
Example 3.17 ((3.40) with 2 assets) With
"
#
0:1 0
.k/ D
;
0 0:1
92
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
0.24
0.21
0.20
0.20
0.20
0.24
0.20
0.17
0.15
0.15
0.23
0.19
0.16
0.15
0.15
0.23
0.18
0.15
0.13
0.13
0.21
0.17
0.14
0.12
0.11
0.20
0.17
0.18
0.17
0.17
0.20
0.18
0.15
0.14
0.14
0.20
0.17
0.14
0.14
0.13
0.20
0.17
0.15
0.13
0.13
0.19
0.16
0.14
0.13
0.12
0.19
0.18
0.16
0.18
0.18
0.19
0.17
0.15
0.14
0.14
0.19
0.17
0.14
0.14
0.13
0.19
0.16
0.15
0.13
0.13
0.18
0.16
0.14
0.12
0.12
0.20
0.19
0.19
0.17
0.19
0.20
0.18
0.16
0.15
0.16
0.20
0.18
0.16
0.16
0.15
0.19
0.18
0.16
0.14
0.14
0.19
0.17
0.15
0.14
0.14
0.26
0.24
0.25
0.26
0.25
0.24
0.23
0.21
0.21
0.22
0.23
0.23
0.21
0.21
0.21
0.23
0.21
0.20
0.19
0.20
0.21
0.20
0.19
0.18
0.18
0.09
0.08
0.07
0.07
0.06
0.13
0.10
0.08
0.07
0.07
0.14
0.11
0.08
0.08
0.07
0.15
0.11
0.09
0.07
0.07
0.15
0.12
0.09
0.08
0.07
0.05
0.05
0.04
0.04
0.03
0.08
0.06
0.04
0.04
0.04
0.09
0.07
0.05
0.05
0.04
0.09
0.08
0.06
0.04
0.04
0.10
0.08
0.06
0.05
0.05
0.03
0.03
0.04
0.04
0.03
0.06
0.04
0.03
0.03
0.04
0.06
0.06
0.03
0.04
0.03
0.07
0.06
0.05
0.03
0.04
0.08
0.07
0.05
0.04
0.05
0.03
0.03
0.03
0.03
0.03
0.05
0.04
0.03
0.02
0.03
0.06
0.05
0.03
0.04
0.03
0.06
0.06
0.04
0.03
0.03
0.07
0.06
0.05
0.04
0.04
10
0.04
0.04
0.04
0.05
0.05
0.06
0.05
0.04
0.03
0.05
0.06
0.06
0.04
0.05
0.04
0.06
0.06
0.05
0.04
0.05
0.07
0.06
0.05
0.05
0.06
11
0.06
0.06
0.06
0.05
0.04
0.11
0.09
0.08
0.07
0.07
0.13
0.11
0.08
0.08
0.07
0.14
0.10
0.09
0.07
0.07
0.15
0.11
0.09
0.08
0.08
12
0.04
0.05
0.05
0.05
0.04
0.08
0.08
0.06
0.06
0.07
0.09
0.09
0.07
0.07
0.06
0.10
0.10
0.08
0.06
0.06
0.12
0.11
0.09
0.08
0.08
13
0.04
0.04
0.04
0.05
0.04
0.07
0.07
0.06
0.05
0.06
0.08
0.08
0.06
0.07
0.06
0.08
0.09
0.08
0.06
0.06
0.10
0.09
0.08
0.07
0.08
14
0.04
0.04
0.04
0.05
0.04
0.06
0.06
0.05
0.05
0.06
0.06
0.07
0.05
0.06
0.05
0.06
0.07
0.06
0.05
0.06
0.08
0.08
0.07
0.07
0.07
15
0.04
0.04
0.04
0.05
0.04
0.06
0.06
0.06
0.06
0.07
0.07
0.07
0.06
0.07
0.06
0.07
0.08
0.07
0.06
0.07
0.08
0.08
0.07
0.08
0.09
16
0.03
0.03
0.03
0.02
0.01
0.07
0.06
0.05
0.05
0.05
0.09
0.08
0.05
0.06
0.04
0.10
0.08
0.06
0.04
0.04
0.12
0.09
0.07
0.06
0.06
17
0.04
0.04
0.04
0.04
0.03
0.08
0.08
0.07
0.06
0.07
0.09
0.09
0.07
0.08
0.06
0.10
0.09
0.08
0.06
0.06
0.12
0.10
0.09
0.08
0.08
18
0.04
0.04
0.05
0.05
0.04
0.08
0.08
0.07
0.07
0.08
0.09
0.09
0.07
0.08
0.06
0.09
0.09
0.08
0.06
0.06
0.10
0.10
0.09
0.08
0.08
19
0.04
0.04
0.04
0.05
0.04
0.07
0.07
0.07
0.07
0.07
0.08
0.08
0.07
0.08
0.06
0.08
0.08
0.07
0.07
0.07
0.09
0.09
0.08
0.08
0.08
20
0.04
0.04
0.04
0.04
0.04
0.06
0.07
0.06
0.06
0.07
0.07
0.07
0.06
0.07
0.06
0.07
0.08
0.07
0.07
0.07
0.08
0.08
0.08
0.07
0.09
21
0.03
0.03
0.03
0.03
0.04
0.00
0.00
0.01
0.01
0.00
0.01
0.01
0.01
0.00
0.02
0.02
0.01
0.00
0.02
0.02
0.04
0.02
0.00
0.00
0.00
22
0.02
0.02
0.02
0.02
0.02
0.01
0.01
0.00
0.00
0.01
0.02
0.02
0.00
0.01
0.00
0.03
0.02
0.00
0.01
0.01
0.04
0.02
0.01
0.01
0.01
23
0.01
0.00
0.00
0.00
0.01
0.02
0.03
0.02
0.02
0.03
0.03
0.03
0.02
0.03
0.01
0.04
0.04
0.02
0.01
0.01
0.05
0.04
0.03
0.03
0.03
24
0.01
0.00
0.00
0.00
0.01
0.01
0.02
0.01
0.01
0.02
0.02
0.02
0.01
0.02
0.00
0.02
0.03
0.01
0.01
0.01
0.03
0.02
0.02
0.02
0.03
25
0.03
0.01
0.02
0.02
0.02
0.00
0.00
0.00
0.00
0.02
0.00
0.01
0.00
0.01
0.00
0.00
0.01
0.00
0.00
0.00
0.01
0.00
0.00
0.01
0.02
Notice that
1
N
PN
i D1
Cov.Ri t
k ; Ri t /
k ; Ri t /
2
i
Cov.Rmt
k ; Rmt /
2
m
24.40
10
0.12
0
10
20
20.78
Crosscov
Autocov
means
Figure 3.12: Decomposition of return from momentum strategy based on daily FF data
RQ D R
i
1 h 0Q
10 .k/1
1
0
0 Q
Cov.Rmt k ; Rmt / D E 2 1 R t 1 Ri t k D E 2 10 RQ t RQ i0 t k 1 D
:
N
N
N2
P
PN
1
2
2
2
Finally, we note that N1 N
m / . Together, these results
m D N
i D1 i
i D1 . i
give
N
10 .k/1
1
1 X
2
E t .k/ D
C tr .k/ C
. i
m/ ;
N2
N
N i D1
which can be rearranged as (3.40).
3.4
Other Predictors
There are many other, perhaps more economically plausible, possible predictors of future
stock returns. For instance, both the dividend-price ratio and nominal interest rates have
been used to predict long-run returns, and lagged short-run returns on other assets have
been used to predict short-run returns.
See Figure 3.13 for an illustration.
94
3.4.1
D t C1 C P t C1
D t C1 C P tC1
, so P t D
:
Pt
R tC1
(3.41)
Pt D
(3.42)
(3.43)
provided the discounted value of P t Cj goes to zero as j ! 1. This is simply an accounting identity. It is clear that a high price in t must lead to low future returns and/or
high future dividendswhich (by rational expectations) also carry over to expectations
of future returns and dividends.
It is sometimes more convenient to analyze the price-dividend ratio. Dividing (3.42)
and (3.43) by D t gives
1 D t C1
1
1
D t C2 D t C1
D t C3 D t C2 D t C1
Pt
D
C
C
C :::
Dt
R t C1 D t
R t C1 R tC2 D t C1 D t
R t C1 R t C2 R t C3 D t C2 D t C1 D t
(3.44)
j
1 Y
X
D t Ck =D tCk
D
R t Ck
j D1
(3.45)
kD1
As with (3.43) it is just an accounting identity. It must therefore also hold in expectations.
Since expectations are good (the best?) predictors of future values, we have the implication that the asset price should predict a discounted sum of future dividends, (3.43),
and that the price-dividend ratio should predict a discounted sum of future changes in
dividends.
95
dt
1
X
.d t C1Cs
d t Cs /
(3.46)
r t C1Cs ;
sD0
where
D 1=.1 C D=P / where D=P is a steady state dividend-price ration ( D
1=1:04 0:96 if D=P is 4%).
As before, a high price-dividend ratio must imply future dividend growth and/or low
future returns. In the exact solution (3.44), dividends and returns which are closer to the
present show up more times than dividends and returns far in the future. In the approximation (3.46), this is captured by giving a higher weight (higher s ).
Proof. (of (3.46)slow version) Rewrite (3.41) as
P tC1
D t C1
D t C1 C P tC1
D
1C
or in logs
R t C1 D
Pt
Pt
P t C1
r t C1 D p tC1
p t C ln 1 C exp.d t C1
p t C1 / :
Make a first order Taylor approximation of the last term around a steady state value of
d t C1 p t C1 , denoted d p,
ln 1 C exp.d t C1
p t C1 /
ln 1 C exp.d
i
p/ C
constant C .1
where D 1=1 C exp.d
The result is
r t C1
exp.d
p/
1 C exp.d
/ .d t C1
p/
d tC1
p tC1
p t C1 / ;
p t C .1
p t C .1
/ .d t C1
p t C1 /
/ d t C1 ;
96
where 0 <
< 1. Add and subtract d t from the right hand side and rearrange
pt
r t C1
.p t C1
d t C1 /
.p t
dt
.p t C1
d t C1 / C .d t C1
d t / C .d t C1
dt /
d t / , or
r t C1
This is a (forward looking, unstable) difference equation, which we can solve recursively
forward. Provided lims!1 s .p t Cs d t Cs / D 0, the solution is (3.46). (Trying to solve
for the log price level instead of the log price-dividend ratio is problematic since the
condition lims!1 s p t Cs D 0 may not be satisfied.)
Dividend-Price Ratio as a Predictor
One of the most successful attempts to forecast long-run return is by using the dividendprice ratio
q
X
r t Cs D C q .d t p t / C " t Cq :
(3.47)
sD1
For instance, CLM Table 7.1, report R2 values from this regression which are close to
zero for monthly returns, but they increase to 0.4 for 4-year returns (US, value weighted
index, mid 1920s to mid 1990s).
By comparing with (3.46), we see that the dividend-ratio in (3.47) is only asked to
predict a finite (unweighted) sum of future returnsdividend growth is disregarded. We
should therefore expect (3.47) to work particularly well if the horizon is long (high q) and
if dividends are stable over time.
From (3.46) we get (from using Cov.x; y z/ D Cov.x; y/ Cov.x; z/) that
!
!
1
1
X
X
s
s
.d tC1Cs d tCs /
Var.p t d t / Cov p t d t ;
Cov p t d t ;
r t C1Cs ;
sD0
sD0
(3.48)
which shows that the variance of the price-dividend ratio can be decomposed into the
covariance of price-dividend ratio with future dividend change minus the covariance of
price-dividend ratio with future returns. This expression highlights that if p t d t is not
constant, then it must forecast dividend growth and/or returns.
The evidence in Cochrane suggests that p t d t does not forecast future dividend
growth, so that forecastability of future returns explains the variability in the dividendprice ratio. This fits very well into the findings of the R2 of (3.47). To see that, recall the
97
following fact.
Remark 3.18 (R2 from a least squares regression) Let the least squares estimate of in
O The fitted values yO t D x 0 .
O If the regression equation includes a
y t D x t0 0 C u t be .
t
constant, then R2 D Corr .y t ; yO t /2 . In a simple regression where y t D a C bx t C u t ,
where x t is a scalar, R2 D Corr .y t ; x t /2 .
Return = a + b*log(E/P), R2
0.4
0.1
0.2
0.05
20
40
Return horizon (months)
60
20
40
Return horizon (months)
60
1
0
1
2
4
3
2
log(E/P), lagged
3.4.2
The evidence for US stock returns is that long-run returns may perhaps be predicted
by using dividend-price ratio or interest rates, but that the long-run autocorrelations are
weak (long run US stock returns appear to be weak-form efficient but not semi-strong
efficient). Both CLM 7.1.4 and Cochrane 20.1 use small models for discussing this
98
case. The key in these discussions is to make changes in dividends unforecastable, but
let the return be forecastable by some state variable (E t d t C1Cs
E t d t Cs D 0 and
E t r t C1 D r C x t ), but in such a way that there is little autocorrelation in returns. By
taking expectations of (3.46) we see that price-dividend will then reflect expected future
returns and therefore be useful for forecasting.
3.5
Security Analysts
Makridakis, Wheelwright, and Hyndman (1998) 10.1 shows that there is little evidence
that the average stock analyst beats (on average) the market (a passive index portfolio).
In fact, less than half of the analysts beat the market. However, there are analysts which
seem to outperform the market for some time, but the autocorrelation in over-performance
is weak.
The paper by Bondt and Thaler (1990) compares the (semi-annual) forecasts (oneand two-year time horizons) with actual changes in earnings per share (1976-1984) for
several hundred companies. The paper has regressions like
Actual change D C .forecasted change/ C residual,
and then studies the estimates of the and coefficients. With rational expectations (and
a long enough sample), we should have D 0 (no constant bias in forecasts) and D 1
(proportionality, for instance no exaggeration).
The main findings are as follows. The main result is that 0 < < 1, so that the
forecasted change tends to be too wild in a systematic way: a forecasted change of 1% is
(on average) followed by a less than 1% actual change in the same direction. This means
that analysts in this sample tended to be too extremeto exaggerate both positive and
negative news.
Barber, Lehavy, McNichols, and Trueman (2001) give a somewhat different picture.
They focus on the profitability of a trading strategy based on analysts recommendations.
They use a huge data set (some 360,000 recommendations, US stocks) for the period
1985-1996. They sort stocks in to five portfolios depending on the consensus (average)
recommendationand redo the sorting every day (if a new recommendation is published).
They find that such a daily trading strategy gives an annual 4% abnormal return on the
portfolio of the most highly recommended stocks, and an annual -5% abnormal return on
99
3.6
Further reading: Diebold (2001) 11; Stekler (1991); Diebold and Mariano (1995)
To do a solid evaluation of the forecast performance (of some forecaster/forecast
method/forecast institute), we need a sample (history) of the forecasts and the resulting
forecast errors. The reason is that the forecasting performance for a single period is likely
to be dominated by luck, so we can only expect to find systamtic patterns by looking at
results for several periods.
Let e t be the forecast error in period t
et D yt
yO t ;
(3.49)
where yO t is the forecast and y t the actual outcome. (Warning: some autors prefer to work
with yO t y t as the forecast error instead.)
Most statistical forecasting methods are based on the idea of minimizing the sum of
squared forecast errors, tTD1 e t2 . For instance, the least squares (LS) method picks the
100
regression coefficient in
y t D 0 C 1 x t C " t
(3.50)
to minimize the sum of squared residuals, tTD1 "2t . This will, among other things, give
a zero mean of the fitted residuals and also a zero correlation between the fitted residual
and the regressor.
Evaluation of a forecast often involve extending these ideas to the forecast method,
irrespective of whether a LS regression has been used or not. In practice, this means
studying if (i) the forecast error, e t , has a zero mean; (ii) the forecast error is uncorrelated
to the variables (information) used in constructing the forecast; and (iii) to compare the
sum (or mean) of squared forecasting errors of different forecast approaches. A non-zero
mean of the errors clearly indicates a bias, and a non-zero correlation suggests that the
information has not been used efficiently (a forecast error should not be predictable...)
Remark 3.19 (Autocorrelation of forecast errors ) Suppose we make one-step-ahead
forecasts, so we are forming a forecast of y t C1 based on what we know in period t.
Let e t C1 D y t C1 E t y t C1 , where E t y t C1 denotes our forecast. If the forecast error is unforecastable, then the forecast errors cannot be autocorrelated, for instance,
Corr.e tC1 ; e t / D 0. For two-step-ahead forecasts, the situation is a bit different. Let
e tC2;t D y tC2 E t y t C2 be the error of forecasting y t C2 using the information in period
t (notice: a two-step difference). If this forecast error is unforecastable using the information in period t, then the previously mentioned e t C2;t and e t;t 2 D y t E t 2 y t must
be uncorrelatedsince the latter is known when the forecast E t y t C2 is formed (assuming this forecast is efficient). However, there is nothing hat guarantees that e t C2;t and
e t C1;t 1 D y t C1 E t 1 y t C1 are uncorrectedsince the latter contains new information
compared to what was known when the forecast E t y t C2 was formed. This generalizes to
the following: an efficient h-step-ahead forecast error must have a zero correlation with
the forecast error h 1 (and more) periods earlier.
The comparison of forecast approaches/methods is not always a comparison of actual
forecasts. Quite often, it is a comparison of a forecast method (or forecasting institute)
with some kind of naive forecast like a no change or a random walk. The idea of such
a comparison is to study if the resources employed in creating the forecast really bring
value added compared to a very simple (and inexpensive) forecast.
101
Example 3.20 We want to compare the performance of the two forecast methods a and
b. We have the following forecast errors .e1a ; e2a ; e3a / D . 1; 1; 2/ and .e1b ; e2b ; e3b / D
. 1:9; 0; 1:9/. Both have zero means, so there is (in this very short sample) no constant
bias. The mean squared errors are
MSEa D . 1/2 C . 1/2 C 22 =3 D 2
MSEb D . 1:9/2 C 02 C 1:92 =3
2:41;
so forecast a is better according to the mean squared errors criterion. The mean absolute
errors are
MAEa D j 1j C j 1j C j2j=3
1:33
1:27;
so forecast b is better according to the mean absolute errors criterion. The reason for the
difference between these criteria is that forecast b has fewer but larger errorsand the
quadratic loss function punishes large errors very heavily. Counting the number of times
the absolute error (or the squared error) is smaller, we see that a is better one time (first
period), and b is better two times.
To perform formal tests of forecasting superiority a Diebold and Mariano (1995) test
is typically performed. For instance to compare the MSE of two methods (a and b), first
define
2
a 2
b
gt D et
et ;
(3.51)
where e ti is the forecasting error of model i. Treating this as a GMM problem, we then
test if
E g t D 0;
(3.52)
by applying a t-test on the same means
gN
Std.g/
N
(3.53)
and where the standard error is typically estimated using Newey-West (or similar) approach. Howecer, when models a and b are nested, then the asymptotic distribution is
non-normal so other critical values must be applied (see Clark and McCracken (2001)).
103
Other evaluation criteria can be used by changing (3.51). For instance, to test the mean
absolute errors, use g t D je ta j je tb j instead.
hp
i P
Remark 3.21 From GMM we typically have Cov
T g/
N D 1
sD 1 Cov .g t ; g t s /, so
P1
1=2
Std.g/
N D
.
sD 1 Cov .g t ; g t s / =T
3.7
As a way to calculate an upper bound on predictability, Lo and MacKinlay (1997) construct maximally predictable portfolios. The weights on the different assets in this portfolio can also help us to understand more about how the predictability works.
Let Z t be an n 1 vector of demeaned returns
E Rt ;
Zt D Rt
(3.54)
Z t C " t , where E t
" t D 0, Var t
0
1 ." t " t /
Z t such that
D :
(3.55)
Consider a portfolio 0 Z t . The R2 from predicting the return on this portfolio is (as
usual) the fraction of the variability of 0 Z t that is explained by 0 E t 1 Z t
Var. 0 " t /= Var. 0 Z t /
R2 . / D 1
D Var. 0 Z t /
D Var.
D
Et
Cov.E t
Z t /= Var. 0 Z t /
Zt / =
Cov.Z t / :
(3.56)
The covariance in the denominator can be calculated directly from data, but the covariance matrix in the numerator clearly depends on the forecasting model we use (to create
E t 1 Z t ).
The portfolio ( vector) that gives the highest R2 is the eigenvector (normalized to
sum to unity) associated with the largest eigenvalue (also the value of R2 ) of Cov.Z t / 1 Cov.E t
Example 3.22 (One forecasting variable) Suppose there is only one predictor, x t
Z t D x t
1,
C "t ;
104
Z t /.
1 /
Var.x t 1 / 0
:
0 Var.x
0 C 0
t 1 /
0
The first order conditions for maximum then gives (this is very similar to the calculations
of the minimum variance portfolio in mean-variance analysis)
D
=10
3.8
Technical Analysis
Main reference: Bodie, Kane, and Marcus (2002) 12.2; Neely (1997) (overview, foreign
exchange market)
Further reading: Murphy (1999) (practical, a believers view); The Economist (1993)
(overview, the perspective of the early 1990s); Brock, Lakonishok, and LeBaron (1992)
(empirical, stock market); Lo, Mamaysky, and Wang (2000) (academic article on return
distributions for technical portfolios)
3.8.1
Technical analysis is typically a data mining exercise which looks for local trends or
systematic non-linear patterns. The basic idea is that markets are not instantaneously efficient: prices react somewhat slowly and predictably to news. The logic is essentially that
an observed price move must be due to some news (exactly which is not very important)
and that old patterns can tell us where the price will move in the near future. This is an
attempt to gather more detailed information than that used by the market as a whole. In
practice, the technical analysis amounts to plotting different transformations (for instance,
105
a moving average) of pricesand to spot known patterns. This section summarizes some
simple trading rules that are used.
3.8.2
Many trading rules rely on some kind of local trend which can be thought of as positive
autocorrelation in price movements (also called momentum1 ).
A filter rule like buy after an increase of x% and sell after a decrease of y% is
clearly based on the perception that the current price movement will continue.
A moving average rule is to buy if a short moving average (equally weighted or exponentially weighted) goes above a long moving average. The idea is that event signals
a new upward trend. Let S (L)be the lag order of a short (long)moving average , with
S < L and let b be a bandwidth (perhaps 0.01). Then, a MA rule for period t could be
2
3
buy in t if MA t 1 .S/ > MA t 1 .L/.1 C b/
6
7
(3.57)
4 sell in t if MA t 1 .S/ < MA t 1 .L/.1 b/ 5 , where
no change
MA t
otherwise
1 .S/
D .p t
C : : : C pt
S /=S:
The difference between the two moving averages is called an oscillator (or sometimes,
moving average convergence divergence2 ). A version of the moving average oscillator is
the relative strength index3 , which is the ratio of average price level on up days to the
average price on down daysduring the last z (14 perhaps) days.
The trading range break-out rule typically amounts to buying when the price rises
above a previous peak (local maximum). The idea is that a previous peak is a resistance
level in the sense that some investors are willing to sell when the price reaches that value
(perhaps because they believe that prices cannot pass this level; clear risk of circular
reasoning or self-fulfilling prophecies; round numbers often play the role as resistance
levels). Once this artificial resistance level has been broken, the price can possibly rise
substantially. On the downside, a support level plays the same role: some investors are
willing to buy when the price reaches that value. To implement this, it is common to let
1
106
the resistance/support levels be proxied by minimum and maximum values over a data
window of length L. With a bandwidth b (perhaps 0.01), the rule for period t could be
2
3
buy in t if P t > M t 1 .1 C b/
6
7
(3.58)
4 sell in t if P t < m t 1 .1 b/ 5 , where
no change
Mt
mt
otherwise
D max.p t
D min.p t
1; : : : ; pt S /
1 ; : : : ; p t S /:
When the price is already trending up, then the trading range break-out rule may be
replaced by a channel rule, which works as follows. First, draw a trend line through
previous lows and a channel line through previous peaks. Extend these lines. If the price
moves above the channel (band) defined by these lines, then buy. A version of this is to
define the channel by a Bollinger band, which is 2 standard deviations from a moving
data window around a moving average.
A head and shoulder pattern is a sequence of three peaks (left shoulder, head, right
shoulder), where the middle one (the head) is the highest, with two local lows in between
on approximately the same level (neck line). (Easier to draw than to explain in a thousand
words.) If the price subsequently goes below the neckline, then it is thought that a negative
trend has been initiated. (An inverse head and shoulder has the inverse pattern.)
Clearly, we can replace buy in the previous rules with something more aggressive,
for instance, replace a short position with a long.
The trading volume is also often taken into account. If the trading volume of assets
with declining prices is high relative to the trading volume of assets with increasing prices
is high, then this is interpreted as a market with selling pressure. (The basic problem with
this interpretation is that there is a buyer for every seller, so we could equally well interpret
the situations as if there is a buying pressure.)
3.8.3
(using Pearsons 2 test and the Kolmogorov-Smirnov test). They reject that hypothesis
in many cases, using daily data (19621996) for around 50 (randomly selected) stocks.
See Figures 3.143.15 for an illustration.
3.8.4
If we instead believe in mean reversion of the prices, then we can essentially reverse the
previous trading rules: we would typically sell when the price is high.
Some investors argue that markets show periods of mean reversion and then periods
with trendsan that both can be exploited. Clearly, the concept of a support and resistance levels (or more generally, a channel) is based on mean reversion between these
points. A new trend is then supposed to be initiated when the price breaks out of this
band.
Inverted MA rule, S&P 500
1350
MA(3) and MA(25), bandwidth 0.01
1300
1250
1200
1150
Jan
Long MA ()
Long MA (+)
Short MA
Feb
Mar
Apr
1999
3.9
References: Ferson, Sarkissian, and Simin (2003), Goyal and Welch (2008), and Campbell and Thompson (2008)
108
0.6
Std
1.16
0.6
0.4
0.4
0.2
0.2
0
3
0
Return
0
3
Mean
0.06
Std
1.72
0
Return
0.6
Std
0.93
0.4
0.4
0.2
0.2
0
3
0
Return
0
3
Mean
0.01
Std
0.90
0
Return
Spurious Regressions
Ferson, Sarkissian, and Simin (2003) argue that many prediction equations suffer from
spurious regression featuresand that data mining tends to make things even worse.
Their simulation experiment is based on a simple model where the return predictions
are
r t C1 D C Z t C v t C1 ;
(3.59)
where Z t is a regressor (predictor). The true model is that returns follows the process
r t C1 D
C Z t C u t C1 ;
(3.60)
where the residual is white noise. In this equation, Z t represents movements in expected
109
SMI
Rule
2
1990
1995
2000
Year
2005
2010
1990
1995
2000
Year
2005
2010
6
4
2
1990
1995
2000
Year
2005
2010
Figure 3.16: Examples of trading rules applied to SMI. The rule portfolios are rebalanced
every Wednesday: if condition (see figure titles) is satisfied, then the index is held for the
next week, otherwise a government bill is held. The figures plot the portfolio values.
returns. The predictors follow a diagonal VAR(1)
" # "
#"
# " #
" #!
Zt
0
Zt 1
"t
"t
D
C
, with Cov
D :
Zt
0
Zt 1
"t
"t
(3.61)
In the case of a pure spurious regression, the innovations to the predictors are uncorrelated ( is diagonal). In this case, ought to be zeroand their simulations show that
the estimates are almost unbiased. Instead, there is a problem with the standard deviation
O If
of .
is high, then the returns will be autocorrelated.
Under the null hypothesis of D 0, this autocorrelated is loaded onto the residuals.
For that reason, the simulations use a Newey-West estimator of the covariance matrix
(with an automatic choice of lag order). This should, ideally, solve the problem with the
110
Autocorrelation of xt ut
Model: yt = 0.9xt + t ,
where t = t1 + ut , ut is iid N
xt = xt1 + t , t is iid N
= 0.9
=0
= 0.9
0.5
0
0.5
0.5
0.5
0.05
0
0.5
0.5
0.5
0.5
0.05
0
0.5
0.5
2.7 (to be compared with the nominal value of 1.96). Since the point estimates are almost
unbiased, the interpretation is that the standard deviations are underestimated. In contrast,
with low autocorrelation and/or low importance of Z t , the standard deviations are much
more in line with nominal values.
See Figures 3.173.18 for an illustration. They show that we need a combination of an
autocorrelated residuals and an autocorrelated regressor to create a problem for the usual
LS formula for the standard deviation of a slope coefficient. When the autocorrelation is
very high, even the Newey-West estimator is likely to underestimate the true uncertainty.
To study the interaction between spurious regressions and data mining, Ferson, Sarkissian,
and Simin (2003) let Z t be chosen from a vector of L possible predictorswhich all are
generated by a diagonal VAR(1) system as in (3.61) with uncorrelated errors. It is assumed that the researchers choose Z t by running L regressions, and then picks the one
with the highest R2 . When
D 0:15 and the researcher chooses between L D 10
predictors, the simulated 5% critical value is 3.5. Since this does not depend on the importance of Z t , it is interpreted as a typical feature of data mining, which is bad enough.
When the autocorrelation is 0.95, then the importance of Z t start to become important
spurious regressions interact with the data mining to create extremely high simulated
critical values. A possible explanation is that the data mining exercise is likely to pick out
the most autocorrelated predictor, and that a highly autocorrelated predictor exacerbates
the spurious regression problem.
3.9.2
Goyal and Welch (2008) find that the evidence of predictability of equity returns disappears when out-of-sample forecasts are considered. Campbell and Thompson (2008)
claim that there is still some out-of-sample predictability, provided we put restrictions on
the estimated models.
Campbell and Thompson (2008) first report that only few variables (earnings price
ratio, T-bill rate and the inflation rate) have significant predictive power for one-month
stock returns in the full sample (18712003 or early 1920s2003, depending on predictor).
To gauge the out-of-sample predictability, they estimate the prediction equation using
data up to and including t 1, and then make a forecast for period t. The forecasting
performance of the equation is then compared with using the historical average as the
predictor. Notice that this historical average is also estimated on data up to an including
112
t 1, so it changes over time. Effectively, they are comparing the forecast performance
of two models estimated in a recursive way (long and longer sample): one model has just
an intercept, the other has also a predictor. The comparison is done in terms of the RMSE
and an out-of-sample R2
2
ROS
D1
XT
t Ds
.r t
rOt /2 =
XT
tDs
.r t
rNt /2 ;
(3.62)
where s is the first period with an out-of-sample forecast, rOt is the forecast based on the
prediction model (estimated on data up to and including t 1) and rNt is the historical
average (also estimated on data up to and including t 1).
The evidence shows that the out-of-sample forecasting performance is very weakas
claimed by Goyal and Welch (2008).
It is argued that forecasting equations can easily give strange results when they are
estimated on a small data set (as they are early in the sample). They therefore try different
restrictions: setting the slope coefficient to zero whenever the sign is wrong, setting
the prediction (or the historical average) to zero whenever the value is negative. This
improves the results a bitalthough the predictive performance is still weak.
See Figure 3.19 for an illustration.
RMSE, E/P regression vs MA
0.2
0.2
0.19
0.19
0.18
0.18
0.17
0.17
0.16
0.16
113
Bibliography
Barber, B., R. Lehavy, M. McNichols, and B. Trueman, 2001, Can investors profit from
the prophets? Security analyst recommendations and stock returns, Journal of Finance, 56, 531563.
Bodie, Z., A. Kane, and A. J. Marcus, 2002, Investments, McGraw-Hill/Irwin, Boston,
5th edn.
Bondt, W. F. M. D., 1991, What do economists know about the stock market?, Journal
of Portfolio Management, 17, 8491.
Bondt, W. F. M. D., and R. H. Thaler, 1990, Do security analysts overreact?, American
Economic Review, 80, 5257.
Boni, L., and K. L. Womack, 2006, Analysts, industries, and price momentum, Journal
of Financial and Quantitative Analysis, 41, 85109.
Brock, W., J. Lakonishok, and B. LeBaron, 1992, Simple technical trading rules and the
stochastic properties of stock returns, Journal of Finance, 47, 17311764.
Brockwell, P. J., and R. A. Davis, 1991, Time series: theory and methods, Springer Verlag,
New York, second edn.
Campbell, J. Y., and J. H. Cochrane, 1999, By force of habit: a consumption-based
explanation of aggregate stock market behavior, Journal of Political Economy, 107,
205251.
Campbell, J. Y., A. W. Lo, and A. C. MacKinlay, 1997, The econometrics of financial
markets, Princeton University Press, Princeton, New Jersey.
Campbell, J. Y., and S. B. Thompson, 2008, Predicting the equity premium out of sample: can anything beat the historical average, Review of Financial Studies, 21, 1509
1531.
Campbell, J. Y., and L. M. Viceira, 1999, Consumption and portfolio decisions when
expected returns are time varying, Quarterly Journal of Economics, 114, 433495.
114
Chance, D. M., and M. L. Hemler, 2001, The performance of professional market timers:
daily evidence from executed strategies, Journal of Financial Economics, 62, 377
411.
Clark, T. E., and M. W. McCracken, 2001, Tests of equal forecast accuracy and encompassing for nested models, Journal of Econometrics, 105, 85110.
Cochrane, J. H., 2005, Asset pricing, Princeton University Press, Princeton, New Jersey,
revised edn.
Diebold, F. X., 2001, Elements of forecasting, South-Western, 2nd edn.
Diebold, F. X., and R. S. Mariano, 1995, Comparing predcitve accuracy, Journal of
Business and Economic Statistics, 13, 253265.
Epstein, L. G., and S. E. Zin, 1991, Substitution, risk aversion, and the temporal behavior
of asset returns: an empirical analysis, Journal of Political Economy, 99, 263286.
Ferson, W. E., S. Sarkissian, and T. T. Simin, 2003, Spurious regressions in financial
economics, Journal of Finance, 57, 13931413.
Goyal, A., and I. Welch, 2008, A comprehensive look at the empirical performance of
equity premium prediction, Review of Financial Studies 2008, 21, 14551508.
Granger, C. W. J., 1992, Forecasting stock market prices: lessons for forecasters, International Journal of Forecasting, 8, 313.
Huberman, G., and S. Kandel, 1987, Mean-variance spanning, Journal of Finance, 42,
873888.
Leitch, G., and J. E. Tanner, 1991, Economic forecast evaluation: profit versus the conventional error measures, American Economic Review, 81, 580590.
Lo, A. W., and A. C. MacKinlay, 1997, Maximizing predictability in the stock and bond
markets, Macroeconomic Dynamics, 1, 102134.
Lo, A. W., H. Mamaysky, and J. Wang, 2000, Foundations of technical analysis: computational algorithms, statistical inference, and empirical implementation, Journal of
Finance, 55, 17051765.
115
116
4.1
4.1.1
Heteroskedasticity
Descriptive Statistics of Heteroskedasticity (Realized Volatility)
2
t
1X 2
u
D
q sD1 t
D .u2t
C u2t
(4.1)
C : : : C u2t q /=q;
where the latest q observations are used. Notice that t2 depends on lagged information,
and could therefore be thought of as the prediction (made in t 1) of the volatility in t.
Unfortunately, this method can produce quite abrupt changes in the estimate.
See Figures 4.14.2 for illustrations.
An alternative is to apply an exponential moving average (EMA) estimator of volatility, which uses all data points since the beginning of the samplebut where recent observations carry larger weights. The weight for lag s be .1
/ s where 0 < < 1,
so
2
t
D .1
1
X
sD1
s 1 2
ut s
D .1
/.u2t
C u2t
2 2
ut 3
C : : :/;
(4.2)
117
0.04
0.04
0.035
0.03
0.02
0.03
Fri
12
Hour
18
0.05
0.1
0.05
2000
2005
2010
0.05
0.05
2000
2005
2005
2010
0.1
2000
2010
2000
2005
2010
118
50
50
40
40
30
30
20
20
10
10
0
1990
1995
2000
2005
0
1990
2010
1995
2000
2005
2010
/u2t
D .1
(4.3)
2
t 1:
The initial value (before the sample) could be assumed to be zero or (better) the unconditional variance in a historical sample. The EMA is commonly used by practitioners. For
instance, the RISK Metrics (formerly part of JP Morgan) uses this method with D 0:94
for use on daily data. Alternatively, can be chosen to minimize some criterion function
2 2
like tTD1 .u2t
t / .
Remark 4.1 (VIX) Although VIX is based on option prices, it is calculated in a way
that makes it (an estimate of) the risk-neutral expected variance until expiration, not the
implied volatility, see Britten-Jones and Neuberger (2000) and Jiang and Tian (2005).
See Figure 4.3 for an example.
We can also estimate the realized covariance of two series (ui t and ujt ) by
q
ij;t
1X
ui;t s uj;t
D
q sD1
D .ui;t
1 uj;t 1
C ui;t
2 uj;t 2
C : : : C ui;t
q uj;t q /=q;
(4.4)
D .1
/ui;t
1 uj;t 1
ij;t 1 :
(4.5)
0.5
0.5
0.5
2000
2005
2010
0.5
2000
2005
2010
0.5
0
0.5
2000
2005
2010
0.5
Corr
Corr
1995
2000
2005
2010
44day window
0.5
1995
2000
2005
2010
4.1.2
(4.6)
where the variance swap rate (also called the strike or forward price for ) is agreed on at
inception (t) and the realized volatility is just the sample variance for the swap period.
Both rates are typically annualized, for instance, if data is daily and includes only trading
days, then the variance is multiplied by 252 or so (as a proxy for the number of trading
days per year).
A volatility swap is similar, except that the payoff it is expressed as the difference
between the standard deviations instead of the variances
Volatility swap payoff t Cm =
p
realized variance t Cm
(4.7)
If we use daily data to calculate the realized variance from t until the expiration(RV tCm ),
then
252 Pm
RV t Cm D
R2 ;
(4.8)
m sD1 tCs
where R t Cs is the net return on day t C s. (This formula assumes that the mean return is
zerowhich is typically a good approximation for high frequency data. In some cases,
the average is taken only over m 1 days.)
Notice that both variance and volatility swaps pays off if actual (realized) volatility
between t and t C m is higher than expected in t . In contrast, the futures on the VIX pays
off when the expected volatility (in t C m) is higher than what was thought in t. In a way,
we can think of the VIX futures as a futures on a volatility swap (between t C m and a
month later).
Since VIX2 is a good approximation of variance swap rate for a 30-day contract, the
return can be approximated as
Return of a variance swap t Cm D .RV t Cm
V IX t2 /=V IX t2 :
(4.9)
Figures 4.6 and 4.7 illustrate the properties for the VIX and realized volatility of the
S&P 500. It is clear that the mean return of a variance swap (with expiration of 30
121
1992
1994
1996
1998
2000
2002
2004
2006
2008
2010
2012
Implied volatility from options (iv) should contain information about future volatilityas
is therefore often used as a predictor. It is unclear, however, if the iv is more informative
than recent (actual) volatility, especially since they are so similarsee Figure 4.6.
Table 4.1 shows that the iv (here represented by VIX) is close to be an unbiased
predictor of future realized volatility since the slope coefficient is close to one. However,
the intercept is negative, which suggests that the iv overestimate future realized volatility.
122
1.4
1.2
1
0.8
0.6
0.4
0.2
0
1
0.5
0.5
1.5
2.5
123
(1)
lagged RV
R2
obs
(3)
0:91
.11:88/
2:77
. 2:06/
0:62
5344:00
0:28
.2:15/
0:62
.6:74/
1:25
. 1:55/
0:63
5324:00
0:76
.10:48/
lagged VIX
constant
(2)
3:77
.3:87/
0:58
5324:00
Table 4.1: Regression of 22-day realized S&P return volatility 1990:1-2011:5. All daily
observations are used, so the residuals are likely to be autocorrelated. Numbers in parentheses are t-stats, based on Newey-West with 30 lags.
Corr(EUR,GBP)
lagged Corr(EUR,GBP)
Corr(EUR,CHF)
Corr(EUR,JPY)
0:91
.28:20/
lagged Corr(EUR,CHF)
0:89
.23:40/
lagged Corr(EUR,JPY)
constant
0:05
.2:93/
0:85
151:00
R2
obs
0:08
.2:69/
0:81
151:00
0:82
.16:17/
0:05
.2:46/
0:67
151:00
Table 4.2: Regression of monthly realized correlations 1998:1-2010:8. All exchange rates
are against the USD. The monthly correlations are calculated from 5 minute data. Numbers in parentheses are t-stats, based on Newey-West with 1 lag.
4.1.4
(4.10)
In the standard case we assume that " t is iid (independently and identically distributed),
which rules out heteroskedasticity.
124
RV(EUR)
lagged RV(EUR)
RV(GBP)
RV(CHF)
0:62
.7:38/
lagged RV(GBP)
0:73
.10:63/
lagged RV(CHF)
0:47
.5:87/
lagged RV(JPY)
constant
D(Tue)
D(Wed)
D(Thu)
D(Fri)
R2
obs
RV(JPY)
0:11
.3:15/
0:04
.2:76/
0:07
.4:25/
0:08
.4:79/
0:09
.3:68/
0:39
3303:00
0:07
.2:47/
0:01
.0:98/
0:06
.3:65/
0:06
.3:05/
0:04
.1:96/
0:53
3303:00
0:20
.4:43/
0:05
.2:56/
0:06
.2:66/
0:10
.4:64/
0:10
.5:23/
0:23
3303:00
0:59
.4:75/
0:18
.2:39/
0:02
.0:72/
0:07
.2:21/
0:10
.2:16/
0:07
.1:94/
0:35
3303:00
Table 4.3: Regression of daily realized variance 1998:1-2010:8. All exchange rates are
against the USD. The daily variances are calculated from 5 minute data. Numbers in
parentheses are t-stats, based on Newey-West with 1 lag.
In case the residuals actually are heteroskedasticity, least squares (LS) is nevertheless
a useful estimator: it is still consistent (we get the correct values as the sample becomes
really large)and it is reasonably efficient (in terms of the variance of the estimates).
However, the standard expression for the standard errors (of the coefficients) is (except in
a special case, see below) not correct. This is illustrated in Figure 4.9.
There are two ways to handle this problem. First, we could use some other estimation
method than LS that incorporates the structure of the heteroskedasticity. For instance,
combining the regression model (4.10) with an ARCH structure of the residualsand
estimate the whole thing with maximum likelihood (MLE) is one way. As a by-product
we get the correct standard errors provided, of course, the assumed distribution is correct. Second, we could stick to OLS, but use another expression for the variance of the
125
20
20
10
10
10
10
20
20
10
0
x
10
10
0
x
10
y = 0.03 + 1.3x + u
Solid regression lines are based on all data,
dashed lines exclude the crossed out data point
(4.11)
are zero. (This can be done
Example 4.4 (Whites test) If the regressors include .1; x1t ; x2t / then w t in (4.11) is the
2
2
vector (1; x1t ; x2t ; x1t
; x1t x2t ; x2t
).
126
0.09
0.08
0.07
0.06
0.05
0.04
Model: yt = 0.9xt + t ,
where t N (0, ht ), with ht = 0.5exp(x2t )
0.03
0.02
0.01
0
0.5
0.4
0.3
0.2
0.1
0.1
0.2
0.3
model, and then test if the following additional (ARCH related) moment conditions are
satisfied at those parameters
2
3
u2t 1
6 :
7 2
7 .u
::
E6
(4.13)
4
5 t a0 / D 0 q 1 :
u2t q
An alternative test (see Harvey (1989) 259260), is to apply a Box-Ljung test on uO 2t , to
see if the squared fitted residuals are autocorrelated. We just have to adjust the degrees of
freedom in the asymptotic chi-square distribution by subtracting the number of parameters
estimated in the regression equation. These tests for ARCH effects will typically capture
GARCH (see below) effects as well.
4.2
ARCH Models
(4.14)
We will study different ways of modelling how the volatility of the residual is autocorrelated.
4.2.1
Properties of ARCH(1)
In the ARCH(1) model the residual in the regression equation (4.14) can be written
ut D vt
with v t
iid with E t
v t D 0 and E t
v t2 D 1;
(4.15)
(4.16)
50
50
40
40
30
30
20
20
10
10
0
1980
1990
2000
2010
0
1980
1990
2000
2010
129
distribution with the same variance (excess kurtosis). In fact, we have the kurtosis
(
2
3 11 3
if denominator is positive
E u4t
2 > 3
D
(4.18)
2 2
.E u t /
1
otherwise.
As a comparison, the kurtosis of a normal distribution is 3. This means that we can
expected u t to have fat tails, but that the standardized residuals u t = t perhaps look more
normally distributed. See Figure 4.12 for an illustration (although absed on a GARCH
model).
Example 4.5 (Kurtosis) With D 1=3, the kurtosis is 4, at D 0:5 it is 9 and at D 0:6
it is infinite.
Proof. (of (4.18)) Since v t and t are independent, we have E.u2t / D E.v t2 t2 / D
E t2 and E.u4t / D E.v t4 t4 / D E. t4 / E.v t4 / D E. t4 /3, where the last equality follows
from E.v t4 / D 3 for a standard normal variable. To find E. t4 /, square (4.16) and take
expectations (and use E t2 D !=.1 /)
E
4
t
D ! 2 C 2 E u4t
D ! C E.
2
4
t
C 2! E u2t
4
t /3
2
C 2! =.1
/, so
!
1C
:
1 3 2 .1 /
/2 gives (4.18).
T
ln .2 /
2
1X
ln
2 t D1
2
t
1 X u2t
if v t is iid N.0; 1/:
2 t D1 t2
(4.19)
130
By plugging in (4.14) for u t and (4.16) for t2 , the likelihood function is written in terms
of the data and model parameters. The likelihood function is then maximized with respect
to the parameters. Note that we need a starting value of 12 D ! C u20 . The most
convenient (and common) way is to maximize the likelihood function conditional on a y0
and x0 . That is, we actually have a sample from (t D) 0 to T , but observation 0 is only
used to construct a starting value of 12 . The optimization should preferably impose the
constraints in (4.16). The MLE is consistent.
N. ; 2 /) The pdf of an x t
1
1 .x t
/2
pdf .x t / D p
exp
;
2
2
2 2
N. ;
/ is
so the log-likelihood is
ln L t D
1
ln .2 /
2
1
ln
2
1 .x t
2
/2
2
If x t and xs are independent (uncorrelated if normally distributed), then the joint pdf
is the product of the marginal pdfsand the joint log-likelihood is the sum of the two
likelihoods.
Remark 4.7 (Coding the ARCH(1) ML estimation) A straightforward way of coding the
estimation problem (4.14)(4.16) and (4.19) is as follows.
First, guess values of the parameters b (a vector), and !, and . The guess of b can be
taken from an LS estimation of (4.14), and the guess of ! and from an LS estimation of
uO 2t D ! C uO 2t 1 C " t where uO t are the fitted residuals from the LS estimation of (4.14).
Second, loop over the sample (first t D 1, then t D 2, etc.) and calculate uO t from (4.14)
and t2 from (4.16). Plug in these numbers in (4.19) to find the likelihood value.
Third, make better guesses of the parameters and do the second step again. Repeat until
the likelihood value converges (at a maximum).
Remark 4.8 (Imposing parameter constraints on ARCH(1)) To impose the restrictions in
(4.16) when the previous remark is implemented, iterate over values of .b; !;
Q /
Q and let
2
! D !Q and D exp.a/=1
Q
C exp.a/.
Q
It is often found that the fitted normalized residuals, uO t = t , still have too fat tails
compared with N.0; 1/: Estimation using other likelihood functions, for instance, for a
131
t-distribution can then be used. Or the estimation can be interpreted as a quasi-ML (is
typically consistent, but requires different calculation of the covariance matrix of the parameters).
Another possibility is to estimate the model by GMM using, for instance, the following moment conditions
2
3
xt ut
6
7
2
E 4 u2t
(4.20)
5 D 0.kC2/ 1 ;
t
2
u2t 1 .u2t
t /
where u t and t2 are given by (4.14) and (4.16).
It is straightforward to add more lags to (4.16). For instance, an ARCH.p/ would be
2
t
D ! C 1 u2t
C : : : C p u2t
(4.21)
p:
We then have to add more moment conditions to (4.20), but the form of the likelihood
function is the same except that we now need p starting values and that the upper boundary constraint should now be jpD1 j 1.
4.3
GARCH Models
Instead of specifying an ARCH model with many lags, it is typically more convenient to
specify a low-order GARCH (Generalized ARCH) model. The GARCH(1,1) is a simple
and surprisingly general model where
2
t
D ! C u2t
2
t 1 ,with
! > 0; ;
0; and C < 1;
(4.22)
2
t Cs
D N 2 C . C /s
2
t C1
N 2 , with N 2 D
(4.23)
132
50
40
40
30
30
20
20
10
10
0
1980
1990
2000
0
1980
2010
1990
2000
2010
50
40
30
20
10
0
1980
1990
2000
2010
1
0
1
2
3
3
Standardized residuals: ut /
2
1
0
1
2
Quantiles from N(0,1), %
Empirical quantiles
2
1
0
1
2
3
3
Standardized residuals: ut /t
2
1
0
1
2
Quantiles from N(0,1), %
133
Std
40
Coef Std err
20
0
1995
2000
2005
2010
0.032
0.005
0.086
0.008
0.897
0.009
D .1
/u2t
2
t 1:
This methods is commonly used by practitioners. For instance, the RISK Metrics uses
this method with D 0:94. Clearly, plays the same type of role as in (4.22) and
1
as . The main differences are that the exponential moving average does not have
a constant and volatility is non-stationary (the coefficients sum to unity). See Figure 4.11
for a comparison.
The kurtosis of the process is
(
/2
3 1 12.C
if denominator is positive
E u4t
2 .C / > 3
D
.E u2t /2
1
otherwise.
(4.24)
134
Proof. (of (4.24)) Since v t and t are independent, we have E.u2t / D E.v t2 t2 / D E t2
and E.u4t / D E.v t4 t4 / D E. t4 / E.v t4 / D E. t4 /3, where the last equality follows from
E.v t4 / D 3 for a standard normal variable. We also have E.u2t t2 / D E t4
E
4
t
D E.! C u2t
D ! 2 C 2 E u4t
D ! 2 C 2 E.
D
C
1
2
2
t 1/
C 2 E
4
t /3
C 2 E
! 2 C 2!. C / E t2
:
1 2 2 . C 2 /2
4
t 1
4
t
C 2! E u2t
C 2! E
2
t
C 2! E
C 2! E
2
t
2
t 1
C 2 E.u2t
C 2 E
2
1 t 1/
4
t
2
t
t 1
D ! C u2t 1 C ! C u2t 2 C
D ! .1 C / C u2t
:
D ::
C u2t
2
t 2
C 2
2
t 2
VaR95%
136
GARCH std, %
5
10
4
3
2
1
0
1980
1990
2000
2010
0
1980
1990
2000
2010
0.4
0.3
0.2
0.1
0
4
4.4
Non-Linear Extensions
A very large number of extensions of the basic GARCH model have been suggested.
Estimation is straightforward since MLE is done as for any other GARCH modeljust
the specification of the variance equation differs.
An asymmetric GARCH (Glosten, Jagannathan, and Runkle (1993)) can be constructed as
(
1 if q is true
2
2
2
2
(4.27)
t D ! C u t 1 C t 1 C .u t 1 > 0/u t 1 , where .q/ D
0 else.
This means that the effect of the shock u2t 1 is if the shock was negative and C if
the shock was positive. With < 0, volatility increases more in response to a negative
u t 1 (bad news) than to a positive u t 1 .
137
Empirical Prob(R<VaR)
0.08
0.07
0.06
0.05
0.04
0.03
0.02
0.01
0
0.02
0.04
0.06
Theoretical Prob(R<VaR)
0.08
0.1
Figure 4.16: Backtesting VaR from a GARCH model, assuming normally distributed
shocks
The EGARCH (exponential GARCH, Nelson (1991)) sets
ln
2
t
D!C
ju t
1j
t 1
C ln
2
t 1
ut
(4.28)
t 1
Apart from being written in terms of the log (which is a smart trick to make t2 > 0 hold
without any restrictions on the parameters), this is an asymmetric model. The ju t 1 j term
is symmetric: both negative and positive values of u t 1 affect the volatility in the same
way. The linear term in u t 1 modifies this to make the effect asymmetric. In particular,
if < 0, then the volatility increases more in response to a negative u t 1 (bad news)
than to a positive u t 1 .
Hentschel (1995) estimates several models of this type, as well as a very general
formulation on daily stock index data for 1926 to 1990 (some 17,000 observations). Most
standard models are rejected in favour of a model where t depends on t 1 and ju t 1
bj3=2 .
138
4.5
A stochastic volatility model differs from GARCH models by making the volatility truly
stochastic. Recall that in a GARCH model, the volatility in period t ( t ) is know already
in t 1. This is not the case in a stochastic volatility model where the log volatility follows
an ARMA process. The simplest case is the AR(1) formulation
ln
2
t
D ! C ln
2
t 1
C t , with t
(4.29)
i idN.0; 1/;
2
t
C .ln v t2
E ln v t2 /:
(4.30)
We could use this as the measurement equation in a Kalman filter (pretending that ln v t2
E ln v t2 is normally distributed), and (4.29) as the state equation. (The Kalman filter is a
convenient way to calculate the likelihood function.) In essence, this is an AR(1) model
with noisy observations.
If ln v t2 is normally distributed , then this will give MLE, otherwise just a quasi-MLE.
For instance, if v t is i idN.0; 1/ (see Ruiz (1994)) then we have approximately E ln v t2
1:27 and Var.ln v t2 / D 2 =2 (with D 3:14:::) so we could write the measurement
equation as
ln u2t D
1:27 C ln
2
t
C w t with (assuming) w t
N.0;
=2/:
(4.31)
In this case, only the state equation contains parameters that we need to estimate: !; ; .
See Figure 4.17 for an example.
139
std, annualized
40
30
20
10
0
1980
1985
1990
1995
2000
2005
2010
4.6
(G)ARCH-M
It can make sense to let the conditional volatility enter the mean equationfor instance,
as a proxy for risk which may influence the expected return.
Example 4.12 (Mean-variance portfolio choice) A mean variance investor solves
max E Rp
2
p k=2;
Rp D Rm C .1
subject to
/Rf ;
where Rm is the return on the risky asset (the market index) and Rf is the riskfree return.
The solution is
1 E.Rm Rf /
D
:
2
k
m
In equilibrium, this weight is one (since the net supply of bonds is zero), so we get
E.Rm
Rf / D k
2
m;
which says that the expected excess return is increasing in both the market volatility and
risk aversion (k).
140
40
30
20
10
0
1980
1990
2000
2010
2
t
C u t ; E.u t jx t ;
t/
D 0:
(4.32)
4.7
4.7.1
Multivariate (G)ARCH
Different Multivariate Models
This section gives a brief summary of some multivariate models of heteroskedasticity. Let
the model (4.14) be a multivariate model where y t and u t are n 1 vectors. We define
141
1) covariance matrix of u t as
t D Et
(4.33)
u t u0t :
0
1ut 1/
C Bvech. t
1 /:
(4.34)
This typically gives too many parameters to handleand makes it difficult to impose
sufficient restrictions to make t is positive definite (compare the restrictions of positive
coefficients in (4.22)).
Example 4.14 (vech formulation, n D 2) For instance, with n D 2 we have
2 2
2
2
3
3
3
u1;t 1
11;t
11;t 1
6
7
6
7
6
7
4 21;t 5 D C C A 4 u1;t 1 u2;t 1 5 C B 4 21;t 1 5 ;
22;t
u22;t
22;t 1
11;t 1
21;t 1
3
7
5;
22;t 1
0
1ut 1A
C B 0t
1 B;
(4.35)
where C is symmetric and A and B are n n matrices. Notice that this equation is
specified in terms of t , not vech. t /. Recall that a quadratic form positive definite,
provided the matrices are of full rank.
Example 4.16 (BEKK model, n D 2) With n D 2 we have
#0 "
"
# "
# "
c
c
a
a
u21;t 1
u1;t 1 u2;t
11;t
12;t
11
12
11
12
D
C
c12 c22
a21 a22
u1;t 1 u2;t 1
u22;t 1
12;t
22;t
"
#0 "
#"
#
b11 b12
b11 b12
11;t 1
12;t 1
;
b21 b22
b21 b22
12;t 1
22;t 1
#"
#
a
a
1
11
12
C
a21 a22
Remark 4.19 (Estimating the constant correlation model) A quick (and dirty) method
for estimating is to first estimate the individual GARCH processes and then estimate the
p
p
correlation of the standardized residuals u1t = 11;t and u2t = 22;t .
The Dynamic Correlation Model
The dynamic correlation model (see Engle (2002) and Engle and Sheppard (2001)) allows
the correlation to change over time. In short, the model assumes that each conditional
variance follows a univariate GARCH process and the conditional correlation matrix is
(essentially) allowed to follow a univariate GARCH equation.
The conditional covariance matrix is (by definition)
p
t D D t R t D t , with D t D diag.
(4.36)
i i;t /;
/QN C v t
0
1vt 1
C Q t
1,
i i;t ;
(4.37)
where and are two scalars and QN is the unconditional correlation matrix of the normalized residuals (vi;t ). To guarantee that the conditional correlation matrix is indeed a
correlation matrix, Q t is treated as if it where a covariance matrix and R t is simply the
implied correlation matrix. That is,
R t D diag
qi i;t
Q t diag
qi i;t
(4.38)
144
The basic idea of this model is to estimate a conditional correlation matrix as in (4.38)
and then scale up with conditional variances (from univariate GARCH models) to get a
conditional covariance matrix as in (4.36).
See Figure 4.19 for an illustrationwhich also suggest that the correlation is close to
what an EWMA method delivers. The DCC model is used in a study of asset pricing in,
for instance, Duffee (2005).
Example 4.21 (Dynamic correlation model, n D 2) With n D 2 the covariance matrix
t is
"
# "p
#"
# "p
#
0
1
0
11;t
12;t
11;t
12;t
11;t
D
;
p
p
0
1
0
12;t
22;t
22;t
12;t
22;t
and each of 11t and 22t follows a GARCH process. To estimate the dynamic correlations, we first calculate (where and are two scalars)
"
#
"
#
"
#
"
#"
#0
q11;t 1 q12;t 1
q11;t q12;t
1 qN 12
v1;t 1 v1;t 1
C
;
D .1 /
C
q12;t 1 q22;t 1
q12;t q22;t
qN 12 1
v2;t 1 v2;t 1
p
where vi;t 1 D ui;t 1 = i i;t 1 and qN ij is the unconditional correlation of vi;t and vj;t
and we get the conditional correlations by
"
# "
#
p
q
q
1
1
q
=
11;t 22;t
12;t
12;t
D
:
p
q12;t = q11;t q22;t
1
1
12;t
Assuming a GARCH(1,1) as in (4.22) gives 9 parameters (2
.qN 12 ; ; /).
4.7.2
3 GARCH parameters,
In principle, it is straightforward to specify the likelihood function of the model and then
maximize it with respect to the model parameters. For instance, if u t is iid N.0; t /, then
the log likelihood function is
ln L D
Tn
ln.2 /
2
1X
ln j t j
2 t D1
1X 0
u 1ut :
2 t D1 t t
(4.39)
145
40
40
Std
60
Std
60
20
0
20
1995
2000
2005
2010
1995
2000
2005
2010
Corr
0.5
DCC
CC
0
1995
2000
2005
2010
4.8
The ARCH and GARCH models imply that volatility is random, so they are (strictly
speaking) not consistent with the B-S model. However, they are often combined with the
B-S model to provide an approximate option price. See Figure 4.20 for a comparison
of the actual distribution of the log asset price at different horizons when the returns
are generated by a GARCH modeland a normal distribution with the same mean and
variance. It is clear that the normal distribution is a good approximation unless the horizon
is short and the ARCH component (1 u2t 1 ) dominates the GARCH component (1 t2 1 ).
4.8.2
Over the period from t to t C the change of log asset price minus a riskfree rate (including dividends/accumulated interest), that is, the continuously compounded excess return,
follows a kind of GARCH(1,1)-M process
ln S t
ln S t
p
h t z t ; where z t is iid N.0; 1/
p
h t D ! C 1 .z t
/2 C 1 h t :
1 ht
r D ht C
(4.40)
(4.41)
( ) = (0.8 0.09)
N
Sim
( ) = (0.8 0.09)
0.02
0.05
0.01
0
20
10
0
Return
10
20
0
40
20
0
Return
20
40
( ) = (0.09 0.8)
( ) = (0.09 0.8)
0.03
0.1
0.02
0.05
0
15
0.01
10
0
Return
10
15
0
40
20
0
Return
20
40
D Et
fM t max S t
K; 0g :
(4.42)
De
Et
fmax S t
K; 0g ;
(4.43)
where E t is the expectations operator for the risk neutral distribution. See, for instance,
Huang and Litzenberger (1988).
148
0.502
105
5
0.132
10
0.589
421.390
0.1
0
0.1
0.2
0.3
0.4
0.5
10
0
s (lead of h)
10
lnS t and h t Cs
For parameter estimates on a more recent sample, see Table 4.4. These estimates
suggests that has the wrong sign (high volatility predicts low future returns) and the
persistence of volatility is much higher than in HN ( is much higher).
!
-2.5
1.22e-006
0.00259
0.903
6.06
Table 4.4: Estimate of the Heston-Nandi model on daily S&P500 excess returns, in %.
Sample: 1990:1-2011:5
4.8.3
T=5
T = 50
20
N
Sim
15
10
0
4.5
4.55
4.6
4.65
4.7
N
Sim
4.75
lnST
0
4.5
4.55
4.6
4.65
lnST
4.7
4.75
0.502
105
0.132
105
0.589
421.390
T = 50
8
char fun
Sim
6
4
2
0
4.5
4.55
4.6
4.65
lnST
4.7
4.75
150
HestonNandi model, T = 50
physical
riskneutral
8
7
6
5
4
3
2
1
0
4.5
4.55
4.6
4.65
4.7
4.75
lnST
Figure 4.23: Physical and riskneutral distribution of lnST in the Heston-Nandi model
derive any option price (for any time to expiry). For a European call option with strike
price K and expiry at date T , the result is
C t .S t ; r; K; T / D e
D S t P1
E t max ST
e
KP2 ;
K; 0
(4.44)
(4.45)
where P1 and P2 are two risk neutral probabilities (implied by the risk neutral version of
(4.40)(4.41), see above). It can be shown that P2 is the risk neutral probability that ST >
K, and that P1 is the delta, @C t .S t ; r; K; T /=@S t (just like in the Black-Scholes model).
In practice, HN calculate these probabilities by first finding the risk neutral characteristic
function of ST , f . / D E t exp.i ln ST /, where i 2 D 1, and then inverting to get the
probabilities.
Remark 4.24 (Characteristic function and the pdf) The characteristic function of a random variable x is
f . / D E exp.i x/
R
D x exp.i x/pdf.x/dx;
151
where pdf.x/ is the pdf. This is a Fourier transform of the pdf (if x is a continuous random
2 2
variable). For instance, the cf of a N. ; 2 / distribution is exp.i
=2/. The pdf
can therefore be recovered by the inverse Fourier transform as
pdf.x/ D
1 R1
1 exp. i x/f . /d :
2
In practice, we typically use a fast (discrete) Fourier transform to perform this calculation, since there are very quick computer algorithms for doing that (see the appendix).
Remark 4.25 (Characteristic function of ln ST in the HN model) First, define
A t D A tC1 C i r C B t C1 !
Bt D i . C
1/
1
2
2
1
1
ln.1
2
21 B t C1 /
2
1 .i
1/
C 1 B t C1 C
;
2 1 1 B t C1
1 2
A0 D T i r C .T 1/ ! i
2
1 2
B0 D i
:
2
We can then write the characteristic function as
f . / D exp .i ln S0 C A0 C B0 !/
D exp i ln S0 C T .r C ! /
T !=2 ;
152
Returns on the index are calculated by using official index plus dividends. The riskfree
rate is taken to be a synthetic T-bill rate created by interpolating different bills to match
the maturity of the option. Weekly data for 19921994 are used (created by using lots of
intraday quotes for all Wednesdays).
HN estimate the GARCH(1,1)-M process (4.40)(4.41) with ML on daily data on
the S&P500 index returns. It is found that the i parameter is large, i is small, and that
1 > 0 (as expected). The latter seems to be important for the estimated h t series (see
Figures 1 and 2).
Instead of using the GARCH(1,1)-M process estimated from the S&P500 index
returns, all the model parameters are subsequently estimated from option prices. Recall
that the probabilities P1 and P2 in (4.45) depend (nonlinearly) on the parameters of the
risk neutral version of (4.40)(4.41). The model parameters can therefore be estimated
by minimizing the sum (across option price observation) squared pricing errors.
In one of several different estimations, HN estimate the model on option data for
the first half 1992 and then evaluate the model by comparing implied and actual option
prices for the second half of 1992. These implied option prices use the model parameters
estimated on data for the first half of the year and an estimate of h t calculated using
these parameters and the latest S&P 500 index returns. The performance of this model is
compared with a Black-Scholes model (among other models), where the implied volatility
in week t 1 is used to price options in period t. This exercise is repeated for 1993 and
1994.
It is found that the GARCH model outperforms (in terms of MSE) the B-S model. In
particular, it seems as if the GARCH model gives much smaller errors for deep out-ofthe-money options (see Figures 2 and 3). HN argue that this is due to two aspects of the
model: the time-profile of volatility (somewhat persistent, but mean-reverting) and the
negative correlation of returns and volatility.
153
4.9
Fundamental Values and Asset Returns in Global Equity Markets, by Bansal and Lundblad
Basic Model
e
Riet D i Rmt
C "i t ;
e
where Rmt
is the world market index. As in CAPM, the market return is proportional to
its volatilityhere modelled as a GARCH(1,1) process. We there fore have a GARCH-M
(-in-Mean) process
e
Rmt
D
2
mt
2
mt
C "mt , E t
D C "2m;t
1 "mt
D 0 and Var t
2
m;t 1 :
1 ."mt /
2
mt ;
(4.47)
(4.48)
A gross return
Ri;t C1 D
Di;t C1 C Pi;t C1
;
Pi t
(4.49)
154
i .pi;t C1
ri;tC1
/
d
i;t C1
zit
zi;tC1
(4.50)
gi;tC1
di t D zi t
1
X
s
i
E t .gi;t CsC1
ri;t CsC1 / :
(4.51)
sD0
To calculate the right hand side of (4.51), notice the following things. First, the dividend growth (cash flow dynamics) is modelled as an ARMA(1,1)see below for details. Second, the riskfree rate (rf t ) is assumed to follow an AR(1). Third, the expected
return equals the riskfree rate plus the expected excess returnwhich follows (4.46)
(4.48).
Since all these three processes are modelled as univariate first-order time-series processes, the solution is
pi t
2
m;t C1
C Ai;3 rf t :
(4.52)
(BL use an expected dividend growth instead of the actual but that is just a matter of
convenience, and has another timing convention for the volatility.) This solution can be
thought of as the fundamental (log) price-dividend ratio. The main theme of the paper is
to study how well this fundamental log price-dividend ratio can explain the actual values.
The model is estimated by GMM (as a system), but most of the moment conditions
are conventional. In practice, this means that (i) the betas and the AR(1) for the riskfree
rate are estimated by OLS; (ii) the GARCH-M by MLE; (iii) the ARMA(1,1) process by
moment conditions that require the innovations to be orthogonal to the current levels; and
(iv) moment conditions for changes in pi t di t D zi t define3d in (4.52). This is the
overidentified part of the model.
155
4.9.3
As a benchmark for comparison, consider the case when the right hand side in (4.51)
equals a constant. This would happen when the growth rate of cash flows is unpredictable,
the riskfree rate is constant, and the market risk premium is too (which here requires that
the conditional variance of the market return is constant). In this case, the price-dividend
ratio is constant, so the log return equals the cash flow growth plus a constant.
This benchmark case would not be very successful in matching the observed volatility
and correlation (across markets) of returns: cash flow growth seems to be a lot less volatile
than returns and also a lot less correlated across markets.
What if we allowed for predictability of cash flow growth, but still kept the assumptions of constant real interest rate and market risk premium? Large movements in predictable cash flow growth could then generate large movements in returns, but hardly the
correlation across markets.
However, large movements in the market risk premium would contribute to both. It is
clear that both mechanisms are needed to get a correlation between zero and one. It can
also be noted that the returns will be more correlated during volatile periodssince this
drives up the market risk premium which is a common component in all returns.
4.9.4
The growth rate of cash flow, gi t , is modelled as an ARMA(1,1). The estimation results
show that the AR parameter is around 0:95 and that the MA parameter is around 0:85.
This means that the growth rate is almost an iid process with very low autocorrelation
but only almost. Since the MA parameter is not negative enough to make the sum of the
AR and MA parameters zero, a positive shock to the growth rate will have a long-lived
effect (even if small). See Figure 4.24.
Remark 4.27 (ARMA(1,1)) An ARMA(1,1) model is
y t D ay t
C " t C " t
1,
1
X
sD1
as 1 .a C /" t s :
156
1
0
0.5
= 0.8
=0
= 0.8
1
2
0
0
4
6
period
10
4
6
period
10
ARMA(1,1): yt = ayt1 + t + t1
.1 C a/.a C /
, and
1 C 2 C 2a
Da
s 1
for s D 2; 3; : : :
Results
1. The hypothesis that the CAPM regressions have zero intercepts (for all five country
indices) cannot be rejected.
2. Most of the parameters are precisely estimated, except
A.1
Characteristic Function
(A.1)
where f .x/ is the pdf. This is a Fourier transform of the pdf (if x is a continuous random
2 2
variable). For instance, the cf of a N. ; 2 / distribution is exp.i
=2/. The pdf
can therefore be recovered by the inverse Fourier transform as
1 R1
1 exp. i x/h. /d :
2
f .x/ D
(A.2)
In practice, we typically use a fast (discrete) Fourier transform to perform this calculation,
since there are very quick computer algorithms for doing that.
A.2
FFT in Matlab
A.3
PN
j D1 qj e
2 i
N
(A.3)
.j 1/.k 1/
1 PN
2 i
N .j
kD1 Qk e
N
1/.k 1/
(A.4)
Approximate the characteristic function (A.1) as the integral over xmin ; xmax (assuming
the pdf is zero outside)
R xmax i x
h. / D xmin
e f .x/dx:
(A.5)
Approximate this by a Riemann sum
h. /
PN
kD1 e
i xk
f .xk / x:
(A.6)
158
Split up xmin ; xmax into N intervals of equal size, so the step (and interval width) is
xD
xmax
xmin
(A.7)
1=2/ x;
(A.8)
x=2.
1/=3 D 2. The xj
PN
kD1 e
(A.9)
fk x;
2 j 1
;
N
x
so we can control the central location of . Use that in the Riemann sum
j
hj
PN
kD1 e
i.xmin C1=2 x/ 2N
qj
1
x
1
hj
N
(A.10)
DbC
1
x
e i xmin C.k
i.xmin C 1=2 x/ 2N
1 PN 2 i .j
eN
N kD1
j 1
x
1=2/ xb
fk x;
(A.11)
=N to get
Qk
fk x ;
(A.12)
which has the same for as the ifft (A.4). We should therefore be able to calculate Qk by
applying the fft (A.3) on qj . We can then recover the density function as
fk D e
Qk = x:
(A.13)
159
Bibliography
Andersen, T. G., T. Bollerslev, P. F. Christoffersen, and F. X. Diebold, 2005, Volatility
forecasting, Working Paper 11188, NBER.
Bansal, R., and C. Lundblad, 2002, Market efficiency, fundamental values, and the size
of the risk premium in global equity markets, Journal of Econometrics, 109, 195237.
Britten-Jones, M., and A. Neuberger, 2000, Option prices, implied price processes, and
stochastic volatility, Journal of Finance, 55, 839866.
Campbell, J. Y., A. W. Lo, and A. C. MacKinlay, 1997, The econometrics of financial
markets, Princeton University Press, Princeton, New Jersey.
Duan, J., 1995, The GARCH option pricing model, Mathematical Finance, 5, 1332.
Duffee, G. R., 2005, Time variation in the covariance between stock returns and consumption growth, Journal of Finance, 60, 16731712.
Engle, R. F., 2002, Dynamic conditional correlation: a simple class of multivariate generalized autoregressive conditional heteroskedasticity models, Journal of Business and
Economic Statistics, 20, 339351.
Engle, R. F., and K. Sheppard, 2001, Theoretical and empirical properties of dynamic
conditional correlation multivariate GARCH, Discussion Paper 2001-15, University
of California, San Diego.
Franses, P. H., and D. van Dijk, 2000, Non-linear time series models in empirical finance,
Cambridge University Press.
Glosten, L. R., R. Jagannathan, and D. Runkle, 1993, On the relation between the expected value and the volatility of the nominal excess return on stocks, Journal of Finance, 48, 17791801.
Gourieroux, C., and J. Jasiak, 2001, Financial econometrics: problems, models, and
methods, Princeton University Press.
Hamilton, J. D., 1994, Time series analysis, Princeton University Press, Princeton.
160
Harvey, A. C., 1989, Forecasting, structural time series models and the Kalman filter,
Cambridge University Press.
Hentschel, L., 1995, All in the family: nesting symmetric and asymmetric GARCH
models, Journal of Financial Economics, 39, 71104.
Heston, S. L., and S. Nandi, 2000, A closed-form GARCH option valuation model,
Review of Financial Studies, 13, 585625.
Huang, C.-F., and R. H. Litzenberger, 1988, Foundations for financial economics, Elsevier
Science Publishing, New York.
Jiang, G. J., and Y. S. Tian, 2005, The model-free implied volatility and its information
content, Review of Financial Studies, 18, 13051342.
Nelson, D. B., 1991, Conditional heteroskedasticity in asset returns, Econometrica, 59,
347370.
Ruiz, E., 1994, Quasi-maximum likelihood estimation of stochastic volatility models,
Journal of Econometrics, 63, 289306.
Taylor, S. J., 2005, Asset price dynamics, volatility, and prediction, Princeton University
Press.
161
Factor Models
5.1
(5.1)
5.2
5.2.1
If the residuals in the CAPM regression are iid, then the traditional LS approach is just
fine: estimate (5.1) and form a t-test of the null hypothesis that the intercept is zero. If the
disturbance is iid normally distributed, then this approach is the ML approach.
162
0.1
Mean
Mean
0.1
0.05
0.05
0.1
0.15
0.05
0.05
Std
0.1
0.15
Std
The new asset has the abnormal return
compared to the market (of 2 assets)
Mean
0.1
0.05
Means
Cov
matrix
Tang
portf
0
0.05
0.1
0.15
N=2
0.47
0.53
NaN
=0
0.47
0.53
0.00
=0.05
0.31
0.34
0.34
=0.04
0.82
0.91
0.73
Std
.E f t /2
Var.O 0 / D 1 C
Var."i t /=T
Var .f t /
D .1 C SRf2 / Var."i t /=T;
(5.2)
(5.3)
where SRf2 is the squared Sharpe ratio of the market portfolio (recall: f t is the excess
return on market portfolio). We see that the uncertainty about the intercept is high when
the disturbance is volatile and when the sample is short, but also when the Sharpe ratio of
the market is high. Note that a large market Sharpe ratio means that the market asks for
a high compensation for taking on risk. A bit uncertainty about how risky asset i is then
gives a large uncertainty about what the risk-adjusted return should be.
Proof. (of (5.2)) Consider the regression equation y t D x t0 b0 C u t . With iid errors
163
that are independent of all regressors (also across observations), the LS estimator, bOLs , is
asymptotically distributed as
p
T .bOLs
b0 / ! N.0;
xx1 /, where
When the regressors are just a constant (equal to one) and one variable regressor, f t , so
x t D 1; f t 0 , then we have
"
# "
#
PT
P
1
f
1
E
f
1
t
t
T
xx D E t D1 x t x t0 =T D E
D
, so
t D1
2
T
ft ft
E f t E f t2
#
"
#
"
2
2
2
2
E
f
Var.f
/
C
.E
f
/
E
f
E
f
t
t
t
t
t
2
D
xx1 D
:
2
2
Var.f t /
E ft
.E f t /
E ft
1
E ft
1
(In the last line we use Var.f t / D E f t2 .E f t /2 :)
The t-test of the hypothesis that 0 D 0 is then
O
O
d
Dq
! N.0; 1/ under H0 : 0 D 0:
Std./
O
.1 C SRf2 / Var."i t /=T
(5.4)
Note that this is the distribution under the null hypothesis that the true value of the intercept is zero, that is, that CAPM is correct (in this respect, at least).
Remark 5.1 (Quadratic forms of normally distributed random variables) If the n 1
2
vector X
N.0; /, then Y D X 0 1 X
n . Therefore, if the n scalar random
variables Xi , i D 1; :::; n, are uncorrelated and have the distributions N.0; i2 /, i D
2
1; :::; n, then Y D inD1 Xi2 = i2
n.
Instead of a t-test, we can use the equivalent chi-square test
O 2
O 2
d
D
!
2
Var./
O
.1 C SRf / Var."i t /=T
2
1
under H0 : 0 D 0:
(5.5)
The chi-square test is equivalent to the t-test when we are testing only one restriction, but
it has the advantage that it also allows us to test several restrictions at the same time. Both
the t-test and the chisquare tests are Wald tests (estimate unrestricted model and then test
the restrictions).
It is quite straightforward to use the properties of minimum-variance frontiers (see
Gibbons, Ross, and Shanken (1989), and MacKinlay (1995)) to show that the test statistic
164
(5.6)
where SRf is the Sharpe ratio of the market portfolio and SRc is the Sharpe ratio of
the tangency portfolio when investment in both the market return and asset i is possible.
(Recall that the tangency portfolio is the portfolio with the highest possible Sharpe ratio.)
If the market portfolio has the same (squared) Sharpe ratio as the tangency portfolio of the
mean-variance frontier of Ri t and Rmt (so the market portfolio is mean-variance efficient
also when we take Ri t into account) then the test statistic, O i2 = Var.O i /, is zeroand
CAPM is not rejected.
Proof. (of (5.6)) From the CAPM regression (5.1) we have
"
# "
#
"
# "
#
e
e
Riet
i2 m2 C Var."i t / i m2
i
i
m
i
Cov
D
, and
D
:
e
e
2
e
Rmt
i m2
m
m
m
Suppose we use this information to construct a mean-variance frontier for both Ri t and
e
Rmt , and we find the tangency portfolio, with excess return Rct
. It is straightforward to
show that the square of the Sharpe ratio of the tangency portfolio is e0 1 e , where
e
is the vector of expected excess returns and is the covariance matrix. By using the
covariance matrix and mean vector above, we get that the squared Sharpe ratio for the
tangency portfolio, e0 1 e , (using both Ri t and Rmt ) is
e
c
c
i2
D
C
Var."i t /
e
m
2
;
i2
C .SRm /2 :
Var."i t /
Use the notation f t D Rmt Rf t and combine this with (5.3) and to get (5.6).
It is also possible to construct small sample test (that do not rely on any asymptotic results), which may be a better approximation of the correct distribution in real-life
samplesprovided the strong assumptions are (almost) satisfied. The most straightforward modification is to transform (5.5) into an F1;T 1 -test. This is the same as using a
t -test in (5.4) since it is only one restriction that is tested (recall that if Z
tn , then
2
Z
F .1; n/).
An alternative testing approach is to use an LR or LM approach: restrict the intercept
165
in the CAPM regression to be zero and estimate the model with ML (assuming that the
errors are normally distributed). For instance, for an LR test, the likelihood value (when
D 0) is then compared to the likelihood value without restrictions.
A common finding is that these tests tend to reject a true null hypothesis too often
when the critical values from the asymptotic distribution are used: the actual small sample size of the test is thus larger than the asymptotic (or nominal) size (see Campbell,
Lo, and MacKinlay (1997) Table 5.1). To study the power of the test (the frequency of
rejections of a false null hypothesis) we have to specify an alternative data generating
process (for instance, how much extra return in excess of that motivated by CAPM) and
the size of the test (the critical value to use). Once that is done, it is typically found that
these tests require a substantial deviation from CAPM and/or a long sample to get good
power.
5.2.2
To test the null hypothesis that all intercepts are zero, we then use the test statistic
T O 0 .1 C SR2 / 1
5.3
5.3.1
2
n,
(5.9)
To test n assets at the same time when the errors are non-iid we make use of the GMM
framework. A special case is when the residuals are iid. The results in this section will
then coincide with those in Section 5.2.
Write the n regressions in (5.7) on vector form as
Ret D C f t C " t , E " t D 0n
(5.10)
where and are n 1 vectors. Clearly, setting n D 1 gives the case of a single test
asset.
The 2n GMM moment conditions are that, at the true values of and ,
E g t .; / D 02n 1 , where
#
"
# "
"t
Ret f t
g t .; / D
D
:
ft "t
f t Ret f t
(5.11)
(5.12)
There are as many parameters as moment conditions, so the GMM estimator picks values
of and such that the sample analogues of (5.11) are satisfied exactly
"
#
T
T
e
O t
X
X
R
O
f
1
1
t
O D
O D
g.
N ;
O /
g t .;
O /
D 02n 1 ;
(5.13)
O t/
T t D1
T tD1 f t .Ret O f
which gives the LS estimator. For the inference, we allow for the possibility of non-iid
errors, but if the errors are actually iid, then we (asymptotically) get the same results as in
Section 5.2.
With point estimates and their sampling distribution it is straightforward to set up a
Wald test for the hypothesis that all elements in are zero
d
O 0 Var./
O 1 O !
2
n:
(5.14)
167
Remark 5.2 (Easy coding of the GMM Problem (5.13)) Estimate by LS, equation by
equation. Then, plug in the fitted residuals in (5.12) to generate time series of the moments
(will be important for the tests).
Remark 5.3 (Distribution of GMM) Let the parameter vector in the moment condition
have the true value b0 . Define
S0 D Cov
hp
i
@g.b
N 0/
T gN .b0 / and D0 D plim
:
@b 0
When the estimator solves min gN .b/0 S0 1 gN .b/ or when the model is exactly identified, the
distribution of the GMM estimator is
p
T .bO
D D0 1 S0 .D0 1 /0 :
D0 D
since .A B/
DA
1
ft
#"
1
ft
#0 !
In ;
(5.16)
(if conformable).
168
f
1
1
2
2
t
1
1
t
2t
1t
6
7
6
7D
;
6 gN .; / 7 D T
7 T
e
t D1 6 f .R e
tD1 f
R
f
/
2 2 f t
t
1
1 t 5
4 3
5
4 t 1t
2t
e
gN 4 .; /
f t .R2t
2 2 f t /
where gN 1 .; / denotes the sample average of the first moment condition. The Jacobian
is
2
3
@gN 1 =@1 @gN 1 =@2 @gN 1 =@1 @gN 1 =@2
6
7
6 @gN 2 =@1 @gN 2 =@2 @gN 2 =@1 @gN 2 =@2 7
@g.;
N
/
6
7
D6
7
@1 ; 2 ; 1 ; 2 0
@
g
N
=@
@
g
N
=@
@
g
N
=@
@
g
N
=@
1
3
2
3
1
3
2 5
4 3
@gN 4 =@1 @gN 4 =@2 @gN 4 =@1 @gN 4 =@2
2
3
1 0 ft 0
"
#"
#0 !
6
7
XT
7
0
1
0
f
1
1
1
1 XT 6
t
6
7D
I2 :
D
7
2
t D1 6 f
t D1
T
T
0
f
0
ft
ft
4 t
5
t
0 f t 0 f t2
p
The asymptotic covariance matrix of T times the sample moment conditions, evaluated at the true parameter values, that is at the true disturbances, is defined as
!
p T
1
X
T X
g t .; / D
S0 D Cov
R.s/, where
(5.17)
T tD1
sD 1
R.s/ D E g t .; /g t s .; /0 :
(5.18)
169
1 vector " t as
R.s/ D E g t .; /g t s .; /0
"
#"
#0
"t
"t s
DE
ft "t
ft s "t s
" "
#
! "
#
!0 #
1
1
DE
"t
"t s
:
ft
ft s
(5.19)
"
T
O
O
2
#!
"
D
1
E ft
E f t E f t2
Var."i t /;
5.3.2
We could also construct an LM test instead by imposing D 0 in the moment conditions (5.11) and (5.13). The moment conditions are then
#
"
Ret f t
D 02n 1 :
(5.21)
E g./ D E
f t .Ret f t /
Since there are q D 2n moment conditions, but only n parameters (the vector), this
model is overidentified.
We could either use a weighting matrix in the GMM loss function or combine the
moment conditions so the model becomes exactly identified.
With a weighting matrix, the estimator solves
minb g.b/
N 0 W g.b/;
N
(5.22)
where g.b/
N
is the sample average of the moments (evaluated at some parameter vector b),
and W is a positive definite (and symmetric) weighting matrix. Once we have estimated
the model, we can test the n overidentifying restrictions that all q D 2n moment condiO If not, the restriction (null hypothesis)
tions are satisfied at the estimated n parameters .
that D 0n 1 is rejected. The test is based on a quadratic form of the moment conditions,
T g.b/
N 0 1 g.b/
N
which has a chi-square distribution if the correct matrix is used.
Alternatively, to combine the moment conditions so the model becomes exactly identified, premultiply by a matrix A to get
An
2n
E g./ D 0n 1 :
(5.23)
The model is then tested by testing if all 2n moment conditions in (5.21) are satisfied at this vector of estimates of the betas. This is the GMM analogue to a classical
LM test. Once again, the test is based on a quadratic form of the moment conditions,
T g.b/
N 0 1 g.b/
N
which has a chi-square distribution if the correct matrix is used.
Details on hows to compute the estimates effectively are given in Appendix B.1.
For instance, to effectively use only the last n moment conditions in the estimation,
we specify
"
#
h
i
Ret f t
A E g./ D 0n n In E
D 0n 1 :
(5.24)
f t .Ret f t /
171
(5.25)
Example 5.8 (Combining moment conditions, CAPM on two assets) With two assets we
can combine the four moment conditions into only two by
2
3
e
R
1ft
1t
"
# 6
7
e
6 R2t
0 0 1 0
2 f t 7
6
7 D 02 1 :
A E g t .1 ; 2 / D
E6
7
e
0 0 0 1
4 f t .R1t 1 f t / 5
e
f t .R2t
2 f t /
Remark 5.9 (Test of overidentifying assumption in GMM) When the GMM estimator
solves the quadratic loss function g./
N 0 S0 1 g./
N
(or is exactly identified), then the J test
statistic is
O 0 S 1 g.
O d 2
T g.
N /
0 N / ! q k ;
where q is the number of moment conditions and k is the number of parameters.
Remark 5.10 (Distribution of GMM, more general results) When GMM solves minb g.b/
N 0 W g.b/
N
O D 0k 1 , the distribution of the GMM estimator and the test of overidentifying
or Ag.
N /
assumptions are different than in Remarks 5.3 and 5.9.
5.3.3
The size (using asymptotic critical values) and power in small samples is often found
to be disappointing. Typically, these tests tend to reject a true null hypothesis too often
(see Campbell, Lo, and MacKinlay (1997) Table 5.1) and the power to reject a false null
hypothesis is often fairly low. These features are especially pronounced when the sample
is small and the number of assets, n, is high. One useful rule of thumb is that a saturation
ratio (the number of observations per parameter) below 10 (or so) is likely to give poor
performance of the test. In the test here we have nT observations, 2n parameters in and
, and n.n C 1/=2 unique parameters in S0 , so the saturation ratio is T =.2 C .n C 1/=2/.
For instance, with T D 60 and n D 10 or at T D 100 and n D 20, we have a saturation
ratio of 8, which is very low (compare Table 5.1 in CLM).
172
One possible way of dealing with the wrong size of the test is to use critical values
from simulations of the small sample distributions (Monte Carlo simulations or bootstrap
simulations).
5.3.4
Choice of Portfolios
This type of test is typically done on portfolios of assets, rather than on the individual
assets themselves. There are several econometric and economic reasons for this. The
econometric techniques we apply need the returns to be (reasonably) stationary in the
sense that they have approximately the same means and covariance (with other returns)
throughout the sample (individual assets, especially stocks, can change character as the
company moves into another business). It might be more plausible that size or industry
portfolios are stationary in this sense. Individual portfolios are typically very volatile,
which makes it hard to obtain precise estimate and to be able to reject anything.
It sometimes makes economic sense to sort the assets according to a characteristic
(size or perhaps book/market)and then test if the model is true for these portfolios.
Rejection of the CAPM for such portfolios may have an interest in itself.
5.3.5
Empirical Evidence
See Campbell, Lo, and MacKinlay (1997) 6.5 (Table 6.1 in particular) and Cochrane
(2005) 20.2.
One of the more interesting studies is Fama and French (1993) (see also Fama and
French (1996)). They construct 25 stock portfolios according to two characteristics of the
firm: the size and the book value to market value ratio (BE/ME). In June each year, they
sort the stocks according to size and BE/ME. They then form a 5 5 matrix of portfolios,
where portfolio ij belongs to the i th size quantile and the j th BE/ME quantile. This is
illustrated in Table 5.1.
Tables 5.25.3 summarize some basic properties of these portfolios.
Fama and French run a traditional CAPM regression on each of the 25 portfolios
(monthly data 19631991)and then study if the expected excess returns are related
to the betas as they should according to CAPM (recall that CAPM implies E Riet D
e
i E Rmt
). However, there is little relation between E Riet and i (see Figure 5.4). This
lack of relation (a cloud in the i E Riet space) is due to the combination of two features
173
10
I
5
0
D
A
FH
GC
JB
0.5
1
beta (against the market)
alpha
NaN
3.50
0.67
0.83
4.36
1.72
1.53
1.17
1.86
2.61
0.47
all
A (NoDur)
B (Durbl)
C (Manuf)
D (Enrgy)
E (HiTec)
F (Telcm)
G (Shops)
H (Hlth )
I (Utils)
J (Other)
1.5
StdErr
NaN
8.88
13.39
6.36
14.63
12.16
11.37
9.89
11.70
11.77
7.20
pval
0.07
0.01
0.74
0.41
0.06
0.37
0.39
0.45
0.30
0.16
0.68
15
10
I
5
0
D
A
G
FH CJB E
Excess market return: 5.5%
0
5
10
15
Predicted mean excess return (with =0)
CAPM
Factor: US market
alpha and StdErr are in annualized %
5
0
D
A
FH
GC
JB
0.5
1
beta (against the market)
1.5
all
A (NoDur)
B (Durbl)
C (Manuf)
D (Enrgy)
E (HiTec)
F (Telcm)
G (Shops)
H (Hlth )
I (Utils)
J (Other)
alpha
NaN
3.50
0.67
0.83
4.36
1.72
1.53
1.17
1.86
2.61
0.47
tstat, LS tstat, NW
NaN
NaN
2.51
2.51
0.32
0.33
0.83
0.82
1.90
1.91
0.90
0.90
0.86
0.85
0.76
0.76
1.01
1.03
1.42
1.39
0.41
0.41
174
18
16
14
12
10
8
6
US data 1957:12010:12
25 FF portfolios (B/M and size)
6
8
10
12
14
Predicted mean excess return (CAPM), %
16
18
1
6
11
16
21
2
7
12
17
22
3
8
13
18
23
4
9
14
19
24
5
10
15
20
25
175
18
16
14
12
lines connect same size
10
8
1 (small)
2
3
4
5 (large)
6
4
4
6
8
10
12
14
Predicted mean excess return (CAPM), %
16
18
5:6
4:8
5:3
6:4
5:3
11:9
11:1
9:1
8:8
6:9
13:5
10:9
10:7
9:8
6:7
5
16:3
12:2
12:6
10:2
8:5
Table 5.2: Mean excess returns (annualised %), US data 1957:12010:12. Size 1: smallest
20% of the stocks, Size 5: largest 20% of the stocks. B/M 1: the 20% of the stocks with
the smallest ratio of book to market value (growth stocks). B/M 5: the 20% of the stocks
with the highest ratio of book to market value (value stocks).
5.4
Reference: Cochrane (2005) 12.1; Campbell, Lo, and MacKinlay (1997) 6.2.1
176
18
16
14
12
lines connect same B/M
10
8
1 (low)
2
3
4
5 (high)
6
4
4
6
8
10
12
14
Predicted mean excess return (CAPM), %
16
18
1:4
1:4
1:4
1:3
1:1
1:1
1:1
1:0
1:0
1:0
1:0
1:0
1:0
1:0
0:9
5
1:0
1:2
1:1
1:1
0:9
Table 5.3: Beta against the market portfolio, US data 1957:12010:12. Size 1: smallest
20% of the stocks, Size 5: largest 20% of the stocks. B/M 1: the 20% of the stocks with
the smallest ratio of book to market value (growth stocks). B/M 5: the 20% of the stocks
with the highest ratio of book to market value (value stocks).
5.4.1
A Multi-Factor Model
When the K factors, f t , are excess returns, the null hypothesis typically says that i D 0
in
(5.26)
Riet D i C i0 f t C "i t , where E "i t D 0 and Cov.f t ; "i t / D 0K 1 :
177
and i is now an K 1 vector. The CAPM regression is a special case when the market
excess return is the only factor. In other models like ICAPM (see Cochrane (2005) 9.2),
we typically have several factors. We stack the returns for n assets to get
3 2
3 2
32
3 2
3
2
e
1
11 : : : 1K
f1t
"1t
R1t
76 : 7 6 : 7
6 : 7 6 : 7 6 :
: : ::
7 6 :: 7 C 6 :: 7 , or
6 :: 7 D 6 :: 7 C 6 ::
:
:
5 4
5 4
54
5 4
5
4
e
n
n1 : : : nK
fKt
"nt
Rnt
Ret D C f t C " t ; where E " t D 0n
n;
(5.27)
where is n 1 and is n K. Notice that ij shows how the ith asset depends on the
j th factor.
This is, of course, very similar to the CAPM (one-factor) modeland both the LS and
GMM approaches are straightforward to extend.
5.4.2
The results from the LS approach of testing CAPM generalizes directly. In particular,
(5.9) still holdsbut where the residuals are from the multi-factor regressions (5.26) and
where the Sharpe ratio of the tangency portfolio (based on the factors) depends on the
means and covariance matrix of all factors
T O 0 .1 C SR2 / 1
2
n,
E f:
(5.28)
This result is well known, but some propertiers of SURE models are found in Appendix
A.
5.4.3
f t / D 0n.1CK/ 1 :
(5.29)
Note that this expression looks similar to (5.11)the only difference is that f t may now
be a vector (and we therefore need to use the Kronecker product). It is then intuitively
clear that the expressions for the asymptotic covariance matrix of O and O will look very
178
similar too.
When the system is exactly identified, the GMM estimator solves
g.;
N
/ D 0n.1CK/ 1 ;
(5.30)
which is the same as LS equation by equation. The model can be tested by testing if all
alphas are zeroas in (5.14).
Instead, when we restrict D 0n 1 (overidentified system), then we either specify a
weighting matrix W and solve
min g./
N 0 W g./;
N
(5.31)
N
n.1CK/ g./
D 0nK 1 :
(5.32)
(5.33)
More generally, details on hows to compute the estimates effectively are given in Appendix B.1.
Example 5.11 (Moment condition with two assets and two factors) The moment conditions for n D 2 and K D 2 are
2
3
e
R1t
1 11 f1t 12 f2t
6
7
e
6 R2t
7
f
2
21
1t
22
2t
6
7
6 f .Re
7
6 1t 1t 1 11 f1t 12 f2t / 7
E g t .; / D E 6
7 D 06 1 :
e
6 f1t .R2t
2 21 f1t 22 f2t / 7
6
7
6 f .Re
7
4 2t 1t 1 11 f1t 12 f2t / 5
e
f2t .R2t
2 21 f1t 22 f2t /
Restricting 1 D 2 D 0 gives the moment conditions for the overidentified case.
179
Empirical Evidence
Fama and French (1993) also try a multi-factor model. They find that a three-factor model
fits the 25 stock portfolios fairly well (two more factors are needed to also fit the seven
bond portfolios that they use). The three factors are: the market return, the return on a
portfolio of small stocks minus the return on a portfolio of big stocks (SMB), and the
return on a portfolio with high BE/ME minus the return on portfolio with low BE/ME
(HML). This three-factor model is rejected at traditional significance levels (see Campbell, Lo, and MacKinlay (1997) Table 6.1 or Fama and French (1993) Table 9c), but it can
still capture a fair amount of the variation of expected returnssee Figures 5.75.10.
5.5
Reference: Cochrane (2005) 12.2; Campbell, Lo, and MacKinlay (1997) 6.2.3 and 6.3
5.5.1
Linear factor models imply that all expected excess returns are linear functions of the
same vector of factor risk premia ( )
E Riet D i0 , where
is K
1, for i D 1; : : : n:
(5.34)
180
15
10
5
0
D
A
H
E F I
GC
0
5
10
15
Predicted mean excess return (with =0)
all
A (NoDur)
B (Durbl)
C (Manuf)
D (Enrgy)
E (HiTec)
F (Telcm)
G (Shops)
H (Hlth )
I (Utils)
J (Other)
pval
0.00
0.06
0.02
0.75
0.14
0.30
0.50
0.75
0.01
0.99
0.01
alpha
NaN
2.53
4.46
0.31
3.38
1.68
1.25
0.50
4.31
0.01
2.77
StdErr
NaN
8.63
12.13
6.07
14.15
10.14
11.12
9.78
10.92
10.53
6.17
FamaFrench model
Factors: US market, SMB (size), and HML (booktomarket)
alpha and StdErr are in annualized %
12
10
8
6
US data 1957:12010:12
25 FF portfolios (B/M and size)
8
10
12
14
Predicted mean excess return (FF), %
16
18
181
18
16
14
12
lines connect same size
10
8
1 (small)
2
3
4
5 (large)
6
4
4
8
10
12
14
Predicted mean excess return (FF), %
16
18
11 : : :
7 6 :
::
7 D 6 ::
:
5 4
n1 : : :
E Ret D ;
32
1K
1
76 : 7
::
7
6
: 7
:
5 4 : 5 , or
nK
K
(5.35)
where is n K.
When the factors are excess returns, then the factor risk premia must equal the expected excess returns of those factors. (To see this, let the factor also be one of the test
e
assets. It will then get a beta equal to unity on itself (for instance, regressing Rmt
on
e
itself must give a coefficient equal to unity). This shows that for factor k, k D E Rk t .
More generally, the factor risk premia can be interpreted as follows. Consider an asset
that has a beta of unity against factor k and zero betas against all other factors. This asset
will have an expected excess return equal to k . For instance, if a factor risk premium is
182
18
16
14
12
lines connect same B/M
10
8
1 (low)
2
3
4
5 (high)
6
4
4
8
10
12
14
Predicted mean excess return (FF), %
16
18
(5.36)
16
16
14
12
= 0.67
b = -13.3
pval: 0.00
10
8
6
4
Fit of CAPM
18
14
12
= 0.65 -17.36
b = -542 25364
pval: 0.00
10
8
6
4
5
10
15
Predicted mean excess return, %
5
10
15
Predicted mean excess return, %
US data 1957:1-2010:12
25 FF portfolios (B/M and size)
CAPM:
Ri = i + i Rm and ERi = i
SDF:
m = 1 + b (f Ef )
Coskewness model:
Ri = + 1 Rm + 2 R2m and
ERi = 1i 1 + 2i 2
16
16
14
12
= 0.47
b = -9.3
pval: 0.00
10
8
6
4
Fit of CAPM
18
14
12
= 0.47 -28.78
b = -888 42036
pval: 0.00
10
8
6
4
5
10
15
Predicted mean excess return, %
5
10
15
Predicted mean excess return, %
US data 1957:1-2010:12
25 FF portfolios (B/M and size)
CAPM:
Ri = i + i Rm and ERi = i , = ERm
SDF:
m = 1 + b (f Ef )
Coskewness model:
Rm
is + 1 Rm + 2 R2m and
i =
ERi = 1i 1 + 2i 2 , 1 = ERm
Figure 5.12: CAPM and quadratic model, market excess is exactly priced
against market
against (market)2
x 10
1.4
1.3
0
1.2
1.1
1
0
10
15
Portfolio
20
25
10
10
15
Portfolio
20
25
US data 1957:12010:12
25 FF portfolios (B/M and size)
a minimization problem like (5.31). The test is based on a quadratic form of the moment
conditions, T g.b/
N 0 1 g.b/
N
which has a chi-square distribution if the correct matrix
is used. In the special case of W D S0 1 , the distribution is given by Remark 5.3. For
other choices of the weighting matrix, the expression for the covariance matrix is more
complicated.
It is straightforward to show that the Jacobian of these moment conditions (with respect to vec.; ; /) is
3
2
"
#"
#0 !
P
1
1
6 T1 TtD1
In 0n.1CK/ K 7
7
f
f
D0 D 6
t
t
5
4
i
h
0
In
where the upper left block is similar to the expression for the case with excess return
factors (5.15), while the other blocks are new.
Example 5.12 (Two assets and one factor) we have the moment conditions
2
3
e
R1t
1 1 f t
6
7
e
6 R2t
7
f
2
2
t
6
7
6 f .Re
7
6 t 1t 1 1 f t / 7
E g t .1 ; 2 ; 1 ; 2 ; / D E 6
7 D 06 1 :
e
6 f t .R2t
7
f
/
2
2
t
6
7
6
7
e
R
1
4
5
1t
e
R2t
2
There are then 6 moment conditions and 5 parameters, so there is one overidentifying
restriction to test. Note that with one factor, then we need at least two assets for this
testing approach to work (n K D 2 1). In general, we need at least one more asset
186
@gN
D
@1 ; 2 ; 1 ; 2 ; 0
6
6
6
1 XT 6
6
6
tD1 6
T
6
6
4
2
5.5.2
1
T
6
4
PT
t D1
3
1 0 ft 0 0
7
0 1 0 ft 0 7
7
f t 0 f t2 0 0 7
7
7
2
0 7
0 ft 0 ft
7
0 0
0 1 7
5
0 0 0
2
"
#"
#0 !
3
1
1
I2 0 4 1 7
5:
ft
ft
0; I2
Instead of estimating the overidentified model (5.37) (by specifying a weighting matrix),
we could combine the moment equations so they become equal to the number of parameters. This can be done, by specifying a matrix A and combine as A E g t D 0. This does
not generate any overidentifying restrictions, but it still allows us to test hypotheses about
some moment conditions and about . One possibility is to let the upper left block of A
be an identity matrix and just combine the last n moment conditions, Ret , to just K
moment conditions
"
In.1CK/
0K
n.1CK/
0n.1CK/
K n
#
n
2 "
6
E4
1
ft
Ret
2 "
6
E4
1
ft
.Ret
A E g t D 0n.1CK/CK
3
#
.Ret
f t / 7
5D0
(5.38)
(5.39)
.Ret
f t / 7
5D0
(5.40)
Here A has n.1 C K/ C K rows (which equals the number of parameters (; ; /) and
n.1 C K C 1/ columns (which equals the number of moment conditions). (Notice also
that is K n, is n K and is K 1.)
Remark 5.13 (Calculation of the estimates based on (5.39)) In this case, we can estimate
187
/ D 0K
or
D ./
E Ret ;
f
/
t
1
1
t
1t
6
7
6
7
e
f t .R2t
2 2 f t /
4
5
e
e
1 .R1t
1 / C 2 .R2t
2 /
which has as many parameters as moment conditions. The test of the asset pricing model
is then to test if
2
3
e
R1t
1 1 f t
6
7
e
6 R2t
7
f
2
2
t
6
7
6 f .Re
7
6 t 1t 1 1 f t / 7
E g t .1 ; 2 ; 1 ; 2 ; / D E 6
7 D 06 1 ;
e
6 f t .R2t
7
f
/
2
2
t
6
7
6
7
e
R1t
1
4
5
e
R2t 2
188
02
3
e
" # "
#
E R1t
0
11 12 13 B6
e 7
D
@4E R2t
5
0
21 22 23
e
E R3t
5.5.3
2
3
11 12 "
6
7
421 22 5
31 32
1
#
1 C
A:
2
The test of the general multi-factor models is sometimes written on a slightly different
form (see, for instance, Campbell, Lo, and MacKinlay (1997) 6.2.3, but adjust for the
fact that they look at returns rather than excess returns). To illustrate this, note that the
regression equations (5.27) imply that
E Ret D C E f t :
(5.41)
E f t /;
(5.42)
which is another way of summarizing the restrictions that the linear factor model gives.
We can then rewrite the moment conditions (5.37) as (substitute for and skip the last set
of moments)
""
#
#
1
E g t .; / D E
.Ret .
E f t / f t / D 0n.1CK/ 1 :
(5.43)
ft
Note that there are n.1 C K/ moment conditions and nK C K parameters (nK in and
K in ), so there are n K overidentifying restrictions (as before).
Example 5.16 (Two assets and one factor) The moment conditions (5.43) are
2
3
e
R1t
1 .
E f t / 1 f t
6
7
e
6 R2t
2 .
E f t / 2 f t 7
6
7 D 04 1 :
E g t .1 ; 2 ; / D E 6
e
E f t / 1 f t 7
4 f t R1t 1 .
5
e
f t R2t 2 .
E f t / 2 f t
189
This gives 4 moment conditions, but only three parameters, so there is one overidentifying
restriction to testjust as with (5.39).
5.5.4
It would (perhaps) be natural if the tests discussed in this section coincided with those in
Section 5.4 when the factors are in fact excess returns. That is almost so. The difference is
that we here estimate the K 1 vector (factor risk premia) as a vector of free parameters,
while the tests in Section 5.4 impose D E f t . This can be done in (5.39)(5.40) by doing
two things. First, define a new set of test assets by stacking the original test assets and the
excess return factors
" #
Ret
;
(5.44)
RQ et D
ft
which is an .n C K/
(5.45)
(5.46)
It is also straightforward to show that this gives precisely the same test statistics as the
Wald test on the multifactor model (5.26).
Proof. (of (5.46)) The betas of the RQ et vector are
"
#
n K
Q D
:
IK
The expression corresponding to E.Ret / D 0 is then
" #
"
#
h
i
h
i
Ret
n K
D 0K n IK
, or
0K n IK E
ft
IK
E ft D :
Remark 5.17 (Two assets, one excess return factor) By including the factors among the
190
1 f t
2 f t
3 f t
1 f t /
2 f t /
3 f t /
2 / C 1.f t
3
7
7
7
7
7
7
7 D 07 1 :
7
7
7
7
5
3 /
f
t
3
3
t
6
7
6
7
e
6 f t .R1t
1 1 f t / 7
6
7
e
7 D 09 1 :
E g t .1 ; 2 ; 3 ; 1 ; 2 ; 3 ; / D E 6
f
.R
f
/
t
2
2
t
2t
6
7
6
7
6 f t .f t 3 3 f t / 7
6
7
e
6
7
R
1
1t
6
7
6
7
e
R2t
2
4
5
f t 3
In fact, this gives the same test statistic as when testing if 1 and 2 are zero in (5.14).
5.5.5
When Some (but Not All) of the Factors Are Excess Returns
" #
Zt
ft D
;
Ft
(5.47)
assets by stacking the original test assets and the excess return factors
" #
Ret
RQ et D
;
Zt
which is an .n C v/
where # is some w
(5.48)
(5.49)
#
Z/
(5.50)
where the Z and F are just betas of the original test assets on Z t and F t respectively
according to the partitioning
h
i
n K D nZ v nF w :
(5.51)
One possible choice of # is # D F 0 , since then F are the same as when running a
cross-sectional regression of the expected abnormal return (E Ret Z Z ) on the betas
( F ).
Proof. (of (5.50)) The betas of the RQ et vector are
"
#
Z
F
n v
n w
Q D
:
Iv 0 v w
The expression corresponding to E.Ret
"
0v
#w
n
n
/ D 0 is then
Q E RQ et D Q Q Q
#"
# "
#"
#"
Iv
E Ret
0v n
Iv
nZ v nF w
D
0w v
E Zt
#w n 0w v
Iv 0v w
"
# "
#" #
E Zt
Iv
0v w
Z
D
:
e
Z
F
#w n E R t
#w n n v #w n n w
F
#
Z
F
D E Zt :
192
C # F
F;
D .# F / 1 #.E Ret
so
Z /:
Example 5.18 (Structure of to identify for excess return factors) Continue Example
e
5.15 (where there are 2 factors and three test assets) and assume that Z t D R3t
so the
first factor is really an excess returnwhich we have appended last to set of test assets.
Then 31 D 1 and 32 D 0 (regressing Z t on Z t and F t gives the slope coefficidents 1
P If we set .11 ; 12 ; 13 / D .0; 0; 1/, then the moment conditions in Example 5.15
and 0.)
can be written
02
3 2
3
1
e
" # "
#
E R1t
11 12 " #
0
0
0
1 B6
6
7 Z C
e 7
D
@4E R2t
5 421 22 5
A:
0
21 22 23
F
E Zt
1
0
The first line reads
0 D E Zt
5.5.6
1 0
"
#
Z
, so
D E Zt :
Empirical Evidence
Chen, Roll, and Ross (1986) use a number of macro variables as factorsalong with
traditional market indices. They find that industrial production and inflation surprises are
priced factors, while the market index might not be. Breeden, Gibbons, and Litzenberger
(1989) and Lettau and Ludvigson (2001) estimate models where consumption growth is
the factorwith mixed results.
5.6
This section discusses how we can estimate and test the asset pricing equation
E pt
D E xt mt ;
(5.52)
193
where x t are the payoffs and p t 1 the prices of the assets. We can either interpret
p t 1 as actual asset prices and x t as the payoffs, or we can set p t 1 D 1 and let x t be
gross returns, or set p t 1 D 0 and x t be excess returns.
Assume that the SDF is linear in the factors
0
mt D
(5.53)
ft ;
where the .1 C K/ 1 vector f t contains a constant and the other factors. Combining
with (5.52) gives the sample moment conditions
g.
N /D
T
X
t D1
g t . /=T D 0n 1 , where
gt D xt mt
pt
D x t f t0
pt
(5.54)
1:
(5.55)
(5.56)
(5.57)
194
5.6.1
The model (5.52) does not put any restrictions on the riskfree rate, which may influence
the test. The approach above is also incapable of handling the case when all payoffs are
excess returns. The reason is that there is nothing to tie down the mean of the SDF. To
demonstrate this, the model of the SDF (5.52) is here rewritten as
mt D m
N C b 0 .f t
E f t /;
(5.58)
so m
N D E m.
Remark 5.19 (The SDF model (5.58) combined with excess returns) With excess returns,
x t D Ret and p t 1 D 0. The asset pricing equation is then
0 D E.m t Ret / D E Ret m
N C E Ret .f t
E f t /0 b;
pt
D xt m
N C x t .f t
E f t /0 b
pt
1;
(5.59)
where m
N is given (our restriction). See Appendix B.2 for details on how to calculate the
estimates.
Provided we choose m
N 0, this formulation works with payoffs, gross returns and
also excess returns. It is straightforward to show that the choice of m
N does not matter for
the test based on excess returns (p D 0, so p D 0).
5.6.2
Reference: Ferson (1995); Jagannathan and Wang (2002) (theoretical results); Cochrane
(2005) 15 (empirical comparison); Bekaert and Urias (1996); and Sderlind (1999)
The test of the linear factor model and the test of the linear SDF model are (generally)
not the same: they test the same implications of the models, but in slightly different ways.
195
The moment conditions look a bit differentand combined with non-parametric methods
for estimating the covariance matrix of the sample moment conditions, the two methods
can give different results (in small samples, at least). Asymptotically, they are always the
same, as showed by Jagannathan and Wang (2002).
There is one case where we know that the tests of the linear factor model and the
SDF model are identical: when the factors are excess returns and the SDF is constructed
e
to price these factors as well. To demonstrate this, let R1t
be a vector of excess returns
on some benchmarks assets. Construct a stochastic discount factor as in Hansen and
Jagannathan (1991):
e
e 0
mt D m
N C .R1t
RN 1t
/ ;
(5.60)
e
where m
N is a constant and is chosen to make m t price R1t
in the sample, that is, so
e
tTD1 E R1t
m t =T D 0:
(5.61)
e
Consider the test assets with excess returns R2t
, and SDF performance
gN 2t D
1 PT
e
t D1 R2t m t :
T
(5.62)
(5.63)
e
where E " t D 0 and Cov.R1t
; " t / D 0. Then, the SDF-performance (pricing error) is
proportional to a traditional alpha
gN 2t =m
N D :
O
(5.64)
e
e
e
test assets, R2t
D z t 1 Rpt
, where z t 1 are information variables and Rpt
are basic test
assets (mutual funds say). In this case, we are testing the performance of these dynamic
strategies (in terms of mutual funds, say). For instance, suppose R1t is a scalar and the
for z t 1 R1t is positive. This would mean that a strategy that goes long in R1t when z t 1
is high (and vice versa) has a positive performance.
Proof. (of (5.64)) (Here written in terms of population moments, to simplify the notae
e
m
N . Using this and the expression
tion.) It follows directly that D Var.R1t
/ 1 E R1t
for m t in (5.62) gives
e
E g2t D E R2t
m
N
e
e
e
Cov R2t
; R1t
Var.R1t
/
e
E R1t
m:
N
We now rewrite this equation in terms of the parameters in the factor portfolio model
e
e
(5.63). The latter implies E R2t
D C E R1t
, and the least squares estimator of the slope
1
e
e
e
coefficients is D Cov R2t ; R1t Var R1t . Using these two facts in the equation
aboveand replacing population moments with sample moments, gives (5.64).
5.7
e
1 Rmt :
(5.65)
Since the second factor is not an excess return, the test is done as in (5.37).
An alternative interpretation of this is that we have only one factor, but that the coefficient of the factor is time varying. This is easiest seen by plugging in the factors in the
197
D C .1 C 2 z t
e
1 Rmt
e
1 /Rmt
C "i t
C "i t :
(5.66)
The first line looks like a two factor model with constant coefficients, while the second
line looks like a one-factor model with a time-varying coefficient (1 C 2 z t 1 ). This
is clearly just a matter of interpretation, since it is the same model (and is tested in the
same way). This model can be estimated and tested as in the case of general factorsas
e
z t 1 Rmt
is not a traditional excess return.
See Figure 5.145.15 for an empirical illustration.
Remark 5.20 (Figures 5.145.15, equally weighted 25 FF portfolios) Figure 5.14 shows
the betas of the conditional model. It seems as if the small firms (portfolios with low numbers) have a somewhat higher exposure to the market in bull markets and vice versa,
while large firms have pretty constant exposures. However, the time-variation is not
marked. Therefore, the conditional (two-factor model) fits the cross-section of average
returns only slightly better than CAPMsee Figure 5.15.
Conditional models typically have more parameters than unconditional models, which
is likely to give small samples issues (in particular with respect to the inference). It is
important to remember some of the new factors (original factors times instruments) are
probably not an excess returns, so the test is done with an LM test as in (5.37).
5.8
against Rm
against xRm
0.1
1.4
0.05
1.2
1
0.8
0.05
0
10
15
Portfolio
20
25
0.1
10
15
Portfolio
20
25
US data 1957:12010:12
25 FF portfolios (B/M and size)
instrument: lagged market return
16
16
Fit of CAPM
18
14
12
10
8
6
4
14
12
10
8
6
4
5
10
15
Predicted mean excess return (Rm), %
5
10
15
Predicted mean excess return (Rm,xRm), %
US data 1957:12010:12
25 FF portfolios (B/M and size)
instrument: lagged market return
199
Coefficient on x, different 2
1
0.4
2 = 0.5
0.3
2 = 0.25
0.2
2 = 0
0.5
=1
=5
0
3
0
z
0.1
3
0
3
0
z
Figure 5.16: Logistic function and the effective slope coefficient in a Logistic smooth
transition regression
More ambitiously, we could use a smooth transition regression, which estimates both
the abruptness of the transition between regimes as well as the cutoff point. Let G.z/
be a logistic (increasing but S -shaped) function
G.z/ D
1
1 C exp
.z
c/
(5.67)
where the parameter c is the central location (where G.z/ D 1=2) and > 0 determines
the steepness of the function (a high implies that the function goes quickly from 0 to 1
around z D c.) See Figure 5.16 for an illustration. A logistic smooth transition regression
is
y t D 1
D 1
(5.68)
At low z t values, the regression coefficients are (almost) 1 and at high z t values they are
(almost) 2 . See Figure 5.16 for an illustration.
Remark 5.21 (NLS estimation) The parameter vector ( ; c; 1 ; 2 ) is easily estimated by
Non-Linear least squares (NLS) by concentrating the loss function: optimize (numerically) over . ; c/ and let (for each value of . ; c/) the parameters (1 ; 2 ) be the OLS
coefficients on the vector of regressors .1 G.z t / x t ; G.z t /x t /.
200
Slope on factor
2
low state
high state
1.5
0.5
10
15
FF portfolio no.
20
25
Figure 5.17: Betas on the market in the low and high regimes, 25 FF portfolios
The most common application of this model is by letting x t D y t s . This is the
LSTAR modellogistic smooth transition auto regression model, see Franses and van
Dijk (2000).
For an empirical application to a factor model, see Figures 5.175.18.
Remark 5.22 (Figures 5.175.18, equally weighted 25 FF portfolios) Figure 5.17 shows
the betas in the (extreme) low and the (extreme) high state. The extreme growth firms
(portfolio no. 1, 6, 11, 16, and 21) have much higher betas in the low state (when the
lagged return on a momentum portfolio is low) than in the high statewhile value firms
have practically constant betas. Figure 5.18 suggests that this regime model fits the average returns much better than CAPM.
5.9
Fama-MacBeth
Reference: Cochrane (2005) 12.3; Campbell, Lo, and MacKinlay (1997) 5.8; Fama and
MacBeth (1973)
The Fama and MacBeth (1973) approach is a bit different from the regression approaches discussed so faralthough is seems most related to what we discussed in Sec201
Fit of CAPMLSTAR
18
16
16
Fit of CAPM
18
14
12
10
8
6
14
12
10
8
6
4
5
10
15
Predicted mean excess return, %
5
10
15
Predicted mean excess return, %
US data 1957:12010:12
25 FF portfolios (B/M and size)
regime: lagged return on momentum portfolio
0 O
t i
C "i t ;
(5.69)
where Oi are the regressors. (Note the difference to the traditional cross-sectional
approach discussed in (5.10), where the second stage regression regressed E Riet on
Oi , while the Fama-French approach runs one regression for every time period.)
202
(5.70)
(5.71)
T
1X
.O"i t
T t D1
"Oi /2 :
(5.72)
Since "Oi is the sample average of "Oi t , the variance of the former is the variance of the latter
203
"Oi /2 :
(5.73)
O /2 :
(5.74)
Fama and MacBeth (1973) found, among other things, that the squared beta is not
significant in the second step regression, nor is a measure of non-systematic risk.
Proof. (of (5.8)) Write each of the regression equations in (5.7) on a traditional form
" #
1
Riet D x t0 i C "i t , where x t D
:
ft
Define
xx D plim
XT
t D1
x t x t0 =T , and
ij
D plim
XT
t D1
then the asymptotic covariance matrix of the vectors Oi and Oj (assets i and j ) is
(see below for a separate proof). In matrix form,
2
3
11 : : :
1n
p
6 :
:: 7
1
O
6
Cov. T / D 4 ::
: 7
5 xx ;
n1 : : : O nn
1
ij xx =T
where O stacks O1 ; : : : ; On . As in (5.3), the upper left element of xx1 equals 1 C SR2 ,
where SR is the Sharpe ratio of the market.
Proof. (of distribution of SUR coefficients, used in proof of (5.8) ) To simplify, con-
204
zt D xt C vt ;
where y t ; z t and x t are zero mean variables. We then know (from basic properties of LS)
that
1
O D C PT
.x1 u1 C x2 u2 C : : : xT uT /
OD
.x1 v1 C x2 v2 C : : : xT vT / :
t D1 x t x t
1
C PT
t D1 x t x t
In the traditional LS approach, we treat x t as fixed numbers (constants) and also assume
that the residuals are uncorrelated across and have the same variances and covariances
across time. The covariance of O and O is therefore
O O/ D
Cov.;
D
!2
x12 Cov .u1 ; v1 / C x22 Cov .u2 ; v2 / C : : : xT2 Cov .uT ; vT /
PT
t D1 x t x t
!2
PT
t D1 x t x t
PT
t D1 x t x t
D PT
t D1 x t x t
uv ,
where
uv
D Cov .u t ; v t / ;
uv :
Divide and multiply by T to get the result in the proof of (5.8). (We get the same results
if we relax the assumption that x t are fixed numbers, and instead derive the asymptotic
distribution.)
Remark A.1 (General results on SURE distribution, same regressors) Let the regression
equations be
yi t D x t0 i C "i t , i D 1; : : : ; n;
where x t is a K 1 vector (the same in all n regressions). When the moment conditions
are arranged so that the first n are x1t " t , then next are x2t " t
E g t D E.x t " t /;
then Jacobian (with respect to the coefs of x1t , then the coefs of x2t , etc) and its inverse
205
are
D0 D
xx In and D0 1 D
xx1 In :
P
0
The covariance matrix of the moment conditions is as usual S0 D 1
sD 1 E g t g t s . As
an example, let n D 2, K D 2 with x t0 D .1; f t / and let i D .i ; i /, then we have
2
3
2
3
gN 1
y1t 1 1 f t
6
7
6
7
XT 6 y2t 2 2 f t 7
6 gN 2 7
6
7D 1
6
7;
6 gN 7 T
7
t D1 6 f .y
f
/
1
1 t 5
4 3 5
4 t 1t
gN 4
f t .y2t 2 2 f t /
and
2
@gN 1 =@1
@gN 2 =@1
@gN 3 =@1
@gN 4 =@1
@gN 1 =@2
6
6
@gN 2 =@2
@gN
6
D
6
@1 ; 2 ; 1 ; 2 0
@gN 3 =@2
4
@gN 4 =@2
2
1 0
6
1 XT 6
6 0 1
D
t D1 6 f
T
4 t 0
0 ft
3
7
7
7
7
5
XT
x t x t0
t D1
I2 :
Remark A.2 (General results on SURE distribution, same regressors, alternative ordering of moment conditions and parameters ) If instead, the moment conditions are arranged so that the first K are x t "1t , the next are x t "2t as in
E g t D E." t x t /;
then the Jacobian (wrt the coffecients in regression 1, then the coeffs in regression 2 etc.)
and its inverse are
D0 D In . xx / and D0 1 D In . xx1 /:
206
f
2t
2
2 t
4 3 5
4
5
gN 4
f t .y2t 2 2 f t /
and
2
@gN 1 =@1
6
6
@gN 2 =@1
@gN
D6
6
0
@1 ; 1 ; 2 ; 2
@gN 3 =@1
4
@gN 4 =@1
2
1 ft
6
X
6 f t f t2
T
1
6
D
t D1 6 0
T
0
4
0 0
B
B.1
@gN 1 =@1
@gN 2 =@1
@gN 3 =@1
@gN 4 =@1
3
@gN 1 =@2 @gN 1 =@2
7
@gN 2 =@2 @gN 2 =@2 7
7
@gN 3 =@2 @gN 3 =@2 7
5
@gN 4 =@2 @gN 4 =@2
3
0 0
7
0 0 7
1 XT
0
7 D I2
xt xt :
t D1
T
1 ft 7
5
f t f t2
This section describes how the GMM problem can be programmed. We treat the case
with n assets and K Factors (which are all excess returns). The moments are of the form
"
#
!
1
gt D
.Ret f t /
ft
"
#
!
1
gt D
.Ret f t /
ft
for the exactly identified and overidentified case respectively
Suppose we could write the moments on the form
gt D zt yt
x t0 b ;
to make it easy to use matrix algebra in the calculation of the estimate (see below for how
207
to do that). These moment conditions are similar to those for the instrumental variable
method. In that case we could let
zy D
T
T
T
1X
1X
1X
z t y t and zx D
z t x t0 , so
g t D zy
T t D1
T t D1
T t D1
zx b:
zx b D 0, so bO D zx1 zy :
(It is straightforward to show that this can also be calculated equation by equation.) In the
overidentified case with a weighting matrix, the loss function can be written
gN 0 W gN D .zy zx b/0 W .zy zx b/; so
0
0
0
zx
W zx bO D 0 and bO D .zx
W zx / 1 zx
W zy :
0
zx
W zy
AgN D Azy
In practice, we never perform an explicit inversionit is typically much better (in terms of
both speed and precision) to let the software solve the system of linear equations instead.
To rewrite the moment conditions as g t D z t y t x t0 b , notice that
1
0
"
gt D
1
ft
!B
B e
In B
BR t
@
zt
"
gt D
1
ft
zt
1
ft
#0
! C
C
In b C
C , with b D vec.; /
A
x t0
B
In @Ret
"
1
C
f t0 In b A , with b D vec./
x t0
for the exactly identified and overidentified case respectively. Clearly, z t and x t are matrices, not vectors. (z t is n.1 C K/ n and x t0 is either of the same dimension or has n
rows less, corresponding to the intercept.)
Example B.1 (Rewriting the moment conditions) For the moment conditions in Example
208
5.11 we have
0
2
3B
1
0 B
6
7B
B
6 0
1 7
6
7B
" #
6f
7 B Re
6 1t 0 7 B
g t .; / D 6
7 B 1t
e
6 0 f1t 7 B
R2t
6
7B
6f
7B
4 2t 0 5 B
B
0 f2t B
@
zt
1
30 2 3C
1
0
1 C
6
7 6 7C
6 0
6 7C
1 7
6
7 6 2 7C
6f
7 6 7C
6 1t 0 7 611 7C
6
7 6 7C :
6 0 f1t 7 621 7C
6
7 6 7C
6f
7 6 7C
4 2t 0 5 412 5C
C
0 f2t
22 C
A
2
x t0
Proof. (of rewriting the moment conditions) From the properties of Kronecker products, we know that (i) vec.ABC / D .C 0 A/vec.B/; and (ii) if a is m 1 and c is n 1,
then a c D .a In /c. The first rule allows to write
"
#
"
#0
!
h
i
h
i 1
1
C f t D In
as
In vec. /:
ft
ft
b
x t0
f t /:
zt
(For the exactly identified case, we could also use the fact .A B/0 D A0 B 0 to notice
that z t D x t .)
Remark B.2 (Quick matrix calculations of zx and zy ) Although a loop wouldnt take
too long time to calculate zx and zy , there is a quicker way. Put 1 f t0 in row t
of the matrix ZT .1CK/ and Re0
t in row t of the matrix RT n . For the exactly identified
case, let X D Z. For the overidentified case, put f t0 in row t of the matrix XT K . Then,
calculate
zx D .Z 0 X=T / In and vec.R0 Z=T / D zy :
209
B.2
B.2.1
T
X
x t f t0 =T
t D1
T
X
and p D
pt
1 =T:
t D1
p ;
J D xf
p W xf
p :
@J
D
D
@
@g.
N O/
@ 0
0
D xf
W xf O
0
xf
OD
W xf
W g.
N O/
p , so
0
xf
W p :
Ap D 0, so
D .Axf / 1 Ap :
B.2.2
T
X
t D1
x t =T; xf D
T
X
tD1
x t .f t
E f t /0 =T and p D
T
X
pt
1 =T:
t D1
210
p W x m
N C xf b
p :
0
O
0K 1 D xf W x m
N C xf b p , so
0
bO D xf
W xf
0
xf
W p
x m
N :
Ap D 0, so
b D .Axf / 1 A p
x m
N :
Bibliography
Bekaert, G., and M. S. Urias, 1996, Diversification, integration and emerging market
closed-end funds, Journal of Finance, 51, 835869.
Breeden, D. T., M. R. Gibbons, and R. H. Litzenberger, 1989, Empirical tests of the
consumption-oriented CAPM, Journal of Finance, 44, 231262.
Campbell, J. Y., A. W. Lo, and A. C. MacKinlay, 1997, The econometrics of financial
markets, Princeton University Press, Princeton, New Jersey.
Chen, N.-F., R. Roll, and S. A. Ross, 1986, Economic forces and the stock market,
Journal of Business, 59, 383403.
Christiansen, C., A. Ranaldo, and P. Sderlind, 2010, The time-varying systematic risk
of carry trade strategies, Journal of Financial and Quantitative Analysis, forthcoming.
Cochrane, J. H., 2005, Asset pricing, Princeton University Press, Princeton, New Jersey,
revised edn.
211
Dahlquist, M., and P. Sderlind, 1999, Evaluating portfolio performance with stochastic
discount factors, Journal of Business, 72, 347383.
Fama, E., and J. MacBeth, 1973, Risk, return, and equilibrium: empirical tests, Journal
of Political Economy, 71, 607636.
Fama, E. F., and K. R. French, 1993, Common risk factors in the returns on stocks and
bonds, Journal of Financial Economics, 33, 356.
Fama, E. F., and K. R. French, 1996, Multifactor explanations of asset pricing anomalies, Journal of Finance, 51, 5584.
Ferson, W. E., 1995, Theory and empirical testing of asset pricing models, in Robert A.
Jarrow, Vojislav Maksimovic, and William T. Ziemba (ed.), Handbooks in Operations
Research and Management Science . pp. 145200, North-Holland, Amsterdam.
Ferson, W. E., and R. Schadt, 1996, Measuring fund strategy and performance in changing economic conditions, Journal of Finance, 51, 425461.
Franses, P. H., and D. van Dijk, 2000, Non-linear time series models in empirical finance,
Cambridge University Press.
Gibbons, M., S. Ross, and J. Shanken, 1989, A test of the efficiency of a given portfolio,
Econometrica, 57, 11211152.
Greene, W. H., 2003, Econometric analysis, Prentice-Hall, Upper Saddle River, New
Jersey, 5th edn.
Hansen, L. P., and R. Jagannathan, 1991, Implications of security market data for models
of dynamic economies, Journal of Political Economy, 99, 225262.
Jagannathan, R., and Z. Wang, 1996, The conditional CAPM and the cross-section of
expectd returns, Journal of Finance, 51, 353.
Jagannathan, R., and Z. Wang, 1998, A note on the asymptotic covariance in FamaMacBeth regression, Journal of Finance, 53, 799801.
Jagannathan, R., and Z. Wang, 2002, Empirical evaluation of asset pricing models: a
comparison of the SDF and beta methods, Journal of Finance, 57, 23372367.
212
Lettau, M., and S. Ludvigson, 2001, Resurrecting the (C)CAPM: a cross-sectional test
when risk premia are time-varying, Journal of Political Economy, 109, 12381287.
MacKinlay, C., 1995, Multifactor models do not explain deviations from the CAPM,
Journal of Financial Economics, 38, 328.
Sderlind, P., 1999, An interpretation of SDF based performance measures, European
Finance Review, 3, 233237.
Treynor, J. L., and K. Mazuy, 1966, Can Mutual Funds Outguess the Market?, Harvard
Business Review, 44, 131136.
213
Reference: Bossaert (2002); Campbell (2003); Cochrane (2005); Smith and Wickens
(2002)
6.1
6.1.1
(6.1)
R t M t D 1:
ln M t D ln
1/
, so
c t ; where
(6.2)
c t D ln C t =C t
1:
(6.3)
The second line is only there to introduce the convenient notation c t for the consumption
growth rate.
The next few sections study if the pricing model consisting of (6.1) and (6.2) can fit
214
historical data. To be clear about what this entails, note the following. First, general
equilibrium considerations will not play any role in the analysis: the production side will
not be even mentioned. Instead, the focus is on one of the building blocks of an otherwise
unspecified model. Second, complete markets are not assumed. The key assumption is
rather that the basic asset pricing equation (6.1) holds for the assets I analyse. This means
that the representative investor can trade in these assets without transaction costs and taxes
(clearly an approximation). Third, the properties of historical (ex post) data are assumed
to be good approximations of what investors expected. In practice, this assumes both
rational expectations and that the sample is large enough for the estimators (of various
moments) to be precise.
To highlight the basic problem with the consumption-based model and to simplify the
exposition, I assume that the excess return, Ret , and consumption growth, c t , have a
bivariate normal distribution. By using Steins lemma, we can write the the risk premium
as
E t 1 Ret D Cov t 1 .Ret ; c t / :
(6.4)
The intuition for this expressions is that an asset that has a high payoff when consumption
is high, that is, when marginal utility is low, is considered risky and will require a risk
premium. This expression also holds in terms of unconditional moments. (To derive that,
start by taking unconditional expectations of (6.1).)
We can relax the assumption that the excess return is normally distributed: (6.4) holds
also if Ret and c t have a bivariate mixture normal distributionprovided c t has the
same mean and variance in all the mixture components (see Section 6.1.1 below). This
restricts consumption growth to have a normal distribution, but allows the excess return
to have a distribution with fat tails and skewness.
Remark 6.1 (Steins lemma) If x and y have a bivariate normal distribution and h.y/ is
a differentiable function such that Ejh0 .y/j < 1, then Covx; h.y/ D Cov.x; y/ Eh0 .y/.
Proof. (of (6.4)) For an excess return Re , (6.1) says E Re M D 0, so
E Re D
Cov.Re ; M /= E M:
Pdf of c
1
0.06
Kernel
Normal
0.04
0.5
0.02
0
0
1
2
Consumption growth, %
20
10
0
10
20
Market excess return, %
Figure 6.1: Density functions of consumption growth and equity market excess returns.
The kernel density function of a variable x is estimated by using a N.0; / kernel with
D 1:06 Std.x/T 1=5 . The normal distribution is calculated from the estimated mean
and variance of the same variable.
216
0.2
0.15
0.1
0.05
0
3
2
2
0
1
1
y
2
3
217
6.2
6.2.1
This section studies if the consumption-based asset pricing model can explain the historical risk premium on the US stock market.
To discuss the historical average excess returns, it is convenient to work with the
unconditional version of the pricing expression (6.4)
E Ret D Cov.Ret ; c t / :
(6.5)
Table 6.1 shows the key statistics for quarterly US real returns and consumption growth.
Mean
c
e
Rm
Riskfree
Std
Autocorr
1:984 0:944
5:369 16:899
1:213
2:429
0:362
0:061
0:642
Corr with
1:000
0:211
0:196
0:15
0:17
Std.Ret /
0:01 :
Std. c t /
(6.6)
(6.7)
aversion can motivate why investors require such a high equity premium. This is the
equity premium puzzle stressed by Mehra and Prescott (1985) (although they approach
the issue from another angle). Indeed, even if the correlation was one, (6.7) would require
35.
6.2.2
In contrast to the traditional interpretation of efficient markets, it has been found that
excess returns might be somewhat predictableat least in the long run (a couple of years).
In particular, Fama and French (1988a) and Fama and French (1988b) have argued that
future long-run returns can be predicted by the current dividend-price ratio and/or current
returns.
Figure 6.3 illustrates this by showing results the regressions
RetCk .k/ D a0 C a1 x t C u t Ck , where x t D E t =P t or Ret .k/;
(6.8)
where Ret .k/ is the annualized k-quarter excess return of the aggregate US stock market
and E t =P t is the earnings-price ratio.
It seems as if the earnings-price ratio has some explanatory power for future returns
at least for long horizons. In contrast, the lagged return is a fairly weak predictor.
Return = a + b*lagged Return, R2
0.5
0.1
0
0.05
0.5
0
20
40
Return horizon (months)
60
20
40
Return horizon (months)
60
consumption-based model, (6.4) says that the conditional expected excess return should
equal the conditional covariance times the risk aversion.
Figure 6.4.a shows recursive estimates of the mean return of the aggregate US stock
market and the covariance with consumption growth (dated t C 1). The recursive estimation means that the results for (say) 1965Q2 use data for 1955Q21965Q2, the results
for 1965Q3 add one data point, etc. The second subfigure shows the same statistics, but
estimated on a moving data window of 10 years. For instance, the results for 1980Q2 are
for the sample 1971Q31980Q2. Finally, the third subfigure uses a moving data window
of 5 years.
Together these figures give the impression that there are fairly long swings in the
data. This fundamental uncertainty should serve as a warning against focusing on the fine
details of the data. It could also be used as an argument for using longer data series
provided we are willing to assume that the economy has not undergone important regime
changes.
It is clear from the earlier Figure 6.4 that the consumption-based model probably cannot generate plausible movements in risk premia. In that figure, the conditional moments
are approximated by estimates on different data windows (that is, different subsamples).
Although this is a crude approximation, the results are revealing: the actual average excess
return and the covariance move in different directions on all frequencies.
6.2.3
The CRRA utility function has the special feature that the intertemporal elasticity of substitution is the inverse of the risk aversion, that is, 1= . Choosing the risk aversion parameter, for instance, to fit the equity premium, will therefore have direct effects on the
riskfree rate.
A key feature of any consumption-based asset pricing model, or any consumption/saving
model for that matter, is that the riskfree rate governs the time slope of the consumption
profile. From the asset pricing equation for a riskfree asset (6.1) we have E t 1 .Rf t / E t 1 .M t / D
1. Note that we must use the conditional asset pricing equationat least as long as we
believe that the riskfree asset is a random variable. A riskfree asset is defined by having
a zero conditional covariance with the SDF, which means that it is regarded as riskfree at
the time of investment (t 1). In practice, this means a real interest rate (perhaps approximated by the real return on a T-bill since the innovations in inflation are small), which
220
Recursive estimation
E(Rem)
Cov(Rem,c)
0
2
0
1960
1980
2000
1960
1980
2000
4
Initialization: data for first 10 years (not shown)
2
0
2
1960
1980
2000
Rf t D
ln C E t
ct
Var t
1.
c t /=2:
(6.9)
Rf t D
ln C E c t
E Var t
1.
c t /=2:
(6.10)
Before we try to compare (6.10) with data, several things should be noted. First, the log
1
1 Cu t
1 .x t ; y t /
221
ln C 25
(6.11)
0:02
Rf t D
StdE t
c t ;
(6.12)
if the conditional variance of consumption growth is constant. This expression says that
2
Let E.yjx/ and Var.yjx/ be the expectation and variance of y conditional on x. The unconditional
variance is then Var.y/ D VarE.yjx/ C EVar.yjx/.
222
the standard deviation of expected real interest rate is times the standard deviation of expected consumption growth. We cannot observe the conditional expectations directly, and
therefore not estimate their volatility. However, a simple example is enough to demonstrate that high values of are likely to imply counterfactually high volatility of the real
interest rate.
As an approximation, suppose both the riskfree rate and consumption growth are
AR(1) processes. Then (6.12) can be written
Corrln E t
1 .Rf t /; ln E t 1 .Rf t /
Stdln E t
1 .Rf t /
0:75
0:02
Corr. c t ; c t C1 /
0:3
0:01
Std. c t /
(6.13)
(6.14)
where the second line uses the results in Table 6.1. With D 25, (6.14) implies that the
RHS is much too volatile This shows that an intertemporal elasticity of substitution of
1/25 is not compatible with the relatively stable real return on T-bills.
Proof. (of (6.13)) If x t D x t 1 C " t , where " t is iid, then E t 1 .x t / D x t 1 , so
.E t 1 x t / D .x t 1 /.
6.3
The previous section demonstrated that the consumption-based model has a hard time explaining the risk premium on a broad equity portfolioessentially because consumption
growth is too smooth to make stocks look particularly risky. However, the model does
predict a positive equity premium, even if it is not large enough. This suggests that the
model may be able to explain the relative risk premia across assets, even if the scale is
wrong. In that case, the model would still be useful for some issues. This section takes a
closer look at that possibility by focusing on the relation between the average return and
the covariance with consumption growth in a cross-section of asset returns.
The key equation is (6.5), which I repeat here for ease of reading
E Ret D Cov.Ret ; c t / :
(EPPn2 again)
This can be tested with a GMM framework or a to the traditional cross-sectional regressions of returns on factors with unknown factor risk premia (see, for instance, Cochrane
223
1;
where
3
01 N 01 N 01 N
7
IN
0N N 0N N 7
7
0N N
IN
0N N 7
7;
0 7
01 N 01 N
ic 5
01 N 01 N 11 N
Still, the basic asset pricing implication is the same: expected returns are linearly related
to the covariance.
20
Unconditional CCAPM
: 166
15
tstat of : 1.8
E(Re), %
R2: 0.30
10
20
3
4
e
Cov(c,R ), basis points
Unconditional CAPM
: 0.7
15
tstat of : 0.4
E(Re), %
R2: 0.02
10
2
3
e e
Cov(Rm,R ), percent
20
Unconditional CCAPM
E(Re), %
15
1 (small)
2
3
4
5 (large)
10
20
3
4
e
Cov(c,R ), basis points
Unconditional CCAPM
E(Re), %
15
1 (low)
2
3
4
5 (high)
10
3
4
e
Cov(c,R ), basis points
20
Unconditional CAPM
E(Re), %
15
1 (small)
2
3
4
5 (large)
10
20
2
3
e e
Cov(Rm,R ), percent
Unconditional CAPM
E(Re), %
15
1 (low)
2
3
4
5 (high)
10
2
3
e e
Cov(Rm,R ), percent
Size 1
2
3
4
5
B/M
3
6:6
3:4
4:1
1:8
3:1
1:2
0:1
0:7
1:3
0:5
1:0
2:6
1:0
0:2
0:7
3:0
2:6
1:7
1:1
1:3
4:1
2:2
4:1
0:7
0:3
Table 6.2: Historical minus fitted risk premia (annualised %) from the unconditional model.
Results are shown for the 25 equally-weighted Fama-French portfolios, formed according
to size and book-to-market ratios (B/M). Sample: 1957Q1-2008Q4
Size 1
2
3
4
5
5:8
4:7
4:8
6:0
4:7
11:0
8:4
8:4
6:4
6:1
B/M
3
11:9
10:7
8:6
8:2
6:3
13:8
11:0
10:2
9:3
6:1
16:6
12:0
12:0
9:6
8:0
Table 6.3: Historical risk premia (annualised %). Results are shown for the 25 equallyweighted Fama-French portfolios, formed according to size and book-to-market ratios
(B/M) Sample: 1957Q1-2008Q4
Size 1
2
3
4
5
114:5
73:5
85:1
30:4
65:2
10:5
0:7
8:7
19:6
7:8
B/M
3
8:5
23:8
11:5
1:8
11:0
4
21:9
23:6
16:8
12:3
22:1
5
24:9
18:2
33:7
6:8
4:2
Table 6.4: Relative errors of risk premia (in %) of the unconditional model. The relative
errors are defined as historical minus fitted risk premia, divided by historical risk premia.
Results are shown for the 25 equally-weighted Fama-French portfolios, formed according
to size and book-to-market ratios (B/M). Sample: 1957Q1-2008Q4
228
6.4
The basic asset pricing model is about conditional moment and it can be summarizes as
in (6.4) which is given here again
Et
Ret D Cov t
e
1 .R t ;
ct / :
(EPP3c again)
Expression this in terms of unconditional moments as in (6.5) shows only part of the
story. It is, however, fair to say that if the model does not hold unconditionally, then that
is enough to reject the model.
However, it can be shown (see Sderlind (2006)) that several refinements of the consumption based model (the habit persistence model of Campbell and Cochrane (1999)
and also the model with idiosyncratic risk by Mankiw (1986) and Constantinides and
Duffie (1996)) also imply that (6.4) holds, but with a time varying effective risk aversion
coefficient (so should carry a time subscript).
6.4.1
!a t
.1
!/y t ;
(6.15)
where c t is log consumption, a t log financial wealth and y t is log income. The coefficient ! is estimated with LS to be around 0.3. Although (6.15) contains non-stationary
variables, it is interpreted as a cointegrating relation so LS is an appropriate estimation
method. Lettau and Ludvigson (2001a) shows that cay is able to forecast stock returns (at
least, in-sample). Intuitively, cay should be a signal of investor expectations about future
returns (or wage earnings...): a high value is probably driven by high expectations.
229
1 cay t 1
(6.16)
and b t D 0 C 1 cay t
(6.17)
1:
This is a conditional C-CAPM. It is clearly the same as specifying a linear factor model
Riet D C i1 cay t
C i 2 c t C i 3 . c t
cay t
1/
C "i t ;
(6.18)
where the coefficients are estimated in time series regression (this is also called a scaled
factor model since the true factor, c, is scaled by the instrument, cay). Then, the
cross-sectional pricing implications are tested by
E Ret D ;
(6.19)
230
c t D c0 Yc;t
1
1
(6.20)
C eR;t
(6.21)
C ec;t :
In practice, only three lags of lagged consumption growth is used to predict consumption
growth and only the cay variable is used to predict the asset return.
Then, the return is related to the covariance as
Ret D b0 C .b1 C b2 p t
1 / eR;t ec;t
(6.22)
C wt ;
where .b1 C b2 p t 1 / is a model of the effective risk aversion. In the CRRA model, b2 D
0, so b1 measures the relative risk aversion as in (6.4). In contrast, in Campbell and
Cochrane (1999) p t 1 is an observable proxy of the surplus ratio which measure how
close consumption is to the habit level.
The model (6.20)(6.22) is estimated with GMM, using a number of instruments
(Z t 1 ): lagged values of stock market value/consumption, stock market returns, cay and
the product of demeaned consumption and returns. This can be thought of as first finding
proxies for
0
E t 1 Ret D R
YR;t 1 and Cov t 1 .eR;t ; ec;t / D v0 Z t 1
(6.23)
bR
Et
e
t
D b0 C .b1 C b2 p t
C ut :
(6.24)
The point of using a (GMM) system is that this allows handling the estimation uncertainty of the prediction equations in the testing of the relation between the predictions.
The empirical results (using monthly returns on the broad U.S. stock market and per
capita expenditures in nondurables and services, 19592001) suggest that there is a strong
negative relation between the conditional covariance and the conditional expected market
returnwhich is clearly at odds with a CRRA utility function (compare (6.4)). In addition, typical proxies of the p t 1 variable do not seem to any important (economic) effects.
In an extension, the paper also studies other return horizons and tries other ways to
model volatility (including a DCC model).
(See also Sderlind (2006) for a related approach applied to a cross-section of returns.)
231
6.5
Ultimate Consumption
Ret .C t =C t
1/
(6.25)
D0
D Pnt , so
(6.26)
D E t n C t Cn =Pn;t :
(6.27)
1/
(6.28)
This expression relates the one-period excess return to an n-period SDFwhich involves
the interest rate (1=Pn;t ) and ratio of marginal utilities n periods apart.
If we can apply Steins lemma (possibly extended) and use yn;t D ln 1=Pnt to denote
the n-period log riskfree rate, then we get
Et
Ret D
Cov t
D Cov t
e
1 .R t ; ln Mn;t /
e
1 R t ;
ln.C t Cn =C t
1 /
Cov t
e
1 R t ; yn;t :
(6.29)
This first term is very similar to the traditional expression (6.2), except that we here have
the (nC1)-period (instead of the 1-period) consumption growth. The second term captures
the covariance between the excess return and the n-period interest rate in period t (both
are random as seen from t 1). If we set n D 0, then this equation simplifies to the
traditional expression (6.2). Clearly, the moments in (6.29) could be unconditional instead
of conditional.
The empirical approach in Parker and Julliard (2005) is to estimate (using GMM) and
test the cross-sectional implications of this model. (They do not use Steins lemma.) They
232
find that the model fits data much better with a high value of n (ultimate consumption)
than with n D 0 (the traditional model). Possible reasons could be: (i) long-run changes
in consumption are better measured in national accounts data; (ii) the CRRA model is a
better approximation for long-run movements.
Ultimate CCAPM, 1 quarters
: 166
tstat of : 1.8
R2: 0.30
15
E(Re), %
E(Re), %
20
10
5
0
: 208
tstat of : 1.8
R2: 0.27
15
10
5
2
4
e
Cov(c,R ), basis points
2
4
e
Cov(c,R ), basis points
Unconditional CCAPM
US quarterly data 1957Q12008Q4
E(Re), %
20
: 525
tstat of : 1.8
R2: 0.47
15
10
5
0
2
4
e
Cov(c,R ), basis points
233
Bibliography
Bossaert, P., 2002, The paradox of asset pricing, Princeton University Press.
Campbell, J. Y., 2003, Consumption-based asset pricing, in George Constantinides,
Milton Harris, and Rene Stultz (ed.), Handbook of the Economics of Finance . chap. 13,
pp. 803887, North-Holland, Amsterdam.
Campbell, J. Y., and J. H. Cochrane, 1999, By force of habit: a consumption-based
explanation of aggregate stock market behavior, Journal of Political Economy, 107,
205251.
Campbell, J. Y., A. W. Lo, and A. C. MacKinlay, 1997, The econometrics of financial
markets, Princeton University Press, Princeton, New Jersey.
Campbell, J. Y., and S. B. Thompson, 2008, Predicting the equity premium out of sample: can anything beat the historical average, Review of Financial Studies, 21, 1509
1531.
Cochrane, J. H., 2005, Asset pricing, Princeton University Press, Princeton, New Jersey,
revised edn.
Constantinides, G. M., and D. Duffie, 1996, Asset pricing with heterogeneous consumers, The Journal of Political Economy, 104, 219240.
Duffee, G. R., 2005, Time variation in the covariance between stock returns and consumption growth, Journal of Finance, 60, 16731712.
Engle, R. F., 2002, Dynamic conditional correlation: a simple class of multivariate generalized autoregressive conditional heteroskedasticity models, Journal of Business and
Economic Statistics, 20, 339351.
Epstein, L. G., and S. E. Zin, 1991, Substitution, risk aversion, and the temporal behavior
of asset returns: an empirical analysis, Journal of Political Economy, 99, 263286.
Fama, E. F., and K. R. French, 1988a, Dividend yields and expected stock returns,
Journal of Financial Economics, 22, 325.
234
Fama, E. F., and K. R. French, 1988b, Permanent and temporary components of stock
prices, Journal of Political Economy, 96, 246273.
Fama, E. F., and K. R. French, 1993, Common risk factors in the returns on stocks and
bonds, Journal of Financial Economics, 33, 356.
Goyal, A., and I. Welch, 2008, A comprehensive look at the empirical performance of
equity premium prediction, Review of Financial Studies 2008, 21, 14551508.
Lettau, M., and S. Ludvigson, 2001a, Consumption, wealth, and expected stock returns,
Journal of Finance, 56, 815849.
Lettau, M., and S. Ludvigson, 2001b, Resurrecting the (C)CAPM: a cross-sectional test
when risk premia are time-varying, Journal of Political Economy, 109, 12381287.
Mankiw, G. N., 1986, The equity premium and the concentration of aggregate shocks,
Journal of Financial Economics, 17, 211219.
Mehra, R., and E. Prescott, 1985, The equity premium: a puzzle, Journal of Monetary
Economics, 15, 145161.
Mittelhammer, R. C., G. J. Judge, and D. J. Miller, 2000, Econometric foundations, Cambridge University Press, Cambridge.
Parker, J., and C. Julliard, 2005, Consumption risk and the cross section of expected
returns, Journal of Political Economy, 113, 185222.
Smith, P. N., and M. R. Wickens, 2002, Asset pricing with observable stochastic discount
factors, Discussion Paper No. 2002/03, University of York.
Sderlind, P., 2006, C-CAPM Refinements and the cross-section of returns, Financial
Markets and Portfolio Management, 20, 4973.
Sderlind, P., 2009, An extended Steins lemma for asset pricing, Applied Economics
Letters, forthcoming, 16, 10051008.
Weil, P., 1989, The equity premium puzzle and the risk-free rate puzzle, Journal of
Monetary Economics, 24, 401421.
235
7
7.1
Term risk premia can be defined in several ways. All these premia are zero (or at least
constant) under the expectations hypothesis.
A yield term premium is defined as the difference between a long (n-period) interest
rate and the expected average future short (m-period) rates over the same period
' ty .n; m/ D ynt
1 Xk 1
E t ym;t Csm , with k D n=m:
sD0
k
(7.1)
1
E t .y3m;t C y3m;tC3m C y3m;tC6m C y3m;tC9m / :
4
new m bond
new m bond
2m
new m bond
3m
4m
hold n D 4m bond
Figure 7.1: Timing for yield term premium
The (m-period) forward term premium is the difference between a forward rate for an
m-period investment (starting in k periods ahead) and the expected short interest rate.
' tf .k; m/ D f t .k; k C m/
E t ym;t Ck ;
(7.2)
where f t .k; k C m/ is a forward rate that applies for the period t C k to t C k C m. Figure
7.2 illustrates the timing.
236
2k
3k
4k
' th .n; m/ D
(7.3)
ymt :
Figure 7.3 illustrates the timing. This definition is perhaps most similar to the definition
of risk premia of other assets (for instance, equity).
Example 7.2 (Holding-period premium, holding a 10-year bond for one year).
' th .10; 1/ D E t ln.P9;t C1 =P10;t /
D 10y10;t
y1t
9 E t y9;tC1
y1t :
hold m bond
now
2m
3m
premia. For instance, a short nominal interest rate is likely to include an inflation risk
premium since inflation over the next period is risky. However, this is not the focus here.
The (pure) expectations hypothesis of interest rates says that all these risk premia
should be constant (or zero if the pure theory).
7.2
7.2.1
The basic tests of the expectations hypothesis (EH) is that the realized values of the term
premia (replace the expected values by realized values) in (7.1)(7.3) should be unpredictable. In this case, the regressions of the realized premia on variables that are known
in t should have zero slopes (b1 D 0; b2 D 0; b3 D 0)
1 Xk 1
ym;t Csm D a1 C b10 x t C u t Cn
sD0
k
f t .k; k C m/ ym;t Ck D a2 C b20 x t C u t CkCm
ynt
1
ln.Pn
m
m;tCm =Pnt /
ymt D a3 C b30 x t C u t Cn :
(7.4)
(7.5)
(7.6)
These tests are based on the maintained hypothesis that the expectation errors (for instance, ym;tCsm E t ym;tCsm ) are unpredictableas they would be if expectations are
rational.
The intercepts in these regressions pick out constant term premia. Non-zero slopes
would indicate that the changes of the term premia are predictablewhich is at odds with
the expectations hypothesis.
Notice that we use realized (instead of expected) values on the left hand side of the
tests (7.4)(7.6). This is validunder the assumption that expectations can be well apprxomated by the properties of the sample data. To see that, consider the yield term
premium in (7.1) and add/subtract the realized value of the average future short rate,
Pk 1
sD0 ym;tCsm =k,
ynt
1 Xk 1
1 Xk 1
E t ym;t Csm D ynt
ym;t Csm C " t Cn , where
sD0
sD0
k
k
1 Xk 1
1 Xk 1
ym;t Csm
E t ym;tCsm :
" t Cn D
sD0
sD0
k
k
(7.7)
(7.8)
238
Notice that " tCn is the surprise, so it should not be forecastable by any information available in period tprovided expectations are rational. By comparing with the test (7.4)
we see that " t Cm is loaded into the regression residual u t Cn . (This does nto cause any
trouble since " tCm should be uncorrelated to all regressorssince they are know in t .)
7.2.2
ff t D b30 x t
for one maturity (2-year bond, say), then the regressions
1
ln.Pn
m
m;t Cm =Pnt /
ymt D an C bn ff t
(7.10)
factor
constant
R2
obs
1:00
.6:45/
0:00
. 0:00/
0:14
556:00
1:88
.6:65/
0:00
. 0:58/
0:15
556:00
2:70
.6:83/
0:00
. 1:05/
0:16
556:00
3:48
.7:01/
0:01
. 1:48/
0:17
556:00
Table 7.1: Regression of different excess (1-year) holding period returns (in columns, indicating the maturity of the respective bond) on a single forecasting factor and a constant.
U.S. data for 1964:1-2011:4.
239
Slope coeffients
10
Each line shows the regression of one excess
holding period return on several forward rates
8
6
4
2
0
2
4
2
3
4
5
6
8
10
1.5
2.5
3
3.5
Regressors: forward rates (years)
4.5
Figure 7.4: A single forecasting factor for bond excess hold period returns
7.2.3
Spread-Based Tests
Many classical tests of the expectations hypothesis have only used interest rates as predictors (x t include only interest rates). In addition, since interest rates have long swings
(are close to be non-stationary), the regressions have been expressed in terms of spreads.
To test that the yield term premium is zero (or at last constant), add and subtract ymt
(the current short m-period rate) from (7.4) and rearrange to get
1 Xk 1
.ym;t Csm
sD0
k
ymt / D .ynt
ymt / C " t Cn ;
(7.11)
which says that the term spread between a long and a short rate (ynt ymt ) equals the
expected average future change of the short rate (relative to the current short rate).
240
Example 7.3 (Yield term premium, rolling over 3-month rates for a year)
1
.y3m;t
4
y3m;t / C .y3m;tC3m
y3m;t / C .y3m;tC9m
y12m;t
y3m;t / D
y3m;t :
ymt / D C .ynt
(7.12)
ymt / C " t Cn ;
where the expectations hypothesis (zero yield term premium) implies D 0 and D 1.
(Sometimes the intercept is disregarded). See Figure 7.5 for an empircal example.
Intercept with 90% conf band
0.005
0.01
0.015
0.02
0.025
0
4
6
maturity (years)
10
4
6
maturity (years)
10
Regression of average future changes in 3m rate on constant and current term spread (long 3m)
US monthly data 1970:12011:4
ymt D C f t .k; k C m/
(7.13)
where the expectations hypothesis (zero forward term premium) implies D 0 and D
1. This regression tests if the forward-spot spread is an unbiased predictor of the change
of the spot rate.
Finally, use (7.3) to rearrange (7.6) as
yn
m;t Cm
ynt D C
m
n
.ynt
ymt / C " t Cn ;
(7.14)
241
7.3
(7.15)
(7.16)
i t C1 C ' t ;
where Em
i t C1 is the markets expectations of the interest rate change and ' t is the
t
risk premium. In this expression, i t C1 is short hand notation for the dependent variable
(which in all three cases is a change of an interest rate) and s t denotes the regressor (which
in all three cases is a term spread).
The regression coefficient in (7.15) is
. C /
1C 2C2
Std .'/
D
,
Std Em
i t C1
t
D1
Cov E t
Em
t
Var
Em
t
C ;
where
D Corr Em
t
i t C1 ; Em
t
i t C1 C '
(7.17)
i t C1 ; ' , and
i t C1 C '
The second term in (7.17) captures the effect of the (time varying) risk premium and
the third term ( ) captures any systematic expectations errors ( E t Em
i tC1 ).
t
Figure 7.6 shows how the expectations corrected regression coefficient ( ) depends
on the relative volatility of the term premium and expected interest change ( ) and their
correlation ( ). A regression coefficient of unity could be due to either a constant term
premium ( D 0), or to a particular combination of relative volatility and correlation
( D
), which makes the forward spread an unbiased predictor.
When the correlation is zero, the regression coefficient decreases monotonically with
, since an increasing fraction of the movements in the forward rate are then due to the
242
0.5
0.5
0
0.5
1.5
2.5
in ex post data. We cannot, of course, tell whether these expectation errors are due to a
small sample (for instance, a peso problem) or to truly irrational expectations.
Proof. (of (7.17)) Define
Et
i t C1 D E t
i t C1 C u tC1
i t C1 D Em
t
i t C1 C t C1 :
C
''
m'
C2
m'
D1C
D1
D1
D1
mm
m'
mm
mm
m '
2
m
2
'
C2
m '
C '2 C 2
. C /
1C 2C2
2
m
2
2
m= m
''
2
'
C '' C 2
C 2 m'
m'
m '
2
'
m '
2
m
2
m
Bibliography
Cochrane, J. H., and M. Piazzesi, 2005, Bond risk premia, American Economic Review,
95, 138160.
Froot, K. A., 1989, New Hope for the Expectations Hypothesis of the Term Structure of
Interest Rates, The Journal of Finance, 44, 283304.
244
Reference: Cochrane (2005) 19; Campbell, Lo, and MacKinlay (1997) 11, Backus, Foresi,
and Telmer (1998); Singleton (2006) 1213
8.1
Overview
On average, yield curves tend to be upward sloping (see Figure 8.1), but there is also
considerable time variation on both the level and shape of the yield curves.
Average yields
7.2
Sample: 1970:12011:4
yield
6.8
6.6
6.4
6.2
6
5.8
5
6
Maturity (years)
10
245
Slope t D y10y
Curvature t D y2y
y3m
y3m
y10y
y2y :
(8.1)
This means that we measure the level vy a long rate, the slope by the difference between a long and a short rateand the curvature (or rather, concavity) by how much the
medium/short spead exceeds the long/medium spread. For instance, if the yield curve is
hump shaped (so y2y is higher than both y3m and y10y ), then the curvature measure is
positive. In contrast, when the yield curve is U-shaped (so y2y is lowert than both y3m
and y10y ), then the curvature measure is negative. See Figure 8.2 for an example.
An alternative is to use principal component analysis. See Figure 8.3 for an example.
Remark 8.1 (Principal component analysis) The first (sample) principal component of
the zero (possibly demeaned) mean N 1 vector z t is w10 z t where w1 is the eigenvector associated with the largest eigenvalue of D Cov.z t /. This value of w1 solves the
problem maxw w 0 w subject to the normalization w 0 w D 1. This eigenvalue equals
Var.w10 z t / D w10 w1 . The j th principal component solves the same problem, but under
the additional restriction that wi0 wj D 0 for all i < j . The solution is the eigenvector
associated with the j th largest eigenvalue (which equals Var.wj0 z t / D wj0 wj ). This
means that the first K principal components are those (normalized) linear combinations
that account for as much of the variability as possibleand that the principal components are uncorrelated (Cov.wi0 z t ; wj0 z t / D 0). Dividing an eigenvalue with the sum of
eigenvalues gives a measure of the relative importance of that principal component (in
terms of variance). If the rank of is K, then only K eigenvalues are non-zero.
Remark 8.2 (Principal component analysis 2) Let W be N xN matrix with wi as column
i . We can the calculate the N x1 vector of principal components as pc t D W 0 z t . Since
W 1 D W 0 (the eigenvectors are orthogonal), we can invert as z t D Wpc t . The wi
vector (column i of W ) therefore shows how the different elements in z t change as the i th
principal component changes.
Interest rates are strongly related to business cycle conditions, so it often makes sense
to include macro economic data in the modelling. See Figure 8.4 for how the term spreads
246
Level factor
15
long rate
15
10
10
5
5
0
1970
1980
1990
2000
2010
0
1970
Slope factor
1980
1990
long/short spread
5 long/medium spread
5
1980
1990
2010
Curvature factor
1970
2000
2000
2010
1970
1980
1990
2000
2010
8.2
There are many different types of risk premia on fixed income markets.
Nominal bonds are risky in real terms, and are therefore likely to carry inflation risk
premia. Long bonds are risky because their market values fluctuate over time, so they
probably have term premia. Corporate bonds and some government bonds (in particular,
from developing countries) have default risk premia, depending on the risk for default.
Interbank rates may be higher than T-bill of the same maturity for the same reason (see
the TED spread, the spread between 3-month Libor and T-bill rates) and illiquid bonds
may carry liquidity premia (see the spread between off-the run and on-the-run bonds).
Figures 8.58.8 provide some examples.
247
US interest spreads
20
15
10
0
1970
1980
1990
2000
2010
4
1970
1980
1990
2000
2010
20
0.5
10
0
10
20
1970
0
1st (97.2)
2nd (2.6)
3rd (0.1)
1980
1990
2000
2010
0.5
3m
6m
1y
3y
5y
Maturity
7y
10y
8.3
An affine yield curve model implies that the yield on an n-period discount bond can be
written
ynt D an C bn0 x t , where an D An =n and bn D Bn =n;
(8.2)
where x t is an K 1 vector of state variables. The An (a scalar) and the Bn (an K 1
vector) are discussed below.
The price of an n-period bond equals the cross-moment between the pricing kernel
(M t C1 ) and the value of the same bond next period (then an n 1-period bond)
Pnt D E t M t C1 Pn
1;tC1 :
(8.3)
The Vasicek model assumes that the log SDF (m tC1 ) is an affine function of a single
248
3
4
1970
1975
1980
1985
1990
1995
2000
2005
2010
m tC1 D x t C
x t C1 D .1
C x t C " t C1 :
(8.4)
(8.5)
(8.6)
(8.7)
249
Baa (corporate)
Aaa (corporate)
10y Treasury
0.1
0.05
0
1970
1975
1980
1985
1990
1995
2000
2005
2010
3
2.5
2
1.5
1
0.5
0
1990
1995
2000
2005
2010
An D An
C Bn
and
1
.1
(8.8)
/
. C Bn 1 /
=2;
(8.9)
250
3.5
3
2.5
2
1.5
1
0.5
0
2007
2008
2009
2010
2011
0.6
0.4
0.2
0
1998
2000
2002
2004
2006
2008
2010
An D An
C Bn0
.I
(8.10)
/
C Bn0
SS 0 . C Bn 1 / =2;
(8.11)
251
(8.12)
(8.13)
m t C1 D 10 x t C
x t C1 D .I
(8.14)
(8.15)
An D An
. C Bn 1 /2
C Bn
.1
=2 and
(8.16)
(8.17)
/ ;
and
Bn D 1 C 0 Bn
An D An
C Bn0
.I
S C Bn0 1 S
S C Bn0 1 S
=2, and
/ ;
(8.18)
(8.19)
m t C1 D y1t
t C1
The K
t C1 ;
t0 t =2
N.0; I /:
(8.20)
where 0 is a K
t D 0 C 1xt ;
(8.21)
the state vector dynamics is the same as in the multivariate Vasicek model (8.7). For this
model, the coefficients are
Bn0 D Bn0
An D An
1
1
C Bn0
(8.22)
S 1 C b10
1
.I
S 0
Bn0 1 SS 0 Bn 1 =2 C a1 :
(8.23)
8.4
The maximum likelihood approach typically backs out the unobservable factors from
the yieldsby either assuming that some of the yields are observed without any errors or
by applying a filtering approach.
8.4.1
We assume that K yields (as many as there are factors) are observed without any errors
these can be used in place of the state vector. Put the perfectly observed yields in the
vector yot and stack the factor model for these yieldsand do the same for the J yields
(times maturity) with errors (unobserved), yut ,
yot D ao C bo0 x t so x t D bo0 1 .yot
yut D au C bu0 x t C
ao / ; and
(8.24)
(8.25)
where t are the measurement errors. The vector ao and matrix bo stacks the an and bn for
the perfectly observed yields; au and bu for the yields that are observed with measurement
errors (u of unobserved, although that is something of a misnomer). Clearly, the a
vectors and b matrices depend on the parameters of the model, and need to be calculated
(recursively) for the maturities included in the estimation.
The measurement errors are not easy to interpret: they may include a bit of pure
measurement errors, but they are also likely to pick up model specification errors. It is
therefore difficult to know which distribution they have, and whether they are correlated
across maturities and time. The perhaps most common (ad hoc) approach is to assume that
the errors are iid normally distributed with a diagonal covariance matrix. To the extent
253
that is a false assumption, the MLE approach should perhaps be better thought of as a
quasi-MLE.
The estimation clearly relies on assuming rational expectations: the perceived dynamics (which govern who the market values different bonds) is estimated from the actual
dynamics of the data. In a sense, the models themselves do not assume rational expectations: we could equally well think of the state dynamics as reflecting what the market
participants believed in. However, in the econometrics we estimate this by using the actual
dynamics in the historical sample.
Remark 8.3 (Log likelihood based on normal distribution) The log pdf of an q 1 vector
z t N. t ; t / is
ln pdf.z t / D
q
ln.2 /
2
1
ln j t j
2
1
.z t
2
0
1
t / t .z t
t /:
Example 8.4 (Backing out factors) Suppose there are two factor and that y1t and y12t
are assumed to be observed without errors and y6t with a measurement error, then (8.24)
(8.25) are
"
# " # " #" #
b0
x1t
y1t
a1
C 01
D
b12 x2t
y12t
a12
bo0
ao
"
y6t
"
#" #
a1
b1;1 b1;2
x1t
D
C
, and
a12
b12;1 b12;2 x2t
" #
x1t
C 6t
D a6 C b60
x2t
0
au
bu
" #
h
i x
1t
D a6 C b6;1 b6;2
C 6t :
x2t
Remark 8.5 (Discrete time models and how to quote interest rates) In a discrete time
model, it is often convenient to define the period length according to which maturities
we want to analyze. For instance, with data on 1-month, 3-month, and 4 year rates, it is
convenient to let the period length be one month. The (continuously compounded) interest
rate data are then scaled by 1/12.
254
Remark 8.6 (Data on coupon bonds) The estimation of yield curve models is typically
done on data for spot interest rates (yields on zero coupon bonds). The reason is that
coupon bond prices (and yield to maturities) are not exponentially affine in the state
vector. To see that, notice that a bond that pays coupons in period 1 and 2 has the price
P2c D cP1 C .1 C c/P2 D c exp. A1 B10 x t / C .1 C c/ exp. A2 B20 x t /. However, this
is not difficult to handle. For instance, the likelihood function could be expressed in terms
of the log bond prices divided by the maturity (a quick approximate yield), or perhaps
in terms of the yield to maturity.
Remark 8.7 (Filtering out the state vector) If we are unwilling to assume that we have
enough yields without observation errors, then the backing out approach does not
work. Instead, the estimation problem is embedded into a Kalman filter that treats the
states are unobservable. In this case, the state dynamics is combined with measurement
equations (expressing the yields as affine functions of the states plus errors). The Kalman
filter is a convenient way to construct the likelihood function (when errors are normally
distributed). See de Jong (2000) for an example.
Remark 8.8 (GMM estimation) Instead of using MLE, the model can also be estimated
by GMM. The moment conditions could be the unconditional volatilities, autocorrelations
and covariances of the yields. Alternatively, they could be conditional moments (conditional on the current state vector), which are transformed into moment conditions by
multiplying by some instruments (for instance, the current state vector). See, for instance,
Chan, Karolyi, Longstaff, and Sanders (1992) for an early examplewhich is discussed
in Section 8.5.4.
8.4.2
Assume that we have data on KF factors, F t . We then only have to assume that Ky D
K KF yields are observed without errors. Instead of (8.24) we then have
# "
#
" # "
0
b
ao
yot
o
i x so x D bQ 0 1 .yQ
D
C h
aQ o / :
(8.26)
t
t
ot
o
0KF Ky IKF
0KF 1
Ft
yQot
aQ 0
bQ0
Example 8.9 (Some explicit and some implicit factors) Suppose there are three factors
and that y1t and y12t are assumed to be observed without errors and F t is a (scalar)
explicit factor. Then (8.26) is
2
3 2 3 2
32 3
y1t
a1
b10
x1t
6
7 6 7 6 0 76 7
4y12t 5 D 4a12 5 C 4 b12 5 4x2t 5
Ft
0
0; 0; 1
x3t
2 3 2
32 3
a1
b1;1 b1;2 b1;3
x1t
6 7 6
76 7
D 4a12 5 C 4b12;1 b12;2 b12;3 5 4x2t 5
0
0
0
1
x3t
Clearly, x3t D F t .
8.4.3
XT
t D1
1 /:
(8.27)
Notice that the relation between x t and yot in (8.24) is continuous and invertible, so a
density function of x t immediately gives the density function of yot . In particular, with a
multivariate normal distribution x t jx t 1 N E t 1 x t ; Cov t 1 .x t / we have
yot jyo;t
N ao C bo0 E t
x t ; bo0 Cov t
ao / :
(8.28)
To calculate this expression, we must use the relevant expressions for the conditional
mean and covariance.
See Figure 8.9 for an illustration.
1
256
0.04
0.08
1 year
7 years
0.02
at average xt
data
0.07
0
0.06
0.02
0.04
1970
1980
1990
2000
0.05
2010
4
6
8
Maturity (years)
10
x t C1 D .1
C x t C " t C1
a1 D
N a1 C b1 E t
2
is
=2 C x t :
x t ; b1 Var t
=2; b1 D 1, E t
1 .x t /b1
x t D .1
with
/ C xt
1:
=2:
Combining gives
y1t jy1;t
N .1
/.
=2/ C y1;t
1;
:
257
/.
=2/ C y1;t
C "t :
Clearly, we can estimate an intercept, , and 2 from this relation (with LS or ML), so it is
not possible to identify and separately. We can therefore set to an arbitrary value.
For instance, we can use to fit the average of a long interest rate. The other parameters
are estimated to fit the time series behaviour of the short rate only.
Remark 8.11 (Slope of yield curve when D 1) When D 1, then the slope of the yield
curve is
2
ynt y1t D
1 3n C 2n2 =6 C .n 1/
=2:
(To show this, notice that bn D 1 for all n when D 1.) As a starting point for calibrating
, we could therefore use
1 yNnt yN1t
1 3n C 2n2
C
;
guess D
2 =2
n 1
6
where yNnt and yN1t are the sample means of a long and short rate.
Example 8.12 (Time series estimation of the CIR model) In the CIR model,
m t C1 D x t C
x tC1 D .1
=2/x t :
N .1
y1t D .1
=2/.1
=2/.1
/ C y1;t
/ C y1;t
1 ; y1;t 1 .1
y1;t
.1
=2/
2 =2/
, that is,
" tC1 ;
Example 8.13 (Empirical results from the Vasicek model, time series estimation) Figure
8.9 reports results from a time series estimation of the Vasicek model: only a (relatively)
short interest rate is used. The estimation uses monthly observations of monthly interest
rates (that is the usual interest rates/1200). The restriction D 200 is imposed (as is
not separately identified by the data), since this allows us to also fit the average 10-year
interest rate. The upward sloping (average) yield curve illustrates the kind of risk premia
that this model can generate.
Remark 8.14 (Likelihood function with explicit factors) In case we have some explicit
factors like in (8.26), then (8.24) must be modified as
h
i
0
0
Q
Q
Q
yQot jyQo;t 1 N aQ o C bo E t 1 x t ; bo Cov t 1 .x t / bo , with x t D bQo0 1 .yQot aQ o / :
8.4.4
XT
tD1
(8.29)
It is common to assume that the measurement errors are iid normal with a zero mean and
a diagonal covariance with variances !i2 (often pre-assigned, not estimated)
yut jyot
ao / :
(8.30)
Under this assumption (normal distribution with a diagonal covariance matrix), maximizing the likelihood function amounts to minimizing the weighted squared errors of the
yields
XT X ynt yOnt 2
;
(8.31)
arg max ln Lu D arg min
t D1
!i
n2u
where yOnt are the fitted yields, and the sum is over all unobserved yields. In some applied work, the model is reestimated on every date. This is clearly not model consistent
since the model (and the expectations embedded in the long rates) is based on constant
parameters.
259
/ =2 C .1 C /x t =2
C .1 C /2
=4:
b2 D .1 C /=2 and a2 D .1
/ =2
2
2
=2; where
C .1 C /2
=4:
Clearly, with only one interest rate (y2t ) we can only estimate one parameter, so we
need a larger cross section. However, even with a larger cross-section there are serious
identification problems. The parameter is well identified from how the entire yield curve
typically move in tandem with yot . However, , 2 , and can all be tweaked to generate
a sloping yield curve. For instance, a very high mean will make it look as if we are (even
on average) below the mean, so the yield curve will be upward sloping. Similarly, both a
very negative value of (essentially the negative of the price of risk) and a high volatility
(risk), will give large risk premiaespecially for longer maturities. In practice, it seems
as if only one of the parameters , 2 , and is well identified in the cross-sectional
approach.
Example 8.16 (Empirical results from the Vasicek model, cross-sectional estimation)
Figure 8.10 reports results from a cross-sectional estimation of the Vasicek model, where
it is assumed that the variances of the observation errors (!i2 ) are the same across yields.
The estimation uses monthly observations of monthly interest rates (that is the usual interest rates/1200). The values of and 2 are restricted to the values obtained in the time
series estimations, so only and are estimated. Choosing other values for and 2
gives different estimates of , but still the same yield curve (at least on average).
8.4.5
260
0.04
0.08
1 year
7 years
0.02
at average xt
data
0.07
0
0.06
0.02
0.04
1970
1980
1990
2000
0.05
2010
4
6
8
Maturity (years)
10
XT
t D1
1/
1/ :
(8.32)
can be split up as
1 /;
(8.33)
does not affect the distribution of yut conditional on yot . Taking logs gives
ln L t D ln pdf .yut jyot / C ln pdf.yot jyo;t
1 /:
(8.34)
The first term is the same as in the cross-sectional estimation and the second is the same
as in the time series estimation. The log likelihood (8.32) is therefore just the sum of the
log likelihoods from the pure cross-sectional and the pure time series estimations
ln L D
XT
t D1
ln Lut C ln Lot :
(8.35)
261
See Figures 8.118.15 for illustrations. Notice that the variances of the observation
errors (!i2 ) are important for the relative weight of the contribution from the time series
and cross-sectional parts.
Example 8.17 (MLE of the Vasicek Model) Consider the Vasicek model, where we observe y1t without errors and y2t with measurement errors. The likelihood function is then
the sum of the log pdfs in Examples 8.10 and 8.15, except that the cross-sectional part
must be include the variance of the observation errors (! 2 ) which is assumed to be equal
across maturities.
Pricing errors, Vasicek (TS&CS)
0.04
1 year
7 years
0.02
at average xt
data
0.07
0
0.06
0.02
0.04
1970
1980
1990
2000
2010
0.05
4
6
8
Maturity (years)
10
Figure 8.11: Estimation of Vasicek model, combined time series and cross-sectional approach
Example 8.18 (Empirical results from the Vasicek model, combined time series and crosssectional estimation) Figure 8.11 reports results from a combined time series and crosssectional estimation of the Vasicek model. The estimation uses monthly observations
of monthly interest rates (that is the usual interest rates/1200). All model parameters
( ; ; ; 2 ) are estimated, along with the variance of the measurement errors. (All measurement errors are assumed to have the same variances, !.) Figure 8.12 reports the loadings on the constant and the short rate according to the Vasicek model and (unrestricted)
OLS. The patterns are fairly similar, suggesting that the cross-equation (-maturity) restrictions imposed by the Vasicek model are not at great odds with data.
262
2
0.8
1
0
0.7
0
4
6
8
maturity (years)
10
4
6
8
maturity (years)
10
Example 8.20 (Empirical results from a two-factor Vasicek model) Figure 8.13 reports
results from a two-factor Vasicek model. The estimation uses monthly observations of
monthly interest rates (that is the usual interest rates/1200). We can only identify the
mean of the SDF, not whether if it is due to factor one or two. Hence, I restrict 2 D 0.
The results indicate that there is one very persistent factor (affecting the yield curve level),
and another slightly less persistent factor (affecting the yield curve slope). The price of
risk is larger ( i more negative) for the more persistent factor. This means that the risk
premia will scale almost linearly with the maturity. As a practical matter, it turned out
that a derivative-free method (fminsearch in MatLab) worked much better than standard
optimization routines. The pricing errors are clearly smaller than in a one-factor Vasicek
model. Figure 8.15 illustrates the forecasting performance of the model by showing scatter plots of predicted yields and future realized yields. An unbiased forecasting model
263
1 year
7 years
0.02
at avgerage xt
data
0.07
0
0.06
0.02
0.04
1970
1980
1990
2000
2010
0.05
4
6
8
Maturity (years)
10
8.5
8.5.1
264
1 year
7 years
0.02
at avgerage xt
data
0.07
0
0.06
0.02
0.04
1970
1980
1990
2000
2010
0.05
4
6
8
Maturity (years)
10
265
14
12
10
8
6
4
2
0
6
8
10
12
Forecast of y(36m), 2 years ahead
14
16
A Joint Econometric Model of Macroeconomic and Term Structure Dynamics by Hrdahl et al (2005)
Reference: Hrdahl, Tristiani, and Vestin (2006), Ang and Piazzesi (2003)
This paper estimates both an affine yield curve model and a macroeconomic model on
monthly German data 19751998.
To identify the model, the authors put a number of restrictions on the 1 matrix. In
particular, the lagged variables in x t are assumed to have no effect on t .
The key distinguishing feature of this paper is that a macro model (for inflation, output,
and the policy for the short interest rate) is estimated jointly with the yield curve model.
(In contrast, Ang and Piazzesi (2003) estimate the macro model separately.) In this case,
the unobservable factors include variables that affect both yields and the macro variables
(for instance, the time-varying inflation target). Conversely, the observable data includes
not only yields, but also macro variables (output, inflation). It is found, among other
things, that the time-varying inflation target has a crucial effect on yields and that bond
risk premia are affected both by policy shocks (both to the short-run policy rule and to the
inflation target), as well as the business cycle shocks.
266
8.5.3
Diebold and Li (2006) use the Nelson-Siegel model for an m-period interest rate as
1 exp . m= 1 /
1 exp . m= 1 /
m
y.m/ D 0 1 C 1
; (8.36)
C 2
exp
m= 1
m= 1
1
and set 1 D 1=.12 0:0609/. Their approach is as follows. For a given trading date,
construct the factors (the terms multiplying the beta coefficients) for each bond. Then,
run a regression of the cross-section of yields on these factorsto estimate the beta
coefficients. Repeat this for every trading dayand plot the three time series of the
coefficients.
See Figure 8.16 for an example. The results are very similar to the factors calculated
directly from yields (cf. Figure 8.2).
US interest rates, DieboldLi factors
Level coefficient
15
level
slope
curvature
10
0.5
5
0
4
6
8
Maturity (in years)
10
0
1970
5
1980
1990
2000
2010
1990
2000
2010
1970
1980
1970
1980
1990
2000
2010
Figure 8.16: US yield curves: level, slope and curvature, Diebold-Li approach
267
8.5.4
Reference: Chan, Karolyi, Longstaff, and Sanders (1992) (CKLS), Dahlquist (1996)
This paper focuses on the dynamics of the short rate process. The models that CKLS
study have the following dynamics (under the natural/physical distribution) of the oneperiod interest rate, y1t
y1;t C1
2 2
y1t
(8.37)
This formulation nests several well-known models: D 0 gives a Vasicek model and
D 1=2 a CIR model (which are the only cases which will deliver a single-factor affine
model). It is an approximation of the diffusion process
dr t D .0 C 1 r t /dt C r t d W t ;
(8.38)
where W t is a Wiener process. (For an introduction to the issue of being more careful with
estimating a continuous time model on discrete data, see Campbell, Lo, and MacKinlay
(1997) 9.3 and Harvey (1989) 9.) In some cases, like the homoskedastic AR(1), there is
no approximation error because of the discrete sampling. In other cases, there is an error.)
CKLS estimate the model (8.37) with GMM using the following moment conditions
2
3
"
"
# "
# 6 t C1
7
6 " t C1 y1t
7
" t C1
1
2
6
7 ; (8.39)
g t .; ; ; / D
D
6 "2
7
2 2
2 2
"2tC1
y1t
y1t
y
4 t C1
5
1t
2
2 2
." t C1
y1t /y1t
so there are four moment conditions and four parameters (; ; 2 , and ). The choice of
the instruments (1 and y1t ) is somewhat arbitrary since any variables in the information
set in t would do.
CKLS estimate this model in various forms (imposing different restrictions on the parameters) on monthly data on one-month T-bill rates for 19641989. They find that both O
and O are close to zero (in the unrestricted model O < 0 and almost significantly different
from zeroindicating mean-reversion). They also find that O > 1 and significantly so.
This is problematic for the affine one-factor models, since they require D 0 or D 1=2.
A word of caution: the estimated parameter values suggest that the interest rate is non268
0.15
0.1
0.05
0
0.5
1
1960
1980
Year
2000
7
x 10 GMM loss function, local
6
4
2
Correlations of moments:
1.8
1.7
1.6
0.3
0.35
0.4
0.45
Drift vs level
Volatility vs level
100(yt+1 yt ) 2
yt+1 yt
0.1
0.05
0
0.05
0.1
0.05
0.1
yt
0.15
0.2
0.4
0.3
0.2
0.1
0
0.05
0.1
yt
0.15
0.2
Bibliography
Ang, A., and M. Piazzesi, 2003, A no-arbitrage vector autoregression of term structure
dynamics with macroeconomic and latent variables, Journal of Monetary Economics,
60, 745787.
Backus, D., S. Foresi, and C. Telmer, 1998, Discrete-time models of bond pricing,
Working Paper 6736, NBER.
Brown, R. H., and S. M. Schaefer, 1994, The term structure of real interest rates and the
Cox, Ingersoll, and Ross model, Journal of Financial Economics, 35, 342.
270
271
9
9.1
Reference: Campbell, Lo, and MacKinlay (1997) 12.3; Hrdle (1990); Pagan and Ullah
(1999); Mittelhammer, Judge, and Miller (2000) 21
9.1.1
Introduction
Nonparametric regressions are used when we are unwilling to impose a parametric form
on the regression equationand we have a lot of data.
Let the scalars y t and x t be related as
y t D b.x t / C " t ;
(9.1)
where " t is uncorrelated over time and where E " t D 0 and E." t jx t / D 0. The function
b./ is unknown and possibly non-linear.
One possibility of estimating such a function is to approximate b.x t / by a polynomial
(or some other basis). This will give quick estimates, but the results are global in the
sense that the value of b.x t / at a particular x t value (x t D 1:9, say) will depend on all
the data pointsand potentially very strongly so. The approach in this section is more
local by down weighting information from data points where xs is far from x t .
Suppose the sample had 3 observations (say, t D 3, 27, and 99) with exactly the same
value of x t , say 1:9. A natural way of estimating b.x/ at x D 1:9 would then be to
average over these 3 observations as we can expect average of the error terms to be close
to zero (iid and zero mean).
Unfortunately, we seldom have repeated observations of this type. Instead, we may
try to approximate the value of b.x/ (x is a single value, 1.9, say) by averaging over (y)
observations where x t is close to x. The general form of this type of estimator is
PT
Ob.x/ D Pt D1 w.x t x/y t ;
T
x/
t D1 w.x t
(9.2)
272
where w.x t x/= tTD1 w.x t x/ is the weight on observation t. Note that the denominator
makes the weights sum to unity. The basic assumption behind (9.2) is that the b.x/
function is smooth so local (around x) averaging makes sense.
As an example of a w.:/ function, it could give equal weight to the k values of x t
which are closest to x and zero weight to all other observations (this is the k-nearest
neighbor estimator, see Hrdle (1990) 3.2). As another example, the weight function
2
O
could be defined so that it trades off the expected squared errors, Ey t b.x/
, and the
2O
2 2
expected squared acceleration, Ed b.x/=dx . This defines a cubic spline (often used
in macroeconomics when x t D t , and is then called the Hodrick-Prescott filter).
Remark 9.1 (Easy way to calculate the nearest neighbor estimator, univariate case)
Create a matrix Z where row t is .y t ; x t /. Sort the rows of Z according to the second
column (x). Calculate an equally weighted centered moving average of the first column
(y).
9.1.2
Kernel Regression
273
With the N.x; h2 / kernel, we get the following estimator of b.x/ at a point x
PT
xt x
yt
exp u2 =2
t D1 K
h
Ob.x/ D P
,
where
K.u/
D
:
p
T
xt x
2
K
t D1
h
(9.4)
Remark 9.2 (Kernel as a pdf of N.x; h2 /) If K.z/ is the pdf of an N.0; 1/ variable, then
K .x t x/= h = h is the same as using an N.x; h2 / pdf of x t . Clearly, the 1= h term
would cancel in (9.4).
O
In practice we have to estimate b.x/
at a finite number of points x. This could, for
instance, be 100 evenly spread points in the interval between the minimum and the maximum values observed in the sample. Special corrections might be needed if there are a lot
of observations stacked close to the boundary of the support of x (see Hrdle (1990) 4.4).
See Figure 9.2 for an illustration.
Example 9.3 (Kernel regression) Suppose the sample has three data points x1 ; x2 ; x3 D
1:5; 2; 2:5 and y1 ; y2 ; y3 D 5; 4; 3:5. Consider the estimation of b.x/ at x D 1:9.
With h D 1, the numerator in (9.4) is
p
XT
.1:5 1:9/2 =2
.2 1:9/2 =2
.2:5 1:9/2 =2
K.x t x/y t D e
5Ce
4Ce
3:5 = 2
t D1
p
.0:92 5 C 1:0 4 C 0:84 3:5/ = 2
p
D 11:52= 2 :
The denominator is
XT
t D1
K.x t
x/ D e
.1:5 1:9/2 =2
p
2:75= 2 :
Ce
.2 1:9/2 =2
Ce
.2:5 1:9/2 =2
p
= 2
11:52=2:75
4:19:
Kernel regressions are typically consistent, provided longer samples are accompanied
by smaller values of h, so the weighting function becomes more and more local as the
sample size increases. It can be shown (see Hrdle (1990) 3.1 and Pagan and Ullah (1999)
274
5
weights
1.4
1.6
1.8
2
xt
2.2
2.4
1.4
1.6
1.8
2
xt
2.2
2.4
5
1
1.4
1.6
1.8
2
xt
2.2
2.4
R1
2
1 K.u/u du:
(9.5)
In these expressions, 2 .x/ is the variance of the residuals in (9.1) and f .x/ the marginal
density of x. The remaining terms are functions of either the true regression function or
the kernel.
275
Kernel regression
Data
h= 0.25
h= 0.2
4.5
3.5
1.4
1.6
1.8
2
x
2.2
2.4
2.6
(9.6)
R1
and
R1
2
1 K.u/u du
D 1;
XT
t D1
yt
i2
Ob t .x t ; h/ =T;
(9.7)
i
2
p h
1
.x/
d
O
O
T h b.x/ E b.x/ ! N 0; p
;
(9.8)
f .x/
2
277
V olt+1
yt+1
0.1
0.1
0.2
10
yt
15
20
10
yt
15
20
V olt+1
yt+1
0.1
0.1
0.2
Small h
Large h
0
10
yt
15
20
Small h
Large h
0.5
0
10
yt
15
20
1.0006
1.0005
1.0004
1.0003
1.0002
1.0001
1
0.4
0.5
0.6
0.7
h
0.8
0.9
0.1
0.2
V olt+1
yt+1
10
yt
15
20
0.5
0
10
yt
15
20
279
9.1.3
(9.10)
where " t is uncorrelated over time and where E " t D 0 and E." t jx t ; z t / D 0. This makes
the estimation problem much harder since there are typically few observations in every
bivariate bin (rectangle) of x and z. For instance, with as little as a 20 intervals of each
of x and z, we get 400 bins, so we need a large sample to have a reasonable number of
observations in every bin.
In any case, the most common way to implement the kernel regressor is to let
PT
xt x
zt z
K
K
yt
z
t D1 x
hx
hz
O z/ D
b.x;
;
(9.11)
PT
xt x
zt z
Kz hz
t D1 Kx
hx
where Kx .u/ and Kz .v/ are two kernels like in (9.4) and where we may allow hx and hy
to be different (and depend on the variance of xt and y t ). In this case, the weight of the
observation (x t ; z t ) is proportional to Kx xht x x Kz zthz z , which is high if both x t and
y t are close to x and y respectively.
9.1.4
Nonparametric Estimation of State-Price Densities Implicit in Financial Asset Prices, by Ait-Sahalia and Lo (1998)
Lo (1998) applies this to daily data for Jan 1993 to Dec 1993 on S&P 500 index options
(14,000 observations).
This paper estimates nonparametric option price functions and calculates the implicit
risk-neutral distribution as the second partial derivative of this function with respect to the
strike price.
1. First, the call option price, Hi t , is estimated as a multivariate kernel regression
Hi t D b.S t ; X; ; r t ; t / C "i t ;
(9.12)
where S t is the price of the underlying asset, X is the strike price, is time to
expiry, r t is the interest rate between t and t C , and t is the dividend yield
(if any) between t and t C . It is very hard to estimate a five-dimensional kernel
regression, so various ways of reducing the dimensionality are tried. For instance,
by making b./ a function of the forward price, S t exp.r t t /, instead of S t ,
r t , and t separably.
2. Second, the implicit risk-neutral pdf of the future asset price is calculated as
@2 b.S t ; X; ; r t ; t /=@X 2 , properly scaled so it integrates to unity.
3. This approach is used on daily data for Jan 1993 to Dec 1993 on S&P 500 index options (14,000 observations). They find interesting patterns of the implied moments
(mean, volatility, skewness, and kurtosis) as the time to expiry changes. In particular, the nonparametric estimates suggest that distributions for longer horizons
have increasingly larger skewness and kurtosis: whereas the distributions for short
horizons are not too different from normal distributions, this is not true for longer
horizons. (See their Fig 7.)
4. They also argue that there is little evidence of instability in the implicit pdf over
their sample.
9.1.5
281
Interest rate models are typically designed to describe the movements of the entire
yield curve in terms of a small number of factors. For instance, the model
r t C1
2 2
rt
(9.13)
(9.14)
where W t is a Wiener process. Recall that affine one-factor models require D 0 (the
Vasicek model) or D 0:5 (Cox-Ingersoll-Ross).
This paper tests several models of the short interest rate by using a nonparametric
technique.
1. The first step of the analysis is to estimate the unconditional distribution of the
short interest rate by a kernel density estimator. The estimated pdf at the value r is
denoted O 0 .r/.
2. The second step is to estimate the parameters in a short rate model (for instance,
Vasiceks model) by making the unconditional distribution implied by the model
parameters (denoted .; r/ where is a vector of the model parameters and r a
value of the short rate) as close as possible to the nonparametric estimate obtained
in step 1. This is done by choosing the model parameters as
T
1X
O
D arg min
.; r t /
T
t D1
O 0 .r/2 :
(9.15)
3. The model is tested by using a scaled version of the minimized value of the right
hand side of (9.15) as a test statistic (it has an asymptotic normal distribution).
4. It is found that most standard models are rejected (daily data on 7-day Eurodollar
deposit rate, June 1973 to February 1995, 5,500 observations), mostly because actual mean reversion is much more non-linear in the interest rate level than suggested
by most models (the mean reversion seems to kick in only for extreme interest rates
and to be virtuallynon-existent for moderate rates).
282
5. For a critique of this approach (biased estimator...), see Chapman and Pearson
(2000)
9.2
9.2.1
A possible way out of the curse of dimensionality of the multivariate kernel regression is
to specify a partially linear model
(9.16)
y t D z t0 C b.x t / C " t ;
where " t is uncorrelated over time and where E " t D 0 and E." t jx t ; z t / D 0. This model
is linear in z t , but possibly non-linear in x t since the function b.x t / is unknown.
To construct an estimator, start by taking expectations of (9.16) conditional on x t
E.y t jx t / D E.z t jx t /0 C b.x t /:
(9.17)
yt
E.z t jx t /0 C " t :
(9.18)
The double residual method (see Pagan and Ullah (1999) 5.2) has several steps. First,
estimate E.y t jx t / by a kernel regression of y t on x t (bOy .x//, and E.z t jx t / by a similar
kernel regression of z t on x t (bOz .x/). Second, use these estimates in (9.18)
yt
bOy .x t / D z t
bOz .x t /0 C " t
(9.19)
and estimate by least squares. Third, use these estimates in (9.17) to estimate b.x t / as
O t / D bOy .x t /
b.x
O
bOz .x t /0 :
(9.20)
It can be shown that (under the assumption that y t , z t and x t are iid)
p
T .O
/ !d N 0; Var." t / Cov.z t jx t /
(9.21)
We can consistently estimate Var." t / by the sample variance of the fitted residuals in
(9.16)plugging in the estimated and b.x t /: and we can also consistently estimate
283
Cov.z t jx t / by the sample variance of z t bOz .x t /. Clearly, this result is based on the idea
that we asymptotically know the non-parametric parts of the problem (which relies on the
consistency of their estimators).
9.2.2
Basis Expansion
Reference: Hastie, Tibshirani, and Friedman (2001); Ranaldo and Sderlind (2010) (for
an application of the method to exchange rates)
The label non-parametrics is something of a misnomer since these models typically
have very many parameters. For instance, the kernel regression is an attempt to estimate
a specific slope coefficient at almost each value of the regressor. Not surprisingly, this
becomes virtually impossible if the data set is small and/or there are several regressors.
An alternative approach is to estimate an approximation of the function b.x t / in
(9.22)
y t D b.x t / C " t :
This can be done by using piecewise polynomials or splines. In the simplest case, this
amounts to just a piecewise linear (but continuous) function. For instance, if x t is a scalar
and we want three segments (pieces), then we could use the following building blocks
2
3
xt
6
7
(9.23)
4max.x t
1 ; 0/5
max.x t
2 ; 0/
and approximate as
b.x t / D 1 x t C 2 max.x t
This can also be written
2
1 x t
6
b.x t / D 4 1 x t C 2 .x t
1 x t C 2 .x t
1 ; 0/
1/
1/
C 3 .x t
C 3 max.x t
if x t < 1
if 1 x t <
2 / if 2 x t
2 ; 0/:
(9.24)
3
2
7
5:
(9.25)
This function has the slope 1 for x t < 1 , the slope 1 C 2 between 1 and 2 , and
1 C 2 C 3 above 2 . It is no more sophisticated than using dummy variables (for the
different segments), except that the current approach is a convenient way to guarantee
284
Basis expansion
-1
x
max(x 1, 0)
max(x 2, 0)
2
1
1
x
2
1
1
x
285
xt D
or x t D
2)
3
0/
7
0/7
2
7
7;
7
7
1 ; 0/ 5
2 ; 0/
1
(9.26)
O t / is then
where I.q/ D 1 if q is true and 0 otherwise. The variance of b.x
O t / D
Varb.x
@b.x t / @b.x t /
V
:
@ 0
@
(9.27)
Remark 9.7 (The derivatives of b.x t /) From (9.25) we have the following derivatives
2
3 2 3
2
3
2
3
@b .x t / =@ 1
0
2
2
6
7 6 7
6
7
6
7
6 @b .x t / =@ 2 7 6 0 7
6 0 7
6 3 7
6
7 6 7
6
7
6
7
6@b .x t / =@1 7 D 6x t 7 if x t < 1 ; 6 x t 7 if 1 x t < 2 , 6 x t 7 if 2 x t :
6
7 6 7
6
7
6
7
6
7 6 7
6
7
6
7
4@b .x t / =@2 5 4 0 5
4x t
4x t
15
15
@b .x t / =@3
0
0
xt
2
It is also straightforward to extend this several regressorsat least as long as we
assume additivity of the regressors. For instance, with two variables (x t and z t )
b.x t ; z t / D bx .x t / C bz .z t /;
(9.28)
where both bx .x t / and bz .z t / are piecewise functions of the sort discussed in (9.25).
Estimation is just as before, except that we have different knots for different variables.
Estimating VarbOx .x t / and VarbOz .z t / follows the same approach as in (9.27).
See Figure 9.8 for an illustration.
Bibliography
Ait-Sahalia, Y., 1996, Testing continuous-time models of the spot interest rate, Review
of Financial Studies, 9, 385426.
Ait-Sahalia, Y., and A. W. Lo, 1998, Nonparametric estimation of state-price densities
implicit in financial asset prices, Journal of Finance, 53, 499547.
286
yt+1
0.1
knot 1
knot 2
slope seg 1
extra slope seg 2
extra slope seg 3
const
0
Point estimate and
90% confidence band
0.1
0.2
10
yt
15
coeff
1.84
19.49
0.02
0.02
0.43
0.05
Std
0.00
0.30
0.01
0.01
0.18
0.01
20
287
9
9.1
The task is to evaluate if alphas or betas of individual investors (or funds) are related
to investor (fund) characteristics, for instance, age or trading activity. The data set is
panel with observations for T periods and N investors. (In many settings, the panel is
unbalanced, but, to keep things reasonably simple, that is disregarded in the discussion
below.)
9.2
The calendar time (CalTime) approach is to first define M discrete investor groups (for
instance, age 1830, 3140, etc) and calculate their respective average excess returns (yNjt
for group j )
1 P
yi t ;
(9.1)
yNjt D
Nj i 2Groupj
where Nj is the number of individuals in group j .
Then, we run a factor model
yNjt D x t0 j C vjt ; for j D 1; 2; : : : ; M
(9.2)
where x t typically includes a constant and various return factors (for instance, excess returns on equity and bonds). By estimating these M equations as a SURE system with
Whites (or Newey-Wests) covariance estimator, it is straightforward to test various hypotheses, for instance, that the intercept (the alpha) is higher for the M th group than
for the for first group.
Example 9.1 (CalTime with two investor groups) With two investor groups, estimate the
288
yN2t D x t0 2 C v2t :
The CalTime approach is straightforward and the cross-sectional correlations are fairly
easy to handle (in the SURE approach). However, it forces us to constructive discrete
investor groupswhich makes it hard to handle several different types of investor characteristics (for instance, age, trading activity and income) at the same time.
The cross sectional regression (CrossReg) approach is to first estimate the factor
model for each investor
yi t D x t0 i C "i t ; for i D 1; 2; : : : ; N
(9.3)
and to then regress the (estimated) betas for the pth factor (for instance, the intercept) on
the investor characteristics
Opi D zi0 cp C wpi :
(9.4)
In this second-stage regression, the investor characteristics zi could be a dummy variable
(for age roup, say) or a continuous variable (age, say). Notice that using a continuos
investor characteristics assumes that the relation between the characteristics and the beta
is linearsomething that is not assumed in the CalTime approach. (This saves degrees of
freedom, but may sometimes be a very strong assumption.) However, a potential problem
with the CrossReg approach is that it is often important to account for the cross-sectional
correlation of the residuals.
9.3
OLS
(9.5)
289
where xi t is an K 1 vector. Notice that the coefficients are the same across individuals
(and time). Define the matrices
xx
T
N
1 XX
D
xi t xi0 t (an K
T N t D1 i D1
xy D
T
N
1 XX
xi t yi t (a K
T N t D1 i D1
K matrix)
(9.6)
1 vector).
(9.7)
(9.8)
GMM
xi0 t /:
(9.9)
Remark 9.2 (Distribution of GMM estimates) Under fairly weak assumption, the exactly
p
d
identified GMM estimator T N .O 0 / ! N.0; D0 1 S0 D0 1 /, where D0 is the Jacobian
p
of the average moment conditions and S0 is the covariance matrix of T N times the
average moment conditions.
Remark 9.3 (Distribution of O 0 ) As long as T N is finite, we can (with some abuse
p
of notation) consider the distribution of O instead of T N .O 0 / to write
O
N.0; D0 1 SD0 1 /;
where S D S0 =.T N / which is the same as the covariance matrix of the average moment
conditions (9.9).
To apply these remarks, first notice that the Jacobian D0 corresponds to (the probability limit of) the xx matrix in (9.6). Second, notice that
!
T
N
1X 1 X
Cov.average moment conditions/ D Cov
hi t
(9.10)
T t D1 N i D1
290
Driscoll-Kraay
T
N
1 X
1 X
0
h
h
;
with
h
D
hi t , hi t D xi t "i t ;
t t
t
T 2 t D1
N i D1
(9.12)
(9.13)
1 PT PN
S D Cov
hi t :
T N tD1 i D1
To calculate this estimator, (9.13) uses the time dimension (and hence requires a reasonably long time series).
O D
Remark 9.4 (Relation to the notation in Hoechle (2011)) Hoechle writes Cov./
P
P
.X 0 X/ 1 SOT .X 0 X/ 1 , where SOT D TtD1 hO t hO 0t ; with hO t D N
i D1 hi t . Clearly, my xx D
O D .xx T N / 1 ST 2 N 2 .xx T N / 1 ,
X 0 X=.T N / and my S D SOT =.T 2 N 2 /. Combining gives Cov./
which simplifies to (9.12).
291
1
.h1t C h2t C h3t C h4t / ;
4
T
1 X 1 2
D 2
.h1t C h22t C h23t C h24t ;
T t D1 16
C 2h1t h2t C 2h1t h3t C 2h1t h4t C 2h2t h3t C 2h2t h4t C 2h3t h4t /
so we can write
SD
" 4
X
1
T
16
c it /
Var.h
b
b
bov.h ; h / C 2Cbov.h
C 2C
bov.h ; h /i :
C2C
i D1
3t
3t
4t
2t ; h4t /
Notice that S is the (estimate of) the variance of the cross-sectional average, Var.h t / D
Var.h1t C h2t C h3t C h4t /=4.
A cluster method puts restrictions on the covariance terms (of hi t ) that are allowed
to enter the estimate S. In practice, all terms across clusters are left out. This can be
implemented by changing the S matrix. In particular, instead of interacting all i with
each other, we only allow for interaction within each of the G clusters (g D 1; :::; G/
SD
G
T
X
1
1 X g g 0
g
h
h
,
where
h
D
t
t
t
T 2 t D1
N
gD1
hi t :
(9.14)
i 2 cluster g
(Remark: the cluster sums should be divided by N , not the number of individuals in the
292
cluster.)
Example 9.6 (Cluster method on N D 4, changing Example 9.5 directly) Reconsider
Example 9.5, but assume that individuals 1 and 2 form cluster 1 and that individuals 3
and 4 form cluster 2and disregard correlations across clusters. This means setting the
covariances across clusters to zero,
SD
T
1 X 1 2
.h1t C h22t C h23t C h24t ;
2
T t D1 16
2h1t h2t C 2h1t h3t C 2h1t h4t C 2h2t h3t C 2h2t h4t C 2h3t h4t /
0
so we can write
SD
" 4
X
1
T
16
i D1
Example 9.7 (Cluster method on N D 4) From (9.14) we have the cluster (group) averages
1
1
h1t D .h1t C h2t / and h2t D .h3t C h4t / :
4
4
T
P
0
Assuming only one regressor (to keep it simple), the time averages, T1
hgt hgt , are
t D1
T
1X 1 2
D
h C h22t C 2h1t h2t , and
T t D1 16 1t
T
T
1X 2 2 0
1X 1 2
ht ht D
h3t C h24t C 2h3t h4t :
T t D1
T t D1 16
Finally, summing across these time averages gives the same expression as in Example
9.6. The following 4 4 matrix illustrates which cells that are included (assumption: no
293
(9.15)
Example 9.8 (Whites method on N D 4) With only one regressor (9.15) gives
T
1 X 1 2
h1t C h22t C h23t C h24t
SD 2
T tD1 16
4
X
1
T
16
c it /
Var.h
i D1
E "2it , so
(9.16)
Remark 9.9 (Why the cluster method fails when there is a missing time fixed effect
and one of the regressors indicates the cluster membership) To keep this remark short,
assume yi t D 0qi t C "i t , where qi t indicates the cluster membership of individual i
(constant over time). In addition, assume that all individual residuals are entirely due to
an (excluded) time fixed effect, "i t D w t . Let N D 4 where i D .1; 2/ belong to the
first cluster (qi D 1) and i D .3; 4/ belong to the second cluster (qi D 1). (Using
the values qi D 1 gives qi a zero mean, which is convenient.) It is straightforward to
294
demonstrate that the estimated (OLS) coefficient in any sample must be zero: there is in
fact no uncertainty about it. The individual moments in period t are then hi t D qi t w t
2
3 2
3
h1t
wt
6
7 6
7
6 h2t 7 6 w t 7
6
7D6
7
6 h 7 6 w 7:
t 5
4 3t 5 4
h4t
wt
The matrix in Example 9.7 is then
i 1
2
3
4
2
2
0
1 wt wt 0
2 w t2 w t2 0
0
2
3 0
0 w t w t2
0 w t2 w t2
4 0
P
These elements sum up to a positive numberwhich is wrong since N
i D1 hi t D 0 by
definition, so its variance should also be zero. In contrast, the DK method adds the offdiagonal elements which are all equal to w t2 , so summing the whole matrix indeed gives
zero. If we replace the qi t regressor with something else (eg. a constant), then we do not
get this result.
To see what happens if the qi variable does not coincide with the definitions of the clusters change the regressor to qi D . 1; 1; 1; 1/ for the four individuals. We then get
.h1t ; h2t ; h3t ; h4t / D . w t ; w t ; w t ; w t /. If the definition of the clusters (for the covariance matrix) are unchanged, then the matrix in Example 9.7 becomes
i
1
2
3
4
1
w t2
w t2
0
0
2
w t2
w t2
0
0
3
0
0
w t2
w t2
4
0
0
w t2
w t2
which sum to zero: the cluster covariance estimator works fine. The DK method also
295
w t2
w t2
w t2
w t2
3
w t2
w t2
4
w t2
w t2
which also sum to zero. This suggests that the cluster covariance matrix goes wrong
only when the cluster definition (for the covariance matrix) is strongly related to the qi
regressor.
9.4
The CalcTime estimates can be replicated by using the individual data in the panel. For
instance, with two investor groups we could estimate the following two regressions
yi t D x t0 1 C u.1/
i t for i 2 group 1
yi t D x t0 2 C u.2/
i t for i 2 group 2.
(9.17)
(9.18)
More interestingly, these regression equations can be combined into one panel regression (and still give the same estimates) by the help of dummy variables. Let zj i D 1 if
individual i is a member of group j and zero otherwise. Stacking all the data, we have
(still with two investor groups)
yi t D .z1i x t /0 1 C .z2i x t /0 2 C ui t
"
#!0 " #
z1i x t
1
D
C ui t
z2i x t
2
"
#
z
1i
D .zi x t /0 C ui t , where zi D
:
z2i
(9.19)
296
latter
vjt D
We know that
Var.vjt / D
1 P
.j /
i 2Group j ui t :
Nj
1
.
Nj
ii
ih /
(9.20)
ih ;
(9.21)
/
.j /
.j /
where i i is the average Var.u.j
i t / and ih is the average Cov.ui t ; uht /. With a large
cross-section, only the covariance matters. A good covariance estimator for the panel
approach will therefore have to handle the covariance with a groupand perhaps also
the covariance across groups. This suggests that the panel regression needs to handle the
cross-correlations (for instance, by using the cluster or DK covariance estimators).
9.5
Hoechle, Schmid, and Zimmermann (2009) (HSZ) suggest the following regression on all
data (t D 1; : : : ; T and also i D 1; : : : ; N )
yi t D .zi t x t /0 d C vi t
(9.22)
(9.23)
sive and constant group membership (z1i t D 1 means that investor i belongs to group 1,
so zj i t D 0 for j D 2; :::; m), then the LS estimates and DK standard errors of (9.22) are
the same as LS estimates and Newey-West standard errors of the CalTime approach (9.2).
(See HSZ for a proof.)
Proposition 9.11 (When zi t is a measure of investor characterstics, eg. number of fund
switches) The LS estimates and DK standard errors of (9.22) are the same as the LS
estimates of CrossReg approach (9.4), but where the standard errors account for the crosssectional correlations, while those in the CrossReg approach do not. (See HSZ for a
proof.)
Example 9.12 (One investor characteristic and one pricing factor). In this case (9.22) is
2
30
1
6
7
6 x1t 7
6
7 d C vi t ;
yi t D 6
7
z
4 it 5
zi t x1t
D d0 C d1 x1t C d2 zi t C d3 zi t x1t C vi t :
In case we are interested in how the investor characteristics (zi t ) affect the alpha (intercept), then d2 is the key coefficient.
9.6
9.6.1
This section reports results from a simple Monte Carlo experiment. We use the model
yi t D C f t C gi C "i t ;
(9.24)
group). The benchmark return f t is iid normally distributed with a zero mean and a stanp
dard deviation equal to 15= 250, while "i t is a also normally distributed with a zero
mean and a standard deviation of one (different cross-sectional correlations are shown in
the table). In generating the data, the true values of and are zero, while is oneand
these are also the hypotheses tested below. To keep the simulations easy to interpret, there
is no autocorrelation or heteroskedasticity.
Results for three different GMM-based methods are reported: Driscoll and Kraay
(1998), a cluster method and Whites method. To keep the notation short, let the regression model be yi t D xi0 t b C "i t , where xi t is a K 1 vector of regressors. The (least
squares) moment conditions are
1 PT PN
hi t D 0K 1 , where hi t D xi t "i t :
T N t D1 i D1
(9.25)
Standard GMM results show that the variance-covariance matrix of the coefficients is
O D 1 S 1 , where xx D
Cov.b/
xx
xx
1 PT PN
0
t D1
i D1 xi t xi t ;
TN
(9.26)
SDK D
SC l
SW h D
1
T 2N 2
PT PN
t D1
0
i D1 hi t hi t :
(9.27)
To see the difference, consider a simple example with N D 4 and where i D .1; 2/ belong
to the first cluster and i D .3; 4/ belong to the second cluster. The following matrix shows
the outer product of the moment conditions of all individuals. Whites estimator sums up
the cells on the principal diagonal, the cluster method adds the underlined cells, and the
299
3
3
4
7
h1t h03t h1t h04t 7
7
7
h2t h03t h2t h04t 7
7
h3t h03t h3t h04t 7
5
0
0
h4t h3t h4t h4t
(9.28)
MC Covariance Structure
To generate data with correlated (in the cross-section) residuals, let the residual of individual i (belonging to group j ) in period t be
(9.29)
"i t D ui t C vjt C w t ;
where uit
N.0; u2 ), vjt
N.0;
are uncorrelated. This implies that
Var."i t / D
Cov."i t ; "k t / D
2
u
"
C
2
v
2
w
2
v
C
2
w
2
v)
and w t
N.0;
2
w )and
2
w;
#
(9.30)
Clearly, when w2 D 0 then the correlation across groups is zero, but there may be correlation within a group. If both v2 D 0 and w2 D 0, then there is no correlation at all across
individuals. For CalTime portfolios (one per activity group), we expect the ui t to average
out, so a group portfolio has the variance v2 C w2 and the covariance of two different
group portfolios is w2 .
The Monte Carlo simulations consider different values of the variancesto illustrate
the effect of the correlation structure.
9.6.3
Table 9.1 reports the fraction of times the absolute value of a t-statistics for a true null
hypothesis is higher than 1.96. The table has three panels for different correlation patterns
the residuals ("i t ): no correlation between individuals, correlations only within the prespecified clusters and correlation across all individuals.
300
White
Cluster
DriscollKraay
A. No cross-sectional correlation
0:049
0:044
0:050
0:049
0:045
0:051
0:050
0:045
0:050
B. Within-cluster correlations
0:853
0:850
0:859
0:053
0:047
0:049
0:054
0:048
0:050
0:935
0:934
0:015
0:377
0:364
0:000
0:052
0:046
0:050
Table 9.1: Simulated size of different covariance estimators This table presents the
fraction of rejections of true null hypotheses for three different estimators of the covariance matrix: Whites (1980) method, a cluster method, and Driscoll and Kraays
(1998) method. The model of individual i in period t and who belongs to cluster j is
ri t D C f t C gi C "i t , where f t is a common regressor (iid normally distributed)
and gi is the demeaned number of the cluster that the individual belongs to. The simulations use 3000 repetitions of samples with t D 1; : : : ; 2000 and i D 1; : : : ; 1665.
Each individual belongs to one of five different clusters. The error term is constructed as
"i t D ui t C vjt C w t , where ui t is an individual (iid) shock, vjt is a shock common to
all individuals who belong to cluster j , and w t is a shock common to all individuals. All
shocks are normally distributed. In Panel A the variances of .ui t ; vjt ; w t / are (1,0,0), so
the shocks are iid; in Panel B the variances are (0.67,0.33,0), so there is a 33% correlation
within a cluster but no correlation between different clusters; in Panel C the variances are
(0.67,0,0.33), so there is no cluster-specific shock and all shocks are equally correlated,
effectively having a 33% correlation within a cluster and between clusters.
301
In the upper panel, where the residuals are iid, all three methods have rejection rates
around 5% (the nominal size).
In the middle panel, the residuals are correlated within each of the five clusters, but
there is no correlation between individuals that belong to the different clusters. In this
case, but the DK and the cluster method have the right rejection rates, while Whites
method gives much too high rejection rates (around 85%). The reason is that Whites
method disregards correlation between individualsand in this way underestimates the
uncertainty about the point estimates. It is also worth noticing that the good performance
of the cluster method depends on pre-specifying the correct clustering. Further simulations (not tabulated) shows that with a completely random cluster specification (unknown
to the econometrician), gives almost the same results as Whites method.
The lower panel has no cluster correlations, but all individuals are now equally correlated (similar to a fixed time effect). For the intercept () and the slope coefficient on
the common factor (), the DK method still performs well, while the cluster and Whites
methods give too many rejects: the latter two methods underestimate the uncertainty since
some correlations across individuals are disregarded. Things are more complicated for the
slope coefficient of the cluster number (). Once again, DK performs well, but both the
cluster and Whites methods lead to too few rejections. The reason is the interaction of
the common component in the residual with the cross-sectional dispersion of the group
number (gi ).
To understand this last result, consider a stylised case where yi t D gi C "i t where
D 0 and "i t D w t so all residuals are due to an (excluded) time fixed effect. In this
case, the matrix above becomes
3
2
2
3
4
i
1
7
6
6 1 w t2
w t2
w t2
w t2 7
7
6
6
2
2
2
2 7
wt
wt
wt 7
(9.31)
6 2 wt
7
6
2
2
2
2
6 3
wt
wt wt
wt 7
5
4
2
2
2
2
4
wt
wt wt
wt
(This follows from gi D . 1; 1; 1; 1/ and since hi t D gi w t we get .h1t ; h2t ; h3t ; h4t / D
. w t ; w t ; w t ; w t /.) Both Whites and the cluster method sums up only positive cells,
so S is a strictly positive number. (For this the cluster method, this result relies on the assumption that the clusters used in estimating S correspond to the values of the regressor,
302
9.7
An Empirical Illustration
See 9.2 for results on a ten-year panel of some 60,000 Swedish pension savers (Dahlquist,
Martinez and Sderlind, 2011).
Bibliography
Driscoll, J., and A. Kraay, 1998, Consistent Covariance Matrix Estimation with Spatially
Dependent Panel Data, Review of Economics and Statistics, 80, 549560.
Hoechle, D., 2011, Robust Standard Errors for Panel Regressions with Cross-Sectional
Dependence, The Stata Journal forhcoming.
Hoechle, D., M. M. Schmid, and H. Zimmermann, 2009, A Generalization of the Calendar Time Portfolio Approach and the Performance of Private Investors, Working
paper, University of Basel.
303
II
III
IV
Constant
0.828
(2.841)
1.384
(3.284)
0.651
(2.819)
1.274
(3.253)
Default fund
0.406
(1.347)
0.387
(1.348)
0.230
(1.316)
0.217
(1.320)
1 change
0.117
(0.463)
0.125
(0.468)
2 5 changes
0.962
(0.934)
0.965
(0.934)
620 changes
2.678
(1.621)
2.665
(1.623)
2150 changes
4.265
(2.074)
4.215
(2.078)
51
7.114
(2.529)
7.124
(2.535)
0.113
(0.048)
0.112
(0.048)
changes
Number of changes
Age
0.008
(0.011)
0.008
(0.011)
Gender
0.306
(0.101)
0.308
(0.101)
Income
0.007
(0.033)
0.009
(0.036)
R-squared (in %)
55.0
55.1
55.0
55.1
The table presents the results of pooled regressions of an individuals daily excess return on return factors,
and measures of individuals fund changes and other characteristics. The return factors are the excess
returns of the Swedish stock market, the Swedish bond market, and the world stock market, and they are
allowed to across the individuals characteristics. For brevity, the coefficients on these return factors are
not presented in the table. The measure of fund changes is either a dummy variable for an activity category
(see Table ??) or a variable counting the number of fund changes. Other characteristics are the individuals
age in 2000, gender, or pension rights in 2000, which is a proxy for income. The constant term and
coefficients on the dummy variables are expressed in % per year. The income variable is scaled down by
1,000. Standard errors, robust to conditional heteroscedasticity and spatial cross-sectional correlations as
304
in Driscoll and Kraay (1998), are reported in parentheses. The sample consists of 62,640 individuals
followed daily over the 2000 to 2010 period.