As with the earlier volumes in this series, the main purpose of this volume of the Handbook of Statistics is to serve as a source reference and teaching supplement to courses in empirical finance. Many graduate students and researchers in the finance area today use sophisticated statistical methods but there is as yet no comprehensive reference volume on this subject. The present volume is intended to fill this gap. The first part of the volume covers the area of asset pricing. In the first paper, Ferson and Jagannathan present a comprehensive survey of the literature on econometric evaluation of asset pricing models. The next paper by Harvey and Kirby discusses the problems of instrumental variable estimation in latent variable models of asset pricing. The next paper by Lehman reviews semiparametric methods in asset pricing models. Chapter 23 by Shanken also falls in the category of asset pricing. Part II of the volume on term structure of interest rates consists of only one paper by Pagan, Hall and Martin. The paper surveys both the econometric and finance literature in this area, and shows some similarities and divergences between the two approaches. The paper also documents several stylized facts in the data that prove useful in assessing the adequacy of the different models. Part III of the volume deals with different aspects of volatility. The first paper by Ghysels, Harvey and Renault present a comprehensive survey on the important topic of stochastic volatility models. These models have their roots both in mathematical finance and financial econometrics and are an attractive alternative to the popular ARCH models. The next paper by LeRoy presents a critical review of the literature on variancebounds tests for market efficiency. The third paper by Palm on GARCH models of stock price volatility, surveys some more recent developments in this area. Several surveys on the ARCH models have appeared in the literature and these are cited in the paper. The paper surveys developments since the appearance of these surveys. Part IV of the volume deals with prediction problems. The first paper by Diebold and Lopez deals with the statistical methods of evaluation of forecasts. The second paper by Kaul, reviews the literature on the predictability of stock returns. This area has always fascinated those involved in inaking money in financial markets as well as academics who presumably are interested in studying whether one can, in fact, make money in the financial markets. The third paper by Lahiri reviews statistical
vi
Preface
evidence on interest rate spreads as predictors of business cycles. Since there is not much of a literature to survey in this area, Lahiri presents some new results. Part V of the volume deals with alternative probabilistic models in finance. The first paper by Brock and deLima surveys several areas subsumed under the rubic "complexity theory." This includes chaos theory, nonlinear time series models, long memory models and models with asymmetric information. The next paper by Cameron and Trivedi surveys the area of count data models in finance. In some financial studies, the dependent variable is a count, taking nonnegative integer values. The next paper by McCulloch surveys the literature on stable distributions. This area was very active in finance in the early 60's due to the work by Mandelbrot but since then has not received much attention until 'recently when interest in stable distributions has revived. The last paper by McDonald reviews the variety of probability distributions which have been and can be used in the statistical analysis of financial data. Part VI deals with application of specialized statistical methods in finance. This part covers important statistical methods that are of general applicability (to all the models considered in the previous sections) and not covered adequately in the other chapters. The first paper by Maddala and Li covers the area of bootstrap methods. The second paper by Rao covers the area of principal component and factor analyses which has, during recent years, been widely used in financial research particularly in arbitrage pricing theory (APT). The third paper by Maddala and Nimalendran reviews the area of errors in variables models as applied to finance. Almost all variables in finance suffer from the errors in variables problems. The fourth paper b y Qi surveys the applications of artificial neutral networks in financial research. These are general nonparametric nonlinear models. The final paper by Maddala reviews the applications of limited dependent variable models in financial research. Part VII of the volume contains surveys of miscellaneous other problems. The first paper by Bates surveys the literature on testing option pricing models. The next paper by Evans discusses what are known in the financial literature as "peso problems." The next paper by Hasbrouck covers market microstructure, which is an active area of research in finance. The paper discusses the time series work in this area. The final paper by Shanken gives a comprehensive survey of tests of portfolio efficiency. One important area left out has been the use of Bayesian methods in finance. In principle, all the problems discussed in the several chapters of the volume can be analyzed from the Bayesian point of view. Much of this work remains to be done. Finally, we would like to thank Ms. Jo Ducey for her invaluable help at several stages in the preparation of this volume and patient assistance in seeing the manuscript through to publication. G. S. Maddala C. R. Rao
Contributors
D. S. Bates, Department of Finance, Wharton School, University of Pennsylvania, Philadelphia, PA 19104, USA (Ch. 20) W. A. Brock, Department of Economics, University of Wisconsin, Madison, WI 53706, USA (Ch. 11) A. C. Cameron, Department of Economics, University of California at Davis, Davis, CA 956168578, USA (Ch. 12) P. J. F. de Lima, Department of Economics, The Johns Hopkins University, Baltimore, MD 21218, USA (Ch. 11) F. X. Diebold, Department of Economics, University of Pennsylvania, Philadelphia, PA 19104, USA (Ch. 8) M. D. D. Evans, Department of Economics, Georgetown University, Washington DC 200571045, USA (Ch. 21) W. E. Ferson, Department of Finance, University of Washington, Seattle, WA 98195, USA (Ch. 1) E. Ghysels, Department of Economics, The Pennsylvania State University, University Park, PA 16802 and CIRANO (Centre interuniversitaire de recherche en analyse des organisations), Universitd de Montrdal, Montrdal, Quebec, Canada H3A2A5 (Ch. 5) A. D. Hall, School of Business, Bond University, Gold Coast, QLD 4229, Australia (Ch. 4) A. C. Harvey, Department of Statistics, London School of Economics, Houghton Street, London WC2A 2AE, UK (Ch. 5) C. R. Harvey, Department of Finance, Fuqua School of Business, Box 90120, Duke University, Durham, NC 277080120, USA (Ch. 2) J. Hasbrouck, Department of Finance, Stern School of Business, 44 West 4th Street, New York, N Y 100121126, USA (Ch. 22) R. Jagannathan, Finance Department, School of Business and Management, The Hong Kong University of Science and Technology, Clear Water Bay, Kowloon, Hong Kong (Ch. 1) G. Kaul, University of Michigan Business School, Ann Harbor, M Z 481091234 (Ch. 9) C. M. Kirby, Department of Finance, College of Business & Mgm., University of Maryland, College Park, MD 20742, USA (Ch. 2) K. Lahiri, Department of Economics, State University of New York at Albany, Albany, N Y 12222 USA (Ch. 10)
XV
xvi
Contributors
B. N. Lehmann, Graduate School of International Relations, University of California at San Diego, 9500 Gilman Drive, LaJolla, CA 920930519, USA (Ch. 3) S. F. LeRoy, Department of Economics, University of California at Santa Barbara, Santa Barbara, CA 931069210 (Ch. 6) H. Li, Department of Management Science, The Chinese University of Hongkong, 302 Leung Kau Kui Building, Shatin, NT, Hong Kong (Ch. 15) J. A. Lopez, Department of Economics, University of Pennsylvania, Philadelphia, PA 19104, USA (Ch. 8) G. S. Maddala, Department of Economics, Ohio State University, 1945 N. High Street, Columbus, OH 432101172, USA (Chs. 15, 17, 19) V. Martin, Department of Economics, University of Melbourne, Parkville, VIC 3052, Australia (Ch. 4) J. H. McCulloch, Department of Economics and Finance, 410 Arps Hall, 1945 N. High Street, Columbus, OH 432101172, USA (Ch. 13) J. B. McDonald, Department of Economics, Brigham Young University, Provo, UT 84602, USA (Ch. 14) M. Nimalendran, Department of Finance, College of Business, University of Florida, Gainesville, FL 32611, USA (Ch. 17) A. R. Pagan, Economics Program, RSSS, Australian National University, Canberra, ACT 0200, Australia (Ch. 4) F. C. Palm, Department of Quantitative Economics, University of Limburg, P.O. Box 616, 6200 MD Maastricht, The Netherlands (Ch. 7) M. Qi, Department of Economics, College of Business Administration, Kent State University, P.O. Box 5190, Kent, OH 44242 (Ch. 18) C. R. Rao, The Pennsylvania State University, Center for Multivariate Analysis, Department of Statistics, 325 Classroom Bldg., University park, PA 168026105, USA (Ch. 16) E. Renault, lnstitut D'Economie Industrielle, Universitd des Sciences Sociales, Place Anatole France, F31042 Toulouse Cedex, France (Ch. 5) J. Shanken, Department of Finance, Simon School of Business, University of Rochester, Rochester, N Y 14627, USA (Ch. 23) P. K. Trivedi, Department of Economics, Indiana University, Bloomington, IN 474056620, USA (Ch. 12) J. G. Wang, AT&T, Rm. N460WOS, 412 Mt. Kemble Avenue, Morristown, NJ 07960, USA (Ch. 10)
G. S. Maddala and C. R. Rao, eds., Handbookof Statistics, Vol. 14 1996 Elsevier Science B.V. All rights reserved.
We provide a brief review of the techniques that are based on the generalized method of moments ( G M M ) and used for evaluating capital asset pricing models. We first develop the C A P M and multibeta models and discuss the classical twostage regression method originally used to evaluate them. We then describe the pricing kernel representation of a generic asset pricing model; this representation facilitates use of the G M M in a natural way for evaluating the conditional and unconditional versions of most asset pricing models. We also discuss 'diagnostic methods that provide additional insights.
1. Introduction
A major part of the research effort in finance is directed toward understanding why we observe a variety of financial assets with different expected rates of return. F o r example, the U.S. stock market as a whole earned an average annual return of 11.94% during the period from January of 1926 to the end of 1991. U.S. Treasury bills, in contrast, earned only 3.64%. The inflation rate during the same period was 3.11% (see Ibbotson Associates 1992). To appreciate the magnitude of these differences, note that in 1926 a nice dinner for two in New York would have cost about $10. I f the same $10 had been invested in Treasury bills, by the end of 1991 it would have grown to $110, still enough for a nice dinner for two. Yet $10 invested in stocks would have grown to $6,756. The point is that the average return differentials a m o n g financial assets are both substantial and economically important. A variety of asset pricing models have been proposed to explain this phenomenon. Asset pricing models describe how the price of a claim to a future payoff is determined in securities markets. Alternatively, we m a y view asset pri
* Ferson acknowledges financial support from the PigottPACCAR Professorship at the University of Washington. Jagannathan acknowledges financial support from the National Science Foundation, grant SBR9409824. The views expressed herein are those of the authors and not necessarily those of the Federal Reserve Bank of Minneapolis or the Federal Reserve System.
cing models as describing the expected rates of return on financial assets, such as stocks, bonds, futures, options, and other securities. Differences among the various asset pricing models arise from differences in their assumptions that restrict investors' preferences, endowments, production, and information sets; the stochastic process governing the arrival of news in the financial markets; and the type of frictions allowed in the markets for real and financial assets. While there are differences among asset pricing models, there are also important commonalities. All asset pricing models are based on one or more of three central concepts. The first is the law of one price, according to which the prices of any two claims which promise the same future payoff must be the same. The law of one price arises as an implication of the second concept, the noarbitrage principle. The noarbitrage principle states that market forces tend to align the prices of financial assets to eliminate arbitrage opportunities. Arbitrage opportunities arise when assets can be combined, by buying and selling, to form portfolios that have zero net cost, no chance of producing a loss, and a positive probability of gain. Arbitrage opportunities tend to be eliminated by trading in financial markets, because prices adjust as investors attempt to exploit them. For example, if there is an arbitrage opportunity because the price of security A is too low, then traders' efforts to purchase security A will tend to drive up its price. The law of one price follows from the noarbitrage principle, when it is possible to buy or sell two claims to the same future payoff. If the two claims do not have the same price, and if transaction costs are smaller than the difference between their prices, then an arbitrage opportunity is created. The arbitrage pricing theory (APT, Ross 1976) is one of the most wellknown asset pricing model based on arbitrage principles. The third central concept behind asset pricing models is financial market equilibrium. Investors' desired holdings of financial assets are derived from an optimization problem. A necessary condition for financial market equilibrium in a market with no frictions is that the firstorder conditions of the investors' optimization problem be satisfied. This requires that investors be indifferent at the margin to small changes in their asset holdings. Equilibrium asset pricing models follow from the firstorder conditions for the investors' portfolio choice problem and from a marketclearing condition. The marketclearing condition states that the aggregate of investors' desired asset holdings must equal the aggregate "market portfolio" of securities in supply. The earliest of the equilibrium asset pricing models is the SharpeLintnerMossinBlack capital asset pricing model (CAPM), developed in the early 1960s. The CAPM states that expected asset returns are given by a linear function of the assets' betas, which are their regression coefficients against the market portfolio. Merton (1973) extended the CAPM, which is a singleperiod model, to an economic environment where investors make consumption, savings, and investment decisions repetitively over time. Econometrically, Merton's model generalizes the CAPM from a model with a single beta to one with multiple betas. A multiplebeta model states that assets' expected returns are linear functions of a number of betas. The APT of Ross (1976) is another example of a multiplebeta
asset pricing model, although in the APT the expected returns are only approximately a linear function of the relevant betas. In this paper we emphasize (but not exclusively) the econometric evaluation of asset pricing models using the generalized method of moments (GMM, Hansen 1982). We focus on the G M M because, in our opinion, it is the most important innovation in empirical methods in finance within the past fifteen years. The approach is simple, flexible, valid under general statistical assumptions, and often powerful in financial applications. One reason the G M M is "general" is that many empirical methods used in finance and other areas can be viewed as special cases of the G M M . The rest of this paper is organized as follows. In Section 2 we develop the CAPM and multiplebeta models and discuss the classical twostage regression procedure that was originally used to evaluate these models. This material provides an introduction to the various statistical issues involved in the empirical study of the models; it also motivates the need for multivariate estimation methods. In Section 3 we describe an alternative representation of the asset pricing models which facilitates the use of the G M M . We show that most asset pricing models can be represented in this stochastic discount factor form. In Section 4 we describe the G M M procedure and illustrate how to use it to estimate and test conditional and unconditional versions of asset pricing models. In Section 5 we discuss model diagnostics that provide additional insight into the causes for statistical rejections and that help assess specification errors in the models. In order to avoid a proliferation of symbols, we sometimes use the same symbols to mean different things in different subsections. The definitions should be clear from the context. We conclude with a summary in Section 6.
that the market return is an exact linear function of the return on an observable portfolio of common stocks.1 Then, according to the CAPM, E(Rit) = 60 + 61fli where (2.1)
(2.2)
where Rpt denotes the date t return on the optimally chosen portfolio and E(.II) and Var(.[/) denote the expectation and variance of return, conditional on the information set I of the investor as of time t1. We assume that the function V[,.] is increasing and concave in its first argument, decreasing in its second argument, and timeinvariant. For the moment we assume that the information set Iincludes only the unconditional moments of asset returns, and we drop the symbol I to simplify the notation. The firstorder conditions for the optimization problem given above can be manipulated to show that the following must hold:
E(Rit) = E(R0t ) ~ flipE(Rp,  Rot ) (2.3)
for every asset i = 1, 2, ..., N, where Rz,t is the return on the optimally chosen portfolio, Rot is the return on the asset that has zero covariance with Rpt, and flip
= Cov(Rit,Rpt)/Var(Rpt).
To get from the firstorder condition for an investor's optimization problem, as stated in equation (2.3), to the CAPM, it is useful to understand some of the properties of the minimumvariance frontier, that is, the set of portfolio returns with the minimum variance, given their expected returns. It can be readily verified that the optimally chosen portfolio of the investor is on the minimumvariance frontier. One property of the minimumvariance frontier is that it is closed to portfolio formation. That is, portfolios of frontier portfolios are also on the frontier.
1 When this assumption fails, it introduces market proxy error. This source of error is studied by Roll (1977), Stambaugh (1982), Kandel (1984), Kandel and Starnbaugh (1987), Shanken (1987), Hansen and Jagannathan (1994), and Jagannathan and Wang (1996), among others. We will ignore proxy error in our discussion.
Suppose that all investors have the same beliefs. Then every investor's optimally chosen portfolio will be on the same frontier, and hence the market portfolio of all assets in the economy  which is a portfolio of every investor's optimally chosen portfolio  will also be on the frontier. It can be shown (Roll 1977) that equation (2.3) will hold if Rpt is replaced by the return of any portfolio on the frontier and Rot is replaced by its corresponding zerobeta return. Hence we can replace an investor's optimal portfolio in equation (2.3) with the return on the market portfolio to get the CAPM, as given by equation (2.1).
(2.4)
which is a crosssectional regression of Ri o n bi, with regression coefficients equal to 60 and 61. In equation (2.4), Ri denotes the sample average return of the asset, i, and b; is the (OLS) slope coefficient estimate from a regression of the return, R,t, over time on the market index return, Rmt; bi is a constant. Let ui = RiE(git) and vi = flibi. Substituting these relations for E(Rit) and fli in (2.1) leads to (2.4) and specifies the composite error as ei = ui+blVi. This gives rise to a classic errorsinvariables problem, as the regressor bi in the crosssectional regression model (2.4) is measured with error. Using finite timeseries samples for the estimate of b,, the regression (2.4)will deliver inconsistent estimates of 60 and 31, even with an infinite crosssectional sample. However, the crosssectional regression will provide consistent estimates of the coefficients as the timeseries sample size T (which is used in the first step to estimate the beta coefficient fig) becomes very large. This is because the firststep estimate of fl~ is consistent, so as T becomes large, the errorsinvariables problem of the secondstage regression vanishes. The measurement error in beta may be large for individual securities, but it is smaller for portfolios. In view of this fact, early research focused on creating portfolios of securities in such a way that the betas of the portfolios could be estimated precisely. Hence one solution to the errorsinvariables problem is to work with portfolios instead of individual securities. This creates another problem. Arbitrarily chosen portfolios tend to exhibit little dispersion in their betas. If all the portfolios available to the econometrician have the same betas, then equation (2.1) has no empirical content as a crosssectional relation. Black, Jensen, and Scholes (BJS, 1972) came up with an innovative solution to overcome
this difficulty. At every point in time for which a crosssectional regression is run, they estimate betas on individual securities based on past history, sort the securities based on the estimated values of beta, and assign individual securities to beta groups. This results in portfolios with a substantial dispersion in their betas. Similar portfolio formation techniques have become standard practice in the empirical finance literature. Suppose that we can create portfolios in such a way that we can view the errorsinvariables problem as being of secondorder importance. We still have to determine how to assess whether there is empirical support for the CAPM. A standard approach in the literature is to consider specific alternative hypotheses about the variables which determine expected asset returns. According to the CAPM, the expected return for any asset is a linear function of its beta only. Therefore, one natural test would be to examine if any other crosssectional variable has the ability to explain the deviations from equation (2.1). This is the strategy that Fama and MacBeth (1973) followed by incorporating the square of beta and measures of nonmarket (or residual timeseries) variance as additional variables in the crosssectional regressions. More recent empirical studies have used the relative size of firms, measured by the market value of their equity, the ratio of booktomarketequity, and related variables. 2 For example, the following model may be specified: E(R;t) = 6o + 61fli + OsizeLMEi (2.5)
where LMEi is the natural logarithm of the total market value of the equity capital of firm i. In what follows we will first show that these ideas extend easily to the general multiplebeta model. We will then develop a sampling theory for the crosssectional regression estimators.
~_,
k=l,...,K
6kflik
(2.6)
where flik, k= 1,... ,K, are the multiple regression coefficients of the return of asset i on K economywide pervasive risk factors, fk, k = 1 , . . . , K. The coefficient 80 is the expected return on an asset that has fl0k = 0, for k  1 , . . . , K; i.e., it is the expected return on a zero (multiple) beta asset. The coefficient 6k, corresponding to the kth factor, has the following interpretation: it is the expected return differential, or premium, for a portfolio that has flik = 1 and to = 0 for all j k, 2 Fama and French (1992)is a prominent recent exampleof this approach. Berk (1995)providesa justification for using relativemarket value and booktopriceratios as measures of expectedreturns.
measured in excess of the zerobeta asset's expected return. In other words, it is the expected return premium per unit of beta risk for the risk factor, k. Ross (1976) showed that an approximate version of (2.6) will hold in an arbitragefree economy. Connor (1984) provided sufficient conditions for (2.6) to hold exactly in an economy with an infinite number of assets in general equilibrium. This version of the multiplebeta model, the exact APT, has received wide attention in the finance literature. When the factors, fk, are observed by the econometrician, the crosssectional regression method can be used to empirically evaluate the multiplebeta model. 3 For example, the alternative hypothesis that the size of the firm is related to expected returns, given the factor betas, may be examined by using crosssectional regressions of returns on the K factor betas and the LMEi, similar to equation (2.5), and by examining whether the coefficient 6size is different from zero.
2.4. Sampling distributions for coefficient estimators: The twostage, crosssectional regression method
In this section we follow Shanken (1992) and Jagannathan and Wang (1993, 1996) in deriving the asymptotic distribution of the coefficients that are estimated using the crosssectional regression method. For the purposes of developing the sampiing theory, we will work with the following generalization of equation (2.6):
Ka k=0 /(2
(2.7)
where {Aik} are observable characteristics of firm i, which are assumed to be measured without error (the first "characteristic," when k = 0, is the constant 1.0). One of the attributes may be the size variable LMEi. The fli a r e regression betas on a set of/2 economic risk factors, which may include the market index return. Equation (2.7) can be written more compactly using matrix notation as # = X7 (2.8)
where Rt = [Rlt,... ,RNt], I.t = E(Rt),X = [A :/~], and the definition of the matrices A and/~ and the vector 7 follow from (2.7). The crosssectional method proceeds in two stages. First, fl is estimated by timeseries regressions of Rit on the risk factors and a constant. The estimates are denoted by b. Let x = [A : b], and let R denote the timeseries average of the return vector Rt. Let 9 denote the estimator of the coefficient vector obtained from the following crosssectional regression:
9 = (x'x) lx'R
(2.9)
3 See Chen (1983), Connor and Korajczyk(1986), Lehmannand Modest (1987),and McElroyand Burmeister (1988)for discussions on estimating and testing the model when the factor realizations are not observableunder some additional auxiliary assumptions.
where we assume that x is of rank 1 + K1 + K2. If b and R converge respectively to/3 and E(Rt) in probability, then 9 will converge in probability to 7. Black, Jensen, and Scholes (1972) suggest estimating the sampling errors associated with the estimator, 9, as follows. Regress Rt on x at each date t to obtain 9t, where
St = (x'x)  l x ' R t
(2.10)
 g)(gt  O)'
(2.11)
which uses the fact that g is the sample mean of the gt's. Substituting the expression for gt given in (2.10) into the expression for v given in (2.11) gives
U ~ ( X t X )  I x ' [ T  I Z ( R t
,   R ) ( R t  R ) t ] x ( x ' x ) 1
(2.12)
To analyze the BJS covariance matrix estimator, we write the average return vector, R , as R = x7 + (R  kt)  (x  X ) 7 . Substitute this expression for R into the expression for 9 in (2.9) to obtain
Y  7 = ( x ' x )  l x ' [ ( R  #)  (b  fl)72]
(2.13)
(2.14)
Assume that b is a consistent estimate o f / 3 and that T I / 2 ( R  #) ~ a u and T1/2(b  / 3 ) ~ a h, where u and h are random variables with welldefined distributions and +a indicates convergence in distribution. We then have
7)
(x'x)Wu
 (x'x)Wh
(2.15)
In (2.15) the first term on the right side is that component of the sampling error that arises from replacing g by the sample average R. The second term is the component of the sampling error that arises due to replacing/3 by its estimate b. The usual consistent estimate of the asymptotic variance of u is given by
T 1 Z ( R t
t
 R)(Rt  R)' .
(2.16)
which is the same as the expression for the BJS estimate for the covariance matrix of the estimated coefficients v, given in (2.12). Hence if we ignore the sampling error that arises from using estimated betas, then the BJS covariance estimator
provides a consistent estimate of the variance of the estimator g. However, if the sampling error associated with the betas is not small, then the BJS covariance estimator will have a bias. While it is not possible to determine the magnitude of the bias in general, Shanken (1992) provides a method to assess the bias under additional assumptions. 4 Consider the following univariate timeseries regression for the return of asset i on a constant and the k th economic factor:
Rit = O~ik q flikfkt [ eikt .
(2.17)
We make the following additional assumptions about the error terms in (2.17): (1) the error Zm is mean zero, conditional on the time series of the economic factors fk; (2) the conditional covariance of ~ikt and ejlt, given the factors, is a fixed constant tr;jkl. We denote the matrix of the {aqu}ij by Zu. Finally, we assume that (3) the sample covariance matrix of the factors exists and converges in probability to a constant positive definite matrix f~, with the typical element f~kl. THEOREM 2.1. (Shanken, 1992/Jagannathan and Wang, 1996) T1/2(g  ?) converges in distribution to a normally distributed random variable with zero mean and covariance matrix V + W, where V is the probability limit of the matrix v given in (2.12) and
W=
(x'x)lx'{?2kY2'(f*;1IIk'f~n1) } x(x'x)I
(2.18)
l,k= 1,...,K2
where Hkl is defined in the appendix. PROOF. See the appendix. Theorem 2.1 shows that in order to obtain a consistent estimate of the covariance matrix of the BJS twostep estimator g, we first estimate v (a consistent estimate of V) by using the BJS method. We then estimate W by its sample analogue. Although the crosssectional regression method is intuitively very appealing, the above discussion shows that in order to assess the sampling errors associated with the parameter estimators, we need to make rather strong assumptions. In addition, the econometrician must take a stand on a particular alternative hypothesis against which to reject the model. The general approach developed in Section 4 below has, among its advantages, weaker statistical assumptions and the ability to handle both unspecified as well as specific alternative hypotheses.
4 Shanken (1992) uses betas computed from multiple regressions. The derivation which follows uses betas computed from univariate regressions, for simplicity of exposition. The two sets of betas are related by an invertible linear transformation. Alternatively, the factors may be orthogonalized without loss of generality.
10
The notation Et {} will be used to denote the conditional expectation, given a marketwide information set. Sometimes it will be convenient to refer to expectations conditional on a subset Zt of the market information, which are denoted as E(. I Zt). F o r example, Zt can represent a vector of instrumental variables for the public information set which are available to the econometrician. When Zt is the null information set, the unconditional expectation is denoted as E(.). I f we take the expected values of equation (3.1), it follows that versions of the same equation must hold for the expectations E([Zt) and E(.). The r a n d o m variable mr+ 1 has various names in the literature. It is known as a stochastic discount factor, an equivalent martingale measure, a R a d o n  N i c o d y m derivative, or an intertemporal marginal rate of substitution. We will refer to an mt +1 which satisfies (3.1) as a valid stochastic discount factor. The motivation for use of this term arises from the following observation. Write equation (3.1) as Pit = Et{mt+lXi,t+l } where Xi, t + l is the payoff of asset i at time t 1 (the market value plus any cash payments) and Ri,t+l = Xi,t+l/Pit. Equation (3.1) says that if we multiply a future payoff Xi, t+l by the stochastic discount factor mt+l and take the expected value, we obtain the present value of the future payoff. The existence of an rnt+ 1 that satisfies (3.1) says that all assets with the same payoffs have the same price (i.e., the law o f one price). With the restriction that rot+ ! is a strictly positive random variable, equation (3.1) becomes equivalent to a noarbitrage condition. The condition is that all portfolios of assets with payoffs that can never be negative, but are positive with positive probability, must have positive prices. The noarbitrage condition does not uniquely identify m t + 1 unless markets are complete, which means that there are as m a n y linearly independent payoffs available in the securities markets as there are states of nature at date t + 1. To obtain additional insights about the stochastic discount factor and the noarbitrage condition, assume for the m o m e n t that the markets are complete. Given complete markets, positive state prices are required to rule out arbitrage opportunities. 5 Let qt~ denote the time t price of a security that pays one unit at date t + 1 if, and only if, the state of nature at t + 1 is s. Then the time t price of a
5 See Debreu (1959) and Arrow (1970) for models of complete markets. See Beja (1971), Rubinstein (1976), Ross (1977), Harrison and Kreps (1979), and Hansen and Richard (1987) for further theoretical discussions.
11
{Xi,s,t+l}
ZqtsX~,s,t+, = ~nts(qts/~ts)Xi,s,t+,
S S
where nt~ is the probability, as assessed at time t, that state s occurs at time t + 1. Comparing this expression with equation (3.1) shows that ms,t+1 = qts/nts is the value of the stochastic discount factor in state s, under the assumption that the markets are complete. Since the probabilities are positive, the condition that the random variable defined by {ms.t+ 1} is strictly positive is equivalent to the condition that all state prices are positive. Equation (3.1) is convenient for developing econometric tests of asset pricing models. Let Rt+l denote the vector of gross returns on the N assets on which the econometrican has observations. Then (3.1) can be written as
E{Rt+lmt+l}  1 = 0
(3.2)
where 1 denotes the N vector of ones and 0 denotes the N vector of zeros. The set of N equations given in (3.2) will form the basis for tests using the generalized method of moments. It is the specific form of mt + ~ implied by a model that gives the equation empirical content.
3.1. Stochastic discount factor representations of the C A P M and multiplebeta asset pricing models
Consider the CAPM, as given by equation (2.1):
E(Rit+l ) = go q (~lfli
where
(3.3)
Equating terms in equations (2.1) and (3.3) shows that the CAPM of equation (2.1) is equivalent to a version of equation (3.1), where
E(Rit+lmt+l) = 1
where
mt+l = CO  ClRmt+l
co = [1 + E(emt+l)~l/Var(Rmt+l)]/60
(3.4)
12
and cl = 6t/[6oVar(Rmt+l)]. Equation (3.4) was originally derived by Dybvig and Ingersoll (1982). Now consider the following multiplebeta model which was given in equation (2.6):
E(Rit+I) = 60 qZ kflik k=l,...,K
"
It can be readily verified by substitution that this model implies the following stochastic discount factor representation:
E(Rit+lmit+l ) =
where
mit+l = co q ClJ~t+l + " " J CKfKt+I
with co = [1 + Z { 6 k E ( f k ) / V a r ( f k ) } ] / 6 o
k
(3.5)
The preceding results apply to the CAPM and multiplebeta models, interpreted as statements about the unconditional expected returns of the assets. These models are also interpreted as statements about conditional expected returns in some tests where the expectations are conditioned on predetermined, publicly available information. All of the analysis of this section can be interpreted as applying to conditional expectations, with the appropriate changes in notation. In this case, the parameters Co, Cl, 60, 61, etc., will be functions of the time t information set. 3.2. Other examples of stochastic discount factors In equilibrium asset pricing models, equation (3.1) arises as a firstorder condition for a consumerinvestor's optimization problem. The agent maximizes a lifetime utility function of consumption (including possibly a bequest to heirs). Denote this function by V(). If the allocation of resources to consumption and to investment assets is optimal, it is not possible to obtain higher utility by changing the allocation. Suppose that an investor considers reducing consumption at time t to purchase more of (any) asset. The utility cost at time t of the forgone consumption is the marginal utility of consumption expenditures Ct, denoted by (3V/OCt) > 0, multiplied by the price Pi,t of the asset, measured in the same units as the consumption expenditures. The expected utility gain of selling the share and consuming the proceeds at time t + 1 is
13
Et{(Pi,t+l + O i , t + l ) ( O V / O C t + l ) )
where Di,t+ 1 is the cash flow or dividend paid at time t + 1. If the allocation maximizes expected utility, the following must hold:
mt+l = (OV/OG+I)/Et{(OV/OCt)} .
(3.6)
The mt+l in equation (3.6) is the intertemporal marginal rate of substitution (IMRS) of the representative consumer. The rest of this section shows how many models in the asset pricing literature are special cases of (3.1), where mt+l is defined by equation (3.6). 6 If a representative consumer's lifetime utility function V() is timeseparable, the marginal utility of consumption at time t, (b V/OCt), depends only on variables dated at time t. Lucas (1978) and Breeden (1979) derived consumptionbased asset pricing models of the following type, assuming that the preferences are timeseparable and additive:
v= tu(C,I
t
where/~ is a time discount parameter and u(.) is increasing and concave in current consumption Ct. A convenient specification for u(.) is
(3.7)
In equation (3.7), ~ > 0 is the concavity parameter of the period utility function. This function displays constant relative risk aversion equal to ~.7 Based on these assumptions and using aggregate consumption data, a number of empirical studies test the consumptionbased asset pricing model. 8 Dunn and Singleton (1986) and Eichenbaum, Hansen, and Singleton (1988), among others, model consumption expenditures that may be durable in nature. Durability introduces nonseparability over time, since the flow of consumption services depends on the consumer's previous expenditures, and the utility is de
6 Asset pricing models typically focus on the relation of security returns to aggregate quantities. It is therefore necessary to aggregate the Euler equations of individuals to obtain equilibrium expressions in terms of aggregate quantities. Theoretical conditions which justify the use of aggregate quantities are discussed by Gorman (1953), Wilson (1968), Rubinstein (1974), Constantinides (1982), Lewbel (1989), Luttmer (1993), and Constantinides and Duffle (1994). 7 Relative risk aversion in consumption is defined as Cu"(C)/u'(C). Absolute risk aversion is u"(C)/u~(C), where a prime denotes a derivative. Ferson (1983) studies a consumptionbased asset pricing model with constant absolute risk aversion. s Substituting (3.7) into (3.6) shows that mr+1 = fl(Ct+l/Ct) c~. Empirical studies of this model include Hansen and Singleton (1982, 1983), Ferson (1983), Brown and Gibbons (1985), Jagannathan (1985), Ferson and Merrick (1987), and Wheatley (1988).
14
fined over the services. Current expenditures increase the consumer's future utility of services if the expenditures are durable. The consumer optimizes over the expenditures Ct; thus, durability implies that the marginal utility, (OV/OCt), depends on variables dated other than date t. Another form of timenonseparability arises if the utility function exhibits habit persistence. Habit persistence means that consumption at two points in time are complements. For example, the utility of current consumption is evaluated relative to what was consumed in the past. Such models are derived by Ryder and Heal (1973), Becker and Murphy (1988), Sundaresan (1989), Constantinides (1990), Detemple and Zapatero (1991), and Novales (1992), among others. Ferson and Constantinides (1991) model both the durability of consumption expenditures and habit persistence in consumption services. They show that the two combine as opposing effects. In an example where the effect is truncated at a single lag, the derived utility of expenditures is V = (1  ~ )  1 Z f l t ( C t + bCt_~)l~
t
(3.8)
(3.9)
The coefficient b is positive and measures the rate of depreciation if the good is durable and there is no habit persistence. If habit persistence is present and the good is nondurable, this implies that the lagged expenditures enter with a negative effect (b < 0). Ferson and Harvey (1992) and Heaton (1995) consider a form of timenonseparability which emphasizes seasonality. The utility function is
(1 t
+ bC,_4)
where the consumption expenditure decisions are assumed to be quarterly. The subsistence level (in the case of habit persistence) or the flow of services (in the case of durability) is assumed to depend only on the consumption expenditure in the same quarter of the previous year. Abel (1990) studies a form of habit persistence in which the consumer evaluates current consumption relative to the aggregate consumption in the previous period, consumption that he or she takes as exogenous. The utility function is like equation (3.8), except that the "habit stock," bCt1, refers to the aggregate consumption. The idea is that people care about "keeping up with the Joneses." Campbell and Cochrane (1995) also develop a model in which the habit stock is taken as exogenous by the consumer. This approach results in a simpler and more tractable model, since the consumer's optimization does not have to take account of the effects of current decisions on the future habit stock. Epstein and Zin (1989, 1991) consider a class of recursive preferences which can be written as Vt = F(Ct, CEQt(Vt+~)). CEQ~(.) is a time t "certainty equiva
15
lent" for the future lifetime utility V t + 1 . The function F(.,CEQt(.)) generalizes the. usual expected utility function of lifetime consumption and may be timenonseparable. Epstein and Zin (1989) study a special case of the recursive preference model in which the preferences are
(3.10)
They show that when p 0 and 1  ~ 0, the IMRS for a representative agent becomes
(3.11)
where Rm,t+! is the gross market portfolio return. The coefficient of relative risk aversion for timeless consumption gambles is ~, and the elasticity of substitution for deterministic consumption is (1  p )  l . If~ = 1  p, the model reduces to the timeseparable, power utility model. If u = 1, the log utility model of Rubinstein (1976) is obtained. In summary, many asset pricing models are special cases of the equation (3.1). Each model specifies that a particular function of the data and the model parameters is a valid stochastic discount factor. We now turn to the issue of estimating the models stated in this form.
In this section we provide an overview of the generalized method of moments and a brief review of the associated asymptotic test statistics. We then show how the G M M is used to estimate and test various specifications of asset pricing models.
(4.1)
The equation (3.1) implies that Et{ui,t+l } = 0 for all i. Given a sample of N assets and T time periods, combine the error terms from (4.1) into a T N matrix u, with typical row ult+l. By the law of iterated expectations, the model implies that E(ui,t+l ]Zt) ~ 0 for all i and t (for any Zt in the information set at time t), and therefore E(ut+lZt)= 0 for all t. The condition E ( u t + l Z t ) = 0 says that Ut+l is orthogonal to Zt and is therefore called an orthogonality condition. These or
16
thogonality conditions are the basis of tests of asset pricing models using the GMM. A few points deserve emphasis. First, G M M estimates and tests of asset pricing models are motivated by the implication that E(ui,t+llZt) = 0, for any Zt in the information set at time t. However, the weaker condition E(ut+lZt) = 0, for a given set of instruments Zt, is actually used in the estimation. Therefore, G M M tests of asset pricing models have not exploited all of the predictions of the theories. We believe that further refinements to exploit the implications of the theories more fully will be useful. Empirical work on asset pricing models relies on rational expectations, interpreted as the assumption that the expectation terms in the model are mathematical conditional expectations. For example, the rational expectations assumption is used when the expected value in equation (3.1) is treated as a mathematical conditional expectation to obtain expressions for E(.IZ) and E(.). Rational expectations implies that the difference between observed realizations and the expectations in the model should be unrelated to the information that the expectations are conditioned on. Equation (3.1) says that the conditional expectation of the product of mt+l and Ri,t+l is the constant 1.0. Therefore, the error term 1  mt+lRi,t+l in equation (4.1) should not be predictably different from zero when we use any information available at time t. If there is variation over time in a return Re,t+1 that is predictable using instruments Zt, the model implies that the predictability is removed when Ri,t+l is multiplied by a valid stochastic discount factor, mt+l. This is the sense in which conditional asset pricing models are asked to "explain" predictable variation in asset returns. This idea generalizes the "random walk" model of stock values, which implies that stock returns should be completely unpredictable. That model is a special case which can be motivated by risk neutrality. Under risk neutrality the IMRS is a constant. In this case, equation (3.1) implies that the r e t u r n Ri,t+ 1 should not differ predictably from a constant. G M M estimation proceeds by defining an N x L matrix of sample mean orthogonality conditions, G = (dZ/T), and letting g = vec(G), where Z is a T x L matrix of observed instruments with typical row Z/, a subset of the available information at time t. 9 The vec(.) operator means to partition G into row vectors, each of length L: (h_s, _h2, ..., hN). Then one stacks the h's into a vector, O, with length equal to the number of orthogonality conditions, NL. Hansen's (1982) G M M estimates of 0 are obtained by searching for parameter values that make 9 close to zero by minimizing a quadratic form 91W9, where W is an NLxNL weighting matrix. Somewhat more generally, let ut+l(O) denote the random N vector Rt+lm(O, xt+l)l, and define 9 r ( 0 ) = Tl~(u~(O)Z tl). Let Or denote the parameter values that minimize the quadratic form 9~rAror, where A r is any positive definite N L x NL matrix that may depend on the sample, and let J r 9 This section assumes that the same instruments are used for each of the asset equations. In general, each asset equation could use a differentset of instruments, which complicatesthe not~ttion.
17
denote the minimized value of the quadratic form g'rArgr. Jagannathan and Wang (1993) show that J r will have a weighted chisquare distribution which can be used for testing the hypothesis that (3.1) holds. THEOREM 4.1. (Jagannathan and Wang, 1993). Suppose that the matrix AT converges in probability to a constant positive definite matrix A. Assume also that x/Tor(00) ~ a N(0, S), where N(., .) denotes the multivariate normal distribution, 00 are the true parameter values, and S is a positive definite matrix. Let
D = E[Ogr/O0]lO=Oo
and let Q = (s 1/2)(,41/2) [I  (141/2)'D(D'AD) 1Dt(A1/2)1 (A 1/2)(81/2) where A 1/2 and S 1/2 are the upper triangular matrices from the Cholesky decompositions of A and S. Then the matrix Q has NLdim(O) nonzero, positive eigenvalues. Denote these eigenvalues by 2i, i = 1, 2, ..., NLdim(O). Then Jr converges to
where 09/00 is an NL dim(P) matrix of derivatives. A consistent estimator for the asymptotic covariance of the sample mean of the orthogonalit~ conditions is used in practice. That is, we replace W in (4.2) with Cov(9) and replace E(0g/00) with its sample analogue. An example of a consistent estimator for the optimal weighting matrix is given by Hansen (1982) as
18
where denotes the Kronecker product. A special case that often proves useful arises when the orthogonality conditions are not serially correlated. In that special case, the optimal weighting matrix is the inverse of the matrix Cov(g), where Cov(9) = [(l/T)Z(ut+lu't+l) (ZtZ~)] . t (4.4)
The GMM weighting matrices originally proposed by Hansen (1982) have some drawbacks. The estimators are not guaranteed to be positive definite, and they may have poor finite sample properties in some applications. A number of studies have explored alternative estimators for the GMM weighting matrix. A prominent example by Newey and West (1987a) suggests weighting the autocovariance terms in (4.3) with Bartlett weights to achieve a positive semidefinite matrix. Additional refinements to improve the finite sample properties are proposed by Andrews (1991), Andrews and Monahan (1992), and Ferson and Foerster (1994).
4.2. Testing hypotheses with the G M M
As we noted above, the Jrstatistic provides a goodnessoffit test for a model that is estimated by the GMM, when the model is overidentified. Hansen's J:rstatistic is the most commonly used test in the finance literature that has used the GMM. Other standard statistical tests based on the GMM are also used in the finance literature for testing asset pricing models. One is a generalization of the Wald test, and a second is analogous to a likelihood ratio test statistic. Additional test statistics based on the GMM are reviewed by Newey (1985) and Newey and West (1987b). For the Wald test, consider the hypothesis to be tested as expressed in the Mvector valued function H(O) = 0, where M < dim(0). The GMM estimates of 0 are asymptotically normal, with mean 0 and variance matrix t~ov(0). Given standard regularity conditions, it follows that the estimates of/z/are asymptotically normal, with mean zero and variance matrix//0Cov(0)//~, where subscripts denote partial derivatives, and that the quadratic form
is asymptotically chisquare, providing a standard Wald test. A likelihood ratio type test is described by Newey and West (1987b), Eichenbaum, Hansen, and Singleton (1988, appendix C), and Gallant (1987). Newey and West (1987b) call this the D test. Assume that the null hypothesis implies that the orthogonality conditions E(9*) = 0 hold, while, under the alternative, a subset E(9 ) = 0 hold. For example, 9* = (9, h). When we estimate the model under the null hypothesis, the quadratic form 9"~W'9 * is minimized. Let W~I be the upper left block of W*; that is, let it be the estimate of Cov (9) ~ under the null. When we
19
hold this matrix fixed the model can be estimated under the alternative by mini= mizing g~Wfl g. The difference of the two quadratic forms T [ g * ' w V  g'W lg] is asymptotically chisquare, with degrees of freedom equal to M if the null hypothesis is true. Newey and West (1987b) describe additional variations on these tests.
4.3.1. Static or unconditional CAPMs If we make the assumption that all the expectation terms in the CAPM refer to the unconditional expectations, we have an unconditional version of the CAPM. It is straightforward to estimate and then test an unconditional version of the CAPM, using equation (3.1) and the stochastic discount factor representation given in equation (3.4). The stochastic discount factor is
mt+l ~ Co [ ClRmt+l
where Co and Cl are fixed parameters. Using only the unconditional expectations, the model implies that E{(c0 + clRmt+l)Rt+l1) = 0 where Rt+ ~ is the vector of gross asset returns. The vector of sample orthogonality conditions is
gT : gT(CO, e l ) ~ ( 1 / T ) ~ ' ~ ( ( C t 0 + ClRmt+l)Rt+l  1} .
With assets N > 2, the number of orthogonality conditions is N and the number of parameters is 2, so the Jrstatistic has N  2 degrees of freedom. Tests of the unconditional CAPM using the stochastic discount factor representation are conducted by Carhart et al. (1995) and Jagannathan and Wang (1996), who reject the model using monthly data for the postwar United States.
20
Tests of the unconditional CAPM may also be conducted using the linear, returnbeta formulation of equation (2.1) and the G M M . Let rt = R t  R o t l be the vector of excess returns, where R o t is the gross return on some reference asset and 1 is an N vector of ones; also let ut = rt  flrmt, where fl is the N vector of the betas of the excess returns, relative to the market, and rmt = R m t  R o t is the excess return on the market portfolio. The model implies that E(ut) = E(utrmt) = 0 . Let the instruments be Zt = (1, rmt)'. The sample orthogonality condition is then
a (fl) = r 1
t
fir,,,) o zt
The number of orthogonality conditions is 2N and the number of parameters is N, so the model is overidentified and may be tested using the Jrstatistic. An alternative approach to testing the model using the returnbeta formulation is to estimate the model under the hypothesis that expected returns depart from the predictions of the CAPM by a vector of parameters ~, which are called J e n s e n ' s a l p h a s . Redefining ut = rt  ~  f i r m , the model has 2N parameters and 2N orthogonality conditions, so it is exactly identified. It is easy to show that the G M M estimators of ~ and fl are the same as the OLS estimators, and equation (4.4) delivers White's (1980) heteroskedasticityconsistent standard errors. The CAPM may be tested using a Wald test or the Dstatistic, as described above. Tests of the unconditional CAPM using the linear returnbeta formulation are conducted with the G M M by MacKinlay and Richardson (1991), who reject the model for monthly U.S. data.
4.3.2. Conditional CAPMs
Empirical studies that rejected the unconditional CAPM, as well as mounting evidence of predictable variation in the distribution of security rates of return, led to empirical work on conditional versions of the CAPM starting in the early 1980s. In a c o n d i t i o n a l a s s e t p r i c i n g m o d e l it is assumed that the expectation terms in the model are conditional expectations, given a public information set that is represented by a vector of predetermined instrumental variables Z t . The multiplebeta models of Merton (1973) and Cox, Ingersoll, and Ross (1985) are intended to accommodate conditional expectations. Merton (1973, 1980) and CoxIngersollRoss also showed how a conditional version of the CAPM may be derived as a special case of their intertemporal models. Hansen and Richard (1987) describe theoretical relations between conditional and unconditional versions of meanvariance efficiency. The earliest empirical formulations of conditional asset pricing models were the l a t e n t v a r i a b l e m o d e l s developed by Hansen and Hodrick (1983) and Gibbons and Ferson (1985) and later refined by Campbell (1987) and Ferson, Foerster, and Keim (1993). These models allow timevarying expected returns, but maintain the assumption that the conditional betas are fixed parameters. Consider the
21
linear, returnbeta representation of the CAPM under these assumptions, writing E(rdZt1) = flE(rmt[Zt1). The returns are measured in excess of a riskfree asset. Let rlt be some reference asset with nonzero ill, so that E(rulZt_l) = fllE(rmt[Zt_l) Solving this expression for
E(rmtlZt0 and
substituting, we have
E(rtlz
where C = (ft.~ill) and ./ denotes elementbyelement division. With this substitution, the expected market risk premium is the latent variable in the model, and C is the N vector of the model parameters. When we form the error term ut = rt  Crlt, the model implies E ( u t [ Z t _ l ) = 0 and we can estimate and test the model by using the G M M . Gibbons and Ferson (1985) argued that the latent variable model is attractive in view of the difficulties in measuring the true market portfolio, but Wheatley (1989) emphasized that it remains necessary to assume that ratios of the betas, measured with respect to the unobserved market portfolio, are constant parameters. Campbell (1987) and Ferson and Foerster (1995) show that a singlebeta latent variable model is rejected in U.S. data. This finding rejects the hypothesis that there is a (conditional) minimumvariance portfolio such that the ratios of conditional betas on this portfolio are fixed parameters. Therefore, the empirical evidence suggests that conditional asset pricing models should be consistent with either (1) a timevarying beta or (2) more than one beta for each assetJ Conditional, multiplebeta models with constant betas are examined empirically by Ferson and Harvey (1991), Evans (1994), and Ferson and Korajczyk (1995). They reject such models with the usual statistical tests but find that they still capture a large fraction of the predictability of stock and bond returns over time. When allowing for timevarying betas, these studies find that the timevariation in betas contributes a relatively small amount to the timevariation in expected asset returns. Intuition for this finding can be obtained by considering the following approximation. Suppose that timevariation in expected excess returns is E(r[Z) = 2fl, where 2 is a vector of timevarying expected risk premiums for the factors and fl is a matrix of timevarying betas. Using a Taylor series, we can approximate Var[E(r]Z)] ~ E(fl)'Var[2]E(fl) + E(2)'Var[fl]E(2) . The first term in the decomposition reflects the contribution of the timevarying risk premiums; the second reflects the contribution of timevarying betas. Since the average beta E(fl) is on the order of 1.0 in monthly data, while the average risk 10A model with more than one fixed beta, and with timevaryingrisk premiums, is generally consistent with a single, timevaryingbeta for each asset. For example, assume that there are two factors with constant betas and timevaryingrisk premiums, where a timevaryingcombination of the two factors is a minimumvarianceportfolio.
22
premium E(2) is typically less than 0.01, the first term dominates the second term. This means that timevariation in conditional betas is less important than timevariation in expected risk premiums, from the perspective of modeling predictable variation in expected asset returns. While from the perspective of modeling predictable timevariation in asset returns, timevariation in conditional betas is not as important as timevariation in expected risk premiums, this does not imply that beta variation is empirically unimportant. From the perspective of modeling the crosssectional variation in expected asset returns, beta variation over time may be very important. To see this, consider the unconditional expected excess return vector, obtained from the model as E{E(rIZ)} = E{23} = E(2)E(fl) + Cov(2, fl) . Viewed as a crosssectional relation, the term Cov(2, 3) may vary significantly in a cross section of assets. Therefore, the implications of a conditional version of the CAPM for the cross section of unconditional expected returns may depend importantly on common timevariation in betas and expected market risk premiums. The empirical tests of Jagannathan and Wang (1996) suggest that this is the case. Harvey (1989) replaced the constant beta assumption with the assumption that the ratio of the expected market premium to the conditional market variance is a fixed parameter, as in
E (rmt I Z t  1 ) / V a r ( r m t IZtl) = ?
The conditional expected returns may then be written according to the conditional CAPM as
E(r, lZt_~ )
= 7Cov(rt, rmtlZt1) .
Harvey's version of the conditional CAPM is motivated by Merton's (1980) model in which the ratio Y, called the m a r k e t p r i c e o f risk, is equal to the relative risk aversion of a representative investor in equilibrium. Harvey also assumes that the conditional expected risk premium on the market (and the conditional market variance, given fixed Y) is a linear function of the instruments, as in
E(rmtlZt_l ) ~ fimZt_l '
I where 6m is a coefficient vector. Define the error terms vt = r , m  6mZt~ and wt = rt(1  vtT). The model implies that the stacked error term ut = (vt,wt) satisfies E ( u t l Z t _ l ) = 0, so it is straightforward to estimate and then test the model using the GMM. Harvey (1989) rejects this version of the conditional CAPM for monthly data in the U.S. In Harvey (1991) the same formulation is rejected when applied using a world market portfolio and monthly data on the stock markets of 21 developed countries. The conditional CAPM may be tested using the stochastic discount factor representation given by equation (3.4): mt+l = C o t  CltRmt+l. In this case the
23
coefficients Cot and clt are measurable functions of the information set Zt. To implement the model empirically it is necessary to specify functional forms for the Cot and clt. From the expression (3.4) it can be seen that these coefficients are nonlinear functions of the conditional expected market return and its conditional variance. As yet there is no theoretical guidance for specifying the functional forms. Cochrane (1996) suggests approximating the coefficients using linear functions, and this approach is followed by Carhart et al. (1995), who reject the conditional CAPM for monthly U.S. data. Jagannathan and Wang (1993) show that the conditional CAPM implies an unconditional twofactor model. They show that
mt+l : ao + alE(rmt+l lit) + Rmt+l
(where It denotes the information set of investors and a0 and al are fixed parameters) is a valid stochastic discount factor in the sense that E(Ri, t+ lmt+ 1)  1 for this choice of mt +1. Using a set of observable instruments Zt, and assuming that E(rmt+l [Zt) is a linear function of Zt, they find that their version of the model explains the cross section of unconditional expected returns better than does an unconditional version of the CAPM. Bansal and Viswanathan (1993) develop conditional versions of the CAPM and multiplefactor models in which the stochastic discount factor mt+l is a nonlinear function of the market or factor returns. Using nonparametric methods, they find evidence to support the nonlinear versions of the models. Bansal, Hsieh, and Viswanathan (1993) compare the performance of nonlinear models with linear models, using data on international stocks, bonds, and currency returns, and they find that the nonlinear models perform better. Additional empirical tests of the conditional CAPM and multiplebeta models, using stochastic discount factor representations, are beginning to appear in the literature. We expect that future studies will further refine the relations among the various empirical specifications.
5. Model diagnostics
We have discussed several examples of stochastic discount factors corresponding to particular theoretical asset pricing models, and we have shown how to test whether these models assign the right expected returns to financial assets. The stochastic discount factors corresponding to these models are particular parametric functions of the data observed by the econometrician. While empirical studies based on these parametric approaches have led to interesting insights, the parametric approach makes strong assumptions about the economic environment. In this section we discuss some alternative econometric approaches to the problem of asset pricing models.
24
5.1. M o m e n t
Hansen and Jagannathan (1991) derive restrictions from asset pricing models while assuming as little structure as possible. In particular, they assume that the financial markets obey the law of one price and that there are no arbitrage opportunities. These assumptions are sufficient to imply that there exists a stochastic discount factor m t + l (which is almost surely positive, if there is no arbitrage) such that equation (3.1) is satisfied. Note that if the stochastic discount factor is a degenerate random variable (i.e., a constant), then equation (3.1) implies that all assets must earn the same expected return. If assets earn different expected returns, then the stochastic discount factor cannot be a constant. In other words, crosssectional differences in expected asset returns carry implications for the variance of any valid stochastic discount factor, which satisfies equation (3.1). Hansen and Jagannathan make use of this observation to derive a lower bound on the volatility of stochastic discount factors. Shiller (1979, 1981), Singleton (1980), and Leroy and Porter (1981) derive a related volatility bound in specific models, and their empirical work suggests that the stochastic discount factors implied by these simple models are not volatile enough to explain expected returns across assets. Hansen and Jagannathan (1991) show how to use the volatility bound as a general diagnostic device. In what follows we derive the Hansen and Jagannathan (1991) bound and discuss their empirical application. To simplify the exposition, we focus on an unconditional version of the bound using only the unconditional expectations. We posit a hypothetical, unconditional, riskfree asset with return R f = E(mt+0 1 . We take the value of Rf, or equivalently E(mt+ 1), as a parameter to be varied as we trace out the bound. The law of one price guarantees the existence of some stochastic discount factor which satisfies equation (3.1). Consider the following projection of any such mt+l on the vector of gross asset returns, Rt+l:
mt+l = R t'+ l f l Jr
~t+l
(5.1)
where
E(ct+lRt+l) = 0
and where fl is the projection coefficient vector. Multiply both sides of equation (5.1) by Rt+l and take the expected value of both sides of the equation, using E[Rt+lCt+l] = O, t o arrive at an expression which may be solved for ft. Substituting this expression back into (5.1) gives the "fitted values" of the projection as
mt+ 1 = R t + l f l = R t + I E ( R t + i R t + I )
, / ! ! 1
1 .
(5.2)
By inspection, the mt*+l given by equation (5.2) is a valid stochastic discount factor, in the sense that equation (3.1) is satisfied when mt*+l is used in place of mr+ 1 . We have therefore constructed a stochastic discount factor mt*+l that is also a payoff on an investment position in the N given assets, where the vector
25
t 1 E(Rt+IRt+I) _1 provides the weights. This payoff is the unique linear least
squares approximation of every admissible stochastic discount factor in the space of available asset payoffs. Substituting mr+ 1 . for Rt+lfl' in equation (5.1) shows that we may write any stochastic discount factor, mr+l, as
mt+l ~ mt+ 1 + t+l
where E(~t+lmt*+l ) = O. It follows that Var(mt+l) _> Var(mt+l). This expression is the basis of the HansenJagannathan bound 11 on the variance of mt +2. Since mr+1 depends only on the second moment matrix of the N returns, the lower bound depends only on the assets available to the econometrician and not on the particular asset pricing model that is being studied. To obtain an explicit expression for the variance bound in terms of the underlying assetreturn moments, substitute from the previous expressions to obtain Var(mt+l) _~ Var(mt+l) = ffVar(Rt+l)// = [Cov(m, R')Var(R)2] Var(R)[Var(R)lCov(m, R')]
= [1  E(m)E(R')]Var(R) 1 [!  E(m)E(R)]
(5.3)
where the time subscripts are suppressed to conserve notation and the last line follows from E(mR) = 1 = E(m)E(R) + Cov(m, R). As we vary the hypothetical values of E(m) = R~f1, the equation (5.3) traces out a parabola in E(m), ~(m) space, where ~(rn) is the standard deviation of mt+l. If we place ~(rn) on the y axis and E(m) on the x axis, the HansenJagannathan bounds resemble a cup, and the implication is that any valid stochastic discount factor mt+l must have a mean and standard deviation that place it within the cup. The lower bound on the volatility of a stochastic discount factor, as given by equation (5.3), is closely related to the standard meanvariance analysis that has long been used in the financial economics literature. To see this, recall that if r = R  R f is the vector of excess returns, then (3.1) implies that 0 = E(mr) = E(m)E(r) + pa(m)a(r) . Since  1 _< p < 1, we have that
a ( m ) / E ( m ) > E(ri)/cr(ri)
for all i. The right side of this expression is the Sharpe ratio for asset i. The Sharpe ratio is defined as the expected excess return on an asset, divided by the standard deviation of the excess return (see Sharpe 1994 for a recent discussion of this ratio). Consider plotting every portfolio that can be formed from the N assets in the Standard Deviation (x axis)  Mean (y axis) plane. The set of such portfolios
11Related bounds were derived by Kandel and Stambaugh (1987), Mackinlay (1987, 1995), and Shanken (1987).
26
with the smallest possible standard deviation for a given mean return is the minimumvariance boundary. Consider the tangent to the minimumvariance boundary from the point 1/E(m) on the y axis. The tangent point is a portfolio of the asset returns, and the slope of this tangent line is the maximum Sharpe ratio that can be attained with a given set of N assets and a given riskfree rate, Rf = 1/E(m). The slope of this line is also equal to Rf multiplied by the HansenJagannathan lower bound on a(m) for a given E(m) =/~fl. That is, we have that
27 (5.4)
E(~ + R't~ ) = v
E(e,= + Rte',#) = ! u
E(yt) = v
+ RI, ) 2] 
<_ 0
The first equation says that the expected value of m t = O~ R t t f l ~ V. The second equation says that the regression function for mt is a valid stochastic discount factor. The third equation says that v is the expected value of the particular candidate discount factor that we wish to test. The fourth equation states that the HansenJagannathan bound is satisfied by the particular candidate stochastic discount factor. We can estimate the parameters v, ~, and the N v e c t o r / / , using the N + 3 equations in (5.4), by treating the last inequality as an equality and using the G M M . Treating the last equation as an equality corresponds to the null hypothesis that the mean and variance of Yt place it on the HansenJagannathan boundary. Under the null hypothesis that the last equation of (5.4) holds as an equality, the minimized value of the G M M criterion function J r , multiplied by T, has a chisquare distribution with one degree of freedom. Cochrane and Hansen (1992) suggest testing the inequality relation using the onesided test.
5.3. Specification error bounds
The methods we have examined so far are developed, for the most part, under the null hypothesis that the asset pricing model under consideration by the econometrician assigns the right prices (or expected returns) to all assets. An alternative is to assume that the model is wrong and examine how wrong the model is. In this section we will follow Hansen and Jagannathan (1994) and discuss one possible way to examine what is missing in a model and assign a scalar measure of the model's misspecification) 2 Let Yt denote the candidate stochastic discount factor corresponding to a given asset pricing model, and let m t denote the unique stochastic discount factor that we constructed earlier, as a combination of asset payoffs. We assume that E[ytRt] does not equal 1N, the N vector of ones; i.e., the model does not correctly price all of the gross returns. We can project Yt on the N asset returns to get yt = RIt~ + ut, and project m;" on the vector of asset returns to get m t = R'tfl + et. Since the candidate Yt does not correctly price all of the assets, then ~ and fl will not be the same. Define pt = (fl  oOtgt as the modifying p a y o f f to the candidate stochastic
12 GMMbased model specificationtests are examined in a general setting by Newey (1985). Other related work includes that by Boudoukh, Richardson, and Smith (1993), who compute approximate bounds on the probabilities of the test statistics in the presence of inequality restrictions; Chen and Knez (1992) develop nonparametric measures of market integration by using related methods; and Hansen, Heaton, and Luttmer (1995) show how to compute specification error and volatility bounds when there are market frictions such as shortsale constraints and proportional transaction costs.
28
discount factor Yr. Clearly, (Yt +Pt) is a valid stochastic discount factor, satisfying equation (3.1). Hansen and Jagannathan (1994) derive specification tests based on the size of the modifying payoff, which measures how far the model's candidate for a stochastic discount factor yt is from a valid stochastic discount factor. Hansen and Jagannathan (1994) show that a natural measure of this distance is 6 = E(pt2), which provides an economic interpretation for the model's misspecification. Payoffs that are orthogonal to Pt are correctly priced by the candidate Yt, and E(p 2) is the maximum amount of mispricing by using Yt for any payoff normalized to have a unit second moment. The modifying payoffpt is also the minimal modification that is sufficient to make yt a valid stochastic discount factor. Hansen and Jagannathan (1994) consider an estimator of the distance measure 6 given as the solution to the following maximization problem:
6T =
(5.5)
If ~r is the solution to (5.5), then the estimate of the modifying payoff is ~R,. It can be readily verified that the firstorder condition to (5.5) implies that o~tTRt satisfies the sample counterpart to the asset pricing equation (3.1). To obtain an estimate of the sampling error associated with the estimated value 6it, consider
ut = y2t  (yt + o~'rRt)2 + 2~r_IU I
The sample mean of ut is 6~,. We can obtain a consistent estimator of the variance of 6~ by the frequency zero spectral density estimators described in Newey and West (1987a) or Andrews (1991) and applied to the time series {ut  6~}t=1...r. Let sv denote the estimated standard deviation of 6~ obtained this way. Then, under standard assumptions, we have that T1/2(6~ .  6 ) / s r converges to a normal (0,1) random variable. Hence, using the delta method, we obtain
T1/Z fir /2sT( fir  6) ~ N(O, 1) .
(5.6)
6. Conclusions
In this article we have reviewed econometric tests of a wide range of asset pricing models, where the models are based on the law of one price, the noarbitrage principle, and models of market equilibrium with investor optimization. Our review included the earliest of the equilibrium asset pricing models, the CAPM, and also considered dynamic multiplebeta and arbitrage pricing models. We provided some results for the asymptotic distribution of traditional twopass estimators for asset pricing models stated in the linear, returnbeta formulation. We emphasized the econometric evaluation of asset pricing models by using
29
Hansen's (1982) generalized method of moments. Our examples illustrate the simplicity and flexibility of the G M M approach. We showed that most asset pricing models can be represented in a stochastic discount factor form, which makes the application of the G M M straightforward. Finally, we discussed model diagnostics that provide additional insight into the causes of the statistical rejections in G M M tests and which help assess the specification errors in these models.
Appendix
PROOF OF THEOREM 2.1 The proof comes from Jagannathan and Wang (1996). We first introduce some additional notation. Let IN be the Ndimensional identity matrix and 1T be a Tdimensional vector of ones. It follows from equation (2.17) that
i ! R  ~ = r  ~ ( I u 1~) ~t,
k= 1,...,K2
where
~k =
)'.
wherefk is the vectordemeaned factor realizations, conformable to the vector ek. In view of the assumption that the conditional covariance of ei~ and Ejtt, given the time series of the factors (denoted by J~), is a fixed constant aqkl, we have that E[(bk  flt)(R1 /~l)[Jk]
= T  I [ I N ((fk t fk) 1 fk t )]E[~kellfk](IN t 1T)
(x'x)lx'[Var(u) + Var(h?2)lx(x'x) 1
Let r~qkl denote the limiting value of Cov(v~f~eik, v/f f / g t ) , as T ~ o0. Let the matrix with rcqtt being its ij th element be denoted by IItt. We assume that the sample covariance matrix of the factors exists and converges in probability to a constant positive definite matrix f~, with typical element ~kt. Since v ~ (bit  flit) converges in distribution to the random variable f~, 1 v ~ f/~eit, , we have
30
Var(hY2) and
= 2...al k 1
K2 ])2k~2l kk
W = (x'x)lx'Var(hy2)x(x'x) '
Z
1,k=l,...,k2
(x'x)lx'{y2kY2'(n2H~tOTtl)}x(x'x)I
where H~I is a matrix whose i,jth element is the limiting value of Cov(v@f~ Elk,
v~f/~jl) as T , ~ .
Q.E.D.
References
Abel, A. (1990). Asset prices under habit formation and catching up with the Jones. Amer. Econom. Rev. Papers Proc. 80, 38M2. Andrews, D. W. K. (1991). Heteroskedasticity and autoeorrelation consistent covariance matrix estimation. Econometrica 59, 817858. Andrews, D. W. K. and J. C. Monahan (1992). An improved heteroskedasticity and autocorrelation consistent covariance matrix estimator. Econometrica 60, 953966. Arrow, K. J. (1970). Essays in the Theory o f RiskBearing. Amsterdam: NorthHolland. Bansal, R. and S. Viswanathan (1993). No arbitrage and arbitrage pricing: A new approach. 3". Finance 8, 12311262. Bansal, R., D. A. Hsieh and S. Viswanathan (1993). A new approach to international arbitrage pricing. J. Finance 48, 17191747. Becker, G. S. and K. M. Murphy (1988). A theory of rational addiction. J. Politic. Econom. 96, 675700. Beja, A. (1971). The structure of the cost of capital under uncertainty. Rev. Econom. Stud. 38(8), 359368. Berk, J. B. (1995). A critique of sizerelated anomalies. Rev. Financ. Stud. 8, 27.%286. Black, F. (1972). Capital market equilibrium with restricted borrowing. J. Business 45, 444455. Black, F., M. C. Jensen and M. Scholes (1972). The capital asset pricing model: Some empirical tests. In: Studies in the Theory of Capital Markets, M. C. Jensen, ed., New York: Praeger, 79121. Boudoukh, J., M. Richardson and T. Smith (1993). Is the ex ante risk premium always positive? A new approach to testing conditional asset pricing models. J. Financ. Econom. 34, 387M08. Breeden, D. T. (1979). An intertemporal asset pricing model with stochastic consumption and investment opportunities. J. Financ. Econom. 7, 265296. Brown, D. P. and M. R. Gibbons (1985). A simple econometric approach for utilitybased asset pricing models. J. Finance 40, 359381. Burnside, C. (1994). HansenJagannathan bounds as classical tests of assetpricing models. J. Business Econom. Statist. 12, 5779. Campbell, J. Y. (1987). Stock returns and the term structure. J. Financ. Econom. 18, 373399. Campbell, J. Y. and J. Cochrane (1995). By force of habit. Manuscript, Harvard Institute of Economic Research, Harvard University. Carhart, M., K. Welch, R. Stevens and R. Krail (1995). Testing the conditional CAPM. Working Paper, University of Chicago. Cecchetti, S. G., P. Lam and N. C. Mark (1994). Testing volatility restrictions on intertemporal marginal rates of substitution implied by Euler equations and asset returns. J. Finance 49, 123152.
31
Chen, N. (1983). Some empirical tests of the theory of arbitrage pricing. J. Finance 38, 13931414. Chen, Z. and P. Knez (1992). A measurement framework of arbitrage and market integration. Working Paper, University of Wisconsin. Cochrane, J. H. (1996). A crosssectional test of a production based asset pricing model. Working Paper, University of Chicago. Cochrane, J. H. and L. P. Hansen (1992). Asset pricing explorations for macroeconomics. In: NBER Macroeconomics Annual 1992, O. J. Blanchard and S. Fischer, eds., Cambridge, Mass.: MIT Press. Connor, G. (1984). A unified beta pricing theory. J. Econom. Theory 34, 1331. Connor, G. and R. A. Korajczyk (1986). Performance measurement with the arbitrage pricing theory: A new framework for analysis. J. Financ. Econom. 15, 373394. Constantinides, G. M. (1982). Intertemporal asset pricing with heterogeneous consumers and without demand aggregation. J. Business 55, 253267. Constantinides, G. M. (1990). Habit formation: A resolution of the equity premium puzzle. J. Politic. Econom. 98, 519543. Constantinides, G. M. and D. Duffle (1994). Asset pricing with heterogeneous consumers. Working Paper, University of Chicago and Stanford University. Cox, J. C., J. E. Ingersoll, Jr. and S. A. Ross (1985). A theory of the term structure of interest rates. Econometrica 53, 385407. Debreu, G. (1959). Theory of Value: An Axiomatic Analysis of Economic Equilibrium. New York: Wiley. Detemple, J. B. and F. Zapatero (1991). Asset prices in an exchange economy with habit formation. Econometrica 59, 16331657. Dunn, K. B. and K. J. Singleton (1986). Modeling the term structure of interest rates under nonseparable utility and durability of goods. J. Financ. Econom. 17, 2755. Dybvig, P. H. and J. E. Ingersoll, Jr., (1982). Meanvariance theory in complete markets. J. Business 55, 233251. Eichenbaum, M. S., L. P. Hansen and K. J. Singleton (1988). A time series analysis of representative agent models of consumption and leisure choice under uncertainty. Quart. J. Econom. 103, 5178. Epstein, L. G. and S. E. Zin (1989). Substitution, risk aversion and the temporal behavior of consumption and asset returns: A theoretical framework. Econometrica 57, 937969. Epstein, L. G. and S. E. Zin (1991). Substitution, risk aversion and the temporal behavior of consumption and asset returns. J. Politic. Econom. 99, 263286. Evans, M. D. D. (1994). Expected returns, timevarying risk, and risk premia. J. Finance 49, 655679. Fama, E. F. and K. R. French. (1992). The crosssection of expected stock returns. J. Finance 47, 427465. Fama, E. F. and J. D. MacBeth (1973). Risk, return, and equilibrium: Empirical tests. J. Politic. Econom. 81, 607436. Ferson, W. E. (1983). Expectations of real interest rates and aggregate consumption: Empirical tests. J. Financ. Quant. Anal. 18, 477497. Ferson, W. E. and G. M. Constantinides (1991). Habit persistence and durability in aggregate consumption: Empirical tests. J. Financ. Econom. 29, 199240. Ferson, W. E. and S. R. Foerster (1994). Finite sample properties of the generalized method of moments tests of conditional asset pricing models. J. Financ. Econom. 36, 2955. Ferson, W. E. and S. R. Foerster (1995). Further results on the smallsample properties of the generalized method of moments: Tests of latent variable models. In: Res. Financ., Vol. 13. Greenwich, Conn.: JAI Press, pp. 91114. Ferson, W. E., S. R. Foerster and D. B. Keim (1993). General tests of latent variable models and meanvariance spanning. J. Finance 48, 131156. Ferson, W. E. and C. R. Harvey (1991). The variation of economic risk premiums. J. Politic. Econom. 99, 385415. Ferson, W. E. and C. R. Harvey (1992). Seasonality and consumptionbased asset pricing. J. Finance 47, 511552.
32
Ferson, W. E. and R. A. Korajczyk (1995). Do arbitrage pricing models explain the predictability of stock returns? J. Business 68, 309349. Ferson, W. E. and J. J. Merrick, Jr. (1987). Nonstationarity and stageofthebusinesscycle effects in consumptionbased asset pricing relations. J. Financ. Econom. 18, 127146. Gallant, R. (1987). Nonlinear Statistical Models. New York: Wiley. Gibbons, M. R. and W. Ferson (1985). Testing asset pricing models with changing expectations and an unobservable market portfolio. J. Financ. Econom. 14, 217236. Gorman, W. M. (1953). Community preference fields. Econometrica 21, 6380. Hansen, L. P. (1982). Large sample properties of generalized method of moments estimators. Econometrica 50, 10291054. Hansen, L. P., J. Heaton and E. G. J. Luttmer (1995). Econometric evaluation of asset pricing models. Rev. Financ. Stud. 8, 237274. Hansen, L. P. and R. Hodrick (1983). Risk averse speculation in the forward foreign exchange market: An econometric analysis of linear models. In: Exchange Rates and International Macroeconomics, J. A. Frenkel, ed., Chicago: University of Chicago Press. Hansen, L. P. and R. Jagannathan (1991). Implications of security market data for models of dynamic economies. J. Politic. Econom. 99, 225262. Hansen, L. P. and R. Jagannathan (1994). Assessing specification errors in stochastic discount factor models. NBER Technical Working Paper No. 153. Hansen, L. P. and S. F. Richard (1987). The role of conditioning information in deducing testable restrictions implied by dynamic asset pricing models. Econornetrica 55, 587~13. Hansen, L. P. and K. J. Singleton (1982). Generalized instrumental variables estimation of nonlinear rational expectations models. Econometrica 50, 12691286. Hansen, L. P. and K. J. Singleton (1983). Stochastic consumption, risk aversion, and the temporal behavior of asset returns. J. Politic. Econorn. 91,249265. Harrison, M. and D. Kreps (1979). Martingales and arbitrage in multiperiod securities markets. J. Econom. Theory 20, 381408. Harvey, C. R. (1989). Timevarying conditional covariances in tests of asset pricing models. J. Financ. Econom. 24, 289317. Harvey, C. R. (1991). The world price of covariance risk. J. Finance 46, 111157. Heaton, J. (1995). An empirical investigation of asset pricing with temporally dependent preference specifications. Econometrica 63, 681717. Ibbotson Associates. (1992). Stocks, bonds, bills, and inflation. 1992 Yearbook. Chicago: Ibbotson Associates. Jagannathan, R. (1985). An investigation of commodity futures prices using the consumptionbased intertemporal capital asset pricing model. J. Finance 40, 175191. Jagannathan R. and Z. Wang (1993). The CAPM is alive and well. Federal Reserve Bank of Minneapolis Research Department Staff Report 165. Jagannathan, R. and Z. Wang (1996). The conditionalCAPM and the crosssection of expected returns. J. Finance 51, 353. Kandel, S. (1984). On the exclusion of assets from tests of the meanvariance efficiency of the market portfolio. J. Finance 39, 6375. Kandel, S. and R. F. Stambaugh (1987). On correlations and inferences about meanvariance efficiency. J. Financ. Econom. 18, 61 90. Lehmann, B. N. and D. M. Modest (1987). Mutual fund performance evaluation: A comparison of benchmarks and benchmark comparisons. J. Finance 42, 233265. Leroy, S. F. and R. D. Porter (1981). The present value relation: Tests based on implied variance bounds. Econometrica 49, 555574. Lewbel, A. (1989). Exact aggregation and a representative consumer. Quart. J. Econom. 104, 621~533. Lintner, J. (1965). The valuation of risk assets and the selection of risky investments in stock portfolios and capital budgets. Rev. Econom. Statist. 47, 1337. Lucas, R. E. Jr. (1978). Asset prices in an exchange economy. Econometrica 46, 14291445. Luttmer, E. (1993). Asset pricing in economies with frictions. Working Paper, Northwestern University.
33
McElroy, M. B. and E. Burmeister (1988). Arbitrage pricing theory as a restricted nonlinear multivariate regression model. J. Business Econom. Statist. 6, 2942. MacKinlay, A. C. (1987). On multivariate tests of the CAPM. J. Financ. Econom. 18, 341371. MacKinlay, A. C. and M. P. Richardson (1991). Using generalized method of moments to test meanvariance efficiency. J. Finance 46, 511527. MacKinlay, A. C. (1995). Mulifactor models do not explain deviations from the CAPM. J. Financ. Econom. 38, 328. Merton, R. C. (1973). An intertemporal capital asset pricing model. Econometrica 41, 867887. Merton, R. C. (1980). On estimating the expected return on the market: An exploratory investigation. J. Financ. Econom. 8, 323361. Mossin, J. (1966). Equilibrium in a capital asset market. Econometrica 34, 768783. Newey, W. (1985). Generalized method of moments specification testing. J. Econometrics 29, 229256. Newey, W. K. and K. D. West (1987a). A simple, positive semidefinite, heteroskedasticity and autocorrelation consistent covariance matrix. Econometrica 55, 703708. Newey, W. K. and K. D. West. (1987b). Hypothesis testing with efficient method of moments estimation, lnternat. Econom. Rev. 28, 777787. Novales, A. (1992). Equilibrium interestrate determination under adjustment costs. J. Econom. Dynamic Control 16, 125. Roll, R. (1977). A critique of the asset pricing theory's tests: Part 1: On past and potential testability of the theory: J. Financ. Econom. 4, 129176. Ross, S. A. (1976). The arbitrage pricing theory of capital asset pricing. J. Econom. Theory 13, 341360. Ross, S. (1977). Risk, return and arbitrage. In: Risk and Return in Finance, I. Friend and J. L. Bicksler, eds. Cambridge, Mass.: BaUinger. Rubinstein, M. (1974). An aggregation theorem for securities markets. J. Financ. Econom. 1, 225244. Rubinstein, M. (1976). The valuation of uncertain income streams and the pricing of options. Bell J. Econom. Mgmt. Sci. 7, 407425. Ryder H. E., Jr. and G. M. Heal (1973). Optimum growth with intertemporally dependent preferences. Rev. Econom. Stud. 40, 133. Shanken, J. (1987). Multivariate proxies and asset pricing relations: Living with the roll critique. J. Financ. Econom. 18, 91110. Shanken, J. (1992). On the estimation of betapricing models. Rev. Financ. Stud. 5, 133. Sharpe, W. F. (1964). Capital asset prices: A theory of market equilibrium under conditions of risk. J. Finance 19, 425442. Sharpe, W. F. (1994). The Sharpe ratio. J. Port. Mgmt. 21, 4958. Shiller, R. J. (1979). The volatility of longterm interest rates and expectations models of the term structure. J. Politic. Econom. 87, 11901219. Shiller, R. J. (1981). Do stock prices move too much to be justified by subsequent changes in dividends? Amer. Econom. Rev. 71, 421436. Singleton, K. J. (1980). Expectations models of the term structure and implied variance bounds. J. Politic. Econom. 88, 11591176. Snow, K. N. (1991). Diagnosing asset pricing models using the distribution of asset returns. J. Finance 46, 955983. Stambaugh, R. F. (1982). On the exclusion of assets from tests of the twoparameter model: A sensitivity analysis. J. Financ. Econom. 10, 237268. Sundaresan, S. M. (1989). Intertemporally dependent preferences and the volatility of consumption and wealth. Rev. Financ. Stud. 2, 7389. Wheatley, S. (1988). Some tests of international equity integration. J. Financ. Econom. 21, 177 212. Wheatley, S. M. (1989). A critique of latent variable tests of asset pricing models. J. Financ. Econom. 23, 325338. White, H. (1980). A heteroskedasticityconsistent covariance matrix estimator and a direct test for heteroskedasticity. Econometrica 48, 817838. Wilson, R. (1968). The theory of syndicates. Econometrica 36, 119132.
G. S. Maddala and C. R. Rao, eds., Handbookof Statistics, Vol. 14 1996ElsevierScience B.V. All rights reserved.
")
z~
A number of wellknown asset pricing models imply that the expected return on an asset can be written as a linear function of one or more beta coefficients that measure the asset's sensitivity to sources of undiversifiable risk. This paper provides an overview of the econometric evaluation of such models using the method of instrumental variables. We present numerous examples that cover both singlebeta and multibeta models. These examples are designed to illustrate the various options available to researchers for estimating and testing beta pricing models. We also examine the implications of a variety of different assumptions concerning the timeseries behavior of conditional betas, covariances, and rewardtorisk ratios. The techniques discussed in this paper have applications in other areas of asset pricing as well. 1. Introduction Asset pricing models often imply that the expected return on an asset can be written as a linear combination of marketwide risk premia, where each risk premium is multiplied by a beta coefficient that measures the sensitivity of the return on the asset to a source of undiverifiable risk in the economy. Indeed, this type of tradeoff between risk and expected return is implied by some of the most famous models in financial economics. The Sharpe (1964)  Lintner (1965) capital asset pricing model (CAPM), the Black (1972) CAPM, the Merton (1973) intertemporal CAPM, the arbitrage pricing theory (APT) of Ross (1976), and the Breeden (1979) consumption CAPM can all be classified under the general heading of beta pricing models. Although these models differ in terms of underlying structural assumptions, each implies a pricing relation that is linear in one or more betas. The fundamental difference between conditional and unconditional beta pricing models is the specification of the information environment that investors use to form expectations. Unconditional models imply that investors set prices based on an unconditional assessment of the joint probability distribution of future returns. Under such a scenario we can construct an estimate of an investor's
35
36
expected return on an asset by taking an average of past returns. Conditional models, on the other hand, imply that investors have timevarying expectations concerning the joint probability distribution of future returns. In order to construct an estimate of an investor's conditional expected return on an asset we have to use the information available to the investor at time t  1 to forecast the return for time t. Both conditional and unconditional models attempt to explain the crosssectional variation in expected returns. Unconditional models imply that differences in average risk across assets determine differences in average returns. There are no timeseries predictions other than expected returns are constant. Conditional models have similar crosssectional implications: differences in conditional risk determine differences in conditional expected returns. But conditional models have implications concerning the timeseries properties of expected returns as well. Conditional expected returns vary with changes in conditional risk and fluctuations in marketwide risk premiums. In theory, we can test a conditional beta pricing model using a single asset. Empirical tests of beta pricing models can be interpreted within the familiar framework of meanvariance analysis. Unconditional tests seek to determine whether a certain portfolio is on the efficient portion of the unconditional meanvariance frontier. The unconditional frontier is determined by the unconditional means, variances and covariances of the asset returns. Conditional tests of beta pricing models are designed to answer a similar question: does a certain portfolio lie on the efficient portion of the meanvariance frontier at each point in time? In conditional tests, however, the meanvariance frontier is determined by the conditional means, conditional variances, and conditional covariances of asset returns. As a general rule, the rejection of unconditional efficiency does not imply a rejection of conditional meanvariance efficiency. This is easily demonstrated using an example given by Dybvig and Ross (1985) and Hansen and Richard (1987). Suppose we are testing whether the 30day Treasury bill is unconditionally efficient using monthly data. Unconditionally, the 30day bill does not lie on the efficient frontier. It is a single risky asset (albeit low risk) whose return has nonzero variance. Thus it is surely dominated by an appropriately chosen portfolio. At the conditional level, however, the conclusion is much different. Conditionally, the 30day bill is nominally risk free. At the end of each month we know precisely what the return will be over the next month. Because the conditional variance of the return on the Tbill is zero, it must be conditionally efficient. A number of different methods have been proposed for testing beta pricing models. This paper focuses on one in particular: the method of instrumental variables. Instrumental variables are a set of data, specified by the econometrician, that proxy for the information that investors use to form expectations. The primary advantage of the instrumental variables approach is that it provides a highly tractable way of characterizing timevarying risk and expected returns. Our discussion of the instrumental variables methodology is organized along the
37
following lines. Section 2 uses the conditional version of the Sharpe (1964) Lintner (1965) CAPM to illustrate how the instrumental variables approach can be employed to estimate and test single beta models. Section 3 extends the analysis to multibeta models. Section 4 introduces the technique of latent variables. Section 5 provides an overview of the estimation methodology. The final section offers some brief closing remarks.
2. Single beta models A. The conditional C A P M The conditional version of the Sharpe (1964)  Lintner (1965) CAPM is undoubtedly one of the most widely studied conditional beta pricing models. We can express the pricing relation associated with this model as: E[rjtlat_l ] = Coy[tit, rmtlf~t_l] E[rmtlat_l J Var[rmt[g~t_l] (1)
where tit is the return on portfolio j from time t  1 to time t measured in excess of the risk free rate, r,,t is the excess return on the market portfolio, and g2t1 represents the information set that investors use to form expectations. The ratio of the conditional covariance between the return on portfolio j and the return on the market, Cov[rjt,rmtlt~t_l], to the variance of the return on the market, Var[rmt[~~t_l] , is the conditional beta of portfolio j with respect to the market. Any crosssectional variation in expected returns can be attributed solely to differences in conditional beta coefficients. As it stands the pricing relation shown in (1) is untestable. To make it testable we have to impose additional structure on the model. In particular, we have to specify a model for conditional expectations. Thus any test of (1) will be a joint test of the conditional CAPM and the assumed specification for conditional expectations. In theory any functional form could be used. Let f ( Z t  1 ) denote the statistical model that generates conditional expectations where Z is a set of instrumental variables. The function f ( . ) could be a linear regression model, a Fourier flexible form [Gallant (1982)], a nonparametric kernel estimator [Silverman (1986), Harvey (1991), and Beneish and Harvey (1995)], a seminonparametric density [Gallant and Tauchen (1989)], a neural net [Gallant and White (1990)], an entropy encoder [Glodjo and Harvey (1995)], or a polynomial series expansion [Harvey and Kirby (1995)]. Once we take a stand on the functional form of the conditional expectations operator it is straightforward to construct a test of the conditional CAPM. First we use f ( . ) to obtain fitted values for the conditional mean ofrjt. This nails down the lefthand side of the pricing relation in (1). Then we apply f() again to get fitted values for the three components on the righthand side of (1). Combining the fitted values for the conditional mean of r, nt, those for the conditional covariance between r it and rmt , and those for the conditional variance of rmt yields
38
fitted values for the righthand side of (1). If the conditional CAPM is valid then the pricing errors  the difference between the fitted values for the lefthand and righthand sides of (1)  should be small and unpredictable. This is the basic intuition behind all tests of conditional beta pricing models. In the presentation that follows we focus on one particular specification for conditional expectations: the linear model. This model, though very simple, has distinct advantages over the many nonlinear alternatives. The linear model is exceedingly easy to implement, and Harvey (1991) shows that it performs well against nonlinear alternatives in outofsample forecasting of the market return. In addition, the linear specification is actually more general than it may seem. Recent work has shown that many nonlinear models can be consistently approximated via an expanding sequencing of finitedimensional linear models. Harvey and Kirby (1995) exploit this fact to develop a simple procedure for constructing analytic tests of both single beta and multibeta pricing models.
39
where ujt is the error in forecasting the return on portfolio j at time t, Zt1 is a row vector of g instrumental variables, and aj is a g x 1 set of timeinvariant weights. Substituting the expression shown in (2) into equation (1) yields the restriction:
Ztl~m E. . . . Z t  l ~ J  1 7 [ , ,  ~ ' ~ ] [UjtUmt]~StlJ ' ~t~mtlL, tl ]
(3)
where Umt is the error in forecasting the return on the market portfolio. Note that both the variance term, E[U2m~IZ,_I],and the covariance term, E[ujtUmtlZt_l], are conditioned on Zt1. Therefore, the pricing relation in (3) should be regarded as an approximation. This is the case because the expectation of the true conditional covariance is not the covariance conditioned on Zt1. The two are connected via the relation: E [ C o v ( r j t , r r n t l ~ t  1 ) l Z t  l ] = Cov(rj,, rmtlZt_l )  Cov(E[rjtll2t_l ], E[rmt ]12t_l]lZt_l). An analogous relation holds for the true conditional variance of rmt and the variance conditioned on Zt1. There is no way to construct a test of the original version of pricing restriction given that the true information set 12 is unobservable. If we multiply both sides of (3) by the conditional variance of the return on the market portfolio we obtain the restriction:
E[um,
E[ujtu,.tZ,,
mlZ,_l]
(4)
Notice that the conditional expected return on both the market portfolio and portfolio j have been moved inside the expectations operator. This can be done because both of these quantities are known conditional on Zta. As a result, we do not need to specify an explicit model for the conditional variance and covariance terms. We simply note that, under the null hypothesis, the disturbance:
(5)
should have mean zero and be uncorrelated with the instrumental variables. If we divide ejr by the conditional variance of the market return, then the resulting quantity can be interpreted as the deviation of the observed return from the return predicted by the model. Thus ejr is essentially just a pricing error. A negative pricing error implies the model is overpricing while a positive pricing error indicates that the model is underpricing. The generalized method of moments (GMM), which is discussed in detail in Section 5, provides a direct way to test the above restriction. Suppose we have a total of n assets. We can stack the disturbances in (2) and the pricing errors in (5) into the (2n + 1) 1 vector:
40
)
, (6)
where u is the innovation in the 1 n vector o f conditional means and e is the 1 n vector o f pricing errors. The conditional C A P M implies that st should be uncorrelated with Zt1. So if we f o r m the K r o n e c k e r p r o d u c t o f st with the vector o f instrumental variables: st ZJt_l , (7)
and take unconditional expectations, we obtain the vector o f orthogonality conditions: E[st @ ZJ,_I] = 0 .
(8)
With n assets there are n 4 1 columns o f innovations for the conditional means and n columns o f pricing errors. Thus, with g instrumental variables we have g(2n + 1) orthogonality conditions. Note, however, that there are g(n + 1) parameters to estimate. This leaves ng overidentifying restrictions. 3 We can obtain consistent estimates o f the ng matrix o f coefficients ~ and the g 1 vector o f coefficients ~m by minimizing the quadratic objective function: JT = g r S ~ g~ , where:
1 T
t I
(9)
(10)
So = ~
j~oo
El(st Z t _ l ) ( S t  j @ Zt_j_l)
(11)
I f the conditional C A P M is true then T times the minimized value o f the objective function converges to a central chisquare r a n d o m variable with ng degrees o f freedom. Thus we can use this criterion as a measure o f the overall goodnessoffit o f the model.
3 An econometric specification of this form is explored for New York Stock Exchange returns in Harvey (1989) and Huang (1989), for 17 international equity returns in Harvey (1991), for international bond returns in Harvey, Solnik and Zhou (1995), and for emerging equity market returns in Harvey (1995).
41
The econometric specification shown in (6) assumes that all of the conditional moments  the means, variances and covariances  change through time. I f some of these moments are constant then we can construct more powerful tests of the conditional C A P M by imposing this additional structure. Traditionally, tests of the C A P M have focused on whether expected returns are proportional to the expected return on a benchmark portfolio. We can construct the same type of test within our conditional pricing framework with a specification of the form:
= (r, 
(12)
where p is a row vector of n beta coefficients. The coefficient/~j represents the ratio of conditional covariance between the return on portfolio j and the return on the benchmark to the conditional variance of the benchmark return. Typically, we think of r,,t as a proxy for the market portfolio. It is important to note, however, that the beta coefficients in (12) are left unrestricted. Thus (12) can also be interpreted as a test of a single factor latent variables model. 4 In the latent variables framework, flj represents the ratio of conditional covariance between the return on portfolio j and an unobserved factor to the conditional covariance between the return on the benchmark portfolio and this factor. The testable implication is that E[etlZt1] = 0 where nt is the vector of pricing errors associated with the constant conditional beta model. There are ng orthogonality conditions and n parameters to estimate so we have g ( n  1) overidentifying restrictions. O f course we can easily incorporate the restrictions on the conditional beta coefficients by changing the specification to:
( [r,Z,_,n]'
gt=(U, Urn, bt et) ,= [ F m Z ta l tm ] 2 l t
[u ,p 
I /
(13)
where b is the disturbance vector associated with the constant conditional beta assumption. Tests based on this specification may shed additional light on the plausibility of the assumption of constant conditional betas. With n assets there are n + 1 columns of innovations in the conditional means, n columns in b and n columns in e. Thus there are g(3n + 1) orthogonality conditions, g(n + 1 ) + n parameters to estimate, and n(2g 1) overidentifying restrictions.
4 See, for example, Hansen and Hodrick (1983), Gibbons and Ferson (1985) and Ferson (1990).
42
E[r,nt[~2t_l]/Var[rmtlI2t_l], is simply the price of covariance risk. This version of the conditional C A P M is examined in Campbell (1987) and Harvey (1989). The vector of pricing errors for the model becomes: et = rt
211tUmt ~
(14)
where 2 is the conditional expected return on the market divided by its conditional variance. To complete the econometric specification we have to include models for the conditional means. The overall system is:
,t=(ut
Umt et)'=
[rmtZtlt~m] t
[r, 
(15)
With n assets there are n + 1 columns of innovations in the conditional means and n columns in e. Thus with g instrumental variables there are g(2n + 1) orthogonality conditions and 1 + (e(n + 1)) parameters. This leaves ng  1 overidentifying restrictions. One way to simplify the estimation in (15) is to note that E[umtujtllt_l] = E[umtrjtlZtl]. This follows from the fact that: E[umttljt[Zt_l] = E[umt(Fjt  Zt_llfj)lZt_l]
As a result, we can drop n of the conditional mean equations. The more parsimonious system is: gt : (Umt et) = ( [ r mt  ,~(Umti't)] Zt15m]t) (16)
N o w we have n + 1 equations and g(n + 1) orthogonality conditions. With g + 1 parameters there are ( n g )  1 overidentifying restrictions. The specifications shown in (t5) and (16) are asymptotically equivalent. But (16) is more computationally manageable. The specifications in (15) and (16) do not restrict 2 to be the conditional covariance to variance ratio. We can easily add this restriction:
"t:(ut
Umt mt
et) '=
[U2t ~  Zt_lam] !
[i, t  ,~(Umtrt)] !
[rmtZtl~m]'
'
(17)
where m is the disturbance associated with the constant rewardtorisk assumption. Tests of this specification should shed additional light on the plausibility of the assumption of a constant price of covariance risk. With n assets there are n columns in u, one column in urn, one column in m and n columns in e. Thus there
43
Ferson and Harvey (1994, 1995) explore specifications where the conditional betas are modelled as a linear functions of the instrumental variables. We could, for example, specify an econometric system of the form:
i~w Ulit ~ ?'it  Z t _ l ~ i
U2t = rmt  Z t _ l ~ m
bl3it = 2 [U2t (Z i,w t t1 Ki ) 
rmtU,it]'
(l 8) lain)!
where the elements of Zi,wri are the fitted conditional betas for portfolio i, #i is the mean return on portfolio i, and ~i is the difference between the unrestricted mean return and the mean return that incorporates the pricing restriction of the conditional CAPM. Note that (18) uses two sets of instruments. The set used to estimate the conditional mean return on portfolio i and the conditional beta for the portfolio, Z i,w, includes both asset specific (/) and marketwide (w) instruments. The conditional mean return on the market is estimated using only the marketwide instruments. This yields an exactly identified system of equations, s The intuition behind the system shown in (18) is straightforward. The first two equations follow from our assumption of linear conditional expectations. They represent statistical models for expected returns. The third equation follows from the definition of the conditional beta:
flit = (g[u2,]Z,l])
2 w  1
E[rmtulitlZt1]
i,w
(19)
In (18) the conditional beta is modelled as a linear function of both the assetspecific and marketwide information. The last two equations deliver the average pricing error for the conditional CAPM. Note that & is the average fitted return from the statistical model. Thus ~i is the difference between the average fitted return from our statistical model and the fitted return implied by the pricing relation of conditional CAPM. It is analogous to the Jensen ~. In the current analysis, however, both the betas and the risk premiums are changing through time. Because of the complexity and size of the above system it is difficult to estimate from more one asset at a time. Thus, in general, not all the crosssectional restrictions of conditional C A P M can be imposed, and it is not possible to report a multivariate test of whether the ~i are equal to zero. Note, however, that (18) 5For analysis of related systems see Ferson (1990), Shanken (1990), Ferson and Harvey (1991), Ferson and Harvey (1993), Ferson and Korajzcyk (1995), Ferson (1995), Harvey (1995) and Jagannathan and Wang (1996).
44
does impose one important crosssectional restriction. Because the system is exactly identified, the market risk premium, Z~t_16m, will be identical for every asset examined. There are no overidentifying restrictions, so tests of the model are based on whether the coefficient ~i is significantly different from zero. Additional insights might be gained by analyzing the timeseries properties of the disturbance:
i~w U6it = ?'it   Z t _ l l i ( ~ t _ l O) t i~w
(20)
Under the null hypothesis, E[u6itlZt_l] is equal to zero. Thus diagnostics can be conducted by regressing u6it on various information variables. We could also construct tests for timevarying of betas based on the coefficient estimates associated with Zi,wxi.
E lZ,_d (E[u'I,u/,IZt_, l ) 
(21)
where r is a row vector of n asset returns, f i s 1 x K vector of factor realizations, u f is a vector of innovations in the conditional means of the factors, and u is a vector of innovations in the conditional means of the returns. The first term on the righthand side of (21) represents the conditional expectation of the factor realizations. It has dimension 1 k. The second term is the inverse of the k x k conditional variancecovariance matrix of the factors. The final term measures the conditional covariance of the asset returns with the factors. Its dimension is k n. The multibeta pricing relation shown in (21) cannot be tested in the same manner as its singlebeta counterpart. Recall that in our analysis of singlebeta models it was possible to take the conditional variance of the market return to the lefthand side of the pricing relation. As a result, we could move the conditional means inside the expectations operator. This is not possible with a multibeta specification. We can, however, get around this problem by focusing on specializations of the multibeta model that parallel those discussed in the previous section. We begin by considering specifications that restrict the conditional betas to be linear functions of the instruments. B. Linear conditional betas The multibeta analogue of the linear conditional beta specification shown in (18) takes the form:
45
ftulit]
(22)
'
where the elements of Zi'wri are the fitted conditional betas associated with the k sources of risk a n d f i s a row vector of factor realizations. Note that as before the system is exactly identified, and the vector of conditional betas:
/~.
= (E[u2,u2,1~_l])
"~W
1
E[f,ul.l~_l]
',W
(23)
is modelled as a linear function, Zi, Wri, of the instruments. This specification can be tested by assessing the statistical significance of the pricing errors and checking to see whether the disturbance:
ll6i t ~ tit  Z t _ l K i ( Z t _ l ~ f )
i,w w
t
(24)
is orthogonal to instruments. The primary advantage of the above formulation is that fitted values are obtained for the risk premiums, the expected returns, and the conditional betas. Thus it is simple to conduct diagnostics that focus on the performance of the model. Its main disadvantage is that it requires a heavy parameterization.
\
~.t = ( gt
ttft et ) ' =
~  Ztlaf]' /
t,., 
(25)
where 2 is a row vector of k timeinvariant rewardtorisk measures. The above system can be simplified to:
g't : ( lift et
)'=
[rt  2(lltftrt)] t )
'
(26)
using the same approach that allowed us to simplify the singlebeta specification discussed earlier. 6
6 Kan and Zhang (1995) generalize this formulation by modelling the conditional rewardtorisk ratios as linear functions of the instrumental variables. Their approach eliminates the need for assetspecific instruments and permits joint estimation of the pricing relation using multiple portfolios. But the type of diagnostics that fall out of the linear conditional beta model  fitted expected returns, betas, etc.  are no longer available.
46
4. Latent variables models The latent variables technique introduced by Hansen and Hodrick (1983) and Gibbons and Ferson (1985) provides a rank restriction on the coefficients of the linear specifications that are assumed to describe expected returns. Suppose we assume that ratio formed by taking the conditional beta for one asset and dividing it by the corresponding conditional beta another asset is constant. Under these circumstances, the kfactor conditional beta pricing model implies that all of the variation in the expected returns is driven by changes in the k conditional risk premiums. We can still form our estimates of the conditional means by projecting returns on the gdimensional vector of instrumental variables. But if all the variation in expected returns is being driven changes in the k risk premiums then we should not need all ng projection coefficients to characterize the time variation in the n returns. Thus the basic idea of the latent variables technique is to test restrictions on the rank of the projection coefficient matrix.
A. Constant conditional beta ratios
First we take the vector of excess returns on our set of portfolios and partition it
as:
rt~ ( t i t
" r2t),
(34)
where rlt is a 1 k vector of returns on the reference assets and r2t is a 1 (n  k) vector of returns on the test assets. Then we partition the matrix of conditional beta coefficients associated with our multifactor pricing model accordingly: ~=(Pl " /~2), (35)
where/~1 is k x k and/~2 is k x (n  k). The pricing relation for the multibeta model tells us that: E[r~tlZt_l] = ~,t/~l and
E[r2tlZt1] = ~tIl2 ,
(36)
(37)
where ~'t is a 1 x k vector of timevarying marketwide risk premiums. We can manipulate (36) to obtain the relation ~t = E [ r l t l Z t  l ] ~ 1. Substituting this expression for ~t into (37) yields the pricing restriction: E[r2,1/,l] = E[rl,l/,1]/~il/~2 (38)
This says that the conditional expected returns on the test assets are proportional to the conditional expected returns on the reference assets. The constants of proportionality are determined by ratios of conditional betas.
47
The pricing relation in (38) can be tested in much the same manner as the models discussed earlier. 7 The only real difference is that we no longer have to identify the factors. One possible specification is:
\
[Zt_162 
(39)
where = flllfl2. There are k columns in ult, n  K columns in u2, and n  k columns in et. Thus we have g(2n  k) orthogonality conditions and gn + k(n  k) parameters. This leaves (  k)(n  k) overidentifying restrictions. Note that both the number of instrumental variables and the total number of assets must be greater than the number of factors.
B. Linear conditional covariance ratios
An important disadvantage of (39) is that the ratio of conditional betas, ~/i = fltlfl2, is assumed to be constant. One way to generalize the latent variables model is to assume the elements of ~/i are linear in the instrumental variables. 8 This assumption follows naturally from the previous specifications that imposed the assumption of linear conditional betas. The resulting latent variables system is:
(40)
w h e r e , is a k 1 vector of ones. With the original set of instruments the dimension of ~* in the final set of moment conditions is g(n  k) and the system is not identified. Thus the researcher must specify some subset of the original instruments, Z*, with dimension g* < g to be used in the estimation. Finally, the parameterization in both (39) and (40) can be reduced by substituting the third equation block into the second block. For example,
~t = (/glt e,) t = ( [PltZt161]t ) [r2t  Z,_161~] / (41) '
Contemporary empirical research in financial economics makes frequent use of a wide variety of econometric techniques. The generalized method of moments has proven to be particularly valuable, however, especially in the area of estimating and testing asset pricing models. This section provides an overview of the gen7 Harvey, Solnik and Zhou (1995) and Zhou (1995) show to construct analytic tests of latent variables models. s See Ferson and Foerster (1994).
48
eralized method of moments (GMM) procedure. We begin by illustrating the intuition behind G M M using a simple example of classical method of moments estimation This is followed by brief discussion of the assumptions underlying the G M M approach to estimation and testing along with a review of some of the key distributional results. For detailed proofs of the consistency and asymptotic normality of G M M estimators see Hansen (1982), Gallant and White (1988), and Potscher and Prucha (1991a,b).
E[xj] ,
(42)
can be written as known function of 0. To implement the C M M procedure we first compute the jth sample moment of x about zero:
1 T /.'1"~ .
rhj = ~  ~
(43)
Then we set the jth sample moment equal to the corresponding population moment f o r j = 1 , 2 , . . . , k :
?~v/1 ~ml(0
.
)
(44)
r~2
=
.
m2(0)
ink(O)
~nk
This yields a set of k equations in k unknowns that can be solved to obtain an estimator for the unknown vector 0. Thus the basic idea behind the C M M procedure is to estimate 0 by replacing population moments with their sample analogues Now let's take a more concrete version of the above example. Suppose that xa, x 2 , . . . , xr is a random sample of size T drawn from a normal distribution with mean # and variance ~r 2. To obtain the classical method of moments estimators of # and O"2 w e note that 0 "2 = m2  (ml)2 This implies that the system of moments equations takes the form:
1 Ti~lXi~ ]~
(45)
__ 1L X2 i ~  0"2 } //2
.
Ti= 1
49
=~xi
(46)
1 u
~2
1 L X~i=Ti=I
xi
Notice that these are also the maximum likelihood estimators of/~ and a 2.
Et[ut+j = 0 ,
(47)
where Et[. ] is the expectations operator conditional on the information set at time t, ut+,  h(Xt+,, 00) is an n x 1 vector of vector of disturbance terms, Xt+~ is an s x 1 vector of observable random variables, and 00 is an m x 1 vector of unknown parameters. The basic idea behind the G M M procedure is to exploit the moment restrictions in (47) to construct a sample objective function whose minimizer is a consistent and asymptotically normal estimate of the unknown vector 00. In order to construct such an objective function, however, we need to make some assumptions about the nature of the data generating process. Let Zt denote the date t realization of an g x 1 vector of observable instrumental variables. We assume, following Hansen (1982), that the vector process {Xt, Zt}~=_~o is strictly stationary and ergodic. Note that this assumption rules out a number of features sometimes encountered in economic data such as deterministic trends, unit roots, and unconditional heteroskedasticity. It accommodates many common forms of conditional heterogeneity, however, and it does not appear to be overly restrictive in most applications. 9 With suitable restrictions on the data generating process in place we can proceed to construct the G M M objective function. First we form the Kronecker product:
f ( x , + ~, z , , 0o)  u,+ ~ z , .
(48)
Then we note that because Zt is in the information set at time t, the model in (47) implies that:
9 Although is possible to establish consistency and asymptotic normality of GMM estimators under weaker assumptions, the associated arguments are too complex for an introductory discussion. The interested reader can consult Potscher and Prucha (1991 a,b) for an overview of recent advances in the asymptotic theory of dynamic nonlinear econometric models.
50
Applying the law of iterated expectations to equation (49) yields the unconditional restriction:
(50)
Equation (50) represents a set of n population orthogonality conditions. The sample analogue of E[f(Xt+~, Zt, 0)]:
1 r
(51)
forms the basis for the G M M objective function. Note that for any given value of 0 the vector 9r(0) is just the sample mean of T realizations of the random vector f(Xt+~, Zt, 0). Given that f(.) is continuous and {Xt, Zt}t~=_~ is strictly stationary and ergodic we have:
(52)
by the law of large numbers. Thus if the economic model is valid the vector gr(00) should be close to zero when evaluated for a large number of observations. The G M M estimator of 00 is obtained by choosing the value of 0 that minimizes the overall deviation o f g r ( 0 ) from zero. As long as E[f(Xt+~, Zt, 0)] is continuous in 0 it follows that this estimator is consistent under fairly general regularity conditions. If the model is exactly identified (m = ng), the G M M estimator is the value of 0 that sets the sample moments equal to zero. For the more common situation where the model is overidentified (m < n~), finding a vector of parameters that sets all of the sample moments equal to zero is not feasible. It is possible, however, to find a value of 0 that sets m linear combinations of the ng sample moment conditions equal to zero. We simply let Ar be an m x ng matrix such that ATgT(O) = 0 has a welldefined solution. The value of 0 that solves this system of equations is the G M M estimator. Although we have considerable leeway in choosing the weighting matrix At, Hansen (1982) shows that the variancecovariance matrix of the estimator is minimized by letting Ar equal D'rST a where Dr and ST are consistent estimates of: Do = E
o)L ] 00 Z~
and
So ~ ~
j=OO
F0(j) ,
(53)
with F0(j)  E[f(Xt+~, Zt, Oo)f(Xt+~j, Ztj, 00)']. Before considering how to derive this result we first have to establish the asymptotic normality of G M M estimators.
51
v/Tot(O) = ~
f(Xt+~, Zt, O) .
(54)
The assumption that {Xt, Zt}t~=_~ is stationary and ergodic, along with standard regularity conditions, implies that a version of the central limit theorem holds. In particular we have that:
(55)
with So given by (53). This result allows us to establish the limiting distribution of the G M M estimator Or. First we make the following assumptions: 1. The estimator Or converges in probability to 00. 2. The weighting matrix Ar converges in probability to A0 where Ao has rank m. 3. Define:
Dr =  A . ~
T t= 1 ~
"
O0
oT )
Zt
(56)
For any Or such that 0rP00 the matrix Dr converges in probability to Do where Do has rank m. Then we apply the mean value theorem to obtain:
(57)
where D~ is given by (56) with Or replaced by a vector 0~ that lies somewhere within the interval whose endpoints are given by Or and 00. Recall that Or is the solution to the system of equations ATgr(O) = 0. So if we premultiply equation (57) by AT we have:
00) = 0.
(58)
(59)
(60)
(61)
Now that we know the limiting distribution of the generic G M M estimator we can determine the best choice for the weighting matrix AT. The natural metric by
52
which to measure our choice is the variancecovariance matrix of the distribution shown in (61). We want, in other words, to choose the Ar that minimizes the variancecovariance matrix of the limiting distribution of the G M M estimator.
H
(AoDo)IAoP
At first it may appear a bit odd to define H in this manner, but it simplifies the problem of finding the efficient choice for At. To see why this is true note that:
(63)
(64)
Because H is an m x ng matrix with rank m it follows that HEft is positive definite. Thus (D~S~01D0)1 is the lower bound on the asymptotic variancecovariance matrix of the G M M estimator. It is easily verified by direct substitution that choosing A0 = D~S001 achieves this lower bound. This completes our review of the distribution theory for G M M estimators. Next we want to consider some of the practical aspects of G M M estimation and see how we might go about testing the restrictions implied economic models. We begin with a strategy for implementing the G M M procedure.
Substituting the optimal choice for the weighting matrix into this expression yields:
53 (65)
g (o) = o ,
where ST is a consistent estimate of the matrix So. But it is apparent that (65) is just the firstorder condition for the problem: min 0 JT(O)
=
9T(O)tSTlgT(O)
(66)
So given a consistent estimate of So we can obtain the G M M estimator for 0o by minimizing the quadratic form shown in equation (66). In order to estimate 00 we need a consistent estimate of So. But, in general, So is a function of 00. The solution to this dilemma is to perform a twostep estimation procedure. Initially we set ST equal to the identify matrix and perform the minimization to get a firststage estimate for 00. Although this estimate is not asymptotically efficient it is still consistent. Thus we can use it to construct a consistent estimate of So. Once we have a consistent estimate of So we obtain the secondstage estimate for 00 by minimizing the quadratic form shown above. Let's assume that we have performed the twostep estimation procedure and obtained the efficient G M M estimate of the vector of parameters 00. Typically we would like to have some way of evaluating how well the model fits the observed data. One way of obtaining such a goodnessoffit measure is to construct a test of the overidentifying restrictions.
(67)
If we multiply equation (67) by v ~ and substitute for v~(OT  0O) from equation (59) we obtain:
(68)
(69)
(70)
Because So is symmetric and positive definite it can be factored as So = P ~ where P is nonsingular. Thus (70) can be written as:
54
(71)
The matrix premultiplying the normal distribution in (71) is idempotent with rank ng  m. It follows, therefore, that the overidentifying test statistic:
MT  TgT(Or)'~oo ~gT(OT)
(72)
converges to a central chisquare random variable with n g  m degrees of freedom. The limiting distribution of Mr remains the same if we use a consistent estimate S r in place of So. Note that in many respects the test for overidentifying restrictions is analogous to the Lagrange multiplier test in maximum likelihood estimation. The G M M estimator of 00 is obtained by setting m linear combinations of the ng orthogonality conditions equal to zero. Thus there are ng  m linearly independent combinations which have not been set equal to zero. Suppose we took these ng  m linear combinations of the moment conditions and set them equal to a (n  m) x 1 vector of unknown parameters e. The system would then be exactly identified and Mr would be identically equal to zero. Imposing the restriction that =0 yields the efficient G M M estimator along with a quantity Tgr(Or)'S~rlgr(Or) that can be viewed as the G M M analogue of the score form of the Lagrange multiplier test statistic. The test for overidentifying restrictions is appealing because it provides a simple way to gauge how well the model fits the data. It would also be convenient, however, to be able to test restrictions on the vector of parameters for the model. As we shall see, such tests can be constructed in a straightforward manner.
G. H y p o t h e s i s testing in G M M
Suppose that we are interested in testing restrictions on the vector of parameters of the form:
q(Oo) =
(73)
where q is a known p x 1 vector of functions. Let the p x m matrix Qo = Oq/00~ denote the Jacobian of q(O) evaluated at 00. By assumption Q0 has rank p. We know that for the efficient choice of the weighting matrix the limiting distribution of the G M M estimator is:
,v/T(OT _ o0) d N ( o , (D~S001D0)1) .
(74)
Thus under fairly general regularity conditions the standard largesample test criteria are distributed asymptotically as central chisquare random variables with p degrees of freedom when the restrictions hold. Let O~ and O} denote the unrestricted estimator and the estimator obtained by minimizing Jr(O) subject to q(O) = O. The Wald test statistic is based on the unrestricted estimator. It takes the form:
55
Q r)
q(OT) ,
(75)
where QT, Dr and Sr are consistent estimates of Q0, Do and So computed using 0~. The Lagrange multiplier test statistic is constructed using the gradient of Jr(O) evaluated at restricted estimator. It is given by:
1
1
1
DTSTr 9r(O~r) ,
(76)
where Dr and Sr are consistent estimates of Do and So computed from ~r The likelihood ratio type test statistic is equal to the difference between the overidentifying test statistic for the restricted and unrestricted estimations:
err  g r ( o r ). ,
1O r ( 0 r .) )
(77)
The same estimate ST must be used for both estimations. It should be clear from the foregoing discussion that a consistent estimate of So is one of the key elements of the G M M approach to estimation and testing. In practice there are a number of different methods for estimating So, and the appropriate method often depends on the specific characteristics of the model under consideration. The discussion below provides an introduction to heteroskedasticity and autocorrelation consistent estimation of the variancecovariance matrix. A more detailed treatment can be found in Andrews (1991).
So 
Z
j~oo
F0(j) ,
(78)
where F0(j)  E[f(Xt+~, Zt, Oo)f(Xi+~_j, Ztj, 00)']. Because we have assumed stationarity, this matrix can also be written as:
oo
(79)
using the relation Fo(j) = Fo(j)'. Now we want to consider how we might go about estimating So consistently. First take the scenario where the vector f(Xt+~, Zt, 00) is serially uncorrelated. Under such circumstances the second term on the righthand side of equation (79) drops out and
T
rr(0)
_ 1 / T
t=l
z,,
z,, 0r)'
provides a consistent estimate for So. The case where f() exhibits serial correlation is more complicated. Note that the sum in equation (79) contains an infinite number of terms. It is obviously
56
impossible to estimate each of these terms. One way to proceed would be to treat f(.) as if it were serially correlated for a finite number of lags L. Under such circumstances a natural estimator for So would be:
L
sT = rT(0) + ~ ( r T ( j )
j=l
+ rT(])') ,
(80)
where Fr(j)=1/T~rt=l+jf(Xt+~,Zt, Or)f(Xt+~:_j, Zt_j, Or)'. As long as the individual Fr(j) in equation (80) are consistent the estimator ST will be consistent providing that L is allowed to increase at suitable rate as the sample size T increases. But the estimator of So in (80) is not guaranteed to be positive semidefinite. This can lead to problems in empirical work. The solution to this difficulty is to calculate ST as a weighted sum of the Fr(j) where the weights gradually decline to zero as j increases. If these weights are chosen appropriately then ST will be both consistent and positive semidefinite. Suppose we begin by defining the ng(L + l) ng(L + l) partitioned matrix:
CT(L)=
rr(O) rr(1)
rT(1)'
...
rT(L)'
. rT(0)
rr(o)
...
r T ( L  1)'1
/
J
(81)
krTin)
rT(L1)
..:
The matrix Cr(L) can always be written in the form Cr(L) = Y Y where Y is an (T + L) ng(L + 1) partitioned matrix. Take L = 2 as an example. The matrix Y is given by:
0
f(Xl+~, Zl, Or)'
:
f(Xl+~, Z1,0r)':
f ( XT +z ' Z T , OT) t
y=!
0
f(Xl+~, Zl, OT)'
v~
(82)
0 o
CT(L) is
ST(L)=[~ol
~ll...aL1]
~,oX]
(83)
LrriL)
where the ~i are scalars. Because St(L) is the partitionedmatrix equivalent of a quadratic form in a positive semidefinite matrix it must also be positive semidefinite. Equation (83) can be rearranged to show that:
57
:,:,+:)(r:(j)+rTo)')
(84)
The weighted sum on righthand side of equation (84) has the general form of an estimator for the variancecovariance matrix So. Thus if we select the ~+ so that the weights in (84) are a decreasing function of L and we allow L to increase with the sample size at an appropriately slow rate we obtain a consistent positive semidefinite estimator for So. The modified Bartlett weights proposed by Newey and West (1987) have been used extensively in empirical research. Let wj be the weight placed on FT(j) in the calculation of the variancecovariance matrix. The weighting function for modified Bartlett weights takes the form:
wj=
L+I
j=0,1,2,...,L j>L,
(85)
where L is the lag truncation parameter. Note that these weights are obtained by setting ei = 1/v/~ + 1 for i = 0, 1 , . . . ,L. Newey and West (1987) show that ifL is allowed to increase at a rate proportional to T 1/3 then ST based on these weights will be a consistent estimator of So. Although the weighting scheme proposed by Newey and West (1987) is popular, recent research has shown that other schemes may be preferable. Andrews (1991) explores both the theoretical and empirical performance of a variety of different weighting functions. Based on his results Parzen weights seem to offer an good combination of analytic tractability and overall performance. The weighting function for Parzen weights is: 1
6j2 '} 6j3 o<J< 1
wj =
2(1 _ j ) 3 0
(86)
The final question we need to address is how choose the lag truncation parameter L in (86). The simplest strategy is to follow the suggestions of Gallant (1987) and set L equal to the integer closest to T 1/5. The main advantage of this plugin approach is that it is yields an estimator that depends only on the sample size for the data set in question. An alternative strategy developed by Andrews (1991), however, may lead to better performance in small samples. He suggests the following datadependent approach: use the firststage estimate of 00 to construct the sample analogue of f(Xt+~,Z,,Oo). Then estimate a firstorder autoregressive model for each element of this vector. The autocorrelation coefficients along with the residual variances can be used to estimate the value of L that minimizes the asymptotic truncated meansquarederror of the estimator. Andrews (1991) presents Monte Carlo results that suggest that estimators of So constructed in this manner perform well under most circumstances.
58
6. Closing remarks
Asset pricing models often imply that the expected return on an asset can be written as a linear function o f one or m o r e beta coefficients that measure the asset's sensitivity to sources o f undiversifiable risk in the economy. This linear tradeoff between risk and expected return makes such models b o t h intuitively appealing and analytically tractable. A n u m b e r o f different methods have been p r o p o s e d for estimating and testing beta pricing models, but the m e t h o d o f instrumental variables is the a p p r o a c h o f choice in most situations. The p r i m a r y advantage o f the instrumental variables a p p r o a c h is that it provides a highly tractable way o f characterizing timevarying risk and expected returns. This paper provides an introduction the econometric evaluation o f b o t h conditional and unconditional beta pricing models. We present n u m e r o u s examples o f h o w the instrumental variable m e t h o d o l o g y can be applied to various models. W e began with a discussion o f the conditional version o f the Sharpe (1964)  Lintner (1965) C A P M and used it to illustrate h o w the instrumental variables a p p r o a c h could be used to estimate and test single beta models. Then we extended the analysis to models with multiple betas and introduced the concept o f latent variables. We also provided an overview o f the generalized m e t h o d o f m o m e n t s a p p r o a c h ( G M M ) to estimation and testing. All o f the techniques developed in this paper have applications in other areas o f asset pricing as well.
References
Andrews, D. W. K. (1991). Heteroskedasticity and autocorrelation consistent covariance matrix estimation. Econometrica 59, 817858. Bansal, R. and C. R. Harvey (1995). Performance evaluation in the presence of dynamic trading strategies. Working Paper, Duke University, Durham, NC. Beneish, M. D. and C. R. Harvey (1995). Measurement error and nonlinearity in the earningsreturns relation. Working Paper, Duke University, Durham, NC. Black, F. (1972). Capital market equilibrium with restricted borrowing. J. Business 45, 444454. Blake, I. F. and J. B. Thomas (1968). On a class of processes arising in linear estimation theory. IEEE Transactions on Information Theory IT14, 1216. Bollerslev, T., R. F. Engle and J. M. Wooldridge (1988). A capital asset pricing model with time varying covariances. J. Politic. Econom. 96, 11631. Breeden, D. (1979). An intertemporal asset pricing model with stochastic consumption and investment opportunities. J. Financ. Econom. 7, 265296. Campbell, J. Y. (1987). Stock returns and the term structure. J. Financ. Econom. 18, 373~400. Carhart, M. and R. J. Krail (1994). Testing the conditional CAPM. Working Paper, University of Chicago. Chu, K. C. (1973). Estimation and decision for linear systemswith elliptically random processes. IEEE Transactions on Automatic Control AC18, 499505. Cochrane, J. (1994). Discrete time empirical finance. Working Paper, University of Chicago. Devlin, S. J. R. Gnanadesikan and J. R. Kettenring, Some multivariate applications of elliptical distributions. In: S. Ideka et al., eds., Essays in probability and statistics, Shinko Tsusho, Tokyo, 365393.
59
Dybvig, P. H. and S. A. Ross (1"985). Differential information and performance measurement using a security market line. J. Finance 40, 383400. Dumas, B. and B. Solnik (1995). The world price of exchange rate risk. J. Finance 445480. Fama, E. F. and J. D. MacBeth (1973). Risk, return, and equilibrium: Empirical tests. J. Politic. Econom. 81, 607~36. Ferson, W. E. (1990). Are the latent variables in timevarying expected returns compensation for consumption risk. J. Finance 45, 397430. Ferson, W. E. (1995). Theory and empirical testing of asset pricing models. In: Robert A. J. W. T. Ziemba and V. Maksimovic, eds. North Holland 145~200 Ferson, W. E., S. R. Foerster and D. B. Keim (1993). General tests of latent variables models and meanvariance spanning. J. Finance 48, 131156. Ferson, W. E. and C. R. Harvey (1991). The variation of economic risk premiums. J. Politic. Econom. 99, 285315. Ferson, W. E. and C. R. Harvey (1993). The risk and predictability of international equity returns. Rev. Financ. Stud. 6, 522566. Ferson, W. E. and C. R. Harvey (1994a). An exploratory investigation of the fundamental determinants of national equity market returns. In: Jeffrey Frankel, ed., The internationalization of equity markets, Chicago: University of Chicago Press, 59138. Ferson, W. E. and R. A. Korajczyk (1995) Do arbitrage pricing models explain the predictability of stock returns. J. Business, 309350. Ferson, W. E. and Stephen R. Foerster (1994). Finite sample properties of the Generalized Method of Moments in tests of conditional asset pricing models. J. Financ. Econom. 36, 2956. Gallant, A. R. (1981). On the bias in flexible functional forms and an essentially unbiased form: The Fourier flexible form. 3". Econometrics 15, 211224. Gallant, A. R. (1987). Nonlinear statistical models. John Wiley and Sons, NY. Gallant, A. R. and G. E. Tauchen (1989). Seminonparametric estimation of conditionally constrained heterogeneous processes. Econometrica 57, 10911120. Gallant, A. R. and H. White (1988). A unified theory of estimation and inference for nonlinear dynamic models. Basil Blackwell, NY. Gallant, A. R. and H. White (1990). On learning the derivatives of an unknown mapping with multilayer feedforward networks. University of California at San Diego. Gibbons, M. R. and W. E. Ferson (1985). Tests of asset pricing models with changing expectations and an unobservable market portfolio. J. Financ. Econom. 14, 217236. Glodjo, A. and C. R. Harvey (1995). Forecasting foreign exchange market returns via entropy coding. Working Paper, Duke University, Durham NC. Hansen, L. P. (1982). Large sample properties of generalized method of moments estimators. Econometrica 50, 10291054. Hansen, L. P. and R. J. Hodrick (1983). Risk averse speculation in the forward foreign exchange market: An econometric analysis of linear models. In: Jacob A. Frenkel, ed., Exchange rates and international macroeconomics, University of Chicago Press, Chicago, IL. Hansen, L. P. and R. Jagannathan (1991). Implications of security market data for models of dynamic economies. J. Politic. Econom. 99, 225262. Hansen, L. P. and R. Jagannathan (1994). Assessing specification errors in stochastic discount factor models. Unpublished working paper, University of Chicago, Chicago, IL. Hansen, L. P. and S. F. Richard (1987). The role of conditioning information in deducing testable restrictions implied by dynamic asset pricing models. Econometrica 55, 587~513. Hansen, L. P. and K. J. Singleton (1982). Generalized instrumental variables estimation of nonlinear rational expectations models. Econometrica, 50, 12691285. Harvey, C. R. (1989). Timevarying conditional covariances in tests of asset pricing models. J. Financ. Econom. 24, 289317. Harvey, C. R. (1991a). The world price of covariance risk. J. Finance 46, 111157. Harvey, C. R. (1991b). The specification of conditional expectations. Working Paper, Duke University.
60
Harvey, C. R. (1995), Predictable Risk and returns in emerging markets, Rev. Financ. Stud. 773816. Harvey, C. R. and C. Kirby (1995). Analytic tests of factor pricing models. Working Paper, Duke University, Durham, NC. Harvey, C. R., B. H. Solnik and G. Zhou (1995). What determines expected international asset returns? Working Paper, Duke University, Durham, NC. Huang, R. D. (1989). Tests of the conditional asset pricing model with changing expectations. Unpublished working Paper, Vanderbilt University, Nashville, TN. Jagannathan, R. and Z. Wang (1996). The CAPM is alive and well. J. Finance 51, 353. Kan, R. and C. Zhang (1995). A test of conditional asset pricing models. Working Paper, University of Alberta, Edmonton, Canada. Keim, D. B. and R. F. Stambaugh (1986). Predicting returns in the bond and stock market. J. Financ. Econom. 17, 357390. Kelker, D. (1970). Distribution theory of spherical distributions and a locationscale parameter generalization. Sankhy~, series A, 419430. Kirby, C (1995). Measuring the predictable variation in stock and bond returns. Working Paper, Rice University, Houston, Tx. Lintner, J. (1965). The valuation of risk assets and the selection of risky investments in stock portfolios and capital budgets. Rev. Econom. Statist. 47, 1337. Merton, R. C. (1973). An intertemporal capital asset pricing model. Eeonometrica 41, 867887. Newey, W. K. and K. D. West (1987). A simple, positive semidefinite, heteroskedasticityconsistent covariance matrix. Eeonometrica 55, 703708. Potscher, B. M. and I. R. Prucha (1991a). Basic structure of the asymptotic theory in dynamic nonlinear econometric models, part I: Consistency and approximation concepts. Econometric Rev. 10, 125216. Potscher, B. M. and I. R. Prucha (1991b). Basic structure of the asymptotic theory in dynamic nonlinear econometric models, part II: Asymptotic normality. Econometric Rev. 10, 253325. Ross, S. A. (1976). The arbitrage theory of capital asset pricing. J. Econom. Theory 13, 341360. Shanken, J. (1990). Intertemporal asset pricing: An empirical investigation. J. Econometrics 45, 99120. Sharpe, W. (1964). Capital asset prices: A theory of market equilibrium under conditions of risk. J. Finance 19, 425442. Silverman, B. W. (1986). Density estimation for statistics and data analysis. London: Chapman and Hall. Solnik, B. (1991). The economic significance, of the predictability of international asset returns. Working Paper, HECSchool of Management. Vershik, A. M. (1964). Some characteristics properties of Gaussian stochastic processes. Theory Probab. Appl. 9, 353356. White, H. (1980). A heteroskedasticity consistent covariance matrix estimator and a direct test of heteroskedasticity. Econometrica 48, 817838. Zhou, G. (1995). Small sample rank tests with applications to asset pricing. J. Empirical Finance 2, 7194.
G.S. Maddala and C.R. Rao, eds., Handbook of Statistics, Vol. 14 1996 Elsevier Science B.V. All rights reserved.
"2
J
Bruce N. Lehmann
This paper discusses semiparametric estimation procedures for asset pricing models within the generalized method of moments (GMM) framework. G M M is widely applied in the asset pricing context in its unconditional form but the conditional mean restrictions implied by asset pricing theory are seldom fully exploited. The purpose of this paper is to take some modest steps toward removing these impediments. The nature of efficient G M M estimation is cast in a language familiar to financial economists: the language of maximum correlation or optimal hedge portfolios. Similarly, a family of beta pricing models provides a natural setting for identifying the sources of efficiency gains in asset pricing applications. My hope is that this modest contribution will facilitate more routine exploitation of attainable efficiency gains.
1. Introduction
Asset pricing relations in frictionless markets are inherently semiparametric. That is, it is commonplace for valuation models to be cast in terms of conditional moment restrictions without additional distributional assumptions. Accordingly, a natural estimation strategy replaces population conditional moments with their sample analogues. Put differently, the generalized method of moments (GMM) framework of Hansen (1982) tightly links the economics and econometrics of asset pricing relations. While applications of G M M abound in the asset pricing literature, empirical workers seldom make full use of the G M M apparatus. In particular, researchers generally employ the unconditional forms of the procedures which do not exploit all of the efficiency gains inherent in the moment conditions implied by asset pricing models. There are two plausible reasons for this: (1) the information requirements are often sufficiently daunting to make full exploitation seem infeasible and (2) the literature on efficient semiparametric estimation is somewhat dense. The purpose of this paper is to take some modest steps toward removing these impediments. The nature of efficient G M M estimation is cast in terms familiar to financial economists: the language of maximum correlation or optimal hedge
61
62
B. N. Lehmann
portfolios. Similarly, a family of beta pricing models provides a natural setting for identifying the sources of efficiency gains in asset pricing applications. My hope is that this modest contribution will facilitate more routine exploitation of attainable efficiency gains. The layout of the paper is as follows. The next section provides an outline of G M M basics with a view toward the subsequent application to asset pricing models. The third section lays out the links between the economics of asset prices when markets do not permit arbitrage opportunities and the econometrics of asset pricing model estimation given the conditional moment restrictions implied by the absence of arbitrage. The general efficiency gains discussed in these two sections are worked out in detain in the fourth section, which documents the sources of efficiency gains in beta pricing models. The final section provides some concluding remarks.
E[gt(O_o)lZt_l ] = E[_gt(_00) ]= 0
(2.1)
where 9t(_00) is the conditional mean zero random q x 1 vector in the model, 00 is the associated p x l vector of parameters in the model, and It1 is some unspecified information set that at least includes lagged values of 9_t(_00). The restriction to zero conditional mean random variables means that 9_t(_00) follows a martingale difference sequence and, thus, is serially uncorrelated.1 A variety of familiar econometric models take this form. Consider, for example, the linear regression model:
Yt = X/~o + 5,
(2.2)
where yt is the tth observation on the dependent variable, x t is a p 1 vector of explanatory variables, and st is a random disturbance term. In this model, suppose that the econometrician observes a vector z t for which it is known that E[~t[Zt_l] = 0. Then this model is characterized by the conditional moment condition: l The behavior of GMM estimatorscan be readily establishedwhen O_t(_0 ) is serially dependentso long as a law of large numbers and central limit theorem apply to its time series average.
63
9t(~_o) = etz_t_1;
(2.3)
When z~_1 = x_ t this is the linear regression model with possibly stochastic regressors; otherwise, it is an instrumental variables estimator. G M M involves setting sample analogues of these moment conditions as close to zero as possible Of course, they cannot all be set to zero if the number of linearly independent moment conditions exceeds the number of unknown parameters Instead, G M M takes p linear combinations of these moment conditions and seeks values of _0 for which these linear combinations are zero First, consider the unconditional version of the moment condition  that is, E[9_t(0_0)] = 0. In order for the model to be identified, assume that gt(O_o) possesses a nonsingular population covariance matrix and that E[Ogt(Oo)l/O0] has full row rank. The G M M estimator can be derived in two ways. Following Hansen (1982), the G M M estimator _O r minimizes the sample quadratic form based on a sample of T observations on _gt(00):
rr~n#r(O_)'Wr(O_o)#_r(O_O_ );

_#r(_O) = ~Zg__t(_O)
t=l
7"
(2.4)
given a positive definite weighting matrix Wr(_00) converging in probability to a positive definite limit W(_00). In this variant, the econometrician chooses WT(O_o) to give the G M M estimator desirable asymptotic properties. Alternatively, we can simply define the estimator __O~.as the solution to the equation system:
1
= t=l
T
= 0 (2.5)
where Ar(Oo) is a sequence of p x q 0p(1) matrices converging to a limit A(_0) with row rank p. In this formulation, At(_00) is chosen to give the resulting estimator desirable asymptotic properties. The estimating equations for the two variants are, of course, identical in form since: AT(00)OT(_O~) = G~_O~rV~(00)O~(_OT)= 0 ;
{00
_ P . . . . . . . , ^ p   __ , .
(2.6)
For my purpose, equation (2.5) is a more suggestive formulation. The large sample behavior of Or is straightforward, particularly in this case where _9 t (0_0) a martingale difference sequence. 2 An appropriate weak law of large numbers insures that g r ( 0 0 ) ~ 0 , which, coupled with the identification condmons, lmphes that __0T+__00.So long as the necessary time series averages converge:
2 The standard referenceon estimation and inferencein this frameworkis Hansen (1982).
64
B. N. Lehmann
s (_00) =
t=l
IS(O_o)[ > 0
aT(O_o)Pa(O_o)
(2.8)
W(O_.o)G(O)t]IG(O_o) W(O_o)
(2.9)
and an appropriate central limit t h e o r e m for martingales ensures that v~(_0r  _00) ~ N[0,
D(O_o)S(O_o)n(O_o)' ] .
Consistent standard error estimates that are robust to conditional heteroskedasticity can be calculated f r o m this expression by replacing 00 with _0r.3 W h a t choice Of AT(Oo) or, equivalently, o f Wr(0_0) is optimal? All fixed weight estimators  that is, those that apply the same matrix Ar(Oo) to each gt(_00) for fixed T  are consistent under the weak regularity conditions sketched above. Accordingly, it is natural to c o m p a r e the asymptotic variances o f estimators, a criterion that can, of course, be justified m o r e formally by confining attention to the class of regular estimators that rules out superefficient estimators. The asymptotically optimal A(_00) is obtained by equating WT(O_o) with ST(00) 1, 1 t 1 yielding an asymptotic covariance matrix of [G(OO_o)S(Oo)^G(O_o) ] . Once again, ST(O_o) can be estimated consistently by replacing _0 with _0r.4 The optimal unconditional G M M estimator has a clear connection with the m a x i m u m likelihood estimator (MLE), even though we do not k n o w the probability law generating the data. Let &at(0_o,r/) denote the logarithm o f the p o p u l a t i o n conditional distribution of the data underlying g_t(00) where 17 is a possibly infinite dimensional set of nuisance parameters. Similarly, let ~frt(_00,~_) denote the true score function, the vector of derivatives o f ~t(00, ~/), with respect to 0. Consider the 'unconditional p o p u l a t i o n projection o f &aTt(00,r/) on the m o m e n t conditions g t(_00):
3 Autocorrelation is not present under the hypothesis that ~(_0) has conditional mean zero and is sampled only once per period (that is, the data are not overlapping). If the data are overlapping, the moment conditions will have a moving average error structure. See Hansen and Hodrick (1980) for a discussion of covariance matrix estimation in this case and Hansen and Singleton (1982) and Newey and West (1987) for methods appropriate for more general autocorrelation. 4 The possible singularity of ST(0) is discussed indirectly in Section 4.3 as part of the justification for factor structure assumptions. While my focus is not on hypothesis testing, the quadratic form in the fitted value of the moment conditions and the optimal weighting matrix yields the test statistic  p) since p degrees of freedom are used in estimating _0.This test of T O__T(O_T)'ST(O_r)I~T(O_T)~z2(q overidentifying conditions is known as Hansen's J test.
65
V_~ut ;
+_V~u, ;
(2.1o)
since E[Lf~t(0_0,_q),gt(_00)'] =  ~ is zero given sufficient regularity to allow differentiation of the moment condition E[g~(_00)] = 0 under the integral sign. In this notation, the asymptotic variance of the unconditional G M M estimator is
[~ I/.t 1 (~1]1.
Hence, the optimal fixed linear combination of moment conditions A (00) has the largest unconditional correlation with the true, but unknown, conditional score in finite samples. This fact does not lead to finite sample efficiency statements for at least two reasons. First, the M L E itself has no obvious efficiency properties in finite samples outside the case where the score takes the linear form 1(0o)(0_  __00)where I(0o) is the Fisher information matrix. Second, the feasible optimal estimator replaces 0_ 0 with _O r in A(O_o), yielding a consistent estimator with no obvious finite sample efficiency properties. Nevertheless, the optimal fixed weight G M M estimator retains this optimality property in large samples. Now consider the conditional version of the moment condition; that is, E[gt(Oo)[It_l ] = 0. The prior information available to the econometrician is that _gt(_00) is a martingale difference sequence. Hence, the econometrician knows only that linear combinations of the g~(_00) with weights based on information available at time t  1 have zero means  nonlinear functions of _gt(_00) have unknown moments given only the martingale difference assumption. Since the econometrician is free to use time varying weights, consider estimators of the form: 5
1 T
; A,_, 1,_1
t=l
/211/
where At1 is a sequence of p x q Op(1) matrices chosen by the econometrician. In order to identify the model, assume 9t(Oo) has a nonsingular population conditional covariance matrix E[gt(00)gt(0~/']li_l] and that E[Og_t(Oo)'/OO_llt_l ] has full row rank. The basic principles of asymptotically optimal estimation and inference in the conditional and unconditional cases are surprisingly similar ignoring the difficulties associated with the calculation of conditional expectations E[e[It]. 6 Once again, under suitable conditional versions of the regularity conditions sketched above:
5 The estimators could, in principle, involve nonlinear functions of these time series averages but their asymptotic linearity means that their effect is absorbed in At1. 6 Hansen (1985), Tauchen (1986), Chamberlain (2987), Hansen, Heaton, and Ogaki (1988), Newey (1990), Robinson (1991), Chamberlain (1992), and Newey (1993) discuss efficient G M M estimation in related circumstances.
66
[T t~=l
ao_
+
,,=E L @
II,
(2.12)
Lsc( O_o)
the sample m o m e n t condition (2.11) is asymptotically linear so that:
(2.12)
Dc(O_o)Sc(O_o)Dc(O_O_o)' ] .
(2.13)
T h e econometrician can choose the weighting matrices At1 to minimize the asymptotic variance of this estimator. The weighting matrices A l which are optimal in this sense are given by:
Att 1 =
~t_l I//~_11 ;
I/it_ 1 =
E[o_t(O__o)Y_t(O_o)tllt_l]
(2.14)
Var[v~(_Or  _0_0o)] ~
~t1 ~21~tm'
~tl')] 1
(2.15)
P[E(~t1 ~1
The evaluation of A_l need not be straightforward and doing so in asset pricing applications is the m a i n preoccupation of Section 4. 7 The relations between the optimal conditional G M M estimator and the M L E are similar to the relations arising in the unconditional case. The conditional population projection of Aa't(_00,q_) on the m o m e n t conditions ~(00) reveals that:
7 The implementation of this efficient estimator is straightforward given the ability to calculate the relevant conditional expectations. Under weak regularity conditions, the estimator can be implemented in two steps by first obtaining an initial consistent estimate (perhaps using the unconditional GMM estimator (2.5)), estimating the optimal weighting matrix Ati using this preliminary estimate, and then solving (2.14) for the efficient conditional GMM estimator. Of course, equations (2.11) and (2.14) can be iterated until convergence, although the iterative and two step estimators are asymptotically equivalent to first order.
67
~v't(_00,_q) = Cov[5~(_00, q), 9~(O_o)'[It1]Var[9_t(O_o)[It_l]1 _gt(_00)+ _v~ct = Or1 ~al_9t(_00) + v~ect (2.16)
l since E[~CPt(_00, q_)_gt(_00)/ + O~(Oo)'/O0_llt_l ] is zero given sufficient regularity to interchange the order of differentiation and integration of the conditional moment condition El9 (Oo)llt_l ] = 0. Hence, the optimal linear combination of moment conditions~ti has the largest conditional correlation with the true, but unknown, conditional score in finite samples. While this observation does not translate into clear finite sample efficiency statements, the G M M estimator based on At_1 is that which is most highly correlated with the M L E asymptotically. It is easy to characterize the relative efficiency of the optimal conditional and unconditional G M M estimators. As is usual, the variance of the difference between the optimal unconditional and conditional G M M estimators is the difference in their variances since the latter is efficient relative to the former. The difference in the optimal weights given to the martingale increments _gt(_00) is:
(2.17)
e] ';21 +
[W21 
Note that the law of iterated expectations applies to both ~t1 and ~Pt1 separately but not to the composite At_, so that E[A_l  A(0~)] does not generally converge to zero. In any event, the relative efficiency of the conditional estimator is higher when there is considerable time variation in both ~t1 and ~Ptl. Finally, the conventional application of the G M M procedure lies somewhere between the conditional and unconditional cases. It involves the observation that zero conditional mean random variables are also uncorrelated with the elements of the information set. Let Zt1 E It1 denote an r x q(r > p) matrix of predetermined variables and consider the revised moment conditions E[Zt_l_0t(0_0)]/t_l ] ~~E[/t_lg,(_00) ] = 0 V Zt1 C It1 .
(2.18)
In the unconditional G M M procedure discussed above, Zt1 is Iq, the q q identity matrix. In many applications, the same predetermined variables gt1 multiply each element of _gt(_00) so that Z t _ 1 takes the form Iq zt_ 1. Finally, different subsets of the information available to the econometrician z_it_1 E It1 can be applied to each element of ~(0~) so that Zt_l is given by
fZll :!j/
_0
~x
Zt1 =
z2t1 "'"
0 " 1
(2.19)
. . . . . . . .
While optimal conditional G M M can be applied in this case, the main point of this procedure is to modify unconditional GMM. As before, the unconditional population projection of ~ ' t ( ~ ) on the moment conditions Zt_l~(O~) yields
68
B. N. Lehmann
't(_00,q) = Cov[L,e't(_00,t/), gt(O_o)'Z[_,]Var[Zt_19t(O_o)]lzt_lgt(O_o)
+ v_~,,zt
= Cbz~kzlZt_lg_t(O_o) + v_.,~uZt
z= E ~ 
fog,(Oo)', , ]
Z;_,J
I t
(2.20)
~ z = E{Ztl#_i(O__oo)~(O_o) Z;_I}
since E{La't(0_0,_q)_gt(0_0)'Z[_a} =  ~ z given sufficient regularity to allow differentiation under the integral sign. The weights q~zTJzlZt_l can also be viewed as a linear approximation to the optimal conditional weights Atl = ~t_lt/J~_ll. Put differently, At_1 would generally be a nonlinear function of Zt1 if Zti were the relevant conditioning information from the perspective of the econometrician.
3. Asset pricing relations and their econometric implications Modern asset pricing theory follows from the restrictions on security prices that arise when markets do not permit arbitrage opportunities. That the absence of arbitrage implies substantive restrictions is somewhat surprising. Outside of international economics, it is not commonplace for the notion that two eggs should sell for the same price in the absence of transactions costs to yield meaningful economic restrictions on egg prices  after all, two eggs of equal grade and freshness are obviously perfect substitutes. 8 By contrast, the noarbitrage assumption yields economically meaningful restrictions on asset prices because of the nature of close substitutes in financial markets. Different assets or, more generally, portfolios of assets may be perfect substitutes in terms of their random payoffs but this might not be obvious by inspection since the assets may represent claims on seemingly very different cash flows. The asset pricing implications of the absence of arbitrage have been elucidated in a number of papers including Rubinstein (1976), Ross (1978b), Harrison and Kreps (1979), and Chamberlain and Rothschild (1983), Hansen and Richard (1987). Consider trade in a securities market on two dates: date t  1 (i.e., today) and date t (i.e., tomorrow). There are N risky assets, indexed by i = 1 , . .  , N , which need not exhaust the asset menu available to investors. The nominal price of asset i today is Pitl. Its value tomorrow  that is, its price tomorrow plus any cash flow distribution between today and tomorrow  is uncertain from the perspective of today and takes on the random value Pit + Dit tomorrow. Hence, its gross return (that is, one plus its percentage return) is given by Rit = (Pit + Dit)/Pit1. Finally, the one period riskless asset, if one exists, has the sure gross return Rft = 1 / P f t  i and ! always denotes a suitably conformable vector of ones. 8 This observationwas translatedinto a livelydiatribe by Summers(1985, 1986).
69
The market has two crucial elements: one environmental and one behavioral. First, the market is frictionless: trade takes place with no taxes, transactions costs, or other restrictions such as short sales constraints. 9 Second, investors vigorously exploit any arbitrage opportunities, behavior that is facilitated by the no frictions assumption, that is, investors are delighted to make something for nothing and they can costlessly attempt to do so. In order to illustrate the asset pricing implications of the absence of arbitrage, suppose that a finite number of possible states of nature s = 1, ...,S can occur tomorrow and that the possible security values in these states are Pist qDist .10 Clearly, there can be at most min [N, S] portfolios with linearly independent payoffs. Hence, the prices of pure contingent claims  securities that pay one unit of account if state s occurs and zero otherwise  are uniquely determined if N ___S and if there are at least S assets with linearly independent payoffs. If N < S, the prices of such claims are not uniquely determined by arbitrage considerations alone, although they are restricted to lie in an Ndimensional subspace if the asset payoffs are linearly independent. Let I]lst_ 1 denote the price of a pure contingent claim that pays one unit of account if state s occurs tomorrow and zero otherwise. These state prices are all positive so long as each state occurs with positive probability according to the beliefs of all investors. The price of any asset is the sum of the values of its payoffs state by state.ll In particular:
s s
Pit1 : ~
s=l
@st1
(3.1)
or, equivalently:
s s
~/st_lRist =  ;
s=l
Rft_, ~
s=l
~/st1 =  .
(3.2)
Since they are nonnegative, scaling state prices so that they sum to one gives them all of the attributes of probabilities. Hence, these risk neutral probabilities:
9 Some frictions can be easily accommodated in the noarbitrage framework but general frictions present nontrivial complications. For recent work that accommodates proportional transactions costs and short sales constraints, see Hansen, Heaton, and Luttmer (1993), He and Modest (1993), and Luttmer (1993). l0 The restriction to two dates involves little loss o f generality as the abstract states of nature could just as easily index both different dates and states of nature. In addition, most of the results for finite S carry over to the infinite dimensional case, although some technical issues arise in the limit of continuous trading. See Harrison and Kreps (1979) for a discussion. 11 The frictionless market assumption is implicit in this statement. In markets with frictions, the return of a portfolio of contingent claims would not be the weighted average of the returns on the component securities across states but would also depend on the trading costs or taxes incurred in this portfolio.
70
B. N . L e h m a n n
Xs*tI  
I[ s t l 1
IPst 1
 RftlPst1 P ft 1
(3.3)
E s = I I]l s t  1
comprise the risk neutral martingale measure, so called because the price of any asset under these probability beliefs is given by:
S
(3.4)
that is, its expected present value. Risk neutral probabilities are one summary of the implications of the absence of arbitrage; they exist if and only if there is no arbitrage. This formulation of the state pricing problem is extremely convenient for pricing derivative claims. Under the risk neutral martingale measure, the riskless rate is the expected return of any asset or portfolio that does not change the span of the market and for which there is a deterministic mapping between its cash flows and states of nature. However, it is not a convenient formulation for empirical purposes. Actual return data is provided according to the true (objective) probability measure. That is, actual returns are generated under rational expectations. Accordingly, let lr~t_l be the objective probability that state s occurs at time t given some arbitrary set of information available at time t1 denoted by Itl. The reformulation of the pricing relations (3.1) and (3.2) in terms of state prices per unit probability qst1 = I]lstl/gstI reveals:
(3.5)
Pft1 = E
q,tlllt1 =
E[Q, IZ,I]
E
Ls=l
qst_lRistllt_l = E[QtRit[It1] = 1
(3.6)
qst_lgft_llIt_l
=_Rftg[Qtllt_l] = 1 .
At this level of generality, these conditional moment restrictions are the only implications of the hypothesis that markets are frictionless and that market prices are marked by the absence of arbitrage. Asset pricing theory endows these conditional moment conditions with expirical content through models for the pricing kernel Qt couched in terms of
71
potential observables. 12 Such models equate the state price per unit probability qst1, the cost per unit probability of receiving one unit of account in state s, with some corresponding measure of the marginal benefit of receiving one unit of account in state s. 13 Most equilibrium models equate Qt, adjusted for inflation, with the intertemporal marginal rate of substitution of a hypothetical, representative optimizing investor. 14 The most common formulation is additively separable, constant relative risk aversion preferences for which Qt = p ( c t / c t  1 )  ~ where p is the rate of time preference, Ct/Ct_ 1 is the rate of consumption growth, and ~ is the coefficient of relative risk aversion, all for the representative agent. 15 Accordingly, let x_ t denote the relevant observables that characterize these marginal benefits in some asset pricing model. Hence, pricing kernel models take the general form:
a t = Q(xt,
O_Q) ;
Ot > 0 ;
x t E It
(3.7)
where _0Q is a vector of unknown parameters. To be sure, the parametric component can be further weakened in settings where it is possible to estimate the function Q(o) nonparametrically given only observations on R__ t and x__ t, However, the bulk of the literature involves models in the form (3.7). 16 Equations (3.5) through (3.7) are what make asset pricing theory inherently semiparametric. 17 The parametric component of these asset pricing relations is a
12 It is also possible to identify the pricing kernel nonparametrically with the returns of particular portfolios. For example, the return of growth optimal portfolio which solves max E{ln~dat_lRt[It_x; wot_l E/tI) is equal to Q[l. Of course, it is hard to solve this maximum problem without parametric distributional assumptions. See Bansal and Lehmann (1955) for an application to the term structure of interest rates. The addition of observables can serve to identify payoff relevant states, giving nonparametric estimation a somewhat semiparametric flavor. Put differently, the econometrician typically observes a sequence of returns without information on which states have been realized; the vector x_. t provides is an indicator of the payoff relevant state of nature realized at time t that helps identify similar outcomes (i.e., states with similar state prices per unit probability). Bansal and Viswanathan (1993) estimate a model along these lines. 13 The marginal benefit side of this equation rationalizes the peculiar dating convention for Qt when it is equal to the time t1 state price per unit probability. 14 Embedding inflation in Qt eliminates the need for separate notation for real and nominal pricing kernels. That is, Qt is equal to Q~tealect/Pct_l where Pet is an appropriate index for translating real cash flows and the real pricing kernel Q~eal into nominal cash flows and kernels. 15 More general models allow for multiple goods and nonseparability of preferences in consumption over time and states as would arise from durability in consumption goods and from preferences marked by habit formation and nonexpected utility maximization. Constantinides and Ferson (1991) summarize much of the durability and habit formation literatures, both theoretically and empirically. See Epstein and Zin (1991a) and Epstein and Zin (1991b) for similar models for Qt which do not impose state separability. Cochrane (1991) exploits the corresponding marginal conditions for producers. 16 Exceptions include Bansal and Viswanathan (1993) and the linear model Qt = ~_~_lX~with ~t1 unobserved, a model discussed in the next section. 17 To be sure, the econometrician could specify a complete parametric probability model for asset returns and such models figure prominently in asset pricing theory. Examples include the Capital Asset Pricing Model (CAPM) when it is based on normally distributed returns and the family of continuous time intertemporal asset pricing models when prices are assumed to follow lt6 processes.
72
B. N, Lehmann
model for the pricing kernel Q(x_t , Oo). The conditional moment conditions (3.6) can then be used to identify any unknown parameters in the model for Qt and to test its overidentifying restrictions without additional distributional assumptions. Note also that the structure of asset pricing theory confers an obvious econometric simplification. The constructed variables Q t R i t  1 constitute a martingale difference sequence and, hence, are serially uncorrelated. This fact greatly simplifies the calculation of the second moments of sample analogues of (3.6), which in turn simplifies estimation and inference) s Moreover, the economics of these relations constrains how these conditional moment restrictions can be used for estimation and interference. Ross (1978b) observed that portfolios are the only derivative assets that can be priced solely as a function of observables, time, and primary asset values given only the absence of arbitrage opportunities in frictionless markets. The same is true for econometricians  for a given asset menu, the econometrician knows only the prices and payoffs of portfolios with weights w__t_ 1 E It1. Hence, only linear combinations of the conditional moment conditions based on information available at time t  1 can be used to estimate the model. Accordingly, in the absence of distributional restrictions, the econometrician must base estimation and inference on estimators of the form:
1 T ^
f ZAtI[R_tQ(X__t,O_Q)  ! ] = 0 ;
t=l
At1 G It1
(3.8)
where Ati is a sequence of p x N Op(1) matrices chosen by the econometrician and p is the number of elements in _0Q.The matrices At1 can be interpreted as the weights of p portfolios with random payoffs At_tR_ t that cost A t  l ! units of account. 0 19 How would a financial econometrician choose At_ . An econometrician who favors likelihood methods for their desirable asymptotic properties might prefer the p portfolios with maximal conditional correlation with the true, but unknown, conditional score. In this application, the conditional projection of ~~tt(O0)~) o n [RtQ(x_t,O_Q)  ! ] is given by:
~,~tt(O0 , q__)= Cov[~f~tt (00, r/), R_tQ(xt, OQ)tllt_l]Var[RtQ(x__t, OQ)lit_l] 1
(3.9)
O0
tJ~/t1 :
l~ This observation fails if returns and Qt are sampled more than once per period. For example, consider the two period total return (i.e., with full reinvestment of intermediate cash flows) Rit,t+l = RitRit+l which satisfies the two period moment condition E[QtQt+lRit:+l I Itl] = 1. In this case, the constructed random variable QtQt+lR~t:+11 follows a first order moving average process. See Hansen and Hodrick (1980) and Hansen, Heaton, and Ogaki (1988) for more complete discussions.
73
since E{~'t(O_o,q__)[R_tQ(xt, Oo)!]'lit_l} =  ~ t  1 given sufficient regularity to permit differentiation under the integral sign. The p portfolios with payoffs 4~t_l~u~__llRt that cost ~t1 gt~111 units of account have no obvious optimality properties from the perspective of prospective investors. However, they are definitely optimal from the perspective of financial econometricians  they are the optimal hedge portfolios for the conditional score of the true, but unknown, log likelihood function. Put differently, the economics and the econometrics coincide here. The econometrician can only observe conditional linear combinations of the conditional moment conditions and seeks portfolios whose payoffs provide information about the parameters of the pricing kernel Q(_~,_0Q). The optimal portfolio weights are ~t_1~u~_11 and the payoffs ~bt_l~u~jlR__t maximize the information content of each observation, resulting in an incremental contribution of ~t_l~_ll~t_l, to the information about _0Q. In other words, the Fisher information matrix of the true score is ~t_17~_ll~'t 1  C and the positive semidefinite matrix C is the smallest such matrix produced by linear combinations of the conditional moment conditions. This development conceals a host of implementation problems associated with the evaluation of conditional expectations. 19 To be sure, ~t1 and ~t1 can be estimated with nonparametric methods when they are time invariant functions ~(_Zt_l) and ~(_zt_l) for _z t 1 E It1. The extension of the methods of Robinson (1987), Newey (1990), Robinson (1991), and Newey (1993) to the present setting, in which RrQ(X_t,_0Q)! is serially uncorrelated but not independently distributed over time or homoskedastic, appears to be straightforward. However, the circumstances in which A_l is a time invariant function of_zt_1 would appear to be the exception rather than the rule. Accordingly, the econometrician generally must place further restrictions on the noarbitrage pricing model in order to proceed with efficient estimation based on conditional moment restrictions, a subject that occupies the next section. Alternatively, the econometrician can work with weaker moment conditions like the unconditional moment restrictions. The analysis of this case parallels that of optimal conditional GMM. Once again, the fixed weight matrices At(_00) from (2.10) are the weights of p portfolios with random payoffs AT(Oo)R t that cost Ar(_00)Z units of account. As noted in the previous section, the price of these random payoffs is ~Pl_t which generally differs from E(At_l)!. These portfolios produce the fixed weight moment condition that has maximum unconditional correlation with the derivatives of the true, but unknown, log likelihood function.
19 The nature of the information set itself is less of an issue. While investors might possess more information than econometricians, this is not a problem because the law of iterated expectations implies that E[Ri, Qt[/7_I]= 1 VI~ic__It_l. o f course, the conditional probabilities nff_1 implicit in this m o m e n t condition generally differ from those implicit in E[RuQt]lt1] = 1 as will the associated values of the pricing kernel ~ (i.e., qff1 = ud,t~/r~t~)i The dependence of Q//~ o n nsff_1 is broken in models for Qt that equate the state price per unit probability qstI with the marginal benefit of receiving one unit of account in state s.
74
B. N. Lehmann
Of course, conventional GMM implementations use conditioning information within the optimal unconditional GMM procedure as discussed in the previous section. Let Zt_l E Iti denote an r x N matrix of predetermined variables and consider the revised moment conditions: E[Zt1 (RtQ(x_t, O_Q)~_)l/~_d
(3.10)
In the preceding paragraph, Zt1 is 1N, the N x N identity matrix; otherwise, it could reflect identical or different elements of the information set available to investors (i.e., z~_1 in IN z_t_l and z_it_1 in (2.19), respectively) being applied to each element of R_.tQ(x_t,OQ)t_ as given in the previous section. The introduction of z_,t_ 1 and zt_ 1 into the unconditional moment condition (3.10) is often described as invoking trading strategies in estimation and inference following Hansen and Jagannathan (1991) and Hansen and Jagannathan (1994). This characterization arises because security returns are given different weights temporally and, when z_it_t zt_ l, crosssectionally after the fashion of an active investor. In unconditional GMM, the returns weighted in this fashion are then aggregated into p portfolios with weights that are refined as information is added to (3.10) in the form of additional components of Ztl. Once again, there is an optimal fixed weight portfolio strategy for the revised moment conditions based on Zt_l (R__tO(x_t, OQ)!).From (2.20), the active portfolio strategy with portfolio weights ~Z~PzlZt_l has random payoffs bzgSzlZt_lRt and costs ~zgtzlZt_l! units of account. The resulting moment conditions have the largest unconditional correlation with the true, but unknown, unconditional score in finite samples within the class of time varying portfolios with weights that are fixed linear combinations of predetermined variables Zt1. Of course, optimal conditional weights can be obtained from the appropriate reformulation of (3.9) above but the whole point of this approach is that the implementation of this linear approximation to the optimal procedure is straightforward.
Semiparametric methods for asset pricing models Rt=~_t+fltQ(x_t,O_Q)+e_t flt = Var[Q(x_t, O_Q)]It_l ]
;
E[_qlI,1] = 0
75
(4.1)
and Var[e] and Cov[.] denote the variance and covariance of their arguments, respectively. Asset pricing theory restricts the intercept vector ~ in this projection which are determined by substituting (4.1) into the m o m e n t condition (3.6):
(4.2)
(4.3)
29.t = )~otE[O(x_t,OQ)21It1]
The riskless asset, if one exists, earns )~0t; otherwise, 20t is the expected return of all assets with returns uncorrelated with Qt. As noted earlier, the lack o f serial correlation in the residual vector ~t is econometrically convenient. The bilinear f o r m of (4.3) is a distinguishing characteristic of these beta pricing models. Put differently, the m o m e n t conditions (3.6) constrain expected returns to be linear in the covariances of returns with the pricing kernel. This linear structure is a central feature of all models based on the absence of arbitrage in frictionless markets; that is, the portfolio with returns that are maximally correlated with Qt is conditionally meanvariance efficient, z Hence, these asset pricing relations differ f r o m semiparametric multivariate regression models in their restrictions on risk p r e m i u m s like 2Qt and ).0t .21 The multivariate representation o f these noarbitrage models produces a s o m e w h a t different, though arithmetically equivalent, description of efficient G M M estimation. The estimator is based on the m o m e n t conditions:
I T
~ZA#t_l~t
t=l
= 0 ;
(4.4)
and, after solving in terms of the expressions for 20t and )~Qt (in particular, that E[Q(xt, OQ)  2Qt[It1] = 2otVar[Q(x t, OQ)[/tl]) and given sufficient regularity to allow differentiation under the integral sign, the optimal choice of A~t_l is: 20 A portfolio is (conditionally) meanvariance efficient if it minimizes (conditional) variance for given level of (conditional) mean return. A portfolio is (conditionally) meanvariance efficient for a given set of assets if only if the (conditional) expected returns of all assets in the set are linear in their (conditional) convariances with the portfolio. See Merton (1972), Roll (1977), and Hansen and Richard (1987). 21 They differ in at least one other respect  most regression specifications with serially uncorrelated errors have E[~_tlQt ] = 0_, which need not satisfied by (4.3).
76
B. N. Lehmann
A ~ t  1 = ~/3tI I//~tl 1 ;
E {~_Q~t'lit1 ,}
= 20t
Var[Q(~,o_O_o)lI,1]O~'O~
O0_o
fit'
(4.5)
= 2or
The last line in the expression for ~t~t1 illustrates the relations with (3.9) in the previous section. Note that the observation of the riskless rate eliminates the term Q.22 involving 0 2ot/ OO There is no generic advantage to casting noarbitrage models in this beta pricing form unless the econometrician is willing to make additional assumptions about the stochastic processes followed by returns. 23 As is readily apparent, there are only three places where useful restrictions can be placed on beta pricing models: (1) constraints on the behavior of the conditional betas, (2) additional restrictions on the model Q(xt, O_Q),and (3) on the regression residuals. We discuss each of these in turn in the Sections 4.14.3 and these ingredients are combined in Section 4.4.
22 In the case of risk neutral pricing, ~ t  t collapses to (020,/0_0)! since Var[Q(x_t, _0Q)lit_l] is zero and to zero if, in addition, the econometrician measures the riskless rate. 23 The law of iterated expectations does not apply to the second moments in these multivariate regression models so that this representation alone does nothing to sharpen unconditional G M M estimation. Additional covariances are introduced in the passage from conditional to unconditional moments because of the bilinear form of beta pricing models. The unconditional moment condition for security i is E[git~t1 litl] = E[gitz_it_l] 0 '7' Z~t_1 6 /t1 and the sum of the two offending covarianees Cov(flit(E[O(x,,O)2Qt)llt1], gitl} q Cov{flit, (E[O(x,, 0) ,~Qt}E[z_it_l] cannot be separated without further restrictions.
= 
77
Accordingly, suppose the econometrician observes a set of variables _2t_1 E Itl, perhaps also contained in x~ (i.e., z t_ 1 c x_a), and specifies a model of the form:
fit ~fl(gtl'Ofl ) ; Zt1 E
It1
(4.6)
where 0# is the vector of unknown parameters in the model for fit" In these circumstances, the beta pricing model becomes: Rt = z20t + _fl(_zt_1, _0#)[Q(x,, _0Q)  2Qt] +~t (4.7)
In the most common form of this model, the conditional betas are constant, the z t_ a is simply the scalar 1, and 0~ is the corresponding vector of constant conditional betas ft. All serial correlation in returns is mediated through the risk premiums given constant conditional betas. 24 Models for conditional betas make efficient G M M estimation more feasible by refining the optimal weighting matrices since: ~ ,  1 = g~~_O lit1 = Rot Var[Q(xt, OQ)llt_l] O~(zti'o00~)' /
/
~
OVar[Q(x_t, _OQ)I/tl]
a__o
X __fl(Zt_l, _Off)'
(4.8)
where, as before, an observed riskless rate eliminates the last line of (4.8). Since the parameter vector _0is (_0Q r__0p r), ~zti and tT~flt_ 1 in (4.5) differ in two respects:
) ( OCov[Q(xt'OOQ )'Rtllti]t
ao
"~'
(4.9)
24 Linear models o f the form flit = Oi~rSi#z~iare also common where Si# is a selection matrix that picks the elements ofz~_ l relevant for flit Linear models for conditional betas naturally arise when the APT holds both conditionally and unconditionally (cf., Lehmann (1992)). Some commercial risk management models allow 0~/~to vary both across securities and over time; see Rosenberg (1974) and Rosenberg and Marathe (1979) for early examples. Error terms can be added to these conditional beta models when their residuals are orthogonal to the instruments _zt_1 c It1. Nonlinear models can be thought of as specifications o f the relevant components of ~ t  1 by the econometrician.
78
B. N. Lehmann
A tedious calculation using partitioned matrix inversion verifies that the variance of the efficient G M M estimator of OOfalls after the imposition of the conditional beta model, both because of the reduction in dimensionality in the transition from the derivatives of Cov[Q(xt, O_Q),Rtllt_l] to the derivatives of Var[Q(xt, 0Q)l/t_l] in the first line of (4.9) and because of the additional moment conditions arising from the conditional beta model in the second line of (4.9). Hence, the problem of constructing estimates of the covariances between returns and the derivatives of the pricing kernel in (3.9) is replaced by the somewhat simpler problem of estimating the conditional variance of the pricing kernel along with its derivatives in these models. Both formulations require estimation of the conditional mean of Q(xt, OQ)and its derivatives through 20t, a requirement eliminated by observation of a riskless asset. While stochastic process assumptions are required to compute E[Q(xt, 0Q)litl], Var[Q(x t, O_Q)lit_l], and their derivatives, a conditional beta model and, when possible, measurement of the riskless rate simplifies efficient G M M estimation considerably. 25 Note also that the optimal conditional weighting matrix q%_lT~tl_l has a portfolio interpretation similar to that in the last section. The portfolio interpretation in this case has a long standing tradition in financial econometrics. Ignoring scale factors, the portfolio weightsassociated with the estimation of the premium 2Qt are proportional to _fl_(gt_l,_0~). Similarly, the portfolio weights associated with the estimation of the 20t are proportional to lfl_(z_t_l, 0p) after scaling Var[Q(xt, 09.)lit_l] to equal one, as is appropriate when the econometrician observes the return of portfolio perfectly correlated with Qt but not a model for Qt itself (a case discussed briefly below). Such procedures have been used assuming returns are independently and identically distributed with constant betas beginning with Douglas (1968) and Lintner (1965) and maturing into a widespread tool in Black, Jensen, and Scholes (1972), Miller and Scholes (1972), and Fama and MacBeth (1973). Shanken (1992) provides a comprehensive and rigorous description of the current state of the art for the independently and identically distributed case. Models for the determinants of conditional betas have another usethey make it possible to identify aspects of the noarbitrage model without an explicit model for the pricing kernel Qt. Given only fl__(zt_l,O~), expected returns are given by: E[Rt]Itt] = !20t + fl_(z_,_l,__0a)[2pt 20t] (4.10)
The potentially estimable conditional risk premiums 20t and )~pt are the expected returns of conditionally meanvariance efficient portfolios since the expected returns on the assets in this menu are linear in their conditional betas. 26 However, 25The presenceof Var[Q(x t, 0~)I/,d and its derivativesin (4.8) arises because (4.6) is a model for conditional betas, not for conditionalcovariances. In most applications, conditional beta models are more appropriate. 26 The CAPM is the best known model which takes this form, in which portfolio p is the market portfolio of all risky assets. The market portfolio return is maximallycorrelatedwith aggregatewealth (which is proportional to Qt in this model)in the CAPM in general;it is perfectlycorrelatedif markets are complete.
79
these parameters are also the expected returns of any assets of portfolios that cost. one unit of account and have conditional betas of one and zero, respectively. Portfolios constructed to have given betas are often called mimicking or basis portfolios in the literatures Mimicking portfolios arise in the portfolio interpretation of efficient conditional GMM estimation in this case and delimit what can be learned from conditional beta models alone. Given only the beta model (4.6):
et =l}cOt [ fl_(zt1, _Off)[}cpt  }cOt] ~ Eflpt ; Itt flpt_ 1 =
(4.11)
Z[~_~pte#pt'llt_l ]

~)flptI
(}cpt
02ot r
~ ~ t~  ~ ( ~  1 , 0 ~ ) ]
+ a}cpo
Note that if we treat the risk premiums as unknown parameters in each period, the limiting parameter space is infinite dimensional. Ignoring this obvious problem, the optimal conditional moment restrictions are given by:
[~  l}cOt fl(Zt1,0fl)(}cpt
}COt)] = _0
(4.12)
and the solution for each }cot and }cpt  }cOt is:
#pt 
hot]
hOt J =[(/fl(Ztl'
27 See Grinblatt and Titman (1987), Huberman, Kandel, and Stambaugh (1987), Lehmann (1987),
Lehmann and Modest (1988), Lehmann (1990), and Shanken (1992) for related discussions. In econometric terms, the portfolio weights that implicitly arise in crosssectional regression models with arbitrary matrices F solve the programming problems: min W_rpt_ WEpt 1
!
w mrinwZrot_lWrOt_l
80
B. N. Lehmann
which are, in fact, the actual, not the expected, returns of portfolios that cost one and zero units of account and that have conditional betas of zero and one, respectively Hence, there are three related limitations on what can be measured from risky asset returns given only a conditional beta model. First, the conditional beta model is identified only up to scale: _fl(z_t_l,Ofl)(2pt  Ot) is observationally equivalent to ~fl_(Zt_l,O__fl)(,~pt ,~Ot)/~O for any ~o 0. Second, the portfolio returns 20t and "~pt "~Othave expected returns 20t and Apt "~Ot,respectively, but the expected returns can only be recovered with an explicit time series model for E[Rt[It_l]. 28 Third, the pricing kernel Qt cannot be recovered from this model  only Rpt, the return of the portfolio of these N risk assets that is maximally correlated with Qt, can be identified f r o m ~pt in the limit (i.e., as _fl(zt_l,_0#)~fl_(z__t_l,_0~)).
^ p
(414)
where xt is a vector of variables that are not asset returns while R_mtis a vector of portfolio returns These models typically place no restrictions on the (unobserved) weights ~xt1 and ~mt1 save for the requirement that they are based on information available at time t1 and that they result in strictly positive values of
28 M o m e n t s of 20t and 2pt  20t can be estimated. For example, the projection of Jot and ~pt  jot on z~_1 E It1 recovers the unconditional projection of 20t and 2pt  20t on zt_ l c It1 in large samples, 29 The A P T as developed by Ross (1976) and Ross (1977) places insufficient restrictions on asset prices to identify Qt. In order to obtain the formulation (4.14), sufficient restrictions m u s t be placed on preferences and investment opportunities so that diversifiable risk c o m m a n d s no risk premium.
81
Qt .30 Put differently, a model takes the more general form Q(x_t,_0) when ~___xt1 and m~t1 are parameterized as o)x (z_ t_ l, _0) and m__,n (zt_ 1, _0). Accordingly, consider the linear conditional multifactor model:
Rt = ~t[ Bx(ztl,OBx)Xt I Brn(Ztl,OBm)Rmt q ~Bt "
(4.15)
The imposition of the moment conditions (2.6) yields the associated restriction on the intercept vector:
so that, in principle, oJxt_1 and OOmt_1 can be inverted from the expression for 2xr Finally, insertion of this expected return relation into the multifactor model yields:
R, = l_~O t @ gx (Zt_l, ~Bx ) [X,  L,] [ em (zt 1, OBm ) [e.~nt  L~0t]
+ ~Bt;E[eBt[Itl] = 0 .
(4.17)
Once again, the residual vector has conditional mean zero because expected returns are spanned by the factor loading matrix B(z~_l, 0B) and a vector of ones. 31 As is readily apparent, this model requires estimates of the conditional mean vector and covariance matrix o f (x_ttRtmt) '. Note t h a t no restrictions are placed on E[Rmt][t_l ] in (4.17). If the econometrician observes the returns R_~t and the variables x_ t with no additional information on Qt, the absence of a model linking R~n t with Qt eliminates the restrictions on E[R_R_.mt[It_l]that arise from the moment condition E[R__mtQt[It_l ] = !. The same observation would hold if the returns of portfolio p were observed in (4.10)(4.13). Put differently, a linear combination of the returns R_~t or of the r e t u r n Rpt provides a scaleflee proxy for Qt. In the absence of data on or of a model for Qt, asset pricing relations explain relative asset prices and expected returns, not the levels of asset prices and risk premiums. As with the imposition of conditional beta models, linear factor models simplify estimation and inference by weakening the information requirements. Linearity of the pricing kernel confers three modest advantages compared with the conditional beta models of the previous section: (1) the derivatives of the conditional mean and variance of Q(xt, O_Q) are no longer required; (2) the conditional covariance matrices involving x t and R_~t contains no unknown model parameters (in contrast to Var[Q(xt,_00)[It_l]); and (3) the linear model permits c%_ 1 and m_~_mt1 _ to remain unobservable. The third point comes at a cost  the
30 Imposing the positivity constraint in linear models is sometimes quite difficult. 31 Since the multifactor models described above are cast in terms o f Qt, [1  B", (~_1,0sm)Z] will not be identically zero. In multifactor models with no explicit link between Qt and the underlying common factors, this remains a possibility. See Huberman, Kandel, and Stambaugh (1987), Huberman and Kandel (1987), and Lehmann and Modest (1988) for a discussion o f this issue.
82
B. N. Lehmann
model places no restrictions on the levels of asset prices and risk premiums. Once again, additional simplifications arise if there is an observed riskless rate. Multifactor models also take the form of prespecified beta models. The analysis of these models parallels that of the single beta case in (4.10)(4.1 3). A conditional factor loading model B(_Zt_l, 0B) can only be identified up to scale and, at best, the econometrician can estimate the returns of the minimum variance basis portfolios, each with a loading of one on one factor and loadings of zero on the others. In terms of the single beta representation, a portfolio of these optimal basis portfolios with timevarying weights has returns that are maximally correlated with Qt or, equivalently, a linear combination ofB(z_4_l, _0B) is proportional to the conditional betas ~ in this multifactor prespecified beta model. 4.3. Diversifiable residual models and estimation in large crosssections One other simplifying assumption is often made in these models: that the residual vectors are only weakly correlated crosssectionally. This restriction is the principal assumption of the APT and it implies that residual risk can be eliminated in large, welldiversified portfolios. It is convenient econometrically for the same reason; the impact of residuals on estimation can be eliminated through diversification in large crosssections. In terms of efficient estimation of beta pricing models, this assumption facilitates estimation of 7%_1, the remaining component of the efficient G M M weighting matrix. To be sure, efficient estimation could proceed by postulating a model for 7J/~t_l in (4.7) of the form ~(zt_l). However, it is unlikely that an econometrician, particularly one using semiparametric methods, would possess reliable prior information of this form save for the factor models of Section 4.2. Accordingly, consider the addition of a linear factor model to the conditional beta models. Once again, consider the projection: 32 R_t = s t + ~_(zt_ 1, O_#)Q(xt, O_O_Q) + Bx(z_t_,, OBx)~ t gm (ztl, OBm)Rm, q (;fiB, and the application of the pricing relation to the intercept vector: ~t ~" [l Bm(Ztl,OBm)l]);Ot fl(2tl~Ofl)~Qt gx(~l,OBx)Lt which, after rearranging terms and insertion into (4.19), yields: R_t = !20, + fl_(z_t_l,_0/~)[Q(x,,O_Q) 2Ot] + Bx(z,_l, OBx) [x~  2xt] + Bm(z_,_l, OBm)[Rmt  _t20t] q ~BBt 2Qt = 2otE[Q(x_,, 0O)2j/t,]; ~t"t~Bt_, = t [~_eBt~_mt'llt_, ]
ax, = ,~o,E[x,O(x,,_0o)II,1] .
(4.19)
(4.20)
(4.21)
32 Of course, one element of (x/R_,,J) must be dropped if (x_/R_mt')and Q(x_4, 0_0)are linearlydependent.
83
When all of these components are present in the model, assume that a vector of ones does not lie in the column span of either Bx(_Zt_l,__0Bx ) or Bm(z_t_l, OBm). This formulation nests all of the models in the preceding subsections. When Bx(z~_l,0_Bx ) and Bm(Zt_l,O_Bm) a r e identically zero, equations (4.21) yield the conditional beta model (4.7) or, in the absence of the pricing kernel model Q(x~,0O), the prespecified beta model (4.11). Similarly, when __fl(_Zt_l,_00/~ ) is identically zero, equations (4.21) yield the observable linear factor model (4.17) or, without observations on xt and R_R~t, the multifactor analogue of the prespecified beta model. When all components are included simultaneously, the conditional factor model places structure on the conditional covariance matrix of the residuals ~/~t_lin the conditional beta model (4.7). This factor model represents more than mere elegant variation  it makes it plausible to place a a priori restrictions on the conditional variance matrix ~#Bt1. In terms of the conditional beta model (4.7), the residual covariance matrix 7%_1 has an observable factor structure in this model given by: 33
(4.22)
Hence, the factor model provides the final input necessary for the efficient estimation of beta pricing models. Chamberlain and Rothschild (1983) provide a convenient characterization of diversifiability restrictions for residuals like _e/~Bt.They assume that the largest eigenvalue of the conditional residual covariance matrix 7~Bt_l remains bounded as the number of assets grows without bound. This condition is sufficient for a weak law of large numbers to apply because the residual variance of a portfolio with weights of order 1IN (i.e., one for which ~_lwt_l ~ 0 as
converges
to zero since
33 Unobservable factor models can be imposed as well as long as the associated conditional betas are constant. The methods developed for the iid case in Chamberlain and Rothschild (1983), Connor and Korajczyk (1988) and L e h m a n n and Modest (1988) apply since the residuals in this application are serially uncorrelated. L e h m a n n (1992) discusses the serially correlated case.
84
B. N. Lehmann
Unfortunately, there is no obvious way to estimate ItlflBt_ 1 subject to this boundedness condition. 34 Hence, the imposition of diversification constraints in practice generally involves the stronger assumption of a strict factor structure: that is, that ~'#BtI is diagonal. Of course, there is no guarantee that a diagonal specification leads to an estimator of higher efficiency than an identity matrix (that is, ordinary least squares) when generalized least squares is appropriate, as would be the case if ~Bt1 is unrestricted save for the diversifiability condition lim ~max(ttlflBt_l) < 00. While weighted least squares may in fact be superior in most applications, conservative inference can be conducted assuming that this specification is false. In any event, the econometrician can allow for a generous amount of dependence in the idiosyncratic variances in the diagonal specification. What is the large crosssection behavior of G M M estimators assuming that a weak law applies to the residuals? To facilitate large N analysis, append the subscript N to the residuals ~flBNt and to the associated parameter vectors and matrices flN(Zt_l, O~N) ,BxN(Z_t_l, O_BxN),BmN(Zt_I, OBmN), and I~BNt_ 1 and take all limits as N grows without bound by adding elements to vectors and rows to matrices as securities are added to the asset menu. An arbitrary conditional G M M estimator can be calculated from: T
1 ZApBNt_I~_3BNt N T t=t
(4.24)
where A~BNtI is a sequence of p Nop(1) matrices chosen by the econometrician having full row rank for which Cmin(A3BNt_IA3BNt_I) ~ (X~ as N ~ (x) where ~min(e) is the smallest eigenvalue of its argument. This latter condition ensures that the weights are diversified across securities and not concentrated on only a few assets. Examination of the estimating equations (4.24) reveals the benefits of large crosssections when residuals are diversifiable. The sample and population residuals are related by:
Jr [OxN(Z,_l, O_BxU) BxU(Z_,_l, O_BxN)]X_t q [BxN(Z_t_l, O_BxN)~_xt BxN(Zt_I, OBxN)L,] ~ [BmN(Zt_l, OBraN) BmN(Z__t_l, OBmN)]Rmt Jr {BraN (ZC_I, ~BmN)l_~o,  BmN(Z,_I, OBmN)l.~O,}
(4.25)
the first component of which is the population residual vector I?,flBUt and the remaining components of which represent the difference between the population and 34 Recently,Ledoit (1994)has proposedestimatingcovariancematricesusing shrinkageestimators of the eigenvalues,an approach that might work here.
85
fitted part o f the model. Clearly, ~_BBN t c a n be eliminated t h r o u g h diversification and, hence, the application of ABsNt_ 1 to ~BBNt will do so since it places implicit weights o f order 1IN on each asset as the n u m b e r o f assets grows without bound. However, the benefits o f diversification have a limit because o f the difference between the population and fitted part o f the model. F o r example, the sampling errors in Q(x_t , O__Q), "~Ot, ~xt, BxN(Z_t_I,OBxN) and BmN(Z_t_l, O BmN) generally c a n n o t be diversified away in a single crosssection. To be sure, some c o m p o n e n t s o f ~BBNt are amenable to diversification in some models. F o r example, if fl(Zt_l,0_B ) is identically zero (i.e., if the pricing kernel at is given by CO'xt_lx__ t + OJmt_lRmt) and, if the models for both BxN(Zt_I,OBxN) and BmN(Z_t_I,OBmN) ^ are linear, the sam) BmU(Zt_a, Osmu) pling errors BxN(Zt_l, OBxN)  BxN(Zt_l ' OBxU) and B,nN(Zt_l, O_BmN can, in principle, be eliminated t h r o u g h diversification. In this case, the only risk p r e m i u m that can be consistently estimated f r o m a single crosssection is 20t since the difference 2xt  2xt can only be eliminated in large time series samples. 35
~ flBt1 
02or O_B )  Bx(zt_l ' OBx) O0 { l  Var[Q(xt,  OQ)]It_1]3(z_t_l, X C o v [xt , Q (xt, OQ)[It_ 1]
 gm
(ztl,
OBm)I} t ~ E[Rmt 
+ 20,{Var[Q(xt, O_Q)1I,_1]
Ofl(Z_t1, O0
oB)'
OCov[xt, Q(x~, O_o)litl]' O0 (4.26)
+ OVar[Q(xt, OQ)]I,1]
00 fl(gt1 '0B)t }
35This point has resulted in much confusion in the beta pricing literature. The literature abounds with inferences drawn from crosssectional regressions of returns on the betas of individual assets computed with respect to particular portfolios. If the betas in these prespecified beta models are computed with respect to an efficient portfolio, the best one can do in a single crosssection (with a priori knowledgeof the populationbetas and return covariance matrix) is to recover the returns of the efficient portfolio. Information on the risk premium of portfolios like p in Section 4.1 can only be recovered over time while the return of portfolio 0 converges to the riskless rate in a single crosssection if the residuals of the prespecified beta model are diversifiable given the population value of ~pt1. Shanken (1992) shows that this is the case using the sample analogue of ~pt1 in a model with constant conditional betas and independently and identically distributed idiosyncratic disturbances given appropriate corrections for biases induced by sampling error. See also Lehmann (1988) and Lehmann (1990).
86
B. N. Lehmann
In the original formulation (3.6)(3.9), efficient estimation required ~tl, the derivatives of the conditional expectation of Q(x_t, OQ)Rt, and ~/t1, the conditional covariance matrix of R__tQ(xt,OQ) t_. Equations (4.21) reflect the kinds of assumptions that the econometrician can make to facilitate efficient estimation. The conditional beta model eases the evaluation of the beta pricing version of #t1 and the factor model assumption places structure on the associated analogue of ~t1. Consistent estimation of A~Bt 1 requires the evaluation of a number of conditional moments  20t, E[Rmt~II], Var[Q(x_t, O_Q)litl], and Cov[xt, Q(xt,_0a)] /tl] and their derivatives, when necessary, along with B~Btl, VBBt1, and kU/~Bt_I. The most common strategy by far is simply to assume that the relevant conditional moments are time invariant functions of available informations. This strategy was taken throughout this section in the models for conditional betas and conditional factor loadings. For the evaluation of A~Bt_I, this approach requires the econometrician to posit relations of the form:
2 o ( z t _ l , o) =
E [Q(xt, _0O)Iz_t_L] 1
(4.27)
War[_~aB,lz_t_l] which permit the consistent estimation ofAgBt_ 1 using initial consistent estimates of __0. It is far from obvious that a financial econometrician can be expected to have reliable to prior information in this form. In most asset pricing applications, the possession of such information about the conditional second moments a S (zt_l, 0) and ~rax(z_t_l,O_ ) is somewhat more plausible than the existence of the corresponding conditional first moment specifications 20(Zt_l,_0) and 2re(zt_1,_0_0) in its conditional mean from. However, observation of the riskless rate eliminates the need to model )~0(_Zt_l,0) and models for Cov[Rmt , Q(x~, %)[Z_t_l] seem no more demanding than those for other conditional second moments. The conditional covariance matrix V~B(Zt_l,0) is somewhat less problematic as well, although the specification of multivariate conditional covariance models is in its infancy. The discussion in Section 4.3 left some ambiguity concerning the availability of plausible models of this sort for ~/~stI due to the inability to impose the general bounded eigenvalue condition. As noted there, the specification of idiosyncratic
87
variances is comparatively straightforward if ~l~Bt_ 1 is diagonal. Finally, conservative inference is always available through the use of the asymptotic covariance matrix in (2.13). Equations (4.27) can either represent parametric models for these conditional moments or functions that are estimable by semiparametric or nonparametric methods. Robinson (1987), Newwy (1990), Robinson (1991), and Newey (1993) discuss bootstrap, nearest neighbor, and series estimation of functions such as those appearing in (4.27). All of these methods suffer from the curse of dimensionality so their invocation must be justified on a case by case basis. Neural network approximations promising somewhat less impairment from this source might be employed as well. 36
5. Concluding remarks
This paper shows that efficient semiparametric estimation of asset pricing relations is straightforward in principle if not in practice. Efficiency follows from the maximum correlation property of the optimal GMM estimators described in the second section, a property that has analogues in the optimal hedge portfolios that arise in asset pricing theory. The semiparametric nature of asset pricing relations naturally leads to a search for efficiency gains in the context of beta pricing models. The structure of these models suggests that efficient estimation is made feasible by the imposition of conditional beta models and/or multifactor models with residuals that satisfy a law of large numbers in the crosssection, models that exist in various incarnations in the beta pricing literature. Hence, strategies that have proved useful in the iid environment have natural, albeit nonlinear and perhaps nonparametric, analogues in this more general setting, the details of which are worked out in the paper. While it has offered no evidence on the magnitude of possible efficiency gains, the paper has surely pointed to more straightforward interpretation and implementation than has been heretofore attainable. What remains is to extend there results in two dimensions. The analysis sidestepped the development of the most general approximations of the conditional moments that comprise the optimal conditional weighting matrices, the subtleties of which arise from the martingale difference nature of the residuals in noarbitrage asset pricing models as opposed to the independence assumption frequently made in other applications. The second dimension involves examination of less parametric semiparametric estimators. In the asset pricing arena, this amounts to semiparametric estimation of pricing kernels and state price densities, a more ambitious and perhaps more interesting task.
36 Barron(1993) and Horniket al. (1993) discussthe superiorapproximationproperties of neural networks in the multidimensionalcase.
88
B. N. Lehrnann
References
Bansal, R. and B. N. Lehmann (1995). Bond returns and the prices of state contingent claims. Graduate School of International Relations and Pacific Studies, University of California at San Diego. Bansal, R. and S. Viswanathan (1993). No arbitrage and arbitrage pricing: A new approach. J. Finance 48, pp. 12311262. Barron, A. R. (1993). Universal approximation bounds for superpositions of a sigmoidal function. IEEE Transactions on Information Theory 39, pp. 930945. Black, F., M. C. Jensen and M. Scholes (1972). The capital assest pricing model: Some empirical tests. In: M. C. Jensen, ed., Studies in the Theory of Capital Markets, New York: Praeger. Breeden, D. T. (1979). An intertemporal asset pricing model with stochastic consumption and investment opportunities. J. Financ. Econom. 7, pp. 265299. Cass, D. and J. E. Stiglitz (1970). The structure of investor preferences and asset returns and separability in portfolio allocation: A contribution to the pure theory of mutual funds. J. Econom. Theory 2, pp. 122160. Chamberlain, G. (1987). Asymptotic efficiency in estimation with conditional moment conditions. J. Econometrics 34, pp. 305334. Chamberlain, G. (1992). Efficiency bounds for semiparametric regression. Econometrica 60, pp. 567596. Chamberlain, G. and M. Rothschild (1983). Arbitrage and meanvariance analysis on large asset markets. Econometrica 51, pp. 12811304. Cochrane, J. (1991). Productionbased asset pricing and the link between stock returns and economic fluctuations. J. Finance 146, pp. 207234. Connor, G. and R. A. Korajczyk (1988). Risk and return in an equilibrium APT: Application of a new test methodology. J. Financ. Econom. 21, pp. 255289. Constantinides, G. and W. Ferson (1991). Habit persistence and durability in aggregate consumption: Empirical tests. J. Financ. Econom. 29, pp. 199240. Douglas, G. W. (1968). Risk in the Equity Markets: An Empirical Appraisal of Market Efficiency. Ann Arbor, Michigan: University Microfilms, Inc. Epstein, L. G. and S. E. Zin (1991a). Substitution, risk aversion, and the temporal behavior of consumption and asset returns: A theoretical framework. Econometrica 57, pp. 937 969. Epstein, L. G. and S. E. Zin (1991b). Substitution, risk averison, and the temporal behavior of consumption and asset returns: An empirical analysis. J. Politic. Eeonom. 96, pp. 263286. Fama, E. F. and J. D. MacBeth (1973). Risk, return, and equilibrium: Empirical tests. J. Politic. Econom. 81, pp. 60%636. Grinblatt, M. and S. Titman (1987). The relation between meanvariance efficiency and arbitrage pricing. J. Business 60, pp. 97112. Hall, A. (1993). Some aspects of generalized method of moments estimation. In: G. S. Maddala, C. R. Rao and H. D. Vinod, ed., Handbook of Statistics: Econometrics. Amsterdam, The Netherlands: Elsevier Science Publishers, pp. 393~418. Hansen, L. P. (1982). Large sample properties of generalized method of moments estimators. Econometrica 50, pp. 10291054. Hansen, L. P. (1985). A method for calculating bounds on the asymptotic covariance matrices of generalized method of moments estimators. J. Econometrics 30, pp. 203238. Hansen, L. P., J. Heaton and E. Luttmer (1995). Econometric evaluation of assest pricing models. Rev. Financ. Stud. g pp. 237274. Hansen, L. P., J. Heaton and M. Ogaki (1988). Efficiency bounds implied by multiperiod conditional moment conditions. J. Amer. Stat. Assoc. 83, pp. 863871. Hansen, L. P. and R. J. Hodrick (1980). Forward exchange rates as optimal predictors of future spot rates: An econometric analysis. J. Politic. Econom. 88, pp. 829853. Hansen, L. P. and R. Jagannathan (1991). Implications of security market data for models of dynamic Economies. J. Politic. Econom. 99, pp. 225262.
89
Hansen, L. P. and R. Jagannathan (1994). Assessing specification errors in stochastic discount factor models. Research Department, Federal Reserve Bank of Minneapolis, Staff Report 167. Hansen, L. P. and S. F. Richard (1987). The role of conditioning information in deducing testable restrictions implied by dynamic asset pricing models. Econometrica 55, pp. 587613. Hansen, L. P. and K. J. Singleton (1982). Generalized instrumental variables estimation of nonlinear rational expectations models. Econometrica 50, pp. 12691286. Harrison, M. J. and D. Kreps (1979). Martingales and arbitrage in multiperiod securities markets. J. Econom. Theory 20, pp. 381408. He, H. and D. Modest (1995). Market frictions and consumptionbased asset pricing. J. Politic. Econom. 103, pp. 94117. Hornik, K., M. Stinchcombe, H. White and P. Auer (1993). Degree of approximation results for feedforward networks approximating unknown mappings and their derivatives. Neural Computation 6, pp. 12621275. Hubennan, G. and S. Kandel (1987). Meanvariance spanning. J. Finance 42, pp. 873888. Huberman, G., S. Kandel and R. F. Stambaugh (1987). Mimicking portfolios and exact asset pricing. J. Finance 42, pp. 19. Ledoit, O. (1994). Portfolio selection: Improved covariance matrix estimation. Sloan School of Management, Massachusetts Institute of Technology, Lehmann, B. N. (1987). Orthogonal portfolios and alternative meanvariance efficiency tests. J. Finance 42, pp. 601619. Lehmann, B. N. (1988). Meanvariance efficiency tests in large crosssections. Graduate School of International Relations and Pacific Studies, University of California at San Diego. Lehmann, B. N. (1990). Residual risk revisited. J. Econometrics 45, pp. 7197. Lehmann, B. N. (1992) Notes of dynamic factor pricing models. Rev. Quant. Finance Account. 2, pp. 6987. Lehmann, B. N. and David M. Modest (1988), The empirical foundations of the arbitrage pricing theory. J. Financ. Econorn. 21, pp. 213254. Lintner, J. (1965). Security prices and risk: The theory and a comparative analysis of A.T &T. and leading industrials. Graduate School of Business, Harvard University. Luttmer, E. (1993). Asset pricing in economies with frictions. Department of Finance, Northwestern University. Merton, R. C. (1972). An analytical derivation of the efficient portfolio frontier. J. Financ. Quant. Anal. 7, pp. 18511872. Merton, R. C. (1973). An intertemporal capital asset pricing model. Econometrica 41, pp. 867887. Miller, M. H. and M. Scholes (1972). Rates of return in relation to risk: A reexamination of some recent findings. In: M.C. Jensen, ed., Studies in the Theory of Capital Markets, New York: Praeger, pp. 79121. Newey, W. K. (1990). Efficient instrumental variables estimation of nonlinear models. Econometrica 58, pp. 809837. Newey, W. K. (1993). Efficient estimation of models with conditional moment restrictions. In: G. S. Maddala, C. R. Rao and H. D. Vinod, eds., Handbook of Statistics: Econometrics. Amsterdam, The Netherlands: Elsevier Science Publishers. Newey, W. K. and K. D. West (1987). A simple, positive semidefinite, heteroskedasticity and autocorrelation consistent covariance matrix. Econometrica 55, pp. 703708. Ogaki, M. (1993). Generalized method of moments: Econometric applications. In: G. S. Maddala, C. R. Rao and H. D. Vinod, eds., Handbook of Statistics: Econometrics, Amsterdam, The Netherlands: Elsevier Science Publishers, pp. 455488. Robinson, P. M. (1987). Asymptotically efficient estimation in the presence of heteroskedasticity of unknown form. Econometrica 59, pp. 875891. Robinson, P. M. (1991). Best nonlinear threestage least squares estimation of certain econometric models. Econometrica 59, pp. 755786. Roll, R. W. (1977). A critique of the asset Pricing Theory's Tests  Part I: On past and potential testability of the theory. J. Financ. Econom. 4, pp. 129176.
90
B. N. Lehmann
Rosenberg, B. (1974). Extramarket components of covariance in security returns. J. Financ. Quant. Anal. 9, pp. 262274. Rosenberg, B. and V. Marathe (1979). Tests of capital asset pricing hypotheses. Research in Finance: A Research Annual 1, pp. 115223. Ross, S. A. (1976). The arbitrage theory of capital assest pricing. J. Economic Theory 13, pp. 341360. Ross, S. A. (1977). Risk, return, and arbitrage. In: I. Friend and J.L. Bicksler, eds., Risk and Return in Finance. Cambridge, Mass.: Ballinger. Ross, S. A. (1978a). Mutual fund separation and financial theory  the separating distributions. J. Econom. Theory 17, pp. 254286. Ross, S. A. (1978b). A simple approach to the valuation of risky streams. J. Business 51, pp. 140 Rubinstein, M. (1976). The valuation of uncertain income streams and the pricing of options. Bell J. Econom. Mgmt. Sci. 7, pp. 407425. Shanken, J. (1992). On the estimation of beta pricing models. Rev. Financ. Stud. 5, pp. 133. Summers, L. H. (1985). On economics and finance. J. Finance 411, pp. 633636. Summers, L. H. (1986). Does the stock market rationally reflect fundamental values? J. Finance 41, pp. 591600. Tauchen, G. (1986). Statistical properties of generalized method of moments estimators of structural parameters obtained from financial market data. J. Business Econom. Statist. 4, pp. 397425.
G.S. Maddala and C.R. Rao, eds., Handbookof Statistics, Vol. 14 1996 Elsevier Science B.V. All rights reserved,
a'i"
1. Introduction
Models of the term structure of interest rates have assumed increasing importance in recent years in line with the need to value interest rate derivative assets. Economists and econometricians have long held an interest in the subject, as an understanding of the determinants of the the term structure has always been viewed as crucial to an understanding of the impact of monetary policy and its transmission mechanism. Most of the approaches taken to the question in the finance literature have revolved around the search for c o m m o n factors that are thought to underlie the term structure and little has been borrowed from the economic or econometrics literature on the subject. The converse can also be said about the small amount of attention paid in econometric research to the finance literature models. The aim of the present chapter is to look at the connections between the two literatures with the aim of showing that a synthesis of the two may well provide some useful information for both camps. The paper begins with a description of a standard set of data on the term structure. This results in a set of stylized facts pertaining to the nature of the stochastic processes generating yields as well as their spreads. Such a set of facts is useful in forming an opinion of the likelihood of various approaches to term structure modelling being capable of replicating the data. Section 3 outlines the various models used in both the economics and finance literature, and assesses how well these models perform in matching the stylized facts. Section 4 presents a conclusion.
* We are grateful for comments on previous versions of this paper by John Robertson, Peter Phillips and Ken Singleton. All computations were performed with a beta version of MICROFIT 4 and GAUSS 3.2 91
92
T h e data set examined involves m o n t h l y observations on 1, 3, 6 and 9 m o n t h and 10 year zero c o u p o n b o n d yields over the period D e c e m b e r 1946 to F e b r u a r y 1991, constructed by McCulloch and K w o n (1993); this is an u p d a t e d version of McCulloch (1989). Table 1 records the autocorrelation characteristics o f the series, with f3j being the jth autocorrelation coefficient, D F the DickeyFuller test, A D F ( 1 2 ) the A u g m e n t e d DickeyFuller test with 12 lags, rt('c) the yield on zerocoupon bonds with m a t u r i t y o f z m o n t h s and spt (z) is the spread rt (z)  rt (1). It shows that there is strong evidence of a unit root in all interest rate series, Because this would imply the possibility of negative interest rates, finance modellers have generally maintained that either there is no unit root and the series feature m e a n reversion or, in continuous time, that an appropriate model is given by the stochastic differential equation
drt = ~dt + trrtdtlt ,
where, t h r o u g h o u t the paper, dtlt is a Wiener process. Because of the "levels effect" o f rt u p o n the volatility o f interest rate changes, we can think of this as an equation in d log rt with constant volatility, and the logarithmic t r a n s f o r m a t i o n ensures that rt remains positive. 1 In any case, the i m p o r t a n t point to be m a d e here is that interest rates seem to behave as integrated processes, certainly over the samples o f data we possess. It m a y be that the autoregressive root is close to unity, rather than identical to it, but such " n e a r integrated" processes are best handled with the integrated process technology rather than that for stationary processes.
P2
P6
P12
Pl(At)
.02 .11 .15 .15 .07
r(1) 2.41 r(3) 2.15 r(6) 2.12 r(9) 2.12 r(120) 1.41 sp(3) 15.32 sp(6) 11.67 sp(9) 10.38 sp(120) 5.60
The 5% critical value for the DF and ADF tests is 2.87. It is known that, ifrt is replaced by rt ~, the restriction y > .5 ensures a positive interest rate while, if y = .5, a < 2a is needed.
93
Instead of the yields one might examine the time series characteristics of the forward rates. The forward rate Fkt(z) contracted at time t for a z period bond to be bought at t + k is F~(v) = [1] [(z + k)rt(z + k)  krt(k)]. For a forward contract one period ahead this becomes FI(T) = Ft(z) = ~[(~ + 1)r_,(z + 1)  r~(1)]. For reasons that become apparent later it is also of interest to examine the properties of the forward "spreads" Fpt(z,  1) = Ft(v  1)  F t  1 (~). These results are to be found in Table 2. Generally, the conclusions would be the same as for yields, except that the persistence in forward rate spreads is not as marked, particularly as the maturity gets longer. As Table 1 also shows, there is a lot of persistence in spreads between shortdated maturities; after fitting an AR(2) to spt(3) the LM test for serial correlation over 12 lags is 80.71. This persistence shows up in other transformations of the yield series, e.g. the realized excess holding yield ht+l('C)= zrt(z) ( z  1)rt+l (z  1)  rt(1), when ~ = 3, has serial correlation coefficients of .188 (lag 1), .144 (lag 8), a n d . 111 (lag 10). Such processes are persistent, but not integrated, as the ADF(12) for ht+l (3) clearly shows with its value of 5.27. Papers have appeared concluding that the excess holding yield is a nonstationary processEvans and Lewis (1994) and Hejazi (1994). That conclusion was reached by the authors performing a PhillipsHansen (1990) regression of ht+l(z) on Ftl(z). Applying the same test to our data, with McCulloch's forward rate series, produces an estimated coefficient on Ft1 (~) o f . 11 with a t ratio of 10, quite consistent with both Evans and Lewis' and Hejazi's results. However, it does not seem reasonable to interpret this as evidence of nonstationarity. Certainly the series is persistent, and an I(1) series like the forward rate exhibits extreme persistence, so that regressing one upon the other can be expected to lead to some "correlation", but to conclude, therefore, that the excess holding yield is nonstationary is quite incorrect. A fractionally integrated process that is stationary would also show such a relationship with an I(1) process. Indeed, the autocorrelation functions of the spreads and excess yields are reminiscent of those for the squares of yield changes, which have been modelled by fractionally integrated processes  see
Table 2 Autocorrelafion features, forward rates, full sample DF F(1) F(3) Y(6) F(9) 2.28 2.18 2.39 2.14 17.08 19.52 20.61 19.73 ADF(12) 1.92 1.97 1.91 1.77 4.07 5.17 5.77 4.69 Pl .98 .98 .98 .98 .29 .16 .11 .15 P2 P6 Pl2 p1(A~ .07 .04 .09 .07 .18 .06 .12 .00 .11 .01 .05 .08 .18 .02 .03 .02
Fp(O, 1) Fp(2, 3)
Fp(5, 6) Fp(8, 9)
94
Baillie et al. (1993). Nevertheless, the s t r o n g persistence in spreads is a characteristic which is a substantial challenge to t e r m structure models. 2 A s is well k n o w n , there was a switch in m o n e t a r y p o l i c y in the U S in O c t o b e r 1979 a w a y f r o m targeting interest rates, a n d this fact generally m e a n s t h a t a n y analyses have to be r e  d o n e to ensure t h a t the results d o n o t simply reflect outcomes f r o m 1979 to 1982. T a b l e 3 therefore presents the same statistics as in T a b l e 1 b u t using only p r e O c t o b e r 1979 data. It is a p p a r e n t t h a t the conclusions d r a w n a b o v e are quite robust. It is also well k n o w n t h a t there is a substantial d e p e n d e n c e o f the c o n d i t i o n a l volatility o f A r t ( z ) u p o n the past, b u t the exact n a t u r e o f this d e p e n d e n c e has been subject to m u c h less analysis. A s will b e c o m e clear, the m o s t i m p o r t a n t issue is w h e t h e r the c o n d i t i o n a l variance, a~t 2 , exhibits a levels effect a n d , if so, exactly w h a t r e l a t i o n s h i p is likely to hold. H e r e we e x a m i n e the evidence for a "levels 2 d e p e n d s on rt(z), c o n c e n t r a t i n g u p o n the five yields effect" in volatility, i.e. a~t m e n t i o n e d earlier. Evidence o f the effect can be m a r s h a l l e d in a n u m b e r o f ways. By far the simplest a p p r o a c h is to p l o t (Art(z)  #)2 a g a i n s t rt1 (z), (and this is d o n e in Fig. 1 for rt(1)). 3 The evidence o f a levels effect l o o k s very strong. A m o r e s t r u c t u r e d a p p r o a c h is to estimate the p a r a m e t e r s o f a diffusion process for yields o f the f o r m
drt = (~l  flirt) d t + r~ldtlt ,
(1)
a n d to e x a m i n e the estimate o f 7t. T o estimate this requires some a p p r o x i m a t i o n scheme. C h a n et al. (1992) c o n s i d e r a discretization b a s e d on the E u l e r scheme with h = 1 (ht being the discretized steps) p r o d u c i n g Table 3 Autocorrelation features, pre October 1979 DF r(1) .76 r(3) .64 r(6) .52 r(9) .55 r(120) .14 sp(3) 11.99 sp(6) 8.80 sp(9) 8.20 sp(120) 4.05 ADF(12) .79 1.03 1.00 .89 .33 2.64 2.89 3.10 4.30
/91
.97 .97 .98 .98 .99 .46 .70 .71 .90
/92
/36
/912
2 Throughout the paper we will take the term structure data as corresponding to actual observations. In practice this may not be so, as a complete term structure is interpolated from observations on parts of the curve. This may introduce some biases of unknown magnitude into relationships between yields. McCulloch and Kwon (1993) interpolate with spline functions. Others e.g. Gouri6roux and Scaillet (1994), actually utilize some of the factor models discussed later to suggest forms for the yield curve that may be used for interpolation. 3 Marsh and Rosenfeld (1983) also did this and commented on the relation.
M o d e l i n g the t e r m s t r u c t u r e
95
24.1132[
15.3448
6.5763
2.1921 24900
1 5.5693
10.8897
16.2100
Fig. 1. Plot of squared changes in one month yield against lagged yield
AF t :
o~1   f l l r t _ l
~
GrtTl_18 , ,
(2)
where here and in the remainder of the paper et is n.i.d.(O, 1). Equation (2) can be estimated by OLS simply by defining the dependent variable as Artr~7~, while the regressors become x, = [r/_~] r]~], as the error term is then O'er, which is n.i.d(O, or2). Because the conditional mean for rt depends only on ~1, ill, while the conditional variance of ut = rt  Et1 (rt) is o'2rt2_~11,which does not involve these parameters, we could estimate the parameters in the following way. 1. Regress Art on 1 and rt1 to get &l and ill. 2. Since
Et_l [u2]
:
,.~2 ~ t t _ 2,, 1 ,
(3)
then
U t2 = ~:r27~ + vt ,
(4)
linear regression program. 3. We can reestimate ~1 ,ill by then doing a weighted regression of Artrt_~ against rt~ and r~_i ~. The above steps would produce a maximum likelihood estimator if et was taken to be sff(0, 1) and the estimation of ~1 was done by a weighted nonlinear
96
regression on (3) using the conditional standard deviation o f vt as weights. 4 C h a n et al. (1992) use a G M M estimator, which jointly estimates cq, ill, 71 and a from the set o f m o m e n t s E(et) = 0,
E ( r t  l e t ) = 0,
E(vt) = 0,
E(rt_lVt ) = 0 .
Their estimator would coincide with the one described above if the last m o m e n t condition was replaced by E(r~_lVt )  0. A potential problem with all the estim a t o r s is that, if fll is likely to be close to zero, the regressors in (2) and (4) will be close to I(1), and so n o n  s t a n d a r d distribution theory almost certainly applies to the G M M estimator. Table 4 presents estimates o f the parameters ~1, fll and 71 f o u n d by using three estimation methods. The first one is based on estimating the diffusion with an Euler approximation, Arth = ~lh  fllhr(t1)h 5 flit
. 1/2
r(t_l)h~ t ,
~1
(5)
with h = 1. It is the estimator described above as G M M . The others stem f r o m the m o d e r n a p p r o a c h o f indirect estimation p r o p o s e d by Gouri6roux et al. (1993) Table 4 Estimates of diffusion process parameters rt(1) GMM ~1 fll ~l
MLE
rt(3) .090 (1.82) .015 (1.24) 1.424 (5.61) .047 (2.48) .007 (7.89) .648 (1.92) .044 (1.89) .010 (1.57) .974 (5.73)
rt(6) .089 (1.80) .015 (1.25) 1.532 (4.99) .041 (3.00) .005 (2.08) .694 (2.31) .045 (4.34) .008 (2.24) .947 (7.21)
rt(9) .091 (1.87) .015 (1.31) 1.516 (5.12) .037 (3.82) .004 (1.08) .753 (3.34) .043 (1.67) .008 (2.04) .941 (3.09)
rt(120) .046 (1.77) .006 (.98) 1.178 (9.80) .015 (3.74) .001 (2.35) 1.136 (19.30) .009 (3.30) .009 (2.36) 1.104 (4.88)
.106 (2.19) .020 (1.52) 1.351 (6.73) .071 (2.17) .012 (.74) .583 (2.39) .107 (1.63) .004 (2.15) .838 (2.67)
~l t1 ~1
EGARCH
~1 fll ~
Asymptotic tratios in parentheses 4 Frydman (1994) argues that the distribution of the MLE of fix is nonstandard when Yl = 1/2 and there is no drift.
97
and Gallant and Tauchen (1992). In these methods one simulates K multiple sets of observations L'th (k = l, ...,K) from (5), with given values of h (we use 1/100) and 0 ' = (el fll ~1 ~2), and then finds the estimates of 0 that set ~tr=l {K 1 ~1 d(~h; ~b)} to zero, where ~ is an estimator of the parameters of some auxiiiary model found by solving ~tr=l dc~(rt; ~) = 0. 5 The logic of the estimator is that, if the model (5) is true, then ~b ~ ~b*,where E[d,(rt; ~b*)] = 0, and the term in curly brackets estimates this expectation by simulation. Consistency and asymptotic normality of the indirect estimator follows from the properties of under misspecification. It is important to note that the auxiliary model need not be correct, but it should be a good representation of the data, otherwise the indirect estimator will be very inefficient. We use two auxiliary models and, in each instance, d o are the scores for ~bfrom those models. The first is (5) with h = 1 and et being assumed n.i.d.(O, 1)(MLE), while the second has rt being an AR(1) with EGARCH(1,1) errors. The visual evidence of Figure 1 is strongly supported by the estimated parametric models, although there is considerable diversity in the estimates obtained. Perhaps the most interesting aspect of the table is the fact that 71 tends to increase with maturity. Based on the evidence from the indirect estimators, 71 = 1/2 seems a reasonable choice for the shortest maturity, which would correspond to the diffusion process used by Cox et.al. (1985). A problem in simply fitting a model with a "levels" effect is that the observed conditional heteroskedasticity in the data might be better accounted for by a G A R C H process, and so the appropriate questions should either be whether there is evidence of a levels effect after removing a G A R C H process, or, whether a levels representation fits the data better than a G A R C H model does. To shed some light on these questions, our strategy was to fit augmented EGARCH(1,1) models to Art(z) = # + a~te~t, e~t ~JV'(0, 1), of the form log ~t a0~ + al~ log tTzt2 1
This specification is used to generate a diagnostic test for the presence of a levels effect, and is not intended to be a good representation of the actual volatility. Hence the tstatistic for testing if 6 is zero can be regarded as a valid test for more general specifications, e.g. 6 g ( r t _ l ( Z ) ) , where g(.) is some function, provided rtl(z) is correlated with g(rtl (z) ). Table 5 gives the estimates of 6 and the
Table 5 and t Ratios for Levels Effect rt(1) .050 3.73 r,(3) .025 3.51 rt(6) .023 3.42 r,(9) .021 3.04 r,(120) .019 2.42
5 A Mihlstein (1974) rather than Euler approximation of (5) was also tried, but there were very minor differences in the results.
98
associated t ratios. Every yield displays a levels effect, although with the 10 year m a t u r i t y it seems weaker. 6 The same conclusion applies to the spreads between forward rates, F p t ( z , z  1). Fitting E G A R C H ( 1 , 1 ) models to these series for z = 1, 3, 6 and 9 m o n t h s maturity, and allowing the levels effect to be a function of Ft1 (r), the tratios that this coefficient was zero were 3.85, 3.72, 17.25 and 12.07 respectively. A n u m b e r of studies have a p p e a r e d that look at this p h e n o m e n o n . A p a r t f r o m our own work, Chan et al. (1992), Broze et al. (1993), Koedijk et al. (1994), and Brenner et al. (1994) have all considered the question, while Vetzal (1992) and K e a r n s (1993) have tried to allow for stochastic volatility, i.e. at2 is not only a function of the past history o f yields. T o date no formal c o m p a r i s o n o f the different models is available, unlike the situation for stock returns e.g. Gallant et al. (1994). All studies find strong evidence for a levels effect on volatility. Brenner et al. provide M L estimates of the p a r a m e t e r s of a discretized joint G A R C H / l e v e l s model in which the volatility function, a 2, is the p r o d u c t of a 2 G A R C H ( 1 , 1 ) process and a levels effect i.e. at2 = (a0 + a l a L l ~ _ l + a2at_l)rt_ 1. T h e estimated value o f V falls to a r o u n d .5, but remains highly significant. Koedijk et al. (1993) have a similar formulation except that a 2 is driven by ~t1 r a t h e r than at_let_1.2 2 Again V is reduced but remains highly significant. One might question the use o f conventional significance levels for the " r a w " t ratios, owing to the fact that one of the regressors is a nearintegrated process. To examine the effects of this we simulated data f r o m an estimated model, equation (6) for rt(1), treating the estimates obtained by M L E estimation as the true p a r a m e t e r values, and then found the distribution of the t ratio for the hypothesis that 6 = 0 using the M L E , constructed by taking one step f r o m the true values o f the coefficients (this would be a simulation o f the a s y m p t o t i c distribution). The results indicate that the distribution of the tratio has fatter tails than the n o r m a l with critical values for two tailed tests of 2.90 (5%) and 2.41 (10%), but use of these would not change the decisions.
2.2. Multivariate properties 2.2.1. The level of the yield curve As was mentioned in the introduction a great deal of w o r k on the term structure views yields as being driven by a set of M factors
M j=l 6 It is interesting to observe that the distribution of the DickeyFuller test is very sensitive to whether there is a levels effect or not. To see this we simulated a model in which Art = .001 + .01rt~_let, where et nid(O,1) and Veither took the value of zero or unity. A small drift was added, although its influence upon the distribution is likely to be small. The simulated critical values for 1%, 2.5% and 5% significance levels when 7 = 0, 1 are (3.14, ~6.41), (2.71, 4.97) and (2.39, 4.03) respectively. Clearly, the presence of a levels effect in volatility means that the critical values are much larger (in absolute terms), strengthening the claim that Table 1 suggests a unit root in yields.
~
99
and it is important to investigate whether this is a reasonable characterization of the data. It is useful here to recognise that the modern econometrics literature on multivariate relations admits just such a parameterization. Suppose the yields are collected into an (n 1) vector Yt and that it is assumed that y t can be represented as a VAR. Then, if Yt are I(1) and, in the n yields there are k cointegrating vectors, Stock and Watson (1988) showed this to mean that the yields can be described in the format
Y' =
~t =
+ "
~tI ~13t
(8)
where ~t are the n  k c o m m o n trends to the system, and E t  l V t ~ O. The format (8) is commonly referred to as the BeveridgeNelsonStockWatson (BNSW) representation. I f there are (n  1) cointegrating vectors, there will be a single c o m m o n f a c t o r , ~lt, that determines the level of the yields. H o w the yields relate to one another is governed by Yt  J i l t = ut i.e. the yield curve is a function of ut. Johansen's (1988) tests for the number of cointegrating vectors may be applied to the data described earlier. Table 6 provides the two most commonly used  the m a x i m u m eigenvalue (Max) and trace (Tr) tests  for the five yields under investigation, and assuming a V A R of order one. 7 F r o m this table there appears to be four cointegrating vectors, i.e. a single c o m m o n trend. Johnson (1994), Engsted and Tanggaard (1994) and Hall et al. (1992) reach the same conclusion. Zhang (1993) argues that there are three c o m m o n trends but Johnson shows that this is due to Zhang's use of a mixture of yields from zero and nonzero coupon bonds. W h a t is the c o m m o n trend? There is no unique answer to this. One solution is to find a yield that is determined outside of the system, as that will be the driving force. For a small country, that rate is likely to be the "world interest rate", which in practice either means a EuroDollar rate or some combination of the US, G e r m a n and Japanese interest rates. Another candidate for the c o m m o n trend is
Table 6 Tests for cointegration amongst yields Max 5 vs 4 vs 3 vs 2 vs 1 vs 4 trends 3 trends 2 trends 1 trends 0 trends 273.4 184.7 95.6 30.9 2.4 Crit. Value (.05) 33.5 27.1 21.0 14.1 3.8 Tr. 586.9 313.5 128.8 33.3 2.4 Crit. Value (.05) 68.5 47.2 29.6 15.4 3.8
7 Changing this order to four does not affect any conclusions, but restricting it to unity fits in better with the theoretical discussion.
100
the simple average of the rates) In any case we will take this to be the first factor ~lt in (7).
2.2.2. The shape of the yield curve The existence of k cointegrating vectors ~ (~ is an (n x k) matrix), such that ~t = ,'yt is I(0), means that any VAR in Yt has the ECM format Ayt : 7~t1 + D(L)Ayt_I + et ,
w h e r e Et1
(9)
(et) : 0 and D(L) is a polynomial in the lag operator. It is also possible to show that ut in (8) can be written as a function of the k EC (error correction) terms (t and this suggests that we might take these to be the remaining factors ~/t (j = 2 , . . . , K) in (7). To make the following discussion more concrete assume that the expectations theory of the term structure holds i.e. a c period yield is the weighted average of the expected one period yields into the future. In the case of discount bonds the weights are equal to ~ so that the theory states
lZ1
rt(z) = T Et(rt+k(1)) .
Of course this is an hypothesis, albeit one that seems quite sensible. It implies that
"c1
r t ( z )  r t ( 1 ) = {l
/
Etrt+k(1) Etrt(1)
[,~" k=0
=[l~~EtArt+j(1)}
~,=
0 0 0
1 0 0
(10)
Johansen's (1988) test for this gives a Z2(4) of 36.8, leading to a very strong rejection of the hypothesis. Such an outcome has also been observed by Hall et al. (1992), Johnson (1994) and Engsted and Tanggaard (1994). A number of possible explanations for the rejection were canvassed in those papers, involving the size of the test statistic being incorrect etc. One's inclination is to examine the estimated matrix of cointegrating vectors given by Johansen's
101
procedure, &, and to see how closely these correspond to the hypothesized values but, unfortunately, the vectors are not unique and the estimated quantities will always be linear combinations of the true values. Some structural information is needed to recover the latter, and to this end we write a' = Aa, where /76 f19
fl120
a=
0 0
0
1 0
0
'
and then proceed to solve the equations ~ = ~ia, where ~i is some nonsingular matrix. This produces/73 = 1.038,/76 = 1.063,/79 = 1.075 and/~120 = 1.076, which indicates that the point estimates are quite close to those predicted by the expectations theory. It is also possible to estimate the fl, by "limited information" rather than "fullinformation" methods. To that end the PhillipsHansen (1990) estimator was adopted with a Parzen kernel and eight lags being used to form the longrun covariance matrices, producing/73 = 1.021, f16 = 1.034,/79 = 1.034 and fl120 = .91. With the exception of the 10 year rate, neither set of estimates seems to be greatly divergent from that predicted. Some insight into why the rejection occurs may be had from (9). Given that cointegration has been established, and working with a VAR(I), i.e. D(L) = 0 in (9), the change in each yield should be governed by
5
(11)
where j = 2 , . . . , 5 maps one to one into the elements z = 3, 6, 9, 120. If the expectations theory is valid flj = 1 and the system becomes
5
and the hypothesis H0 : flj = 1 can be tested by computing the likelihood ratio statistic. It is well known that such a test will be distributed as a Z2(4) under the null hypothesis. If the yields were taken to be I(0), the simplest way to test if flj = 1 would be to rewrite (11) as
art(~)=j~=2"~Jdrtl(j)rtl(1))+
?jdlfls
rtl(1)+e,t,
(12)
and to test if the coefficient of rti (1) in each of the equations for Art(~) was zero. For a number of reasons this does not reproduce the )~2(4) test cited above  there are five coefficients being tested and rtl(1) will be I(1), making the distribution nonstandard. Nevertheless, the separate single equation tests might still be informative. In this case the tvalues that rtl(1) has a zero coefficient in each
102
equation were 4.05, 1.77, .72, .24 and .55 respectively, suggesting that the rejection of (10) lies in the behaviour of the one month rate i.e. the spreads are not capable of fully accounting for its movement. Engsted and Tanggaard (1994) also reach this conclusion. It may be that rtl(1) is proxying for some omitted variable, and the literature has in fact canvassed the possibility of nonlinear effects upon the shortterm rate. Anderson (1994) makes the influence of spreads upon Art(l) nonlinear, while Pfann et al. (1994) take the process driving rt(1) to be a nonlinear autoregression  in particular, the latter allow for a number of regimes according to the magnitude of rt(1), with some of these regimes featuring I(1) behaviour of the rate while others do not. Another possibility, used in Conley et.al. (1994) is that the "drift term" in a continuous time model has the form ~]jm___ ma_jr{ and this would induce a nonlinearity into the relation between Art and rt1. Instead of a misspecification in the mean, rejection of (10) may be due to levels effects in ezt. As noted earlier, the DickeyFuller test critical values are very sensitive to this effect, and the test that rlt1 has a zero coefficient in the Art(l) equation in (12) is actually an A D F test, if the augmenting variables are taken to be the spreads. This led us to produce a small Monte Carlo simulation of Johansen's test for (10) under different assumptions about levels effects in the errors of the VAR. The example is a simplified version of the system above featuring only two variables Ylt and y2t with cointegrating vector [1  1], and being generated from the vector ECM,
Aylt A yzt
= = .8(Ylti . 1 (Ylt1 Y2t1) +
.lY~t_l~lt
The 95 % critical value for Johansen's test that the cointegrating vector is the true one varies according to the value of 7 : 3.95(7 = 0), 4.86(y = .5), 5.87(y = .6), 11.20(7 = .8), and 23.63(7 = 1). Clearly, there is a major impact of the levels effect upon the sampling distribution of Johansen's test, and the phenomenon needs much closer investigation, but it is conceivable that rejection of (10) may just be due to the use of critical values that are too small. Even if one rejects the cointegrating vectors predicted by the expectations theory, the evidence is still that there are k = n  1 error correction terms. It is natural to equate the remaining M  1 factors in (7) (after elimination of the common trend) with these EC terms, but this is not very helpful, as it would mean that M = n, i.e. the number of factors would equal the number of yields. Hall et al. (1992) provide an example of forecasting the term structure using the ECM relation (9), imposing the expectations theory cointegrating vectors to form ~t, and then regressing Ayt on (t1 and any lags needed in Ayt. Hence their model is equivalent to using a single factor, the common trend, to forecast the level, and (n  1) factors to forecast the slope (the EC or spread terms). In practice however they impose the feature that some of the coefficients in ~ were zero, i.e. the number of factors determining the yield varies with the maturity being examined. It is interesting to note that their representation for Art(4) has no EC terms i.e. it is
103
effectively determined outside the system and plays the role of the "world interest rate" mentioned earlier. In an attempt to reduce the number of nontrend factors below n  1, it is tempting to assume that (say) only m = M  1 of the (n  1) terms in (t appear as determinants of Art(z) and that these constitute the requisite c o m m o n factors, but such a restriction would necessitate m of the columns of 7 being zero, thereby violating the rank condition, P(7) = n  1. Consequently the factors will need to be combinations of the EC terms. Now, premultiplying (7) by ce' gives
M
ct'yt=='~~bj{jt ,
j=l
(13)
where b} = [flj1..fljn] is a 1 x n vector. I f we designate the first factor as the c o m m o n trend then it must be that ='bl = 0 as the LHS is I(0) by construction, meaning that
K
(14)
where t is the (K  1) x 1 vector containing {2t. {Kt, and B is an n x (K  1) matrix with p(B) = K  1, where p(.) designates rank. Equation (14) enables us to draw a number of interesting conclusions. Firstly, p[cov(t)] = rain[p(=), p(B)], provided cov(Et) has rank K  1. Since K < n implies K  1 < n  1, it must be that p(B) < p(~), and therefore p[cov(~t)] = K  1 i.e. the number o f factors in the term structure (other than the c o m m o n trend) m a y be found by examining the rank of the covariance matrix of the cointegrating errors. Secondly, since C = ~'B has p(C) = K  1, Ft = (C'C)IC'(t, and hence the factors will be linear combinations of the EC terms. Applying principal components to the data set composed of spreads spt(3), spt(6), spt(9) and spt(120)~ the eigenvalues of the covariance matrix are 4.1, .37, .02 and .002, pointing to the fact that these four spreads can be summarized very well by three components (at most). 9 The three components are:
9 The principal components approach, or variants of it, has been used in a number of papers Litterman and Scheinkman (1991), Dybvig (1989) and Egginton and Hall (1993). This technique finds linear combinations o f the yields such that the variance o f each combination is as small as possible. Thus the i'th principal component of Yt will be b~yt, where b/is a set of weights. Because one could always multiply through by a scale factor the bi are normalized, i.e. b~bi = 1. With this restriction b becomes the eigenvectors of var(yt). Since b/is an eigenvector it is clear that b'var(yt)b = A, where A is a diagonal matrix with the eigenvalues (21 ... An) on it, and that tr[b'var(yt)b] = )'~=l 2/. It is conventional to order the components according to the magnitude o f 2i; the first principal component having the largest 2i. There is a connection between principal components and common trends. Both seek linear combinations o f Yt and, in many cases, one o f the components can be interpreted as the common trend, e.g. in Egginton and Hall (1993) the first component is effectively the average of the interest rates, which we have mentioned as a possible common trend earlier.
104
A. R. Pagan, A. D. Hall and V. Martin .32spt(3) .86spt(6) .37spt(9) + .17spt(120) q~2t = .78spt(3) + .OOspt(6) .55spt(9) + .29spt(120) ~b3t = .54spt(3) + .52spt(6) .58spt(9) + .37spt(120).
=
~lt
In this section we describe some popular ways of modelling the term structure. In order to assess whether these models are capable of replicating observed term structures, it is necessary to decide on some way to compare them to the data. There is a small literature wherein formal statistical tests have been performed on how well the models replicate the data in some designated dimension. Generally, however, the reasons for any rejection of the models remain unclear, as m a n y characteristics are being tested at the one time. In contrast, this chapter uses the method of "stylized facts", i.e. it seeks to match up the predictions of the model with the nature of the data as summarized in Section 2. Thus, we look at whether the models predict that yields are nearintegrated, have levels effects in volatility, exhibit specific cointegrating vectors, produce persistence in spreads, and would be compatible with two or (at most) three factors in the term structure. 1
m a x E t [Ls=t ~U(C~)fls]
where/3 is a discount factor, and Cs is consumption at time s. It is well known that a first order condition for this is
U(Ct)vt = Et{fftu'(C~)v~}
where vt is the value of an asset (or portfolio) in terms of consumption goods. This can be rearranged to give
Assuming that the asset is a discount bond, and the general price level is fixed, consider setting s = t + z giving vt = ft(z). The solution of this equation will then
10 There are many other characteristics of these yields that we ignore in this paper but which are challenging to explain e.g. the extreme leptokurtosis in the density of the change in yields and in the spreads.
105
provide a complete set o f discount b o n d prices for any maturity. It is useful to reexpress (15) as f t(z) = Et[ff U'(Ct+~)/U'(Ct)] , (16)
imposing the restriction that f t ( t + z) = 1, so as to find the price o f a zero c o u p o n b o n d paying $1 at maturity. Hence the term structure would then be determined. I f the price level is not fixed (16) needs to be modified to i t ( z ) = Et[fl~PtUt(Ct+,)/(U'(Ct)Pt+~)] , (17)
where Pt is the price level at time t. There have been a few attempts to price bonds f r o m (16) or (17). C a n o v a and M a r r i n a n (1993) and B o u d o u k h (1993) do this by assuming that ct = log (Ct+l/Ct)  1 and pt = log (Pt+l/Pt)  1, follow a V A R process with some volatility in the errors, and that the utility function has the C R A A form, U(Ct) = C ]  r / ( 1  7), where 7 is the coefficient o f risk aversion. 11 It is necessary to evaluate (17) for the yield rt(v) =  z 1 l o g f t ( z ) . rt(z) =   1 "c log Et[ff(Ct+~/Ct)r(Pt/Pt+~)]
where
pt~)1]
ct~ = G+~/Ct  1 ~ log Ct+~  log Ct Pt~ = Pt+~/Pt  1 ~ logPt+~  log Pt . E x p a n d i n g a r o u n d Et(ct~) and Et(pt~), and ignoring all cross terms and terms o f higher order than a quadratic, 12  l o g f l   1 log { [(1 + Et(ctr))~(1 + Et(p'~))I]3 v + al~tvart(ct~) + a2~tvart(P~t)} , where
alrt a2rt
=
(18)
1/2(1 + 7) (1 +
+ Et(Ptz)) 1
= (1 + Et(ptz))3(1 + Et(ct ))
11 Canova and Marrinan actually use the Cambridge equation for the price level, Pt = Mt/Yt, and so their VAR involves the growth in money, output and consumption. 12 The conditional covariance terms between eta:and Pt, are ignored as one is a real and the other a nominal quantity and most general equilibrium models would make this zero. Boudoukh (1993) however argues that the conditional covariance is important for explaining the term structure.
106
(19)
where
Canova and Marrinan (1993) take a~+l = vart(ejt+l) to be G A R C H processes of the form
o'2~+1 = aoj +
alja~.t
+ a2je2t ,
whereby the formulae in Baillie and Bollerslev (1992) can be used to evaluate Et(zjt~) and vart(zjt~), while Boudoukh (1993) has a2t as a stochastic volatility process. For G A R C H models vart(zjt~) is a linear function of a}t+l. How well does this model perform in replicating the stylized facts of the term structure? To produce a near unit root in yields it is necessary that log (1 + Et(pt~)) ~ Et(pt~) be near integrated i.e. inflation must be a near integrated process, as it is the only one of the two series that has such persistence in either mean or variance  see Boudoukh (1993) for a description of the time series properties of the two series. Then the inflation rate becomes the common trend in the term structure, and the spreads will depend upon consumption growth and the two volatilities. As there is rather weak evidence for much dependence in either inflation or consumption volatility  see the test statistics in Boudoukh it is difficult to see the persistence in spreads being explained by these models. 13 Whether a levels effect in Art(z) can be produced is unclear; the G A R C H structures used by Canova and Marrinan will not produce it, but Boudoukh's stochastic volatility formulation does allow for a levels effect in vart(pt). Moreover, even if volatilities were constant, the conditional means enter the weights attached
13 Although Boudoukh finds much more in his estimated stochastic volatility specification than G A R C H specifications.
107
to them, and this dependence might be used to induce a levels effect into Art(z). Whilst Et(ctz) is likely to be close to a constant due to the weak autocorrelation in consumption growth, there is strong serial correlation in inflation rates, and, with inflation as the common trend, it is conceivable that the requisite effect could be found in that way, although the question was not addressed by the authors. 14 Another attempt at working within this framework is Constantinides (1992) who writes (17) as f t(z) = Et[Kt+~/K,] , where Kt = f f U ' ( C t ) / P t is referred to as a "pricing kernel". He then makes assumptions about the evolution of Kt, in particular that Kt=exp 9+ t+zot+Z(zitei)2
i=1
He works in continuous time and makes zot a Weiner process while the other zit are OrnsteinUhlenbeck diffusion processes with parameters 2i and variances o/2. Each of the zit are taken to be independent. Under these assumptions it turns out that f , ( ~ ) = {lIf=~gi(~)}~/2exp  0+ 2~ ~
i=1
i=l
where Hi(r) = ~r2i/2i + (1  a2 / 2i)e 2~i~. Consequently, rt(z) has the format
N N
rt('r) 6o~ q Z
i=1
Terms such as (z~t  ~i) 2 reflect the fact that the "variance" of the change in Zit of an OrsteinUhlenbeck process depends upon the level of the variable z~t. Constantinides' model will have trouble producing the right outcomes. After converting to yields his model has no factor that would be I(1). The difficulty arises from his specification of the "pricing kernel". The pricing kernel used to evaluate (17) has an (2) variable Pt as it is the inflation rate which is I(1). Consequently it is the assumption implictly made by Constantinides that the kernel is only I(1) through the presence of the term zot which is the root of the problem with his model.
14 Essentially, these are "calibrated" models that emphasise the use of a highly specified theory to explain an observed phenomenon. Hence, one should really distinguish between the model prediction of yields, r~ (z), and the observed outcomes, rt (z). The gap between the two variables is due to factors not captured within the model, or perhaps to specification errors. Examination of the characteristics of the gap may be very informative.
108
Finance theory has developed by working with factor models to determine the term structure. Common to the material just discussed is the use of models of an economy in which there is intertemporal optimization, but a notable difference is the introduction of a production sector and a concern with ensuring that the pricing formulae prohibit the possibility of arbitrage i.e the solution tends to be closer to a general rather than partial equilibrium solution. The basic work horse of the literature is the model due to Cox, Ingersoll and Ross (1985) (CIR). Essentially they propose an economy driven by a number of processes that affect the rate of return to assets e.g. technological change and (possibly) an inflation factor. Dealing with the simplest case where there is just a single state vector, #t, perhaps total factor productivity (TFP), it is assumed that this variable follows a diffusion process of the form
dla t = (b  #t)dt + q)&
1/2.
ar b .
General equilibrium in asset markets for such an economy results in an expression for the instantaneous rate of interest of the form
drt = (a  flrt)dt + ar t
1/2.
atlt .
(20)
Once one has the expression for the instantaneous rate the whole term structure f t ( z ) is priced according to a partial differential equation 1/2 a2r f ~ + (~  fll r) f r + f t  )~rf r  r f = 0 , (21)
where frr = 0 2 f / O r O r, f r = O f / O r , f t = O f lOt and the term 2 r f r , which depends upon the covariance of the change in the price of the factor with the percentage change in the optimal portfolio, is the "market price of risk" associated with that factor. This partial differential equation comes from the fact that a zero coupon riskless bond maturing at t + z must be valued at
f t ( z ) = Et exp
E(r
,It
r(~)d~b
)1
(22)
Since the expected rate of change of the price of the bond is given by
r + 2r f J f , it also can be interpreted as a liquidity premium. It is clear that we could group together the terms (~  f l r ) f r and  2 r f r and treat the problem as
one of pricing an asset using a "hypothetical " instantaneous rate that is generated by
drt = (a _ flrt  2rt)dt +
= (aTrt)dt+art
ar t
1/2a~ , t (23)
1/2,
aqt
The distinction is between the true probability measure in (20) and the "equivalent martingale measure " in (23).
109
The analytic solution for the term structure in the CIR model is then (see Cox et al. (p. 393)) ft(z) = A1 ('c) exp(Bl('c)rt) , where 26 exp((6 + 7)z/2 ) [ 2(exp(&c)l) 6] 2e/"= ]
~1(~) = (6 + ~
Converting to a yield
 1) + 26 ,and 6 = (~ + 2 ) ~/2.
(24)
This is a single factor model with the instantaneous rate or, more fundamentally, the "returns" factor, driving the whole term structure, i.e. the level of the term structure depends on the value of rt at any point in time. The slope of the yield curve depends upon the parameters of the diffusion equation as well as the market price of risk. Perhaps the biggest problem with this methodology is that it will never exactly reproduce an observed yield curve. This bothers practitioners a lot. One response has been to allow a to change according to c and t. What this does is to add on "fudge factors" to the model based yield curve so that the modified curve equals the observed yield structure. Then, after forecasting rt+l and finding the predicted term structure, the "fudge factors" from the previous period are added on. The need for "fudge factors" suggests that there is substantial misspecification in the CIR model as a description of the term structure, just as "intercept corrections" in macro econometric models were given such an interpretation. Brown and Dybvig (1986) estimated the parameters of the ClR model by maximum likelihood and then computed the residuals defined by the gap between the observed bond prices ( f t ) and the predictions of the model (f~). Examination of the residuals pointed to specification errors in the model. 15 Looking at the ClR model in the light of stylized facts, the data should posess the characteristic that interest rates are nearintegrated processes and possibly cointegrated with cointegrating vectors between any pair of rates of [1 1] i.e. the spreads should be I(0). The question that arises is whether the ClR model would deliver such a prediction. One problem to be overcome is quantifying the market price of risk, 2, in the ClR bond formulae. As ClR point out, ~. = 0 if the factor had no effect on the real economy e.g. if it was some nominal quantity such as the inflation rate. Accordingly, we will adopt this interpretation, allowing us to set 2 = 0. To induce a unit root we set fl = 0, and we also put the drift term ~ = 0. This makes 15 Sincethere are n yieldsbut only one factor they neededto add on a vector of errors to the model to produce a nonsingularcovariancematrix for f~, in order to be able to form a likelihood.It may be that the misspecificationreflectsthe assumptions made in this step.
110
6 = v/2a,Al("c) = 1,Bl('C)
Now the spread spt('c) will be
rt(r) = d('r)rt(1) ,
where d(r) < 1 and decreasing with ~. As seen in Section 2, with one exception both the Johansen and PhillipsHansen estimates of d(r) have d ( r ) > 1 and
111
increasing in z. The predictions from CIR type models are therefore diametrically opposed to the data. 16
V2
where dqj t are independent, thereby making each factor independent. Then the solution for the bond price is
f t ( z ) = A1 ( z ) A 2 ( z ) e x p {  B 1 ('t)~lt  B 2 ( z ) ~ 2 t } ,
where A2 and B2 are defined analogously to A1 and B1. Obviously this framework could be extended to encompass any number of factors, provided they are assumed to be independent. Another method is that of Longstaff and Schwartz (1992) who also have two factors but these are related to the underlying rate of return process #t rather than directly to the instantaneous rate. In particular they wish to have the two factors being linear combinations of the instantaneous rate and its conditional variance. The model is interesting because the second factor they use, ~2t, affects only the conditional variance of the Pt process, whereas both factors affect the conditional mean. This is unlike Chen and Scott's model which has ~lt and ~2t affecting both the mean and variance. Empirically, the two factors are regarded as the short term rate and its conditional volatility, where the latter is estimated by a G A R C H
16 Brown and Schaefer (1994) find that the CIR model closely fits the term structure of real yields, where these are computed from British government indexlinked bonds. Note in constructing the Johansen and PhillipsHansen estimators that an intercept was allowed into the relations in order to correspond to A(z).
112
process when assessing the quality of the model, x7 Tests of the model are limited to how well it replicates the unconditional standard deviations of yield changes. There are a number of other two factor models. Brennan and Schwartz (1979) and Edmister and M a d a n (1993) begin with the long and short rates following a joint diffusion process. After imposing the "no arbitrage condition" and assuming that the long rate is a traded instrument, Brennan and Schwatz find that the price of the instantaneous risk associated with the long rate can be eliminated, and the two factors then effectively become the instantaneous rate and the yield spread between that rate and the long rate. Eliminating the price of risk for the long rate makes the model nonlinear and they need to linearize to find a solution. Even then there is no analytical solution for the yield curve as with CIR. Another possibility for a two factor model might be to allow for stochastic volatility as a factor. Edmister and M a d a n find closed form solutions for the term structure in their formulation. Suppose that the first factor in Chen and Scott's model is a "near I ( 1 ) " process whereas the second factor is I(0).Then the instantaneous rate has the c o m m o n trend format (compare (25) and (8) recognising that J can be regarded as the unit column vector). Using the same parameter values for the first factor as the polar case discussed in the preceding subsection i.e. /~l = 0, 2 1   0 , o'1 = 0, the first factor disappears from the spreads, which now equal
r t ( z )  rt(1) ~ log ( A z ( 1 ) / A 2 ( ' r ) ) + [zlB2(z)  B2(1)]~2t .
Hence, they are now stochastic and inherit the properties of the second factor. For them to be persistent, it is necessary that the second factor have that characteristic. Notice also that rt('c)  r t ( z  1) will tend to zero as ~ + c% and this may make it implausible to use this model with a large range of maturities. Consequently, this two factor model can be made to reproduce the standard results of the cointegration approach in the sense that the EC terms are decomposed into a smaller number of factors. Of course the model would predict that the coefficients on the factors would be negative as ~1B2(~) _< B2(1). The conclusion of negative weights extends to any number of factors, provided they are independent, so it is interesting to look at the evidence upon the signs of the coefficients of the factors in our data set, where the nontrend factors are equated with the principal components. Although one cannot uniquely move from the principal components/spreads relation to a spreads/principal components relation, a simple way to get some information on the relationship between spreads and factors is to regress each of the spreads against the principal components. Doing so the R 2 a r e .999, .999, .98 and .99 respectively, showing that the spreads are well explained by the three components. The results from the regressions are 17 Volatility affects the term structure here by its impact upon rt in (25). Shen and Starr (1992) raise the interesting question of why volatility should be priced; if one thinks of bonds as part of a larger portfolio only their covariances with the market portfolio would be relevant. To justify the observed importance of volatility they note that the bid/ask spread will be a function of volatility and that has an immediate effect upon yields.
113
rt+l(Z  1)  r,(z) = ~o +
1 ~1
[rt(z)  rt(1)]
(26)
if the liquidity premium was a constant. They found that this restriction was strongly rejected by the data. With McCulloch and Kwon's data and T = 3, the regression of rt+l (2)  rt(3) against rt(3)  rt(1) yields an estimated coefficient of .09, well away from the predicted value of .5. Of course, the assumption of a constant premium is incorrect. Bond prices are determined by (22) which, when discretized, would be,
ft(z)=Et
exp
 ErJ)/
J=t ." .a
_exp(_Et(t~lrj))vt
:
(27)
where fEt(z ) is the bond price predicted by the expectations theory. Thus rt(z) differs from that of the expectations theory by the term  z 1 log vt, and this in turn will be a function of the conditional moments of Art. In the case where Art is conditionally normal it depends upon the conditional variance, and the equation corresponding to (26) will now feature a time varying ~0 that depends on this moment. If the conditional variance relates to the spreads with a negative coefficient, then that could cause there to be a negative bias in the coefficient of rt(z)  r t ( 1 ) in the Campbell and Shiller regressions. One scenario in which this happens is if the conditional variance depended upon Art, as happens with an E G A R C H model. Then, due to cointegration amongst yields, Art could also be replaced by the lagged spreads, and these will have negative coefficients. More generally, since we observed in Section 2 that the factors influencing the term structure, such as volatility, could be written as linear combinations of the
114
spreads, there is a possibility that term structure anomalies might be explained in this way.
In most cases only numerical solutions for B(z) are available. DuNe and Kan consider some special cases, differing according to the evolution of it. When the ~it are joint diffusions driven by Brownian motion with covariance matrix f~ that is not diagonal, there is the possibility that the weights attached to the factors can have different signs, and so the principal defect with the two factor models of the preceding subsection might be overcome. To date little empirical work seems to be available on these models, with the exception of E1 Karoui and Lacoste (1992) who make it Gaussian with constant volatility.
115
interest of space only a simple Euler discretization of the HJM stochastic differential equations describing the evolution of the forward rate curve is used. Many variants of these equations have emerged, but they have the common format, Ft(z  1)  F t _ l ( Z ) = ct,i + at,let,.I , where et,i is n.i.d.(O, 1). Differences among the models reflect differences in the assumptions made about volatilities. Examples would be a constant volatility model in which ct,1 = a0 + a2z and o't,z_1 = o', or a proportional volatility model that has ct,~i =6Ft('c))~ + ffFt(z)(~nk=lFt(k)) and o't,z1 = riFt(z). The nature of ct,z1 reflects the noarbitrage assumption. After some manipulation it can be shown that Ft(z  1)  Ft, ('r) = + z+l
"c
spt(z)  T A r t ( z + 1) 1
so that the equation used by HJM for the evolution of the forward rate incorporates spreads and changes in yields. In turn, using cointegration ideas, Art(z + 1) depends upon spreads, and this shows quite clearly that the characteristics of F t ( z  I )  F t  l ( ' c ) will be those of the s p r e a d s  see Table 2. Consequently, at least for small z, constant volatility models with martingale difference errors could not adequately describe the data. It is possible that proportional volatility models might do so due to the dependence of their ct,~i upon Ft('c), as the latter is near integrated. To check this out we regressed F t ( 2 )  F t  l ( 3 ) against ct, 2 and s p t  l ( 3 ) for n = 9 and a variety of values for the market price of risk 2. For 2 = 0 the t ratio of the coefficient ofspt_l (3) was 4.37, while for very large 2 it was 4.70. Adopting other values for 2 resulted in t ratios between these extremes. Hence, the conditional mean for the forward rates is far more complex than that found in HJM models. Moreover, the rank of the covariance matrix of the errors et,~I must reflect the number of factors in the term structure, which appears to be two or three, so that the common assumption of a single error to drive all forward spreads seems inaccurate. A number of formal investigations have been made into the compatibility of the HJM model with the data  Abken(1993) and Thurston(1994) fitted HJM models to forward rate data by G M M whilst Amin and Morton(1994) used options prices to recover implied volatilities whose evolution was compared to those of the most popular variants of the HJM model. Abken and Thurston reach conflicting conclusionsthe latter favours a constant volatility formulation and the former a proportional one, although his general conclusion was that all models were rejected by the data. Consequently, it seems interesting to look at the stylized facts regarding volatility and to compare them with model specifications. Equation (28) is useful for this task. As it has been shown that there is a levels effect in Art(k), in order to have constant volatility it would be necessary that
116
there be some "colevels" effect, analogous to the copersistence phenomenon of the G A R C H literature  Bollerslev and Engle (1993)  i.e. even though Art(k) displays a levels effect the linear combination ~~!Art(z 1)  ~ A r t ( 1 ) does not. This contention is easily rejected  a plot of that variable squared against rtl (3) looks almost identical to Figure 1, and such an observation points to the proportional volatility model as being the appropriate one.
4. Conclusion This chapter has described methods of modeling the term structure that are to be found in the econometrics and finance literatures. By utilizing a factor representation we have been able to show that there are many similarities in the two approaches. However, there were also some differences. Within the econometrics literature it is common to assume that yields are integrated processes and that spreads constitute the cointegrating relations. Although the finance literature takes the stance that yields are near integrated but stationary, it emerges that the models used in that literature would not predict that the spreads are cointegrating errors if we actually replaced the stationarity assumption by one of a unit root. The reason for this outcome is found to lie in the assumption that the conditional volatility of yields is a function of the level of the yields. Empirical work tends to support such an hypothesis and we suggest that the consequences of such a relationship can be profound for testing propositions about the term structure. We also document a number of stylized facts about a set of data on yields that prove useful in assessing the likely adequacy of many of the models that are used in finance for capturing the term structure
References
Abken, P. A. (1993). Generalized method of moments tests of forward rate processes. Working Paper, 937. Federal Reserve Bank of Atlanta. Amin, K. I. and A. J. Morton (1994). Implied volatility functions in arbitragefree term structure models. J. Financ. Econom. 35, 141180. Anderson, H. M. (1994). Transaction costs and nonlinear adjustment towards equilibrium in the US treasury bill market. Mimeo, University of Texas at Austin. Baillie, R.T. and T. Bollerslev (1992). Prediction in dynamic models with timedependent conditional variances, J. Econometrics 52, 91113. Baillie, R. T., T. Bollerslev and H. O. Mikkelson (1993). Fractionally integrated autoregressive conditional heteroskedasticity. Mimeo, Michigan State University. Bollerslev T. and R. F. Engle (1993). Common persistence in conditional variances: Definition and representation. Econometrica 61, 167186. Boudoukh, J. (1993). An equilibrium model of nominal bond prices with inflationoutput correlation and stochastic volatility. J. Money, Credit and Banking 25, 636~65. Brennan M. J. and E. S. Schwartz (1979). A continuous time approach to the pricing of bonds. J. Banking Finance 3, 133155. Brenner R. J., R. H. Harjes and K. F. Kroner (1994). Another look at alternative models of the shortterm interest rate. Mimeo, University of Arizona.
117
Brown, S. J. and P. H. Dybvig (1986). The empirical implications of the CoxIngersollRoss theory of the term structure of intestest rates. J. Finance XLI, 617632. Brown, R. H. and S. M. Schaefer (1994). The term structure of real interest rates and the Cox, Ingersoll and Ross model. J. Financ. Econom. 35, 342. Broze, L. O. Scaillet and J. M. Zakoian (1993). Testing for continuoustime models of the shortterm interest rates. CORE Discussion Paper 9331. Campbell, J. Y. and R. J. Shiller (1991). Yield spreads and interest rate movements: A bird's eye view. Rev. Econom. Stud. 58, 495514. Canova F. and J. Marrinan (1993). Reconciling the term structure of interest rates with the consumption based ICAP model. Mimeo, Brown University. Chan K. C., G. A. Karolyi, F. A. Longstaff and A. B. Sanders (1992). An empirical comparison of alternative models of the shortterm interest rate. J. Finance XLVII. 12091227. Chen R. R. and L. Scott (1992). Pricing interest rate options in a two factor CoxIngersollRoss model of the term structure. Rev. Financ. Stud. 5, 613~536. Chen R. R. and L. Scott (1993). Maximum likelihood estimation for a multifactor equilibrium model of the term structure of interest rates. J. Fixed Income 3, 1431. Conley T., L. P. Hansen, E. Luttmer and J. Scheinkman (1994). Estimating subordinated diffusions from discrete time data. Mimeo, University of Chicago. Constantinides, G. (1992). A theory of the nominal structure of interest rates. Rev. Financ. Stud. 5, 531552. Cox, J. C., J. E. Ingersoll and S. A. Ross. (1985). A theory of the term structure of interest rates. Econometrica 53, 385408. Duffie, D. and R. Kan (1993). A yieldfactor model of interest rates. Mimeo, Graduate School of Business, Stanford University. Dybvig, P. H. (1989). Bonds and bond option pricing based on the current term structure. Working Paper, Washington University in St. Louis. Edmister, R. O. and D. B. Madan (1993). Informational content in interest rate term structures. Rev. Econom. Statist. 75, 695699. Egginton, D. M. and S. G. Hall (1993). An investigation of the effect of funding on the slope of the yield curve. Working Paper No. 6, Bank of England. E1 Karoui, N. and V. Lacoste, (1992). Multifactor models of the term structure of interest rates. Working Paper. University of Paris VI. Engsted, T. and C. Tanggaard (1994). Cointegration and the US term structure. J. Banking Finance 18, 167181. Evans, M. D. D. and K. L. Lewis (1994). Do stationary risk premia explain it all? Evidence from the term structure. J. Monetary Econom. 33, 285318. Frydman, H. (1994). Asymptotic inference for the parameters of a discretetime squareroot process. Math. Finance 4, 169181. Gallant, A. R. and G. Tauchen (1992). Which moments to match? Mimeo, Duke University. Gallant, A. R., D. Hsieh and G. Tauchen (1994). Estimation of stochastic volatility models with diagnostics. Mimeo, Duke University. Gonzalo, J. and C. W. J. Granger, (1991). Estimation of common longmemory components in cointegrated systems. UCSD, Discussion Paper 9133. Gourirroux, C., A. Monfort and E. Renault (1993). Indirect inference. J. AppL Econometrics 8, $85Sl18. Gourirroux, C. and O. Scaillet (1994). Estimation of the term structure from bond data. Working Paper No. 9415 CEPREMAP. Hail, A. D., H. M. Anderson and C. W. J. Granger. (1992). A cointegration analysis of treasury bill yields. Rev. Econom. Statist. 74, 116126. Heath, D., R. Jarrow and A. Morton (1992). Bond pricing and the term structure of interest rates: A new methodology for contingent claims valuation. Econometrica 60, 77105. Hejazi, W. 1994. Are term premia stationary? Mimeo, University of Toronto.
118
Ho, T. S. and SB Lee (1986). Term structure movements and pricing interest rate contingent claims. J. Finance 41, 10111029. Johansen, S. (1988). Statistical analysis of cointegrating vectors. J. Econom. Dynamic Control 12, 231254. Johnson, P. A. (1994). On the number of common unit roots in the term structure of interest rates. Appl. Econom. 26, 815820. Kearns, P. (1993). Volatility and the pricing of interest rate derivative claims. Unpublished doctoral dissertation, University of Rochester. Koedijk, K. G., F. G. J. A. Nissen, P. C. Schotman and C. C. P. Wolff (1993). The dynamics of shortterm interest rate volatility reconsidered. Mimeo, Limburg Institute of Financial Economics. Litterman, R and J. Scheinkman (1991). Common factors affecting bond returns. J. Fixed Income 1, 5461. Longstaff, F. and E. S. Schwartz (1992). Interest rate volatility and the term structure: A two factor general equilibrium model. J. Finance XLVII 12591282. Marsh, T. A. and E. R. Rosenfeld (1983). Stochastic processes for interest rates and equilibrium bond prices. J. Finance XXXVIII, 635450. Mihlstein, G. N. (1974). Approximate integration of stochastic differential equations. Theory Probab. Appl. 19, 557562. McCulloch, J. H. (1989). US term structure data. 19461987, Handbook of Monetary Economics 1, 672715. McCulloch, J. H. and H. C. Kwon (1993). US term structure data. 19471991. Ohio State University Working Paper 936. Pearson, N. D. and TS Sun (1994). Exploiting the conditional density in estimating the term structure: An application to the Cox, Ingersoll and Ross model, d. Fixed Income XLIX, 12791304. Pfann, G. A., P. C. Schotman and R. Tschernig (1994). Nonlinear interest rate dynamics and implications for the term structure. Mimeo, University of Limburg. Phillips, P. C. B. and B. E. Hansen (1990). Statistical inference in instrumental variables regression with I(1) processes. Rev. Econom. Stud. 57, 99125. Shen, P. and R. M. Start (1992). Liquidity of the treasury bill market and the term structure of interest rates. Discussion paper 9232. University of California at San Diego. Stock, J. H. and M. W. Watson (1988). Testing for common trends. J. Amer. Statist. Assoc. 83, 10971107. Thurston, D. C. (1994). A generalized method of moments comparison of discrete HeathJarrowMorton interest rate models. Asia Pac. J. Mgmt. 11, 119. Vetzal, K. R. (1992). The impact of stochastic volatility on bond option prices. Working Paper 9208. University of Waterloo. Institute of Insurance and Pension Research, Waterloo, Ontario. Zhang, Z. (1993). Treasury yield curves and cointegration. Appl. Econom. 25, 361367.
G. S. Maddala, and C. R. Rao, eds., Handbookof Statistics, Vol. 14 1996 Elsevier Science B.V. All rights reserved.
..)
Stochastic Volatility*
Eric Ghysels, A n d r e w C. Harvey and Eric Renault
1. Introduction The class of stochastic volatility (SV) models has its roots both in mathematical finance and financial econometrics. In fact, several variations of SV models originated from research looking at very different issues. Clark (1973), for instance, suggested to model asset returns as a function of a r a n d o m process of information arrival. This socalled time deformation approach yielded a timevarying volatility model of asset returns. Later Tauchen and Pitts (1983) refined this work proposing a mixture of distributions model of asset returns with temporal dependence in information arrivals. Hull and White (1987) were not directly concerned with linking asset returns to information arrival but rather were interested in pricing European options assuming continuous time SV models for the underlying asset. They suggested a diffusion for asset prices with volatility following a positive diffusion process. Yet another approach emerged from the work of Taylor (1986) who formulated a discrete time SV model as an alternative to Autoregressive Conditional Heteroskedasticity ( A R C H ) models. Until recently estimating Taylor's model, or any other SV model, remained almost infeasible. Recent advances in econometric theory have made estimation of SV models much easier. As a result, they have become an attractive class of models and an alternative to other classes such as A R C H . Contributions to the literature on SV models can be found both in mathematical finance and econometrics. Hence, we face quite a diverse set of topics. We say very little about A R C H models because several excellent surveys on the subject have appeared recently, including those by Bera and Higgins (1995), Bollerslev, Chou and Kroner (1992), Bollerslev, Engle and Nelson (1994) and
* We benefitedfrom helpful comments from Torben Andersen, David Bates, Frank Diebold, Ren6 Garcia, Eric Jacquier and Neil Shephard on preliminary drafts of the paper. The first author would like to acknowledge the financial support of FCAR (Qurbec), SSHRC (Canada) as well as the hospitality and support of CORE (LouvainlaNeuve,Belgium). The second author wishes to thank the ESRC for financial support. The third author would like to thank the Institut Universitairede France, the Frd~ration Frangaise des Socirt~s d'Assurance as well as CIRANO and C.R.D.E. for financial support. 119
120
Diebold and Lopez (1995). Furthermore, since this chapter is written for the Handbook of Statistics, we keep the coverage of the mathematical finance literature to a minimum. Nevertheless, the subject of option pricing figures prominently out of necessity. Indeed, Section 2, which deals with definitions of volatility has extensive coverage of BlackScholes implied volatilities. It also summarizes empirical stylized facts and concludes with statistical modeling of volatility. The reader with a greater interest in statistical concepts may want to skip the first three subsections of Section 2 which are more finance oriented and start with Section 2.4. Section 3 discusses discrete time models, while Section 4 reviews continuous time models. Statistical inference of SV models is the subject of Section 5. Section 6 concludes.
2.1.1. An instantaneous volatility concept We consider a financial asset, say a stock, with today's (time t) market price denoted by St. 2 Let the information available at time t be described by It and consider the conditional distribution of the return St+h/St of holding the asset over the period [t,t + hi given It. 3 A maintained assumption throughout this chapter will be that asset returns have finite conditional expectation given It or:
2 Here and in the remainder of the paper we will focus on options written on stocks or exchange rates. The large literature on the term structure of interest rates and related derivative securities will not be covered. 3 Section 2.3 will provide a more rigorous discussion of information sets. It should also be noted that we will indifferently be using conditional distributions of asset prices St+h and of returns St+h/St since St belongs to It.
Stochast& volatility
121
+co
(2.1.1)
It,
namely (2.1.2)
The continuously compounded expected rate of return will be characterized by h 1 log Et(St+h/St). Then a first assumption can be stated as follows: ASSUMPTION 2.1.1.A. The continuously compounded expected rate of return converges almost surely towards a finite value i~s(It) when h > 0 goes to zero. F r o m this assumption one has EtSt+h  St ~'~ h#s(It)St or in terms of its differential representation: d Et(S~) /~s (It)St almost surely (2.1.3)
where the derivatives are taken from the right. Equation (2.1.3) is sometimes loosely defined as: Et(dSt)= ps(lt)Stdt. The next assumption pertains to the conditional variance and can be stated as: ASSUMPTION 2.1.1.B. The conditional variance of the return h~Vt(St+h/St) converges almost surely towards a finite value a2s(It) when h > 0 goes to zero. Again, in terms of its differential representation this amounts to:
surely
(2.1.4)
and one loosely associates with the expression Vt(dSt) = a~(It)S2tdt. Both assumptions 2.1.1.A and B lead to a representation of the asset price dynamics by an equation of the following form:
(2.1.5)
where Wt is a standard Brownian Motion. Hence, every time a diffusion equation is written for an asset price process we have automatically defined the socalled instantaneous volatility process as(It) which from the above representation can also be written as:
q 1/2
as(It)
= [lim Lhi h 1
Vt(St+h/St)J
(2.1.6)
Before turning to the next section we would like to provide a brief discussion of some of the foundations for the Assumptions 2.1.1.A and B. It was noted that Bachelier (1900) proposed Brownian Motion process as a model of stock price movements. In modern terminology this amounts to the random walk theory of asset pricing which claims that asset returns ought not to be predictable because of the informational efficiency of financial markets. Hence, it assumes that returns
122
on consecutive regularly sampled periods [t + k, t + k + 1],k = 0 , 2 , . . . ,h  1 are independently (identically) distributed. With such a benchmark in mind, it is natural to view the expectation and the variance of the continuously compounded rate of return log (St+h/St) as proportional to the maturity h of the investment. Obviously we no longer use Brownian Motions as a process for asset prices but it is nevertheless worth noting that Assumptions 2.1.1.A and B also imply that the expected rate of return and the associated squared risk (in terms of variance of the rate of return) of an investment over an infinitelyshort interval [t, t + hi is proportional to h. Sims (1984) provided some rationale for both assumptions through the concept of "local unpredictability". To conclude, let us briefly discuss a particular special case of (2.1.5) predominantly used in theoretical developments and also highlight an implicit restriction we made. When #s(It) = #s and as(It) = as are constants for all t the asset price is a Geometric Brownian Motion. This process was used by Black and Scholes (1973) to derive their wellknown pricing formula for European options. Obviously, since as(It) is a constant we no longer have an instantaneous volatility process but rather a single parameter as  a situation which undoubtedly greatly simplifies many things including the pricing of options. A second point which needs to be stressed is that Assumptions 2.1.1.A and B allow for the possibility of discrete jumps in the asset price process. Such jumps are typically represented by a Poisson process and have been prominent in the option pricing literature since the work of Merton (1976). Yet, while the assumptions allow in principle for jumps, they do not appear in (2.1.5). Indeed, throughout this chapter we will maintain the assumption of sample path continuity and exclude the possibility of jumps as we focus exclusively on SV models.
2.1.2. Option prices and implied volatilities It was noted in the introduction that SV models originated in part from the literature on the pricing of options. We have witnessed over the past two ,decades a spectacular growth in options and other derivative security markets. Such markets are sometimes characterized as places where "volatilities are traded". In this section we will provide the rationale for such statements and study the relationship between socalled options implied volatilities and the concepts of instantaneous and averaged volatilities of the underlying asset return process. The BlackScholes option pricing model is based on a LogNormal or Geometric Brownian Motion model for the underlying asset price: dSt = ~sStdt + asStdWt
(2.1.7)
where #s and as are fixed parameters. A European call option with strike price K and maturity t + h has a payoff:
f St+h  K if St+h >_ K [St+h  K] += ~. 0 otherwise
(2.1.8)
Stochastic volatility
123
Since the seminal Black and Scholes (1973) paper, there is now a well established literature proposing various ways to derive the pricing formula of such a contract. Obviously, it is beyond the scope of this paper to cover this literature in detail. 4 Instead, the bare minimum will be presented here allowing us to discuss the concepts of interest regarding volatility. With continuous costless trading assumed to be feasible, it is possible to form in the BlackScholes economy a portfolio using one call and a shortsale strategy for the underlying stock to eliminate all risk. This is why the option price can be characterized without ambiguity, using only arbitrage arguments, by equating the market rate of return of the riskless portfolio containing the call option with the riskfree rate. Moreover, such arbitragebased option pricing does not depend on individual preferences. 5 This is the reason why the easiest way to derive the BlackScholes option pricing formula is via a "riskneutral world", where asset price processes are specified through a modified probability measure, referred to as the risk neutral probability measure denoted Q (as discussed more explicitly in Section 4.2). This fictitious world where probabilities in general do not coincide with the Data Generating Process (DGP), is only used to derive the option price which remains valid in the objective probability setup. In the risk neutral world we have:
(2.1.9) (2.1.10)
where Et Q is the expectation under Q, B(t, t + h) is the price at time t of a pure discount bond with payoff one unit at time t + h and rt = ~ln~ ~ Log B(t, t + h)
1
(2.1.11)
is the riskless instantaneous interest rate. 6 We have implicitly assumed that in this market interest rates are nonstochastic (Wt is the only source of risk) so that:
B(t,t + h) = e x p [  ft+hr~d~] .
(2.1.12)
By definition, there are no risk premia in a risk neutral context. Therefore rt coincides with the instantaneous expected rate of return of the stock and hence
4 See however Jarrow and Rudd (1983), Cox and Rubinstein (1985), Duffie (1989), Duffle (1992), Hull (1993) or Hull (1995) among others for more elaborate coverage of options and other derivative securities. 5 This is sometimes refered to as preferencefree option pricing. This terminology may somewhat be misleading since individual preferences are implicitly taken into account in the market price of the stock and of the riskless bond. However, the option price only depends on individual preferences through the stock and bond market prices. 6 For notational convenience we denote by the same symbol Wt a Brownian Motion under P (in 2.1.7) and under Q (in 2.1.9). Indeed, Girsanov's theorem establishes the link between these two processes (see e.g. Duffle (1992) and section 4.2.1).
124
the call option price Ct is the discounted value of its terminal payoff (St+h  K ) + as stated in (2.1.10). The lognormality of St+h given St allows one to compute the expectation in (2.1.10) yielding the call price formula at time t:
Ct = St4)(dt)  KB(t, t + h)c~(dt  asx/h)
(2.1.13)
where ~b is the cumulative standard normal distribution function while dt will be defined shortly. Formula (2.1.13) is the socalled BlackScholes option pricing formula. Thus, the option price Ct depends on the stock price St, the strike price K and the discount factor B(t, t + h). Let us now define:
xt = Log St/KB(t, t + h) .
(2.1.14)
Then we have:
C,/St = 4)(dr)  eX'4)(dt  asX/~)
(2.1.15)
with dt = ( x t / a s x f h ) + asx/~/2. It is easy to see the critical role played by the quantity xt, called the moneyness of the option.

I f x t = 0, the current stock price St coincides with the present value of the strike price K. In other words, the contract m a y appear to be fair to somebody who would not take into account the stochastic changes of the stock price between t and t + h. We shall say that we have in this case an at the m o n e y option.  I f xt > 0 (respectively xt < 0) we shall say that the option is in the money (respectively out the money). 7
It was noted before that the BlackScholes formula is widely used among practitioners, even when its assumptions are known to be violated. In particular the assumption of a constant volatility as is unrealistic (see Section 2.2 for empirical evidence). This motivated Hull and White (1987) to introduce an option pricing model with stochastic volatility assuming that the volatility itself is a state variable independent of Wt: 8
dSt/St = rtdt + astdWt (ast)te[o,r], (Wt)tE[O,T] independent Markovian .
(2.1.16)
It should be noted that (2.1.16) is still written in a risk neutral context since rt coincides with the instantaneous expected return of the stock. On the other hand the exogenous volatility risk is not directly traded, which prevents us from de
7 We use here a slightly modified terminology with respect to the usual one. Indeed, it is more common to call at the money/in the money/out of the money options, when St = K/St > K/St < K respectively. From an economic point of view, it is more appealing to compare St with the present value of the strike price K. 8 Other stochastic volatility models similar to Hull and White (1987) appear in Johnson and Shanno (1987), Scott (1987), Wiggins (1987), Chesney and Scott (1989), Stein and Stein (1991) and Heston (1993) among others.
Stochastic volatility
125
fining unambiguously a risk neutral probability measure, as discussed in more detail in Section 4.2. Nevertheless, the option pricing formula (2.1.10) remains valid provided the expectation is computed with respect to the joint probability distribution of the Markovian process (S, as), given (St, ast).9 We can then rewrite (2.1.10) as follows:
Ct = B(t, t + h)Et(St+h  K) + = B ( t , t + h)Et{E[(St+h  K)+[(~7Sz)t<z<t+h] }
(2.1.17)
where the expectation inside the brackets is taken with respect to the conditional probability distribution of St+h given It and a volatility path a&, t < z < t + h. However, since the volatility process O'sT is independent of Wt, we obtain using (2.1.15)
B ( t , t + h)Et[(St+h  K)+[(trs~)t<~<t+h] = StEt[c~(dlt)  eX'49(d2t)]
(2.1.18)
72(t,t + h)  ~ j t
tr2~dz .
(2.1.19)
This yields the socalled Hull and White option pricing formula:
Ct = StEt[~9(dlt)  eX'q~(d2t)] ,
(2.1.20)
where the expectation is taken with respect to the conditional probability distribution (for the risk neutral probability measure) of y(t, t + h) given rrSt .10 In the remainder of this section we will assume that observed option prices obey Hull and White's formula (2.1.20). Then option prices would yield two types of implied volatility concepts: (1) an instantaneous implied volatility, and (2) an averaged implied volatility. To make this more precise, let us assume that the risk neutral probability distribution belongs to a parametric family, Po, 0 E 6). Then, the Hull and White option pricing formula yields an expression for the option price as a function:
Ct = StF[crst, xt, 0o]
(2.1.21)
9 We implicitly assume here that the available information This assumption will be discussed in Section 4.2.
10 The conditioning is with respect to at since it summarizes the relevant information taken from It (the process a is assumed to be Markovian and independent of W).
126
where 0o is the true unknown value of the parameters. Formula (2.1.21) reveals why it is often claimed that "option markets can be thought of as markets trading volatility" (see e.g. Stein (1989)). As a matter of fact, if for any given (xt, 0), F(.,xt, O) is onetoone, then equation (2.1.21) can be inverted to yield an implied
instantaneous volatility: ix
o'~P(0) = GISt, Ct,xt, 0] . (2.1.22)
Bajeux and Rochet (1992), by showing that this onetoone relationship between option prices and instantaneous volatility holds, in fact formalize the use of option markets as an appropriate instrument to hedge volatility risk. Obviously implied instantaneous volatilities (2.1.22) could only be useful in practice for pricing or hedging derivative instruments when we know the true unknown value 0o or, at least, are able to compute a sufficiently accurate estimate of it. However, the difficulties involved in estimating SV models has for long prevented their widespread use in empirical applications. This is the reason why practitioners often prefer another concept of implied volatility, namely the socalled BlaekScholes implied volatility introduced by Latane and Rendleman (1976). It is a process ~oimp(t,t + h) defined by:
dlt =
(2.1.23)
d2t = d l t  ogimP(t, t h ) v f h
where Ct is the observed option price. 12 The Hull and White option pricing model can indeed be seen as a theoretical foundation for this practice; the comparison between (2.1.23) and (2.1.20) allows us to interpret the BlackScholes implied volatility fOimP(t, t + h) as an implied averaged volatility since f~imP(t, t h) is something like a conditional expectation of y(t, t + h) (assuming observed option prices coincide with the Hull and White pricing formula). To be more precise, let us consider the simplest case of at the money options (the general case will be studied in Section 4.2). Since xt = 0 it follows that dzt =  d l t and therefore: q~(dat)  eX@(d2t) = 2qS(dlt)  1. Hence, ~Oio mp(t, t + h) (the index o is added to make explict that we consider at the money options) is defined by:
Since the cumulative standard normal distribution function is roughly linear in the neighborhood of zero, if follows that (for small maturities h):
11 The fact that F(., xt, 0) is onetoone is shown to be the case for any diffusion model on ~st under certain regularity conditions, see Bajeux and Rochet (1992). 12 We do not explicitly study here the dependence between foimp(t, t ~ h) and the various related processes : C , St, xt. This is the reason why, for sake of simplicity, this dependence is not apparent in the notation ogimp (t: t + h).
Stochastic volatility
oiomP(t,t + h) ~ Ed(t, t + h) . This yields an interpretation of the BlackScholes ogimp(t~ t + h) as an implied average volatility:
1/2
127
implied
volatility
oio mp(t, t + h) ~ E, [h J t
(2.1.25)
128
small to explain the empirical asymmetries one observes in stock prices. Others reporting empirical evidence regarding leverage effects include Nelson (1991), Gallant, Rossi and Tauchen (1992, 1993), Campbell and Kyle (1993) and Engle and Ng (1993).
(d) Information arrivals Asset returns are typically measured and modeled with observations sampled at fixed frequencies such as daily, weekly or monthly observations. Several authors, including Mandelbrot and Taylor (1967) and Clark (1973) suggested linking asset returns explicitly to the flow of information arrival. In fact it was already noted that Clark proposed one of the early examples of SV models. Information arrival is nonuniform through time and quite often not directly observable. Conceptually, one can think of asset price movements as the realization of a process Yt = Y~: where Zt is a socalled directing process. This positive nondecreasing stochastic process Zt can be thought of as being related to the arrival of information. This idea of time deformation or subordinated stochastic processes was used by Mandelbrot and Taylor (1967) to explain fat tailed returns, by Clark (1973) to explain volatility and was recently refined and further explored by Ghysels, Gourirroux and Jasiak (1995a). Moreover, Easley and O'Hara (1992) provide a microstructure model involving time deformation. In practice, it suggests a direct link between market volatility and (1) trading volume, (2) quote arrivals, (3) forecastable events such as dividend announcements or macroeconomic data releases, (4) market closures, among many other phenomena linked to information arrival. Regarding trading volume and volatility there are several papers documenting stylized facts notably linking high trading volume with market volatility, see for example Karpoff (1987) or Gallant, Rossi and Tauchen (1992). 13 The intraday patterns of volatility and market activity measured for instance by quote arrivals are also wellknown and documented. Wood, Mclnish and Ord (1985) and Harris (1986) studied this phenomenon for securities markets and found a Ushaped pattern with volatility typically high at the open and close of the market. The around the clock trading in foreign exchange markets also yields a distinct volatility pattern which is tied with the intensity of market activity and produces strong seasonal patterns. The intraday patterns for FX markets are analyzed for instance by Miiller et al. (1990), Baillie and Bollerslev (1991), Harvey and Huang (1991), Dacorogna et al. (1993), Bollerslev and Ghysels (1994), Andersen and Bollerslev (1995), Ghysels, Gourirroux and Jasiak (1995b) among others. Another related empirical stylized fact is that of overnight and weekend market closures and their effect on volatility. Fama (1965) and French and Roll (1986) have found that information accumulates more slowly when the NYSE and AMEX are closed resulting in higher volatility on those markets after weekends
13 There are numerous models, theoretical and empirical, linking trading volume and asset returns which we cannot discuss in detail. A partial list includes Foster and Viswanathan (1993a,b), Ghysels and Jasiak (1994a,b), Hausman and Lo (1991), Huffman (1987), Lamoureux and Lastrapes (1990, 1993), Wang (1993) and Andersen (1995).
Stochastic volatility
129
and holidays. Similar evidence for F X markets has been reported by Baillie and Bollerslev (1989). Finally, numerous papers documented increased volatility of financial markets around dividend announcements (Cornell (1978), Patell and Wolfson (1979,1981)) and macroeconomic data releases (Harvey and Huang (1991, 1992), Ederington and Lee (1993)).
130
equation of a specific model, namely the Black and Scholes model as noted in Section 2.1.3. Since they are computed on a daily basis there is obviously an internal inconsistency since the model presumes constant volatility. Yet, since many option prices are in fact quoted through their implied volatilities it is natural to study the time series behavior of the latter. Often one computes a composite measure since synchronous option prices with different strike prices and maturities for the same underlying asset yield different implied volatilities. The composite measure is usually obtained from a weighting scheme putting more weight on the nearthemoney options which are the most heavily traded in organized markets. 15 The time series properties of implied volatilities obtained from stock, stock index and currency options are quite similar. They appear stationary and are well described by a first order autoregressive model (see Merville and Pieptea (1989) and Sheikh (1993) for stock options, Poterba and Summers (1986), Stein (1989), Harvey and Whaley (1992) and Diz and Finucane (1993) for the S&P100 contract and Taylor and Xu (1994), Campa and Chang (1995) and Jorion (1995) for currency options). It was noted from equation (2.1.25) that implied (average) volatilities are expected to contain information regarding future volatility and therefore should predict the latter. One typically tests such hypotheses by regressing realized volatilities on past implied ones. The empirical evidence regarding the predictable content of implied volatilities is mixed. The time series study of Lamoureux and Lastrapes (1993) considered options on nondividend paying stocks and compared the forecasting performance of GARCH, implied volatility and historical volatility estimates and found that implied volatility forecasts, although biased as one would expect from (2.1.25), outperform the others. In sharp contrast, Canina and Figlewski (1993) studied S&P100 index call options for which there is an extremely active market. They found that implied volatilities were virtually useless in forecasting future realized volatilities of the S&P100 index. In a different setting using weekly sampling intervals for S&P100 option contracts and a different sample Day and Lewis (1992) not only found that implied volatilities had a predictive content but also were unbiased. Studies examining options on foreign currencies, such as Jorion (1995), also found that implied volatilities were predicting future realizations and that GARCH as well as historical volatilities were not outperforming the implied measures of volatility.
(h) The term structure o f implied volatilities
The BlackScholes model predicts a flat term structure of volatilities. In reality, the term structure of atthemoney implied volatilities is typically upward sloping when short term volatilities are low and the reverse when they are high (see Stein(1989)). Taylor and Xu (1994) found that the term structure of implied is Different weightingschemes have been suggested, see for instance Latane and Rendleman (1976), Chiras and Manaster(1978), Beckers(1981), Whaley(1982), Day and Lewis(1988), Engleand Mustafa (1992) and Bates (1995b).
Stochastic volatility
131
volatilities from foreign currency options reverses slope every few months. Stein (1989) also found that the actual sensitivity of medium to short term implied volatilities was greater than the estimated sensitivity from the forecast term structure and concluded that medium term implied volatilities overreacted to information. Diz and Finucane (1993) used different estimation techniques and rejected the overreaction hypothesis, and instead reported evidence suggesting underreaction. (i) Smiles If option prices in the market were conformable with the BlackScholes formula, all the BlackScholes implied volatilities corresponding to various options written on the same asset would coincide with the volatility parameter a of the underlying asset. In reality this is not the case, and the BlackScholes implied volatility wimp(t, t + h) defined by (2.1.23) heavily depends on the calendar time t, the time to maturity h and the moneyness xt = Log St/KB(t, t h) of the option. This may produce various biases in option pricing or hedging when BS implied volatilities are used to evaluate new options with different strike prices K and maturities h. These price distortions, wellknown to practitioners, are usually documented in the empirical literature under the terminology of the smile effect, where the socalled "smile" refers to the Ushaped pattern of implied volatilities across different strike prices. More precisely, the following stylized facts are extensively documented (see for instance Rubinstein (1985), Clewlow and Xu (1993), Taylor and Xu (1993)):

The Ushaped pattern of wimp(t, t h) as a function of K (or logK) has its minimum centered at nearthemoney options (discounted K close to St, i.e. xt close to zero). The volatility smile is often but not always symmetric as a function of logK (or ofxt). When the smile is asymmetric, the skewness effect can often be described as the addition of a monotonic curve to the standard symmetric smile: if a decreasing curve is added, implied volatilities tend to rise more for decreasing than for increasing strike prices and the implied volatility curve has its minimum out of the money. In the reverse case (addition of an increasing curve), implied volatilities tend to rise more with increasing strike prices and their minimum is in the money. The amplitude of the smile increases quickly when time to maturity decreases. Indeed, for short maturities the smile effect is very pronounced (BS implied volatilities for synchronous option prices may vary between 154 and 25%) while it almost completely disappears for longer maturities. It is widely believed that volatility smiles have to be explained by a model of stochastic volatility. This is natural for several reasons: First, it is tempting to propose a model of stochastically time varying volatility to account for stochastically time varying BS implied volatilities. Moreover, the decreasing amplitude of the smile being a function of time to maturity is conformable with a formula like (2.1.25). Indeed, it shows that, when time to maturity is increased,
132
temporal aggregation of volatilities erases conditional heteroskedasticity, which decreases the smile phenomenon. Finally, the skewness itself may also be attributed to the stochastic feature of the volatility process and overall to the correlation of this process with the price process (the socalled leverage effect). Indeed, this effect, while sensible for stock prices data, is small for interest rate and exchange rate series which is why the skewness of the smile is more often observed for options written on stocks. Nevertheless, it is important to be cautious about tempting associations: stochastic implied volatility and stochastic volatility; asymmetry in stocks and skewness in the smile. As will be discussed in Section 4, such analogies are not always rigorously proven. Moreover, other arguments to explain the smile and its skewness (jumps, transaction costs, bidask spreads, nonsynchronous trading, liquidity problems . . . . ) have also to be taken into account both for theoretical reasons and empirical ones. For instance, there exists empirical evidence suggesting that the most expensive options (the upper parts of the smile curve) are also the least liquid; skewness may therefore be attributed to specific configurations of liquidity in option markets.
2.3.1. State variables and information sets The Hull and White (1987) model is a simple example of a derivative asset pricing model where the stock price dynamics are governed by some unobservable state variables, such as random volatility. More generally, it is convenient to assume that a multivariate diffusion process Ut summarizes the relevant state variables in the sense that: dSt/St = #fit + fftdWt ~[ dUt = 7tdt + (~tdWtU [ Cov(dWt, d W if) = , , d t
(2.3.1)
16The analysis in this section has some features in common with Andersen (1992) regarding the use of information sets to clarify the differencebetween SV and ARCH type models.
Stochastic volatility
133
where the stochastic processes #t, at,~t, rt and Pt are Itu = [U,,z _< t] adapted (Assumption 2.3.1). This means that the process U summarizes the whole dynamics of the stock price process S (which justifies the terminology "state" variable) since, for a given sample path (U,)0<~<r of state variables, consecutive returns Stk+~~Sty, 0 < tl < t2 < ... < tk <__T are stochastically independent and lognormal (as in the benchmark BS model). The arguments of Section 2. !.2 can be extended to the state variables framework (see Garcia and Renault (1995)) discussed here. Indeed, such an extension provides a theoretical justification for the common use of the Black and Scholes model as a standard method of quoting option prices via their implied volatilities) 7 In fact, it is a way of introducing neglected heterogeneity in the BS option pricing model (see Renault (1995) who draws attention to the similarities with introducing heterogeneity in microeconometric models of labor markets, etc.). In continuous time models, available information at time t for traders (whose information determines option prices) is characterized by continuous time observations of both the state variable sample path and stock price process sample path; namely: It = a[Uz, Sz; "C~_ t] . (2.3.2)
2.3.2. Discrete sampling and Granger noncausality In the next section we will treat explicitly discrete time models. It will necessitate formulating discrete time analogues of equation (2.3.1). The discrete sampling and Granger noncausality conditions discussed here will bring us a step closer to building a formal framework for statistical modeling using discrete time data. Clearly, a discrete time analogue of equation (2.3.1) is: log St+l/St = #(Ut) + tr(Ut)et+l (2.3.3)
provided we impose some restrictions on the process et. The restrictions we want to impose must be flexible enough to accommodate phenomena such as leverage effects for instance. A setup that does this is the following: ASSUMPTION 2.3.2.A. The process et in (2.3.3) is i.i.d, and not Grangercaused by the state variable process Ut. ASSUMPTION 2.3.2.B. The process et in (2.3.3) does not Grangercause Ut. Assumption 2.3.2.B is useful for the practical use of BS implied volatilities as it is the discrete time analogue of Assumption 2.3.1 where it is stated that the coefficients of the process U are I y adapted (for further details see Garcia and 17 Garcia and Renault (1995) argued that Assumption 2.3.1 is essentialto ensure the homogeneity of option priceswith respectto the pair (stockprice, strike price) which in turn ensures that BS implied volatilities do not depend on the stock price levelbut only on the moneynessS/K. This homogeneity property was first emphasized by Merton (1973).
134
Renault (1995)). Assumption 2.3.2.A is important for the statistical interpretation of the functions #(Ut) and a(Ut) respectively as trend and volatility coefficients, namely, E[log S t + l / S t l ( & / &  l ; z <
t)]
(2.3.4)
since E[et+l [ (U~, e,; z _< t)] = E[et+l ]et; z _< t] = 0 due to the Granger noncausality from Ut to st of Assumption 2.3.2.A. Likewise, one can easily show that Var[log S t + l / S t  # ( U t ) ] ( & / &  l ; "r <_ t)] = E[~rZ(ut)l(&/&_l; z <_ t)] . (2.3.5)
Implicitly we have introduced a new information set in (2.3.4) and (2.3.5) which, besides It defined in (2.3.2), will be useful as well for further analysis. Indeed, one often confines (statistical) analysis to information conveyed by a discrete time sampling of stock return series which will be denoted by the information set
ItR = a[&/&_l :z = 0, 1 , . . . , t 
l,t]
(2.3.6)
where the superscript R stands for returns. By extending Andersen (1994), we shall adopt as the most general framework for univariate volatility modelling, the setup given by the Assumptions 2.3.2.A, 2.3.2.B and: ASSUMe'nON 2.3.2.C. I,t(Ut) is I~ measurable. Therefore in (2.3.4) and (2.3.5) we have essentially shown that: E [log St+l/St[ItR] = #(Ut) (2.3.7) (2.3.8)
Var[(log St+I/St)II, R] =
E[a2(Ut)llff] .
Financial time series are observed at discrete time intervals while a majority of theoretical models are formulated in continuous time. Generally speaking there are two statistical methodologies to resolve this tension. Either one considers for the purpose of estimation statistical discrete time models of the continuous time processes, or alternatively, the statistical model may be specified in continuous time and inference done via a discrete time approximation. In this section we will discuss in detail the former approach while the latter will be introduced in Section 4. The class of discrete time statistical models discussed here is general. In Section 2.4.1 we introduce some notation and terminology. The next section discusses the socalled stochastic autoregressive volatility model introduced by
Stochastic volatility
135
Andersen (1994) as a rather general and flexible semiparametric framework to encompass various representations of stochastic volatility already available in the literature. Identification of parameters and the restrictions required for it are discussed in Section 2.4.3.
2.4.1. Notation and terminology
In Section 2.3, we left unspecified the functional forms which the trend/~(.) and volatility a(.) take. Indeed, in some sense we built a nonparametric framework recently proposed by Lezan, Renault and de Vitry (1995) which they introduced to discuss a notion of stochastic volatility o f unknown f o r m . 18 This nonparametric framework encompasses standard parametric models (see Section 2.4.2 for more formal discussion). For the purpose of illustration let us consider two extreme cases, assuming for simplicity that #(Ut) = 0 : (i) the discrete time analogue of the Hull and White model (2.1.16) is obtained when a(Ut) = at is a stochastic process independent from the stock return standardized innovation process e and (ii) at may be a deterministic function h(et, ~ < t) of past innovations. The latter is the complete opposite of (i) and leads to a large variety of choices of parameterized functions for h yielding XARCH models (GARCH, EGARCH, QTARCH, Periodic GARCH, etc.). Besides these two polar cases where Assumption 2.3.2.A is fulfilled in a trivial degenerate way, one can also accommodate leverage effects, w In particular the contemporaneous correlation structure between innovations in U and the return process can be nonzero, since the Granger noncausality assumptions deal with temporal causal links rather than contemporaneous ones. For instance, we may have a(Ut) = at with: log St+l/St = atet+l (2.4.1) (2.4.2)
Cov(o't+l, E't+lII, R) # 0 .
A negative covariance in (2.4.2) is a standard case of leverage effect, without violating the noncausality Assumptions 2.3.2.A and B. A few concluding observations are worth making to deal with the burgeoning variety of terminology in the literature. First, we have not considered the distinction due to Taylor (1994) between "lagged autoregressive random variance models" given by (2.4.1) and "contemporaneous autoregressive random variance models" defined by: log S t + l / S t : O't+let+l (2.4.3)
18 Lezan, Renault and de Vitry (1995) discuss in detail how to recover phenomena such as volatility clustering in this framework. As a nonparametric framework it also has certain advantages regarding (robust) estimation. They develop for instance methods that can be useful as a first estimation step for efficient algorithms assuming a specific parametric model (see Section 5). 19 Assumption 2.3.2.B is fulfilled in case (i) but may fail in the G A R C H case (ii). When it fails to hold in the latter case it makes the G A R C H framework not very wellsuited for option pricing.
136
Indeed, since the volatility process at is unobservable, the settings (2.4.1) and (2.4.3) are observationally equivalent as long as they are not completed by precise (non)causality assumptions. For instance: (i) (2.4.1) and assumption 2.3.2.A together appear to be a correct and very general definition of a SV model possibly completed by Assumption 2.3.2.B for option pricing and (2.4.2) to introduce leverage effects, (ii) (2.4.3) associated with (2.4.2) would not be a correct definition of a SV model since in this case in general: E[log St+i/St I ItR] 0, and the model would introduce via the process a a forecast which is related not only to volatility but also to the expected return. For notational simplicity, the framework (2.4.3) will be used in Section 3 with the leverage effect captured by Cov(at+l, ct) ~ 0 instead of Cov(at+l, et+l) ~ O. Another terminology was introduced by Amin and Ng (1993) for option pricing. Their distinction between "predictable" and "unpredictable" volatility is very close to the leverage effect concept and can also be analyzed through causality concepts as discussed in Garcia and Renault (1995). Finally, it will not be necessary to make a distinction between weak, semistrong and strong definitions of SV models in analogy with their A R C H counterparts (see Drost and Nijman (1993)). Indeed, the class of SV models as defined here can accommodate parameterizations which are closed under temporal aggregation (see also Section 4.1 on the subject of temporal aggregation). 2.4.2. Stochastic autoregressive volatility For simplicity, let us consider the following univariate volatility process: Yt+l = ~tt + atet+l (2.4.4)
where #t is a measurable function of observables Yt c Itn, z <_ t. While our discussion will revolve around (2.4.4), we will discuss several issues which are general and not confined to that specific model; extensions will be covered more explicitly in Section 3.5. Following the result in (2.3.8) we know that: Var[yt+l[It n] = E[aZllff] (2.4.5)
suggesting (1) that volatility clustering can be captured via autoregressive dynamics in the conditional expectation (2.4.5) and (2) that thick tails can be obtained in either one of three ways, namely (a) via heavy tails of the white noise et distribution, (b) via the stochastic features of E [a 2 litn] and (c) via specific randomness of the volatility process at which makes it latent i.e. at~I~. 2 The volatility dynamics that follow from (1) and (2) are usually an AR(1) model for some nonlinear function of at. Hence, the volatility process is assumed to be stationary and Markovian of order one but not necessarily linear AR(1) in at itself. This is
20 Kim and Shephard (1994), using data on weeklyreturns on the S&P500Index, found that a tGARCH model has an almost identicallikelihoodas the normal based SV model. This exampleshows that a specificrandomness in at may produce the same level of marginal kurtosis as a heavy tailed student distribution of the white noise e.
Stochastic volatility
137
precisely what motivated Andersen (1994) to introduce the Stochastic Autoregressive Variance or SARV class of models where at (or o2) is a polynomial function g(Kt) of a Markov process Kt with the following dynamic specification:
Kt = w + flKt1 + [y + oJ4~t1]ut
(2.4.6)
where fit = ut  1 is zeromean white noise with unit variance. Andersen (1994) discusses sufficient regularity conditions which ensure stationarity and ergodicity for Kt. Without entering into the details, let us note that the fundamental noncausality Assumption 2.3.2A implies that the ut process in (2.4.6) does not Grangercause et in (2.4.4). In fact, the noncausality condition suggests a slight modification of Andersen's (1994) definition. Namely, it suggests assuming et+l independent of ut_j, j >_ 0 for the conditional probability distribution, given etj, j _> 0 rather than for the unconditional distribution. This modification does not invalidate Andersen's SARV class of models as the most general parametric statistical model studied so far in the volatility literature. The GARCH(1,1) model is straightforwardly obtained from (2.4.6) by letting Kt = at2, ~ = 0 and ut = et 2. Note that the deterministic relationship ut = ct 2 between the stochastic components of (2.4.4) and (2.4.6) emphasizes that, in G A R C H models, there is no randomness specific to the volatility process. The Autoregressive Random Variance model popularized by Taylor (1986) also belongs to the SARV class. Here: log at+l = ~ + log at + r/t+1 (2.4.7)
where qt+l is a white noise disturbance such that Cov(r/t+l , et+l) 0 to accommodate leverage effects. This is a SARV model with K, = log at, c~= 0 and
qt+l ~ "~Ut+l "21
Introducing a general class of processes for volatility, like the SARV class discussed in the previous section prompts questions regarding identification. Suppose again that
Yt+l = atEt+l
a 7 = g(K,),
q e {1,2)
(2.4.8)
Andersen (1994), noted that the model is better interpreted by considering the zeromean white noise process fit = ut  1:
g t ~ ( w  ~ ) ) ~ (0~1 f l ) g t  1 ~ (~ [ ~ r ~ t  l ) f t
(2.4.9)
It is clear from the latter that it may be difficult to distinguish empirically the constant w from the "stochastic" constant yut. Similarly, the identification of the and/~ parameters separately is also problematic as (, +/~) governs the persis21 Andersen (1994) also shows that the SARV framework encompasses another type of random variance model that we have considered as illspecified since it combines (2.4.2) and (2.4.3).
138
tence of shocks to volatility. These identification problems are usually resolved by imposing (arbitrary) restrictions on the pairs of parameters (w, 7) and (~, fl). The GARCH(1,1) and Autoregressive Random Variance specifications assume that 7 = 0 and ~ = 0 respectively. Identification of all parameters without such restrictions generally requires additional constraints, for instance via some distributional assumptions on et+l and ut, which restrict the semiparametric framework of (2.4.6) into a parametric statistical model. To address more rigorously the issue of identification, it is useful to consider, according to Andersen (1994), the following reparameterization (assuming for notational convenience that ~ 0): K p = =
(w+7)/(1ocfl) c~+/3
(2.4.10)
7/c~.
Kt = K + p(Ks1  K) + (6 + Ks1)(Js
where Os = ~fit. It is clear from (2.4.10) that only three functions of the original parameters ~,/3, 7, w may be identified and that the three parameters K, p, 6 are identified from the first three unconditional moments of the process Kt for instance. To give to these identification results an empirical content, it is essential to know: (1) how to go from the moments of the observable process Yt to the moments of the volatility process at, and (2) how to go from the moments of the volatility process as to the moments of the latent process Ks. The first point is easily solved by specifying the corresponding moments of the standardized innovation process e. If we assume for instance a Gaussian probability distribution, we obtain:
Elys I
E[ytl[yt_j[
EIAIlytjl
(2.4.11)
The solution of the second point requires in general the specification of the mapping g and of the probability distribution of ut in (2.4.6). For the socalled Lognormal SARV model, it is assumed that ~ = 0 and Kt = log at (Taylor's autoregressive random variance model) and that ut is normally distributed (Lognormality of the volatility process). In this case, it is easy to show that:
(2.4.12)
Without the normality assumption (i.e. QML, mixture of normal, Student distribution...) this model will be studied in much more detail in sections 3 and 5
Stochastic volatility
139
from both probabilistic and statistical points of view. Moreover, this is a template for studying other specifications of the SARV class of models. In addition, various specifications will be considered in Section 4 as proxies of continuous time models.
The purpose of this section will be to discuss the statistical handling of discrete time SV models, using simple univariate cases. We start by defining the most basic SV model corresponding to the autoregressive random variance model discussed earlier in (2.4.7). We study its statistical properties in Section 3.2 and provide a comparison with A R C H models in Section 3.3. Section 3.4 is devoted to filtering, prediction and smoothing. Various extensions, including multivariate models, are covered in the last section. Estimation of the parameters governing the volatility process is discussed later in section 5.
3.1.
(3.1.1)
where y t denotes the demeaned return process y t = log ( S t ~ S t  l )  I~ and log a 2 follows an AR(1) process. It will be assumed that ct is a series of independent, identically distributed random disturbances. Usually et is specified to have a 2 is known. Thus for a normal distribution standard distribution so its variance a~ a 2 is unity while for a tdistribution with v degrees of freedom it will be v / ( v  2). Following a convention often adopted in the literature we write ht = log o2:
Yt = ~rcte 'Sh'
(3.1.2)
where a is a scale parameter, which removes the need for a constant term in the stationary firstorder autoregressive process
ht+l = q~ht + qt, rlt ~ I I D ( O , a ~ )
[qS[ < 1 .
(3.1.3)
It was noted before that if et and r/t are allowed to be correlated with each other, the model can pick up the kind of asymmetric behavior which is often found in stock prices. Indeed a negative correlation between et and ~/t induces a leverage effect. As in Section 2.4.1, the timing of the disturbance in (3.1.3) ensures that the observations are still a martingale difference, the equation being written in this way so as to tie in with the state space literature. It should be stressed that the above model is only an approximation to the continuous time models of Section 2 observed at discrete intervals. The accuracy of the approximation is examined in Dassios (1995) using Edgeworth expansions (see also Sections 4.1 and 4.3 for further discussion).
140
Similarly if the fourth moment of et exists, the kurtosis of yt is ~:exp(a~), where is the kurtosis of et, so Yt exhibits more kurtosis than st. Finally all the odd moments are zero. For many purposes we need to consider the moments of powers of absolute values. Again, tb is assumed to be normally distributed. Then for et having a standard normal distribution, the following expressions are derived in Harvey (1993):
r ( c / 2 + 1/2)
~(172~ e x p ~ o h )
f c2 2"~
c > 1 ,
c 7~ 0
(3.2.2)
(c2 32'~ ~r(c + 1/2) [I'(c/2+1/2)_] 2} Varlyt]C=~2~Uexp~ 2 hi{ r(1/2) [ r(1/2) ] '
c>0.5, c0 . Note t h a t / ' ( 1 / 2 )  v / ~ and F(1) = 1. Corresponding expressions may be computed for other distributions of et including Student's t and the General Error Distribution (see Nelson (1991)). Finally, the square of the coefficient of variation of o2 is often used as a measure of the relative strength of the SV process. This is Var(a~)/[E(a~)] 2 = exp(a 2)  1. Jacquier, Polson and Rossi (1994) argue that this 2. In the empirical studies they quote it is rarely is more easily interpretable than a n less than 0.1 or greater than 2.
c2
Stochastic volatility
where ~Ccis
141
xc = E(lyt]2c)/{E(lYtlC)} 2 ,
(3.2.4)
and Ph#, z = 0, 1,2~... denotes the ACF ofht . Taylor (1986) gives this expression for c equal to one and two and et normally distributed. When c = 2, xc is the kurtosis and this is three for a normal distribution. More generally,
Kc = r ( c + 1 / 2 ) r ( 1 / 2 ) / { r ( c / 2
+ 1/2)} 2
e# 0
{r(c/2 + 1/2)r(c/2
+ v/2)} 2
'
(3.2.5)
Icl <
v/2 ,
c # 0
Note that v must be at least five if c is two. The ACF, p{C), has the following features. First, if O' 2 h is small and/or Ph,~ is close to one, e x p ( ~ a 2)  1 P~) ~ Ph# (Kcexp(~a])  1) ' z_> 1 ; (3.2.6)
compare Taylor (1986, p. 745). Thus the shape of the A C F ofht is approximately carried over to p~C) except that it is multiplied by a factor of proportionality, which must be less than one for c positive as xc is greater than one. Secondly, for the tdistribution, xc declines as v goes to infinity. Thus p!C) is a maximum for a normal distribution. On the other hand, a distribution with less kurtosis than the normal will give rise to higher values of p~C). Although (3.2.6) gives an explicit relationship between p!C) and c, it does not appear possible to make any general statements regarding p~C) being maximized for certain values of c. Indeed different values of a ] lead to different values of c maximizing p!C). ifah2 is chosen so as to give values of p~C) of a similar size to those reported in Ding, Granger and Engle (1993) then the maximum appears to be attained for c slightly less than one. The shape of the curve relating p!C) to c is similar to the empirical relationships reported in Ding, Granger and Engle, as noted by Harvey (1993).
3.2.2. Logarithmic transformation Squaring the observations in (3.1.2) and taking logarithms gives
2 log y~ = log a 2 + ht q log et Alternatively logy~=og+ht+t (3.2.7)
(3.2.8)
142
where co = log 02+ Elog eZ,so that the disturbance it has zero mean by construction. The mean and variance of log e 2 are known to be 1.27 and rc2/2 = 4.93 when et has a standard normal distribution; see Abramovitz and Stegun (1970). However, the distribution of log e 2 is far from being normal, being heavily skewed with a long tail. More generally, if st has a tdistribution with v degrees of freedom, it can be expressed as:
~t z ~tKt 0"5
where ~t is a standard normal variate and ~ct is independently distributed such that vxt is chisquare with v degrees of freedom. Thus log e~ = log (2 _ log Kt and again using results in Abramovitz and Stegun (1970), it follows that the mean and variance of log et 2 are 1.27 ~b(v/2)  log (v/2) and 4.93 + ~k'(v/2) respectively, where ~(.) is the digamma function. Note that the moments of it exist even if the model is formulated in such a way that the distribution of er is Cauchy, that is v = 1. In fact in this case it is symmetric with excess kurtosis two, compared with excess kurtosis four when et is Gaussian. Since log e~ is serially independent, it is straightforward to work out the ACF of log ~ for hr following any stationary process:
2 2 p~O) = p h , j { 1 + 0~/0h} , "c > 1 . (3.2.9)
The notation p~0) reflects the fact that the ACF of a power of an absolute value of the observation is the same as that of the BoxCox transform, that is {[yt]C1}/c, and hence the logarithmic transform of an absolute value, raised to any (nonzero) power, corresponds to c = 0. (But note that one cannot simply set c = 0 in (3.2.3)). Note that even if tit and et are not mutually independent, the t/t and it disturbances are uncorrelated if the joint distribution of et and tit is symmetric, that is f(et, tit) = f (  e t , tit); see Harvey, Ruiz and Shephard (1994). Hence the expression for the ACF in (3.2.9) remains valid.
t = 1,...,T .
(3.3.1)
The G A R C H model was proposed by Bollerslev (1986) and Taylor (1986), and is a generalization of the A R C H model formulated by Engle (1982). The
Stochastic volatility
143
A R C H ( l ) model is a special case of GARCH(1,1) with fi = 0. The motivation comes from forecasting; in an AR(1) model with independent disturbances, the optimal prediction of the next observation is a fraction of the current observation, and in A R C H ( I ) it is a fraction of the current squared observation (plus a constant). The reason is that the optimal forecast is constructed conditional on the current information and in an A R C H model the variance in the next period is assumed to be known. This construction leads directly to a likelihood function for the model once a distribution is assumed for et. Thus estimation of the parameters upon which 03 depends is straightforward in principle. The G A R C H formulation introduces terms analogous to moving average terms in an A R M A model, thereby making forecasts a function of a distributed lag of past squared observations. It is straightforward to show that yt is a martingale difference with (unconditional) variance 7/(1  e /3). Thus ~ +/3 < 1 is the condition for covariance stationarity. As shown in Bollerslev (1986), the condition under which the fourth moment exists in a Gaussian model is 2~2 + (c +/3) 2 < 1. The model then exhibits excess kurtosis. However, the fourth moment condition may not always be satisfied in practice. Somewhat paradoxically, the conditions for strict stationarity are much weaker and, as shown by Nelson (1990), even include the case ~+fl=l. The specification of GARCH(1,1) means that we can write y2 t = ~ o~y21 fl0.2 t1 Ot = 7 + (0~ / 3 ) y 2 _ 1 } I)t /31)t1 where vt = Y 2  03 is a martingale difference. Thus y2 t has the form of an ARMA(1,1) process and so its ACF can be evaluated in the same way. The ACF of the corresponding A R M A model seems to be indicative of the type of patterns likely to be observed in practice in correlograms of y2. The G A R C H model extends by adding more lags of 03 and y2t. However, GARCH(1,1) seems to be the most widely used. It displays similar properties to the SV model, particularly if ~b is close to one. This should be clear from (3.2.6) which has the pattern of an ARMA(1,1) process. Clearly ~b plays a role similar to that of e +/3. The main difference in the ACFs seems to show up most at lag one. Jacquier et al. (1994, p. 373) present a graph of the correlogram of the squared weekly returns of a portfolio on the New York Stock Exchange together with the ACFs implied by fitting SV and GARCH(1,1) models. In this case the ACF implied by the SV model is closer to the sample values. The SV model displays excess kurtosis even if ~b is zero since Yt is a mixture of distributions. The 07 2 parameter governs the degree of mixing independently of the degree of smoothness of the variance evolution. This is not the case with a G A R C H model where the degree of kurtosis is tied to the roots of the variance equation, ~ and/3 in the case of GARCH(1,1). Hence, it is very often necessary to use a nonGaussian G A R C H model to capture the high kurtosis typically found in a financial time series. The basic G A R C H model does not allow for the kind of asymmetry captured by a SV model with contemporaneously correlated disturbances, although it can
144
be modified as suggested in Engle and Ng (1993). The E G A R C H model, proposed by Nelson (1991), handles asymmetry by taking log o~ to be a function of past squares and absolute values of the observations. 3.4. Filtering, smoothing and prediction For the purposes of pricing options, we need to be able to estimate and predict the variance, atz, which of course, is proportional to the exponent of hr. An estimate based on all the observations up to, and possibly including, the one at time t is called a filtered estimate. On the other hand an estimate based on all the observations in the sample, including those which came after time t is called a smoothed estimate. Predictions are estimates of future values. As a matter of historical interest we may wish to examine the evolution of the variance over time by looking at the smoothed estimates. These might be compared with the volatilities implied by the corresponding options prices as discussed in Section 2.1.2. For pricing "at the money" options we may be able to simply use the filtered estimate at the end of the sample and the predictions of future values of the variance, as in the method suggested for A R C H models by Noh, Engle and Kane (1994). More generally, it may be necessary to base prices on the full distribution of future values of the variance, perhaps obtained by simulation techniques; for further discussion see Section 4.2. One can think of constructing filtered and smoothed estimates in a very simple, but arbitrary way, by taking functions (involving estimated parameters) of moving averages of transformed observations. Thus:
~t 2 = g ~j=t1
wtjf(Yt_ j
)
t = 1,.., T ,
(3.4.1)
where r = 0 or 1 for a filtered estimate and r = t  T for a smoothed estimate. Since we have formulated a stochastic volatility model, the natural course of action is to use this as the basis for filtering, smoothing and prediction. For a linear and Gaussian time series model, the state space form can be used as the basis for optimal filtering and smoothing algorithms. Unfortunately, the SV model is nonlinear. This leaves us with three possibilities: a. compute inefficient estimates based on a linear state space model; b. use computer intensive techniques to estimate the optimal filter to a desired level of accuracy; c. use an (unspecified) A R C H model to approximate the optimal filter. We now turn to examine each of these in some detail. 3.4.1. Linear state space f o r m The transformed observations, the log y2rs, can be used to construct a linear state space model as suggested by Nelson (1988) and Harvey, Ruiz and Shephard (1994). The measurement equation is (3.2.8) while (3.1.3) is the transition equa
Stochastic volatility
145
tion. The initial conditions for the state, ht, are given by its unconditional mean and variance, that is zero and o~/(1  q~2) respectively. While it may be reasonable to assume that t/t is normal, ~t would only be normal if the absolute value of et were lognormal. This is unlikely. Thus application of the Kalman filter and the associated smoothers yields estimators of the state, ht, which are only optimal within the class of estimators based on linear combinations of the log ~'s. Furthermore, it is not the h'ts which are required, but rather their exponents. Suppose htlr denotes the smoothed estimator obtained from the linear state space form. Then exp(ht[7) is of the form (3.4.1), multiplied by an estimate of the scaling constant, ~r 2. It can be written as a weighted geometric mean. This makes the estimates vulnerable to very small observations and is an indication of the limitations of this approach. Working with the logarithmic transformation raises an important practical issue, namely how to handle observations which are zero. This is a reflection of the point raised in the previous paragraph, since obviously any weighted geometric mean involving a zero observation will be zero. More generally we wish to avoid very small observations. One possible solution is to remove the sample mean. A somewhat more satisfactory alternative, suggested by Fuller, and studied by Breidt and Carriquiry (1995), is to make the following transformation based on a Taylor series expansion:
2 CSy2 log yt 2 = log (Yt2 + CSy) / ( y , 2 + CS2y) ,
t = 1, ... , T ,
(3.4.2)
where s 2 is the sample variance of the ~ s and c is a small number, the suggested value being 0.02. The effect of this transformation is to reduce the kurtosis in the transformed observations by cutting down the long tail made up of the negative values obtained by taking the logarithms of the "inliers". In other words it is a form of trimming. It might be more satisfactory, to carry out this procedure after correcting the observations for heteroskedasticity by dividing by preliminary estimates, Tt2tS. The log ~t2's are then added to the transformed observations. The ~t2's could be constructed from a first round or by using a totally different procedure, perhaps a nonparametric one. The linear state space form can be modified so as to deal with asymmetric models. It was noted earlier that even if qt and et are not mutually independent, the disturbances in the state space form are uncorrelated if the joint distribution of t and qt is symmetric. Thus the above filtering and smoothing operations are still valid, but there is a loss of information stemming from the squaring of the observations. Harvey and Shephard (1993) show that this information may be recovered by conditioning on the signs of the observations denoted by st, a variable which takes the value + 1 (1) when yt is positive (negative). These signs are, of course, the same as the signs of the et's. Let E+ (E_) denote the expectation conditional on et being positive (negative), and assign a similar interpretation to variance and covariance operators. The distribution of it is not affected by conditioning on the signs of the et's, but, remembering that E0/t[et ) is an odd function of et,
146
= E+(r/tC,)
because the expectation o f Ct is zero and E+(qt~t) = E+ [E(ntlet)log e t ]  #*E(log et) =  Z _ (r/tCt) Finally Var+r/t = E+(r/2)  [E+(r/t)] 2 = a 2 n _ [Z*2 . The linear state space f o r m is n o w log yt2 = co + ht + ~t
ht+l = ~ght + st[z* + r/t
(3.4.3)
\7*st 4  [ z .2
The K a l m a n filter m a y still be initialized by taking h0 to have m e a n zero and variance a~/(1  2). The parameterization in (3.4.3) does not directly involve a p a r a m e t e r representing the correlation between et and r/r The relationship between #* and Y* and the original p a r a m e t e r s in the model can only be obtained by m a k i n g a distributional a s s u m p t i o n a b o u t et as well as qt. W h e n et and r/t are bivariate n o r m a l with Corr(et, r/t) = p, E(r/t[et ) ~ panet, and so #* = E+(r/t) = panE+(et) ~ Pan V/2/TC ~ 0.7979pa n . Furthermore, (3.4.4)
(3.4.5)
W h e n et has a tdistribution, it can be written as ~ttci '5, and ~t and qt can be regarded as having a bivariate n o r m a l distribution with correlation p, while ~t is independent of both. T o evaluate [Z* and 7" one proceeds as before, except that the initial conditioning is on ~t rather than on et, and the required expressions are found to be exactly as in the G a u s s i a n case. The filtered estimate of the log volatility ht, written as ht+~l, takes the form:
ht+llt = (ohtlt_l + ~)(Ptlt_l q *st) (log y2 Ptlt_ l q 27"S t + a~ t  09  htlt_ 1 ) Jr st[z* ,
where Ptlt1 is the corresponding m e a n square error of the hilt1. If p < 0, then 7" < 0, and the filtered estimator will behave in a similar way to the E G A R C H
Stochastic volatility
147
model estimated by Nelson (199l), with negative observations causing bigger increases in the estimated log volatility than corresponding positive values.
148
the discreteness of the Kim and Shephard state space, not all states can be visited in the small number of draws mentioned, i.e. the so called inlier problem (see also Section 3.4.1 and Nelson (1994)) is still present. As a final point it should be noted that when the hyperparameters are unknown, the simulated distribution of the state produced by the Bayesian approach allows for their sampling variability. 3.4.3. A R C H models as approximate filters The purpose here is to draw attention to a subject that will be discussed in greater detail in Section 4.3. In an A R C H model the conditional variance is assumed to be an exact function of past observations. As pointed out by Nelson and Foster (1994, p.32) this assumption is ad hoe on both economic and statistical grounds. However, because A R C H models are relatively easy to estimate, Nelson (1992) and Nelson and Foster (1994) have argued that a useful strategy is to regard them as filters which produce estimates of the conditional variance. Thus even if we believe we have a continuous time or discrete time SV model, we may decide to estimate a GARCH(1,1) model and treat the aZts as an approximate filter, as in (3.4.1). Thus the estimate is a weighted average of past squared observations. It delivers an estimate of the mean of the distribution of a2 conditional on the t~ observations at time t  1 . As an alternative, the model suggested by Taylor (1986) and Schwert (1989), in which the conditional standard deviation is set up as a linear combination of the previous conditional standard deviation and the previous absolute value, could be used. This may be more robust to outliers as it is a linear combination of past absolute values. Nelson and Foster derive an A R C H model which will give the closest approximation to the continuous time SV formulation (see Section 4.3 for more details). This does not correspond to one of the standard models, although it is fairly close to E G A R C H . For discrete time SV models the filtering theory is not as extensively developed. Indeed, Nelson and Foster point out that a change from stochastic differential equations to difference equations makes a considerable difference in the limit theorems and optimality theory. They study the case of near diffusions as an example to illustrate these differences. 3.5. Extensions of the model 3.5.1. Persistence and seasonality The simplest nonstationary SV model has ht following a random walk. The dynamic properties of this model are easily obtained if we work in terms of the logarithmically transformed observations, log ~ . All we have to do is first difference to give a stationary process. The untransformed observations are nonstationary but the dynamic structure of the model will appear in the ACF of ]yt/yt_l] c, provided that e < 0.5. The model is an alternative to I G A R C H , that is (3.3.1) with ~ + fl = 1. The I G A R C H model is such that the squared observations have some of the features of an integrated A R M A process and it is said to exhibit persistence; see Bollerslev
Stochastic volatility
149
and Engle (1993). However, its properties are not straightforward. For example it must contain a constant, 7, otherwise, as Nelson (1990) has shown, o2 converges almost surely to zero and the model has the peculiar feature of being strictly stationary but not weakly stationary. The nonstationary SV model, on the other hand, can be analyzed on the basis that ht is a standard integrated process of order one. Filtering and smoothing can be carried out within the linear state space framework, since log y2 is just a random walk plus noise. The initial conditions are handled in the same way as is normally done with nonstationary structural time series models, with a proper prior for the state being effectively formed from the first observation; see Harvey (1989). The optimal filtered estimate of ht within the class of estimates which are linear in past log ~ ' s , that is htlt1, is a constant plus an equally weighted moving average (EWMA) of past log ~ ' s . In IGARCH ot z is given exactly by a constant plus an EWMA of past squared observations. The random walk volatility can be replaced by other nonstationary specifications. One possibility is the doubly integrated random walk in which A2ht is white noise. When formulated in continuous time, this model is equivalent to a cubic spline and is known to give a relatively smooth trend when applied in levels models. It is attractive in the SV context if the aim is to find a weighting function which fits a smoothly evolving variance. However, it may be less stable for prediction. Other nonstationary components can easily be brought into hr. For example, a seasonal or intradaily component can be included; the specification is exactly as in the corresponding levels models discussed in Harvey (1989) and Harvey and Koopman (1993). Again the dynamic properties are given straightforwardly by the usual transformation applied to log y~, and it is not difficult to transform the absolute values suitably. Thus if the volatility consists of a random walk plus a slowly changing, nonstationary seasonal as in Harvey (1989, p. 403), the appropriate transformations are A~ log ~ and [ Yt/Yts [c where s is the number of seasons. The state space formulation follows along the lines of the corresponding structural time series models for levels. Handling such effects is not so easy within the GARCH framework. Different approaches to seasonality can also be incorporated in SV models using ideas of time deformation as discussed in a later subsection. Such approaches may be particularly relevant when dealing with the kind of abrupt changes in seasonality which seem to occur in high frequency, like five minute or tickbytick, foreign exchange data.
3.5.2. Interventions and other deterministic effects Intervention variables are easily incorporated into SV models. For example, a sudden structural change in the volatility process can be captured by assuming that
2 =_ log log ~r t ~r2
+ ht +
2Wt
150
where wt is zero before the break and one after, and 2 is an unknown parameter. The logarithmic transformation gives (3.2.8) but with 2wt added to the right hand side. Care needs to be taken when incorporating such effects into A R C H models. For example, in the GARCH(1,1) a sudden break has to be modelled as
2
with 2 constrained so that a~ is always positive. More generally observable explanatory variables, as opposed to intervention dummies, may enter into the model for the variance.
3.5.3. Multivariate models The multivariate model corresponding to (3.1.2) assumes that each series is generated by a model of the form
Yit = CiiEit eO'5hit
~
t = 1 , . . . , T,
(3.5.1)
with the covariance (correlation) matrix of the vector et = (elt,...,eNt) t being denoted by Ez. The vector of volatilities, ht, follows a VAR(1) process, that is
ht+l
= right } tl t ,
where t/t ,~ lID(O, ~ ) . This specification allows the movements in volatility to be correlated across different series via E n. Interactions can be picked up by the offdiagonal elements of 4 . The logarithmic transformation of squared observations leads to a multivariate linear state space model from which estimates of the volatilities can be computed as in Section 3.4.1. A simple nonstationary model is obtained by assuming that the volatilities follow a multivariate random walk, that is = I. If Y~ is singular, of rank K < N, there are only K components in volatility, that is each hit in (3.5.1) is a linear combination of K < N common trends, that is
ht = Oh~ + h
(3.5.2)
where h~ is the K x 1 vector of common random walk volatilities, h is a vector of constants and O is an N x K matrix of factor loadings. Certain restrictions are needed on O and h to ensure identifiability; see Harvey, Ruiz and Shephard (1994). The logarithms of the squared observations are "cointegrated" in the sense of Engle and Granger (1987) since there are N  K linear combinations of them which are white noise and hence stationary. This implies, for example, that if two series of returns exhibit stochastic volatility, but this volatility is the same with O' = (1, 1), then the ratio of the series will have no stochastic volatility. The application of the related concept of "copersistence" can be found in Bollerslev and Engle (1993). However, as in the univariate case there is some ambiguity about what actually constitutes persistence.
Stochastic volatility
151
There is no reason why the idea of common components in volatility should not extend to stationary models. The formulation of (3.5.2) would apply, without the need for h, and with h~ modelled, for example, by a VAR(1). Bollerslev, Engle and Wooldridge (1988) show that a multivariate GARCH model can, in principle, be estimated by maximum likelihood, but because of the large number of parameters involved computational problems are often encountered unless restrictions are made. The multivariate SV model is much simpler than the general formulation of a multivariate GARCH. However, it is limited in that it does not model changing covariances. In this sense it is analogous to the restricted multivariate GARCH model of Bollerslev (1986) in which the conditional correlations are assumed to be constant. Harvey, Ruiz and Shephard (1994) apply the nonstationary model to four exchange rates and find just two common factors driving volatility. Another application is in Mahieu and Schotman (1994b). A completely different way of modelling exchange rate volatility is to be found in the latent factor ARCH model of Diebold and Nerlove (1989).
3.5.4. Observation intervals, aggregation and time deformation Suppose that a SV model is observed every b time periods. In this case, h~, where z denotes the new observation (sampling) interval, is still AR(1) but with parameter ~b ~. The variance of the disturbance, t/t, increases, but a 2 remains the same. This property of the SV model makes it easy to make comparisons across different sampling intervals; for example it makes it clear why if q5 is around 0.98 for daily observations, a value of around 0.9 can be expected if an observation is made every week (assuming a week has 5 days). If averages of observations are observed over the longer period, the comparison is more complicated, as h~ will now follow an ARMA(1, l) process. However, the AR parameter is still q5 ~. Note that it is difficult to change the observation interval of ARCH processes unless the structure is weakened as in Drost and Nijman (1993); see also Section 4.4.1. Since, as noted in Section 2.4, one typically uses a discrete time approximation to the continuous time model, it is quite straightforward to handle irregularly spaced observations by using the linear state space form as described, for example, in Harvey (1989). Indeed the approach originally proposed by Clark (1973) based on subordinated processes to describe asset prices and their volatility fits quite well into this framework. The techniques for handling irregularly spaced observations can be used as the basis for dealing with time deformed observations, as noted by Stock 0988). Ghysels and Jasiak (1994a,b) suggest a SV model in which the operational time for the continuous time volatility equation is determined by the flow of information. Such time deformed processes may be particularly suited to dealing with high frequency data. If z = g(t) is the mapping between calendar time z and operational time t, then dSt = #Stdt + a(g(t) )StdWlt
and
152
dlog a(z) = a((b  log a ( z ) ) d z + cdW2~ where Wit and W2~ are standard, independent Wiener processes. The discrete time approximation generalizing (3.1.3), but including a term which in (3.1.2) is incorporated in the constant scale factor a, is then
ht+l = [1  eaAo(t)]b + eag(t)ht + ~lt
where Ag(t) is the change in operational time between two consecutive calendar time observations and qt is normally distributed with mean zero and variance e 2 ( 1  e2aAo(t))/2a. Clearly if A g ( t ) = 1, ~b = e a in (3.1.3). Since the flow of information, and hence Ag(t), is not directly observable, a mapping to calendar time must be specified to make the model operational. Ghysels and Jasiak (1994a) discuss several specifications revolving around a scaled exponential function relating g(t) to observables such as past volume of trade and past price changes with asymmetric leverage effects. This approach was also used by Ghysels and Jasiak (1994b) to model returnvolume comovements and by Ghysels, Gouri6roux and Jasiak (1995b) for modeling intradaily high frequency data which exhibit strong seasonal patterns (cf. Section 3.5.1).
3.5.5. L o n g m e m o r y Baillie, Bollerslev and Mikkelsen (1993) propose a way of extending the G A R C H class to account for long memory. They call their models Fractionally Integrated G A R C H (FIGARCH), and the key feature is the inclusion of the fractional difference operator, (1  L) d, where L is the lag operator, in the lag structure of past squared observations in the conditional variance equation. However, this model can only be stationary when d = 0 and it reduces to GARCH. In a later paper, Bollerslev and Mikkelsen (1995) consider a generalization of the E G A R C H model of Nelson (1991) in which log o3 is modelled as a distributed lag of past et ~s involving the fractional difference operator. This F I E G A R C H model is stationary and invertible i f [ d I< 0.5. Breidt, Crato and de Lima (1993) and Harvey (1993) propose a SV model with ht generated by fractional noise
h, = nt/(1  L ) u ,
0< d < 1 .
(3.5.1)
Like the AR(1) model in (3.1.3), this process reduces to white noise and a random walk at the boundary of the parameter space, that is d = 0 and 1 respectively. However, it is only stationary if d < 0.5. Thus the transition from stationarity to nonstationarity proceeds in a different way to the AR(1) model. As in the AR(1) case it is reasonable to constrain the autocorrelations in (3.5.1) to be positive. However, a negative value of d is quite legitimate and indeed differencing ht when it is nonstationary gives a stationary "intermediate memory" process in which 0.5 < d < 0. The properties of the long memory SV model can be obtained from the formulae in subSection 3.2. A comparison of the ACF for ht following a long
Stochastic volatility
153
memory process with d = 0.45 and ah 2 = 2 with the corresponding ACF when ht is AR(1) with ~b = 0.99 can be found in Harvey (1993). Recall that a characteristic property of long memory is a hyperbolic rate of decay for the autocorrelations instead of an exponential rate, a feature observed in the data (see Section 2.2e). The slower decline in the long memory model is very clear and, in fact, for z = 1000, the long memory autocorrelation is still 0.14, whereas in the AR case it is only 0.000013. The long memory shape closely matches that in Ding, Granger and Engle (1993, p. 868). The model may be extended by letting ~/t be an A R M A process and/or by adding more components to the volatility equation. As regards smoothing and filtering, it has already been noted that the state space approach is approximate because of the truncation involved and is relatively cumbersome because of the length of the state vector. Exact smoothing and filtering, which is optimal within the class of estimators linear in the log Yt2,s , can be carried out by a direct approach if one is prepared to construct and invert the T x T covariance matrix of the log y~ ' s .
At the end of Section 2 we presented a framework for statistical modelling of SV in discrete time and devoted the entire Section 3 to specific discrete time SV models. To motivate the continuous time models we study first of all the exact relationship (i.e. without approximation error) between differential equations and SV models in discrete time. We examine this relationship in Section 4.1 via a class of statistical models which are closed under temporal aggregation and proceed (1) from high frequency discrete time to lower frequencies and (2) from continuous time to discrete time. Next, in Section 4.2, we study option pricing and hedging with continuous time models and elaborate on features such as the smile effect. The practical implementation of option pricing formulae with SV often requires discrete time SV and/or A R C H models as filters and forecasters of the continuous time volatility processes. Such filters, covered in Section 4.3, are in general discrete time approximations (and not exact discretizations as in Section 4.1) of continuous time SV models. Section 4.4 concludes with extensions of the basic model.
4.1. From discrete to continuous time
The purpose of this section is to provide a rigorous discussion of the relationship between discrete and continuous time SV models. The presentation will proceed first with a discussion of temporal aggregation in the context of the SARV class of models and focus on specific cases including G A R C H models. This material is covered in Section 4.1.1. Next we turn our attention to the aggregation of continuous time SV models to yield discrete time representations. This is the subject matter of Section 4.1.2.
154
4.1.1. Temporal aggregation of discrete time models Andersen's SARV class of models was presented in Section 2.4 as a general discrete time parametric SV statistical model. Let us consider the zeromean case, namely:
Yt+I : at~t+l
(4.1.1)
and a q for q = 1 or 2 is a polynomial function o(Kt) of the Markov process Kt with stationary autoregressive representation:
Kt = co + flKt_l +Or
where I/~1 < 1 and E[et+l lez, ~ z < t] 0
(4.1.2)
E[e2+lle~, o~ <_ t] = 1
E[Vt+l lez, oz z < t] = 0 .
The restrictions (4.1.3ac) imply that o is a martingale difference sequence with respect to the filtration ~ t = a[e~, v~, z < t].22 Moreover, the conditional moment conditions in (4.3.1ac) also imply that e in (4.1.1) is a vCnite noise process in a semistrong sense, i.e. E[et+lle~,z < t] = 0 and E[e2+t[e,,z <__t] = 1, and is not Grangercaused by 0.23 From the very beginning of Section 2 we choose the continuously compounded rate of return over a particular time horizon as the starting point for continuous time processes. Therefore, let Yt+l in (4.1.1) be the continuously compounded rate of return for [t, t + 1] of the asset price process St, consequently:
Yt+l ~
log St+l/St
(4.1.4)
Since the unit of time of the sampling interval is to a large extent arbitrary, we would surely want the SV model defined by equations (4.1.1) through (4.1.3), (for given q and function O) to be closed under temporal aggregation. As rates of return are flow variables, closure under temporal aggregation means that for any integer m:
mI
ytmk
is again conformable to a model of the type (4.1.1) through (4.1.3) for the same choice of q and 9 involving suitably adapted parameter values. The analysis in this section follows Meddahi and Renault (1995) who study temporal aggregation of SV models in detail, particularly the case a 2 = Kt, i.e. q = 2 and 9 is the identity
22 Note that we do not use here the decomposition appearing in (2.4.9) namely, ot = [y + aKt 1]fit. 23 The Granger noncausality considered here for et is weaker than Assumption 2.3.2.A as it applies only to the first two conditional moments.
Stochastic volatility
155
function. It is related to the so called continuous time G A R C H approach of Drost and Werker (1994). Hence, we have (4.1.1) with:
= co + f l a 2 , + vt
(4.1.5)
With conditional moment restrictions (4.1.3ac) this model is closed under aggregation. For instance, for m = 2:
fl(2) V}21
= r2 = ( f l + l)[flOt2+Otl]
Moreover, it also worth noting that whenever a leverage effect is present at the aggregate level, i.e.:
CoY [D}221,C}22l] 5 0
with e}z) 1 = (Y,I + Yt2)/0}2_)3, it necessarily appears at the disaggregate level, i.e.
Cov(v,, e,) 0.
For the general case Meddahi and Renault (1995) show that model (4.1.5) together with conditional moment restrictions (4.1.3ac) is a class of processes closed under aggregation. Given this result, it is of interest to draw a comparison with the work of Drost and Nijman (1993) on temporal aggregation of GARCH. While establishing this link between Meddahi and Renault (1995) and Drost and Nijman (1993) we also uncover issues of leverage properties in G A R C H models. Indeed, contrary to what is often believed, we find leverage effect restrictions in G A R C H processes. Moreover, we also find from the results of Meddahi and Renault that the class of weak G A R C H processes includes certain SV models. To find a class of G A R C H processes which is closed under aggregation Drost and Nijman (1993) weakened the definition of GARCH, namely for a positive stationary process 0:
0t 2 = W@ ay2_l + b02_1
(4.1.6)
156 
semistrong G A R C H if E [Yt$11Y~, ~ < tl = 0 and E[y2+I[Y~'2 z2_ < t]t]= if2 o2.24 weak G A R C H ifEL[yt+lly~,y2,z < t] = 0; E L [Yt+lly~,yz,z < =
D r o s t and N i j m a n show that weak G A R C H processes temporally aggregate and provide explicit formulae for their coefficients. In Section 2.4 it was noted that the f r a m e w o r k of S A R V includes G A R C H processes whenever there is no r a n d o m n e s s specific to the volatility process. This p r o p e r t y will allow us to show that the class of weak G A R C H processes  as defined above  in fact includes m o r e general SV processes which are strictly speaking not G A R C H . The arguments, following M e d d a h i and Renault (1995), require a classification o f the models defined by (4.1.3) and (4.1.5) according to the value of the correlation between ut and ~ , namely: (a) Models with perfect correlation: This first class, henceforth denoted C1, is characterized by a linear correlation between ot and ~ conditional on (e~, v,, z < t) which is either 1 o r  1 for the model in (4.1.5). (b) Models without perfect correlation: This second class, henceforth denoted C2, has the above conditional correlation less than one in absolute value. The class C1 contains all semistrong G A R C H processes, indeed whenever V a r [ ~ ] e t , u~,z < t] is p r o p o r t i o n a l to Var[v, le~,o~,z < t] in C 1 w e have a semistrong G A R C H . Consequently, a semistrong G A R C H processes is a model (4.1.5) with (1) restrictions (4.1.3), (2) a perfect conditional correlation as in C1, and (3) restrictions on the conditional kurtosis dynamics. 25 Let us consider n o w the following assumption: ASSUMPTION 4.1.1. The following two conditional expectations are zero:
(4.1.7a) (4.1.7b)
This assumption a m o u n t s to an absence of leverage effects, where the latter is defined in a conditional covariance sense to capture the notion o f instantaneous causality discussed in Section 2.4.1 and applied here in the context o f weak white noise. 26 It should also be noted that (4.1.7a) and (4.1.7b) are in general not equivalent except for the processes of class C1. The class C2 allows for r a n d o m n e s s p r o p e r to the volatility process due to the imperfect correlation. Yet, despite this volatilityspecific r a n d o m n e s s one can 24 For any Hilbert space H of L2, EL[xtlz, z C 11] is the best linear predictor ofxt in terms of I and z E H. It should be noted that a strong GARCH process is afortiori semistrong which itself is also a weak GARCH process. 25 In fact, Nelson and Foster (1994) observed that the most commonly used ARCH models effectively assume that the variance of the variance rises linearly in az 4, which is the main drawback of ARCH models in approximating SV models in continuous time (see also Section 4.3). 26 The conditional expectation (4.1.7b) can be viewed as a conditional covariance between et and e2. It is this conditional covariance which, if nonzero, produces leverage effects in GARCH.
Stochastic volatility
157
show that under Assumption 4.1.1 processes of C2 sat!fly the weak G A R C H definition. Afortiori, any SV model conformable to (4.1.3ac), (4.1.5), (4.7. l a b ) and Assumption 4.1.1 is a weak G A R C H process. It is indeed the symmetry assumptions (4.1.7ab), or restrictions on leverage in G A R C H , that make EL [yt2+t y2, z < t] = o'2 (together with the conditional moment restrictions (4.1.3ac)) and yield the internal consistency for temporal aggregation found by Drost and Nijman (1993, example 2, p. 915) for the class of so called symmetric weak GARCH(1,1). Hence, this class of weak GARCH(1,1)processes can be viewed as a subclass of processes satisfying (4.1.3) and (4.1.5). 27
lye,
ytdt + 6tdWt
where the stochastic processes ot,~Jt,6t and Pt are I t = [aT; z < t] adapted. To ensure that at is a nonnegative process one typically follows either one of two strategies: (1) considering a diffusion for log o2 or (2) describing at2 as a CEV process (or Constant Elasticity of Variance process following Cox (1975) and Cox and Ross (1976)). 28 The former is frequently encountered in the option pricing literature (see e.g. Wiggins (1987)) and is also clearly related to Nelson (199l), who introduced E G A R C H , and to the logNormal SV model of Taylor (1986). The second class of CEV processes can be written as
(4.1.9)
where 6 < 1/2 ensures that o2 is a stationary process with nonnegative values. Equation (4.1.9) can be viewed as the continuous time analogue of the discrete time SARV class of models presented in Section 2.4. This observation establishes links with the discussion of the previous Section 4.1.1 and yields exact discretization results of continuous time SV models. Here, as in the previous section, it will be tempting to draw comparisons with the G A R C H class of models, in particular the diffusions proposed by Drost and Werker (1994) in line with the temporal aggregation of weak G A R C H processes.
27 AS noted before, the class of processes satisfying (4.1.3) and (4.1.5) is closed under temporal aggregation, including processeswith leverageeffectsnot satisfyingAssumption4.1.1. 28 Occasionallyone encountersspecificationswhich do not ensure nonnegativityof the at process. For the sake of computational simplicity some authors for instance have considered OrnsteinUhlenbeck processesfor at or at 2 (see e.g. Stein and Stein (1991)).
158
Firstly, one should note that the CEV process in (4.1.9) implies an autoregressive model in discrete time for a 2 , namely: 2 fit+At : 0 ( 1  e  k a t ) I e kAt a t 2 +ekat
Meddahi and Renault (1995) show that whenever (4.1.9) and its discretization (4.1.10) govern volatility, the discrete time process log St+(k+l)At/St+kAt~ k C ~ is a SV process satisfying the model restrictions (4.1.3ac) and (4.1.5). Hence, from the diffusion (4.1.9) we obtain the class of discrete time SV models which is closed under temporal aggregation, as discussed in the previous section. To be more specific, consider for instance At = 1 , then from (4.1.10) it follows that:
Yt+l
log St+l/St
= fit(1)~ t + l
(fill))2 : W fl(fi~l))2Ot
where from (4.1.10):
(4.1.11)
fl=ek,w=O(1ek),
{1e ~'~ k rt+I
e
3,
e {ut) (fi2u) Wg
(4.1.12)
It is important to note from (4.1.12) that absence of leverage effect in continuous time, i.e. Pt 0 in (4.1.8c), means no such effect at low frequencies and the two symmetry conditions of Assumption 4.1.1 are fulfilled. This line of reasoning also explains the temporal aggregation result of Drost and Werker (1994), but one more generally can interpret discrete time SV models with leverage effects as exact discretizations of continuous time SV models with leverage.
~
4.2.1. The basic option pricing formula Consider again formula (2.1.10) for a European option contract maturing at time t + h =T. As noted in Section 2.1.2, we assume continuous and frictionless trading. Moreover no arbitrage profits can be made from trading in the underlying asset and riskless bonds ; interest rates are nonstochastic so that B(t, T) defined by (2.1.12) denotes the time t price of a unit discount bond maturing at time T. Consider now the probability space (fl , ~ , P ) , which is the fundamental space of the underlying asset price process S:
Stochastic volatility
159
ast/st = #(t, St, ut)at + atdWts a 2 = f(Ut) aut = a(t, Ut)dt + b(t, u t ) a w 7
(4.2.1)
where Wt = (W s, WT) is a standard two dimensional Brownian Motion (Ws and W7 are independent, zeromean and unit variance) defined on (f~ ,if,P). The function f , called the volatility function, is assumed to be onetoone. In this framework (under suitable regularity conditions) the no free lunch assumption is equivalent to the existence of a probability distribution Q on (D,~), equivalent to P, under which discounted price processes are martingales (see Harrison and Kreps (1979)). Such a probability is called an equivalent martingale measure and is unique if and only if the markets are complete (see Harrison and Pliska (1981)).29 From the integral form of martingale representations (see Karatzas and Shreve (1988), p. 184), the (positive) density process of any probability measure Q equivalent to P can be written as: t 1
t $2
Mt=exp[fo
2sdWS~fo(2U)
du
(4.2.2)
a a  ~ 1 f0 t (2  ~ t 2udW: ~~ ) 2du ]
where the processes 2s and 2 ~ are adapted to the natural filtration tTt  o'[Wz,"c < t], t > 0, and satisfy the integrability conditions (almost surely): 2 <+c~and 2 <+c~ . defined by: (4.2.3)
/0'
7t
2Sdu and W7 = W7 +
/0
22du
is a two dimensional Brownian Motion under Q. The dynamic of the underlying asset price under Q is obtained directly from (4.2.l) and (4.2.3). Moreover, the discounted asset price process StB(O, t), 0 < t < T, is a Qmartingale if and only if for rt defined in (2.1.11):
(4.2.4)
Since S is the only traded asset, the process 2 ~ is not fixed. The process 2s defined by (4.2.4) is called the asset risk premium. By analogy, any process 2~ satisfying the required integrability condition can be viewed as a volatility risk
29 Here, the market is seen as incomplete(before taking into account the market pricing of the option) so that we have to characterizea set of equivalentmartingalemeasures.
160
premium and for any choice of 2 ~ , the probability Q(2 ~) defined by the density process M in (4.2.2) is an equivalent martingale measure. Therefore, given the volatility risk premium process 2~: C [ = B(t, T)E Q(~)[Max[0,Sr  K]] , 0< t< T (4.2.5)
is an admissible price process of the European call option. 3 The Hull and White option pricing model relies on the following assumption, which restricts the set of equivalent martingale measures: ASSUMPTION 4.2.1. The volatility risk premium 27 only depends on the current value of the volatility process: 47 = 2~(t, Ut), Vt c [0, T]. This assumption is consistent with an intertemporal equilibrium model where the agent preferences are described by time separable isoelastic utility functions (see He (1993) and Pham and Touzi (1993)). It ensures that wS and W" are independent, so that the Q(2 ") distribution of log St~St, conditionally on ~t and the volatility path (at, 0 < t < T) is normal with mean ftr r u d u  72(t, T) and variance y2(t, T ) = ftr a2du. Under Assumption 4.2.1 one can compute the expectation in (4.2.5) conditionally on the volatility path, and obtain finally: C [ = StE Q(x~)[q~(dlt)  eX'dp(d2t)] (4.2.6)
with the same notation as in (2.1.20). To conclude it is worth noting that many option pricing formulae available in the literature have a feature common with (4.2.6) as they can be expressed as an expectation of the BlackScholes price over a heterogeneous distribution of the volatility parameter (see Renault (1995) for an elaborate discussion on this subject). 4.2.2. Pricing and hedging with the Hull and White model The Markov feature of the process (S, G) implies that the option price (4.2.6) only depends on the contemporaneous values of the underlying asset prices and its volatility. Moreover, under mild regularity conditions, this function is differentiable. Therefore, a natural way to solve the hedging problem in this stochastic volatility context is to hedge a given option of price C] by A~ units of the underlying asset and ~ t units of any other option of price Ct2 where the hedging ratios solve:
(4.2.7)
Such a procedure, known as the deltasigma hedging strategy, has been studied by Scott (1991). By showing that any European option completes the market, i.e. OQ2,/Oat # O, 0 < t < T, Bajeux and Rochet (1992) justify the existence of an 30 Here elsewhere E~(.) = EQ(.I~t) stands for the conditional expectation operator given o~t when the price dynamics are governed by Q.
Stochastic volatility
161
unique solution to the deltasigma hedging problem (4.2.7) and the implicit assumption in the previous sections that the available information/t contains the past values (St, at), z < t. In practice, option traders often focus on the risk due to the underlying asset price variations and consider the imperfect hedging strategy ~ t = 0 and At = oclt/ost Then, the Hull and White option pricing formula (4.2.6) provides directly the theoretical value of At:
At = OCt2~/OSt = EQ(X')~b(dlt)
(4.2.8)
This theoretical value is hard to use in practice since: (1) even if we knew the Q(2 ) conditional probability distribution of dlt given It (summarized by at), the derivation of the expectation (4.2.8) might be computationally demanding and (2) the conditional probability is directly related to the conditional probability distribution of 72(t, T) = ftr ~r2du given at, which in turn may involve nontrivially the parameters of the latent process at. Moreover, these parameters are those of the conditional probability distribution o f ])2(t, T) given at under the riskneutral probability Q(2 ~) which is generally different from the Data Generating Process P. The statistical inference issues are therefore quite complicated. We will argue in Section 5 that only tools like simulationbased inference methods involving both asset and option prices (via an option pricing model) may provide some satisfactory solutions. Nevertheless, a practical way to avoid these complications is to use the BlackScholes option pricing model, even though it is known to be misspecified. Indeed, option traders know that they cannot generally obtain sufficiently accurate option prices and hedge ratios by using the BS formula with historical estimates of the volatility parameters based on time series of the underlying asset price. However, the concept of BlackScholes implied volatility (2.1.23) is known to improve the pricing and hedging properties of the BS model. This raises two issues: (1) what is the internal consistency of the simultaneous use of the BS model (which assumes constant volatility) and of BS implied volatility which is clearly timevarying and stochastic and (2) how to exploit the panel structure of option pricing errors? 31 Concerning the first issue, we noted in Section 2 that the Hull and White option pricing model can indeed be seen as a theoretical foundation for this practice of pricing. Hedging issues and the panel structure of option pricing errors are studied in detail in Renault and Touzi (1992) and Renault (1995).
4.2.3. Smile or smirk? As noted in Section 2.2, the smile effect is now a well documented empirical stylized fact. Moreover the smile becomes sometimes a smirk since it appears more or less lopsided (the so called skewness effect). We cautioned in Section 2 that some explanations of the smile/smirk effect are often founded on tempting analogies rather than rigorous proofs.
3l The valueof a whichequates the BS formulato the observedmarket price of the option heavily depends on the actual date t, the strike price K, the time to maturity (T  t) and thereforecreates a panel data structure.
162
To the best of our knowledge, the state of the art is the following: (i) the first formal proof that a Hull and White option pricing formula implies a symmetric smile was provided by Renault and Touzi (1992), (ii) the first complete proof that the smile/smirk effects can alternatively be explained by liquidity problems (the upper parts of the smile curve, i.e. the most expensive options are the least liquid) was provided by Platten and Schweizer (1994) using a microstructure model, (iii) there is no formal proof that asymmetries of the probability distribution of the underlying asset price process (leverage effect, nonnormality .... ) are able to capture the observed skewness of the smile. A different attempt to explain the observed skewness is provided by Renault (1995). He showed that a slight discrepancy between the underlying asset price St used to infer BS implied volatilities and the stock price St considered by option traders may generate an empirically plausible skewness in the smile. Such nonsynchronous St and St may be related to various issues: bidask spreads, nonsynchronous trading between the two markets, forecasting strategies based on the leverage effect, etc. Finally, to conclude it is also worth noting that a new approach initiated by Gouri6roux, Monfort, Tenreiro (1994) and followed also by AitSahalia, Bickel, Stoker (1994) is to explain the BS implied volatility using a nonparametric function of some observed state variables. Gouri~roux, Monfort, Tenreiro (1995) obtain for example a good nonparametric fit of the following form: crt(St,K) = a(K) + b(K)(log St/St_l) 2 . A classical smile effect is directly observed on the intercept a(K) but an inverse smile effect appears for the pathdependent effect parameter b(K). For American options a different nonparametric approach is pursued by Broadie, Detemple, Ghysels and Torr6s (1995) where, besides volatility, exercise boundaries for the option contracts are also obtained. 32 4.3. Filtering and discrete time approximations In Section 3.4.3 it was noted that the A R C H class of models could be viewed as filters to extract the (continuous time) conditional variance process from discrete time data. Several papers were devoted to the subject, namely Nelson (1990, 1992, 1995a,b) and Nelson and Foster (1994, 1995). It was one of Nelson's seminal contributions to bring together A R C H and continuous time SV. Nelson's first contribution in his 1990 paper was to show that A R C H models, which model volatility as functions of past (squared) returns, converge weakly to a diffusion process, either a diffusion for log cr~ or a CEV process as described in Section 4.1.2. In particular, it was shown that a GARCH(1,1) model observed at finer and finer time intervals At = h with conditional variance parameters COh=hOg,~h=~(h/2) 1/2 and f l h = l  ~ ( h / 2 ) l / Z  O h and conditional mean
32 See also Bossaertsand Hillion(1995) for the use of a nonparametrichedgingprocedure and the smile effect.
Stochastic volatility
163
#h = hca2 converges to a diffusion limit quite similar to equations (4.1.8a) combined with (4.1.9) with 6 = 1, namely
d logSt = ca2dt + fftdWt d .,2 =  04)at + 4aW7
Similarly, it was also shown that a sequence of AR(1)EGARCH(1,1) models converges weakly to an OrnsteinUhlenbeck diffusion for In a2: d In o2 t = ~(fl  In a2t)dt + d W t Hence, these basic insights showed that the continuous time stochastic difference equations emerging as diffusion limits of A R C H models were no longer A R C H but instead SV models. Moreover, following Nelson (1992), even when misspecified, A R C H models still kept desirable properties regarding extracting the continuous time volatility. The argument was that for a wide variety of misspecified A R C H models the difference between the A R C H filter volatility estimates and the true underlying diffusion volatilities converges to zero in probability as the length of the sampling time interval goes to zero at an appropriate rate. For instance the GARCH(1,1) model with ~oh, c~hand ]~h described before estimates &t 2 as follows:
^2 7 t : (.Oh( 1 __
flh)l+
i=o
O~hflhYt_h(i+l)
where yt : log St/Sth. This filter can be viewed as a particular case of equation (3.4.1). The GARCH(1,1) and many other models, effectively achieve consistent estimation of at via a lag polynomial function of past squared returns close to time t. The fact that a wide variety of misspecified A R C H models consistently extract at from high frequency data raises questions regarding efficiency of filters. The answers to such questions are provided in Nelson (1995a,b) and Nelson and Foster (1994, 1995). In Section 3.4 it was noted that the linear state space Kalman filter can also be viewed as a (suboptimal) extraction filter for O"t. Nelson and Foster (1994) show that the asymptotically optimal linear Kalman filter has asymptotic variance for the normalized estimation error h1/4[ln(~2)   l n f f ~ ] equal to 2Y(1/2) V2 where Y ( x ) = d[lnF(x)]/dx and 2 is a scaling factor. A model, closely related to E G A R C H of the following form: ln(~2+h) = ln(?r2) + p2(St+h St)6t l + 2 ( 1  p2)V2[F(1/2)'/2F(3/2)1/21St+h  St]&t 1 
21/2]
yields the asymptotically optimal A R C H filter with asymptotic variance for the normalized estimation error equal to 2 1 2 ( 1  p2)]l/Zwhere the parameter p measures the leverage effect. These results also show that the differences between
164
the most efficient suboptimal Kalman filter and the optimal A R C H filter can be quite substantial. Besides filtering one must also deal with smoothing and forecasting. Both of these issues were discussed in Section 3.4 for discrete time SV models. The prediction properties of (misspecified) A R C H models were studied extensively by Nelson and Foster (1995). Nelson (1995) takes A R C H models a step further by studying smoothing filters, i.e. A R C H models involving not only lagged squared returns but also future realizations, i.e. r = t  T in equation (3.4.1).
4.4. L o n g m e m o r y
We conclude this section with a brief discussion of long memory in continuous time SV models. The purpose is to build continuous time long memory stochastic volatility models which are relevant for high frequency financial data and for (long term) option pricing. The reasons motivating the use of long memory models were discussed in sections 2.2 and 3.5.5. The advantage of considering continuous time long memory is their relative ability to provide a more structural interpretation of the parameters governing short term and long term dynamics. The first subsection defines fractional Brownian Motion. Next we will turn our attention to the fractional SV model followed by a section on filtering and discrete time approximations.
4.4.1. Stochastic integration with respect to f r a c t i o n a l Brownian M o t i o n
We recall in this subsection a few definitions and properties of fractional and long memory processes in continuous time, extensively studied for instance in Comte and Renault (1993). Consider the scalar process:
xt = a(t  s)dWs .
(4.4.1)
.L'
O0
a(t  s)dWs
(4.4.2)
whenever f o ~ aZ(x)dx < +e~. Such processes are called fractional processes if a(x) = x a(x)/r(1 + )for < 1/2, a continuously differentiable on [0, T] and where F(1 + ~) is a scaling factor useful for normalizing fractional derivative operators on [0, T]. Such processes admit several representations, and in particular they can also be written:
xt =
fo' c(t 
s)dW~s,
W~t
= Jo [, F(1 (t :+
~) dW~
(4.4.3)
where W~ is the socalled fractional Brownian Motion of order ~ (see Mandelbrot and Van Ness (1968)).
Stochastic volatility
165
The relation between the functions a and c is onetoone. One can show that W~ is not a semimartingale (see e.g. Rogers (1995)) but stochastic integration with respect to W~ can be defined properly. The processes xt are long memory if:
X~ }O0
and
O<ao~<+cx~ ,
(4.4.4)
for instance,
dxt =  k r t d t + crdW~t xt = O,k > O ,
0<~<
1/2
(4.4.5)
x}~) =
I'
e k(ts) a d Ws
(4.4.6b)
Note that, x}~) the derivative of order ~ of xt, is a solution of the usual SDE:
dzt :  k z t d t + a d W t .
To facilitate comparison with both the F I E G A R C H model and the fractional extensions of the logNormal SV model discussed in Section 3.5.5 let us consider the following fractional SV model (henceforth FSV):
d S t / S t = tTtdm t
(4.4.7a) (4.4.7b)
where k > 0 and 0 _< ~ < 1/2. If nonzero, the fractional exponent ~ will provide some degree of freedom in the order of regularity of the volatility process, namely the greater ~ the smoother the path of the volatility process. If we denote the autocovariance function of o by r~(.) then:
>O=~(r~(h)r~(O))/h~O
as
h~0
This would be incorrectly interpreted as nearintegrated behavior, widely found in high frequency data for instance, when:
ro(h)r~(O)/h= (ph_X)/h~logp
as
h~0
and ~rt is a continuous time AR(1) with correlation p near 1. The long memory continuous time approach allows us to model persistence with the following features:(1) the volatility process itself (and not just its logarithm) has hyperbolic decay of the correlogram ; (2) the persistence of volatility shocks yields leptokurtic features for returns which vanishes with temporal
166
[0,h]:
E[log St+h/St  E(log St+h/St)] 4 , 3 (E[log St+h/StE(log St+h/St)]2) 2 as h * ~ at a rate h 2~1 if ~ 6 [0, 1/2] and a rate exp(kh/2) if ~ = 0.
where log o(~) follows the OU process: d log a}~) ~  k log a}~)dt + 7dWt (4.4.7)
To compute a discrete time approximation one must evaluate numerically the integral (4.4.6) using only values of the process log ~(~) on a discrete partition of [o, t] at points j / n , j = 0, 1 . . . , [nt]. 34 m natural way to proceed is to use step functions, generating the following proxy process: [nt] log~ = ~(t(jj=l
1)/n)~/F(1 + e ) A l o g o ' ~
(4.4.8)
where A loga(~ ) = log o(~ ) loga!~. ) . . . . Comte and Renault (1995) show that J/n j/n tJ  t)/n log &,t converges to the log ot process for n +~ uniformly on compact sets. Moreover, by rearranging (4.4.8) one obtains:
i.1 where L~ is the lag operator corresponding to the sampling scheme j/n,
(1 
loggr~/~ = [~=o([(i+l)~i~]/n~r(l+c~))L
logcr (~)j/n
(4.4.9) i.e.
L, Zj/, = Z(j1)/n. With this sampling scheme logo(~) is a discrete time AR(1)
deduced from the continuous time process with the following representation:
pnL,)logcr~ = Uj/n
(4.4.10)
where Pn = exp(k/n) and uj/n is the associated innovations process. Since the process is stationary we are allowed to write (assuming log a~.~ = uj/. = 0 for j < 0):
33 With usual G A R C H or SV models, it vanishes at an exponential rate (see Drost and Nijman (1993) and Drost and Werker (1994) for these issues in the short memory case). 34 [Z] is the integer k such that k < z < k + 1.
Stochastic volatility
167
lg'(j~ = L/=~n~r(1 +
~) .]
(1  pnLn)luj/n
(4.4.11)
which gives a parameterization of the volatility dynamics in two parts: (1) a long memory part which corresponds to the filter Z+=~aiLin/n ~ with ai = [(i + 1)~i~]/F(1 + ~) and (2) a short memory part which is characterized by the AR(1) process: (1  PnLn)luj/n. Indeed, one can show that the long memory filter is "longterm equivalent" to the usual discrete time long memory filters ( 1  L ) ~ i n the sense that there is a long term relationship (a cointegration relation) between the two types of processes. However, this longterm equivalence between the longmemory filter and the usual discrete time one (1  L)~ does not imply that the standard parametrization FARIMA(1, a,0) is wellsuited in our framework. Indeed, one can show that the usual discrete time filter ( 1  L) ~ introduces some mixing between long and short term characteristics whereas the parsimonious continuous time model doesn't. 35 This feature clearly puts the continuous time FSV at an advantage with regard to the discrete time SV and G A R C H longmemory models.
5. Statistical inference
Evaluating the likelihood function of A R C H models is a relatively straightforward task. In sharp contrast for SV models it is impossible to obtain explicit expressions for the likelihood function. This is a generic feature common to almost all nonlinear latent variable models. The lack of estimation procedures for SV models made them for a long time an unattractive class of models in comparison to ARCH. In recent years, however, remarkable progress has been made regarding the estimation of nonlinear latent variable models in general and SV models in particular. A flurry of methods are now available and are up and running on computers with ever increasing CPU performance. The early attempts to estimate SV models used a G M M procedure. A prominent example is Melino and Turnbull (1990). Section 5.1 is devoted to G M M estimation in the context of SV models. Obviously, G M M is not designed to handle continuous time diffusions as it requires discrete time processes satisfying certain regularity conditions. A continuous time G M M approach, developed by Hansen and Scheinkman (1994), involves moment conditions directly drawn from the continuous time representation of the process. This approach is discussed in Section 5.3. In between, namely in Section 5.2, we discuss the QML approach suggested by Harvey, Ruiz and Shephard (1994) and Nelson (1988). It relies on the fact that the nonlinear (Gaussian) SV model can be transformed into a linear nonGaussian state space model as in Section 3, and from this a Gaussian quasilikelihood can be computed. None of the methods covered in Sections 5.1 through 5.3 involve simulation. However, increased computer power has made simulationbased es35 Namely, (1 Ln)~log~/n is not an AR(1) process.
168
timation techniques increasingly popular. The simulated method of moments, or simulationbased GMM approach proposed by Duffle and Singleton (1993), is a first example which is covered in Section 5.4. Next we discuss the indirect inference approach of Gouri&oux, Monfort and Renault (1993) and the moment matching methods of Gallant and Tauchen (1994) in Section 5.5. Finally, Section 5.6 covers a very large class of estimators using computer intensive Markov Chain Monte Carlo methods applied in the context of SV models by Jacquier, Polson and Rossi (1994) and Kim and Shephard (1994), and simulation based ML estimation proposed in Danielsson (1994) and Danielsson and Richard (1993). In each section we will only try to limit our focus to the use of estimation procedures in the context of SV models and avoid details regarding econometric theory. Some useful references to complement the material which will be covered are (1) Hansen (1992), Gallant and White (1988), Hall (1993) and Ogaki (1993) for G M M estimation, (2) Gouri6roux and Monfort (1993b) and Wooldridge (1994) for QMLE, (3) Gouri&oux and Monfort (1995) and Tauchen (1995) for simulation based econometric methods including indirect inference and moment matching, and finally (4) Geweke (1995) and Shephard (1995) for Markov Chain Monte Carlo methods. 5.1. Generalized method of moments Let us consider the simple version of the discrete time SV as presented in equations (3.1.2) and (3.1.3) with the additional assumption of normality for the probability distribution of the innovation process (et, t/t). This lognormal SV model has been the subject of at least two extensive Monte Carlo studies on GMM estimation of SV models. They were conducted by Andersen and Sorensen (1993) and Jacquier, Polson and Rossi (1994). The main idea is to exploit the stationary and ergodic properties of the SV model which yield the convergence of sample moments to their unconditional expectations. For instance, the second and fourth moments are simple expressions of 02 and 0h 2, namely ~2exp(0]/2) and 304exp(202) respectively. If these moments are computed in the sample, 02 can be estimated directly from the sample kurtosis, k, which is the ratio of the fourth moment to the second moment squared. The expression is just &2 = log(~/3). The parameter 02 can then be estimated from the second moment by substituting in this estimate of 02. We might also compute the firstorder autocovariance of ~ , or simply the sample mean of ~y2_ 1 which has expectation a4exp({ 1 + q~}0h 2) and from which, given the estimate of 02 and 0h 2 , it is straightforward to get an estimate of ~b. The above procedure is an example of the application of the method of moments. In general terms, m moments are computed. For a sample of size T, let gr(fl) denote the m x 1 vector of differences between each sample moment and its theoretical expression in terms of the model parameters/L The generalized method of moments (GMM) estimator is constructed by minimizing the criterion function ]~r = Arg min gr(fl)' Wrgr(fl) P
Stochastic volatility
169
m m weighting matrix reflecting the importance given to the moments. When et and r/t are mutually independent, JacRossi (1994) suggest using 24 moments. The first four are given 1,2, 3, 4, while the analytic expression for the others is:
~r2c2c F
/zc
~ah[1 + ~]
In the more general case when et and qt are correlated, Melino and Turnbull (1990) included estimates of: E[I Yt [ Yt~], "c = 0, 1, 42,..., 10. They presented an explicit expression in the case of z = 1 and showed that its sign is entirely determined by p. The G M M method may also be extended to handle a nonnormal distribution for et. The required analytic expressions can be obtained as in Section 3.2. On the other hand, the analytic expression of unconditional moments presented in Section 2.4 for the general SARV model may provide the basis of G M M estimation in more general settings (see Andersen (1994)). From the very start we expect the G M M estimator not to be efficient. The question is how much inefficiency should be tolerated in exchange for its relative simplicity. The generic setup of G M M leaves unspecified the number of moment conditions, except for the minimal number required for identification, as well as the explicit choice of moments. Moreover, the computation of the weighting matrix is also an issue since many options exist in practice. The extensive Monte Carlo studies of Andersen and Sorensen (1993) and Jacquier, Poison and Rossi (1994) attempted to answer these outstanding questions. In general they find that G M M is a fairly inefficient procedure primarily stemming from the stylized fact, noted in Section 2.2, that in equation (3.1.3) is quite close to unity in most empirical findings because volatility is highly persistent. For parameter values of close to unity convergence to unconditional moments is extremely slow suggesting that only large samples can rescue the situation. The Monte Carlo study of Andersen and Sorensen (1993) provides some guidance on how to control the extent of the inefficiency, notably by keeping the number of moment conditions small. They also provide specific recommendations for the choice of weighting matrix estimators with datadependent bandwidth using the Bartlett kernel.
36 A simpleway to derivethese moment conditionsis via a twostep approach similar in spirit to (2.4.8) and (2.4.9) or (3.2.3).
170
QML estimators of the parameters ~b, a 2 n and the variance of it, o~, are obtained by treating it and ~/t as though they were normal and maximizing the prediction error decomposition form of the likelihood obtained via the Kalman filter. As noted in Harvey, Ruiz and Shephard (1994), the quasi maximum likelihood (QML) estimators are asymptotically normal with covariance matrix given by applying the theory in Dunsmuir (1979, p. 502). This assumes that ~/t and it have finite fourth moments and that the parameters are not on the boundary of the parameter space. The parameter co can be estimated at the same time as the other parameters. 2t s, since this is Alternatively, it can be estimated as the mean of the log Yt asymptotically equivalent when q~ is less than one in absolute value. Application of the QML method does not require the assumption of a specific distribution for et. We will refer to this as unrestricted QML. However, if a distribution is assumed, it is no longer necessary to estimate try, as it is known, and an estimate of the scale factor, a2, can be obtained from the estimate of co. Alternatively, it can be obtained as suggested in subSection 3.4.1. If unrestricted QML estimation is carried out, a value of the parameter determining a particular distribution within a class may be inferred from the estimated variance of it. For example in the case of the Student's t, v may be determined from the knowledge that the theoretical value of the variance of it is 4.93 + ~'(v/2) (where u?(.) is the digamma function introduced in Section 3.2.2). 5.2.2. Asymmetric model In an asymmetric model, QML may be based o n the modified state space form in (3.4.3). The parameters try, cry, 2 q~, #., and y* can be estimated via the Kalman filter without any distributional assumptions, apart from the existence of fourth moments of qt and it and the joint symmetry of it and qt. However, if an estimate of p is wanted it is necessary to make distributional assumptions about the disturbances, leading to formulae like (3.4.4) and (3.4.5). These formulae can be used 2 ~b and p. to set up an optimization with respect to the original parameters tr2, ~rn, This has the advantage that the constraint ]p] < 1 can be imposed. Note that any tdistribution gives the same relationship between the parameters, so within this class it is not necessary to specify the degrees of freedom. Using the QML method with both the original disturbances assumed to be Gaussian, Harvey and Shephard (1993) estimate a model for the CRSP daily returns on a value weighted US market index for 3rd July 1962 to 31st December 1987. These data were used in the paper by Nelson (1991) to illustrate his E G A R C H model. The empirical results indicate a very high negative correlation. 5.2.3. QML in the frequency domain For a long memory SV model, QML estimation in the time domain becomes relatively less attractive because the state space form (SSF) can only be used by expressing ht as an autoregressive or moving average process and truncating at a suitably high lag. Thus the approach is cumbersome, though the initial state covariance matrix is easily constructed, and the truncation does not affect the
Stochastic volatility
171
asymptotic properties of the estimators. If the autoregressive approximation, and therefore the SSF, is not used, time domain Q M L requires the repeated construction and inversion of the T T covariance matrix of the log y~tls; see Sowell (1992). On the other hand, Q M L estimation in the frequency domain is no more difficult than it is in the AR(1) case. Cheung and Diebold (1994) present simulation evidence which suggests that although time domain estimation is more efficient in small samples, the difference is less marked when a mean has to be estimated. The frequency domain (quasi) loglikelihood function is, neglecting constants,
1
T1 T1
(5.2.1)
where I(2j) is the sample spectrum of the log ~ ' s and 9j is the spectral generating function (SGF), which for (3.5.1) is
gj = a2,[2(1  cos2j)] d + a~ .
Note that the summation in (5.2.1) is f r o m j = 1 rather t h a n j = 0. This is because go cannot be evaluated for positive d . However, the omission of the zero fre2 a~ and d, but a~ quency does remove the mean. The unknown parameters are a,, may be concentrated out of the likelihood function by a reparameterisation in 2 2 which a~ 2 is replaced by the signalnoise ratio q = %/a. On the other hand if a distribution is assumed for et, then a~ is known. Breidt, Crato and de Lima (1993) show the consistency of the Q M L estimator. When d lies between 0.5 and one, ht is nonstationary, but differencing the log Yt 2 ts yields a zero mean stationary process, the SGF of which is
172
other words the case for one of the more computer intensive methods outlined in Section 5.6 becomes stronger. Other things being equal, an AR coefficient, qS, close to one tends to favor Q M L because the autocorrelations are slow to die out and are hence captured less well by the moments used in GMM. For the same reason, G M M is likely to be rather poor in estimating a long memory model. The attraction of QML is that it is very easy to implement and it extends easily to more general models, for example nonstationary and multivariate ones. At the same time, it provides filtered and smoothed estimates of the state, and predictions. The onestep ahead prediction errors can also be used to construct diagnostics, such as the BoxLjung statistic, though in evaluating such tests it must be remembered that the observations are nonnormal. Thus even if the hyperparameters are eventually estimated by another method, QML may have a valuable role to play in finding a suitable model specification.
(5.3.1)
A comparison with the notation in Section 2 immediately draws attention to certain limitations of the setup. First, the functions # 0 ( ' ) = # (  ; 0 ) and o0(.) = a(.; 0) are parameterized by Yt only which restricts the state variable process Ut in Section 2 to contemporaneous values of yr. The diffusion in (5.3.1) involves a general vector process Yt, hence yt could include a volatility process to accommodate SV models. Yet, the Yt vector is assumed observable. For the moment we will leave these issues aside, but return to them at the end of the section. Hansen and Scheinkman (1995) consider the infinitesimal operator A defined for a class of square integrable functions ~0: Nn _~ R as follows:
(5.3.2)
Stochastic volatility
173
it does not necessarily exist for all square integrable functions q~ but only for a restricted domain D. A set of m o m e n t conditions can now be obtained for this class of functions ~0 E D. Indeed, as shown for instance by Revuz and Yor (1991), the following equalities hold:
EAocp(yt) = 0 ,
E = 0,
5.3.3)
5.3.4)
where A~ is the adjoint infinitesimal operator of Ao for the scalar product associated with the invariant measure of the process y.37 By choosing an appropriate set of functions, Hansen and Scheinkman exploit m o m e n t conditions (5.3.3) and (5.3.4) to construct a G M M estimator of 0. The choice of the function ~o c D and ~ c D* determines what moments of the data are used to estimate the parameters. This obviously raises questions regarding the choice of functions to enhance efficiency of the estimator but first and foremost also the identification of 0 via the conditions (5.3.3) and (5.3.4). It was noted in the beginning of the section that the multivariate process Yt, in order to cover SV models, must somehow include the latent conditional variance process. Gouri6roux and Monfort (1994, 1995) point out that since the m o m e n t conditions based on ~o and b cannot include any latent process it will often (but not always) be impossible to attain identification of all the parameters, particularly those governing the latent volatility process. A possible remedy is to augment the model with observations indirectly related to the latent volatility process, in a sense making it observable. One possible candidate would be to include in yt both the security price and the BlackScholes implied volatilities obtained through option market quotations for the underlying asset. This approach is in fact suggested by Pastorello, Renault and Touzi (1993) although not in the context of continuous time G M M but instead using indirect inference methods which will be discussed in Section 5.5. 38 Another possibility is to rely on the time deformation representation of SV models as discussed in the context of continuous time G M M by Conley et al. (1995).
37 Please note that A~ is again associated with a domain D* so that ~oc D and ~ C D* in (5.3.4). 38 It was noted in section 2.1.3 that implied volatilities are biased. The indirect inference procedures used by Pastorello, Renault and Touzi (1993) can cope with such biases, as will be explained in section 5.5. The use of option price data is further discussed in section 5.7. 39 SMM was originally proposed for crosssectionapplications, see Pakes and Pollard (1989) and McFadden (1989). See also Gouri~roux and Monfort (1993a).
174
we noted that G M M estimation of SV models is based on minimizing the distance between a set of chosen sample moments and unconditional population moments expressed as analytical functions of the model parameters. Suppose now that such analytical expressions are hard to obtain. This is particularly the case when such expressions involve marginalizations with respect to a latent process such a stochastic volatility process. Could we then simulate data from the model for a particular value of the parameters and match moments from the simulated data with sample moments as a substitute? This strategy is precisely what S M M is all about. Indeed, quite often it is fairly straightforward to simulate processes and therefore take advantage of the SMM procedure. Let us consider again as point of reference and illustration the (multivariate) diffusion of the previous section (equation (5.3.1)) and conduct H simulations i = 1,..., H using a discretization:
^ i 0 ); 0) + a ( ~ ( 0 ) ; O)et and i = 1,. A~v~(O) = #(Yt( ..
, H and t
1, , . , ~ T
where ~vt(O) are simulated given a parameter 0 and et is i.i.d. Gaussian. 4 Subject to identification and other regularity conditions one then considers
1 /~ ~T = Arg min [If(Yt,... Yr)  ~  ~ f ( ~ v ] ( O ) , . . .
0 i=1
,p~(0))l[
with a suitable choice of norm, i.e. weighting matrix for the quadratic form as in G M M , and function f of the data, i.e. moment conditions. The asymptotic distribution theory is quite similar to that of G M M , except that simulation introduces an extra source of random error affecting the efficiency of the S M M estimator in comparison to its G M M counterpart. The efficiency loss can be controlled by the choice of H. 41
Stochastic volatility
175
may be a possible candidate as an auxiliary model. An alternative strategy would be to try to summarize the features of the data via a SNP density as developed by Gallant and Tauchen (1989). This empirical SNP density, or more specifically its score, could also fulfill the role of auxiliary model. Other possibilities could be considered as well. The idea is then to use the auxiliary model to estimate t, so that:
T
(5.5.1)
where we restrict our attention here to a simple dynamic model with one lag for the purpose of illustration. The objective function f* in (5.5.1) can be a pseudolikelihood function when the auxiliary model is deliberately misspecified to facilitate estimation. As an alternative f* can be taken from the class of SNP densities. 43 Gouri6roux, Monfort and Renault then propose to estimate the same parameter vector fl not using the actual sample data but instead using samples hi T {yt(O)}t=l simulated i = 1, ...H times drawn from the model of interest given 0. This yields a new estimator of fl, namely:
H fl T
/~ttr(0) = a r g max(1/H)ZZlogf*(~v~(O)l~_l(O),fl)
i=1 t=l
(5.5.2)
The next step is to minimize a quadratic distance using a weighting matrix Wr to choose an indirect estimator of 0 based on H simulation replications and a sample of T observations, namely: 0nr = Arg m i n ( / ~ r  flh,r(0))'Wr ( f i r  ~Hr(0)) (5.5.3)
The approach of Gallant and Tauchen (1994) avoids the step of estimating fl,qr(0) by computing the score function of f* and minimizing a quadratic distance similar to (5.5.3) but involving the score function evaluated at fir and replacing the sample data by simulated series generated by the model of interest. Under suitable regularity conditions the estimator OHr is root T consistent and asymptotically normal. As with G M M and SMM there is again an optimal weighting matrix. The resulting asymptotic covariance matrix depends on the number of simulations in the same way the SMM estimator depends on H. Gouri~roux, Monfort and Renault (1993) illustrated the use of indirect inference estimator with a simple example that we would like to briefly discuss here. Typically AR models are easy to estimate while MA models require more elaborate procedures. Suppose the model of interest is a moving average model of order one with parameter 0. Instead of estimating the MA parameter directly from the data they propose to estimate an AR(p) model involving the parameter
43 The discussion should not leave the impression that the auxiliary model can only be estimated via MLtype estimators. Any root T consistent asymptotically normal estimation procedure may be used.
176
vector ft. The next step then consists of simulating data using the M A model and proceeding further as described above. 44 They found that the indirect inference estimator for Our appeared to have better finite sample properties than the more traditional m a x i m u m likelihood estimators for the M A parameter. In fact the indirect inference estimator exhibited features similar to the median unbiased estimator proposed by Andrews (1993). These properties were confirmed and clarified by Gouri6roux, Renault and Touzi (1994) who studied the second order asymptotic expansion of indirect inference estimators and their ability to reduce finite sample bias.
(5.5.4)
In Section 5.3 we noted that the above equation holds under certain restrictions such as the functions # and a being restricted to yt as arguments. While these restrictions were binding for the setup of Section 5.3 this will not be the case for the estimation procedures discussed here. Indeed, equation (5.5.4) is only used as an illustrative example. The diffusion is then simulated either via exact discretizations or some type of approximate discretization (e.g. Euler or Mil'shtein, see Pardoux and Talay (1985) or Kloeden and Platten (1992) for further details). More precisely we define the process yl a) such that:
#(y2~;O)6+a(y2~);
( )6 0 ) (~l/2e(6k)l
(5.5.5)
Under suitable regularity conditions (see for instance Strook and Varadhan (1979)) we know that the diffusion admits a unique solution (in distribution) and the process yl z) converges to Yt as 6 goes to zero. Therefore one can expect to simulate yt quite accurately for 6 sufficiently small. The auxiliary model may be a discretization of (5.5.4) choosing 6 = 1. Hence, one formulates a M L estimator based on the nonlinear A R model appearing in (5.5.5) setting 6 = 1. To control for the discretization bias one can simulate the underlying diffusion with 6 = 1/10 or 1/20, for instance, and aggregate the simulated data to correspond with the sampling frequency of the D G P . Broze, Scaillet and Zakoian (1994) discuss the effect of the simulation step size on the asymptotic distribution. The use of simulationbased inference methods becomes particularly appropriate and attractive when diffusions involve latent processes, such as is the case 44 Again one could use a score principle here, following Gallant and Tauchen (1994). In fact in a linear Gaussian setting the SNP approach to fit data generated by a MA (1) model would be to estimate an AR(p) model. Ghysels,Khalaf and Vodounou (1994) provide a more detailed discussion of scorebased and indirect inference estimators of MA models as well as their relation with more standard estimators.
Stochastic volatility
177
with SV models. Gouritroux and Monfort (1994, 1995) discuss several examples and study their performance via Monte Carlo simulation. It should be noted that estimating the diffusion at a coarser discretization is not the only possible choice of auxiliary model. Indeed, Pastorello, Renault and Touzi (1993), Engle and Lee (1994) and Gallant and Tauchen (1994) suggest the use of ARCHtype models. There have been several successful applications of these methods to financial time series. They include Broze et al. (1995), Engle and Lee (1994), Gallant, Hsieh and Tauchen (1994), Gallant and Tauchen (1994, 1995), Ghysels, Gouritroux and Jasiak (1995b), Ghysels and Jasiak (1994a,b), Pastorello et al. (1993), among others.
p(ylh)p(h)dh
where y and h contain the T elements of Yt and ht respectively. This expression can be written in terms of the at 2 's, rather than their logarithms, the ht is, but it makes little difference to what follows. Of course the problem is that the above likelihood has no closed form, so it must be calculated by some kind of simulation method. Excellent discussions can be found in Shephard (1995) and in Jacquier, Poison and Rossi (1994), including the comments. Conceptually, the simplest approach is to use Monte Carlo integration by drawing from the unconditional distribution of h for given values of the parameters,(~b, a~, a2), and estimating the likelihood as the average of the p(y[h)'s. This is then repeated, searching over ~b,a~ until the maximum of the simulated likelihood is found. As it stands this procedure is not very satisfactory, but it may be improved by using ideas of importance sampling. This has been implemented for ML estimation of SV
178
models by Danielsson and Richard (1993) and Danielsson (1994). However, the method becomes more difficult as the sample size increases. A more promising way of attacking likelihood estimation by simulation techniques is to use Markov Chain Monte Carlo (MCMC) to draw from the distribution of volatilities conditional on the observations. Ways in which this can be done were outlined in subSection 3.4.2 on nonlinear filters and smoothers. Kim and Shephard (1994) suggest a method of computing ML estimators by putting their multimove algorithm within a simulated EM algorithm. Jacquier, Poison and Rossi (1994) adopt a Bayesian approach in which the specification of the model has a hierarchical structure in which a prior distribution for the hyperparameters, q~ = (a~, ~b,a)', joins the conditional distributions, ylh and h[~0. (Actually the at's are used rather than the htts). The joint posterior of h and (p is proportional to the product of these three distributions, that is p(h, qgly) cx p(ylh)p(h[q~)p(q)). The introduction of h makes the statistical treatment tractable and is an example of what is called data augmentation; see Tanner and Wong (1987). From the joint posterior, p(h, ely), the marginal p(hly) solves the smoothing problem for the unobserved volatilities, taking account of the sampling variability in the hyperparameters. Conditional on h, the posterior of cp, p(q)ih, y) is simple to compute from standard Bayesian treatment of linear models. If it were also possible to sample directly from p(hlq), y) at low cost, it would be straightforward to construct a Markov chain by alternating back and forth drawing from p(cplh , y) and p(hl~o, y). This would produce a cyclic chain, a special case of which is the Gibbs sampler. However, as was noted in subSection 3.4.2, Jacquier, Poison and Rossi (1994) show that it is much better to decompose p(hiq), y) into a set of univariate distributions in which each hi, o r rather at, is conditioned on all the others. The prior distribution for o9, the parameters of the volatility process in JPR (1994), is the standard conjugate prior for the linear model, a (truncated) NormalGamma. The priors can be made extremely diffuse while remaining proper. JPR conduct an extensive sampling experiment to document the performance of this and more traditional approaches. Simulating stochastic volatility series, they compare the sampling performances of the posterior mean with that of the QML and GMM point estimates. The MCMC posterior mean exhibit root mean squared errors anywhere between half and a quarter of the size of the GMM and QML point estimates. Even more striking are the volatility smoothing performance results. The root mean squared error of the posterior mean of ht produced by the Bayesian filter is 10% smaller than the point estimate produced by an approximate Kalman filter supplied with the true parameters. Shephard and Kim in their comment of JPR (1994) point out that for very high q~ and small a~, the rate of convergence of the JPR algorithm will slow down. More draws will then be required to obtain the same amount of information. They propose to approximate the volatility disturbance with a discrete mixture of normals. The benefit of the method is that a draw of the vector h is then possible, faster than T draws from each hr. However this is at the cost that the draws navigate in a much higher dimensional space due to the discretisation effected.
Stochastic volatility
179
Also, the convergence of chains based upon discrete mixtures is sensitive to the number of components and their assigned probability weights. Mahieu and Schotman (1994) add some generality to the Shephard and Kim idea by letting the data produce estimates of the characteristics of the discretized state space (probabilities, mean and variance). The original implementation of the JPR algorithm was limited to a very basic model of stochastic volatility, AR(1) with uncorrelated mean and volatility disturbances. In a univariate setup, correlated disturbances are likely to be important for stock returns, i.e., the so called leverage effect. The evidence in Gallant, Rossi, and Tauchen (1994) also points at non normal conditional errors with both skewness and kurtosis. Jacquier, Polson, and Rossi (1995a) show how the hierarchical framework allows the convenient extension of the M C M C algorithm to more general models. Namely, they estimate univariate stochastic volatility models with correlated disturbances, and skewed and fattailed variance disturbance, as well as multivariate models. Alternatively, the M C M C algorithm can be extended to a factor structure. The factors exhibit stochastic volatility and can be observable or nonobservable.
5.7. Inference and option price data Some of the continuous time SV models currently found in the literature were developed to answer questions regarding derivative security pricing. Given this rather explicit link between derivates and SV diffusions it is perhaps somewhat surprising that relatively little attention has been paid to the use of option price data to estimate continuous time diffusions. Melino (1994) in his survey in fact notes: "Clearly, information about the stochastic properties of an asset's price is contained both in the history of the asset's price and the price of any options written on it. Current strategies for combining these two sources of information, including implicit estimation, are uncomfortably ad hoc. Statistically speaking, we need to model the source of the prediction errors in option pricing and to relate the distribution of these errors to the stock price process". For example implicit estimation, like computation of BS implied volatilities, is certainly uncomfortably ad hoc from a statistical point of view. In general, each observed option price introduces one source of prediction error when compared to a pricing model. The challenge is to model the joint nondegenerate probability distribution of options and asset prices via a number of unobserved state variables. This approach has been pursued in a number of recent papers, including Christensen (1992), Renault and Touzi (1992), Pastorello et al. (1993), Duan (1994) and Renault (1995). Christensen (1992) considers a pricing model for n assets as a function of a state vector xt which is (l + n) dimensional and divided into a /dimensional observed (zt) and ndimensional unobserved (~ot) components. Let Pt be the price vector of the n assets, then: pt = m(zt, ogt, O) . (5.7.1)
180
Equation (5.7.1) provides a onetoone relationship between the n latent state variables ~ot and the n observed prices pt, for given zt and 0. From a financial viewpoint, it implies that the n assets are appropriate instruments to complete the markets if we assume that the observed state variables zt are already mimicked by the price dynamics of other (primitive) assets. Moreover, from a statistical viewpoint it allows full structural maximum likelihood estimation provided the loglikelihood function for observed prices can be deduced easily from a statistical model for xt. For instance, in a Markovian setting where, conditionally on x0, the joint distribution of x r (Xt)I<j<T is given by the density:
=
fx(x~lxo, O) = I I
t=l
f(zt,('lzt~,'',O)
(5.7.2)
the conditional distribution of data D r = (Pt,Zt)l<_t<_T given Do = (p0,zo) is obtained by the usual Jacobian formula:
T
(5.7.3)
where m0 l(z,.) is the ~oinverse of m(z,.,O) defined formally by mo 1(z, m(z, ~o, 0)) = o9 while ~7,om (.) represents the columns corresponding to ~o of the Jacobian matrix. This M L E using price data of derivatives was proposed independently by Christensen (1992) and Duan (1994). Renault and Touzi (1992) were instead more specifically interested in the Hull and White option pricing formula with: zt = St observed underlying asset price, and ogt = at unobserved stochastic volatility process. Then with the joint process xt = (St, 6t) being Markovian we have a call price of the form:
Ct = m(xt, O, K)
where 0  (~', V') involves two types of parameters: (1) the vector ~ of parameters describing the dynamics of the joint process xt = (St, at) which under the equivalent martingale measure allows to compute the expectation with respect to the (riskneutral) conditional probability distribution of V2(t, t + h) given at; and (2) the vector 7 of parameters which characterize the risk premia determining the relation between the risk neutral probability distribution of the x process and the Data Generating Process. Structural M L E is often difficult to implement. This motivated Renault and Touzi (1992) and Pastorello, Renault and Touzi (1993) to consider less efficient but simpler and more robust procedures involving some proxies of the structural likelihood (5.7.3). To illustrate these procedures let us consider the standard lognormal SV model in continuous time:
Stochastic volatility
181 (5.7.4)
Standard option pricing arguments allow us to ignore misspecifications of the drift of the underlying asset price process. Hence, a first step towards simplicity and robustness is to isolate from the likelihood function the volatility dynamics, namely:
I[(2~2)l/2exp[(22)a(log
i=1
~TtiekAtlogat,_l a(l 
eka'))] 2
(5.7.5)
associated with a sample ot,, i = 1,... ,n and t i  ti_~ = At. To approximate this expression one can consider a direct method, as in Renault and Touzi (1992) or an indirect method, as in Pastorello et al. (1993). The former involves calculating implied volatilities from the Hull and White model to create pseudo samples o'ti parameterized by k, a and c and computing the maximum of (5.7.5) with respect to those three parameters. 45 Pastorello et al. (1993) proposed several indirect inference methods, described in Section 5.5, in the context of (5.7.5). For instance, they propose to use an indirect inference strategy involving GARCH(1,1) volatility estimates obtained from the underlying asset (also independently suggested by Engle and Lee (1994)). This produces asymptotically unbiased but rather inefficient estimates. Pastorello et al. indeed find that an indirect inference simplification of the Renault and Touzi direct procedure involving option prices is far more efficient. It is a clear illustration of the intuition that the use of option price data paired with suitable statistical methods should largely improve the accuracy of estimating volatility diffusion parameters.
5.8. Regression models with stochastic volatility
A single equation regression model with stochastic volatility in the disturbance term may be written
yt
t=
1,...,r,
(5.8.1)
where yt denotes the t th observation, xt is a k 1 vector of explanatory variables, fl is a k 1 vector of coefficients and ut = act exp(0.5ht) as discussed in Section 3. As a special case, the observations may simply have a nonzero mean so that
x't#
vt.
Since ut is stationary, an OLS regression of yt on xt yields a consistent estimator of p. However it is not efficient.
45 The direct maximizationof (5.7.5) using BS implied volatilitieshas also been proposed, see e.g. Heynen,Kemna and Vorst (1994). Obviouslythe use of BS impliedvolatilityinducesa misspecification bias due to the BS model assumptions.
182
For given values of the SV parameters, q5 and a,, 2 a smoothed estimator of ht, htlr, can be computed using one of the methods outlined in Section 3.4. Multiplying (5.8.1) through by exp(.5htlr) gives
=
t = 1,..., r
(5.8.2)
where the fit's can be thought of as heteroskedasticity corrected disturbances. Harvey and Shephard (1993) show that these disturbances have zero mean, constant variance and are serially uncorrelated and hence suggest the construction of a feasible GLS estimator
.8=
kt=l
eh'lTxtx
J
Z eh'lTxtYt "
t=l
(5.8.3)
In the classical heteroskedastic regression model ht is deterministic and depends on a fixed number of unknown parameters. Because these parameters can be estimated consistently, the feasible GLS estimator has the same asymptotic distribution as the GLS estimator. Here ht is stochastic and the MSE of its estimator is of O(1). The situation is therefore somewhat different. Harvey and Shephard (1993) show that, under standard regularity conditions on the sequence of xt, [! is asymptotically normal with mean/~ and a covariance matrix which can be consistently estimated by a~v~r@)=
Zeh':~xtx't
kt=l J
tt~l(ytxt~)e
2 2ht r
'xtx, t
=
e ht'r xtx tt
(5.8.4)
When ht[r is the smoothed estimate given by the linear state space form, the analysis in Harvey and Shephard (1993) suggests that, asymptotically, the feasible GLS estimator is almost as efficient as the GLS estimator and considerably more efficient than the OLS estimator. It would be possible to replace exp(htlr) by a better estimate computed from one of the methods described in Section 3.4 but this may not have much effect on the efficiency of the resulting feasible GLS estimator of ~. When ht is nonstationary, or nearly nonstationary, Hansen (1995) shows that it is possible to construct a feasible adaptive least squares estimator which is asymptotically equivalent to GLS.
Conclusions No survey is ever complete. There are two particular areas we expect will flourish in the years to come but which we were not able to cover. The first is the area of market microstructures which is well surveyed in a recent review paper by Goodhart and O'Hara (1995). With the ever increasing availability of high fre
Stochastic volatility
183
q u e n c y d a t a series, we a n t i c i p a t e m o r e w o r k involving g a m e theoretic models. These can n o w be e s t i m a t e d because o f recent a d v a n c e s in e c o n o m e t r i c m e t h o d s , similar to those e n a b l i n g us to e s t i m a t e diffusions. A n o t h e r a r e a where we expect interesting research to emerge is t h a t involving n o n p a r a m e t r i c p r o c e d u r e s to est i m a t e SV c o n t i n u o u s time a n d derivative securities models. R e c e n t p a p e r s include A i t  S a h a l i a (1994), A i t  S a h a l i a et al. (1994), Bossaerts, H a f n e r a n d H ~ r d l e (1995), B r o a d i e et al. (1995), C o n l e y et al. (1995), Elsheimer et al. (1995), G o u r i 6 roux, M o n f o r t a n d T e n r e i r o (1994), G o u r i 6 r o u x a n d Scaillet (1995), H u t c h i n s o n , L o a n d P o g g i o (1994), L e z a n et al. (1995), L o (1995), P a g a n a n d Schwert (1992). R e s e a r c h into the e c o n o m e t r i c s o f Stochastic V o l a t i l i t y m o d e l s is relatively new. A s o u r survey has shown, there has been a b u r s t o f activity in recent years d r a w i n g on the latest statistical technology. A s r e g a r d s the r e l a t i o n s h i p with A R C H , o u r view is t h a t SV a n d A R C H are n o t necessarily direct c o m p e t i t o r s , b u t r a t h e r c o m p l e m e n t each o t h e r in certain respects. R e c e n t a d v a n c e s such as the use o f A R C H m o d e l s as filters, the w e a k e n i n g o f G A R C H a n d t e m p o r a l a g g r e g a t i o n a n d the i n t r o d u c t i o n o f n o n p a r a m e t r i c m e t h o d s to fit c o n d i t i o n a l variances, illustrate t h a t a unified strategy for m o d e l l i n g volatility needs to d r a w o n b o t h A R C H a n d SV.
References
Abramowitz, M. and N. C. Stegun (1970). Handbook of Mathematical Functions. Dover Publications Inc., New York. AitSahalia, Y. (1994). Nonparametric pricing of interest rate derivative securities. Discussion Paper, Graduate School of Business, University of Chicago. AitSahalia0 Y. S. J. Bickel and T. M. Stoker (1994). GoodnessofFit tests for regression using kernel methods. Discussion Paper, University of Chicago. Amin, K. L. and V. Ng (1993). Equilibrium option valuation with systematic stochastic volatility. J. Finance 48, 881910. Andersen, T. G. (1992). Volatility. Discussion paper, Northwestern University. Andersen, T. G. (1994). Stochastic autoregressive volatility: A framework for volatility modeling. Math. Finance 4, 75102. Andersen, T. G. (1996). Return volatility and trading volume: An information flow interpretation of stochastic volatility. J. Finance, to appear. Andersen, T. G. and T. Bollerslev (1995). Intraday seasonality and volatility persistence in financial Markets. J. Emp. Finance, to appear. Andersen, T. G. and B. Sorensen (1993). GMM estimation of a stochastic volatility model: A Monte Carlo study. J. Business Econom. Statist. to appear. Andersen, T. G. and B. Sorensen (1996). GMM and QML asymptotic standard deviations in stochastic volatility models: A response to Ruiz (1994). J. Econometrics, to appear. Andrews, D. W. K. (1993). Exactly medianunbiased estimation of first order autoregressive unit root models. Econometrica 61, 139165. Bachelier, L. (1900). Th6orie de la sp6culation. Ann. Sci. Ecole Norm. Sup. 17, 2186, [On the Random Character of Stock Market Prices (Paul H. Cootner, ed.) The MIT Press, Cambridge, Mass. 1964]. Baillie, R. T. and T. Bollerslev (1989). The message in daily exchange rates: A conditional variance tale. J. Business Econom. Statist. 7, 297305. Baillie, R. T. and T. Bollerslev (1991). Intraday and lnterday volatility in foreign exchange rates. Rev. Econom. Stud. 58, 565585.
184
Baillie, R. T., T. Bollersle' and H. O. Mikkelsen (1993). Fractionally integrated generalized autoregressive conditional heteroskedasticity. J. Econometrics, to appear. Bajeux, I. and J. C. Rochet (1992). Dynamic spanning: Are options an appropriate instrument? Math. Finance, to appear. Bates, D. S. (1995a). Testing option pricing models. In: G. S. Maddala ed., Handbook of Statistics, Vol. 14, Statistical Methods in Finance. North Holland, Amsterdam, in this volume. Bates, D. S. (1995b). Jumps and stochastic volatility: Exchange rate processes implicit in PHLX Deutschemark options. Rev. Financ. Stud., to appear. Beckers, S. (1981). Standard deviations implied in option prices as predictors of future stock price variability. J. Banking Finance 5, 363381. Bera, A. K. and M. L. Higgins (1995). On ARCH models: Properties, estimation and testing. In: L. Exley, D. A. R. George, C. J. Roberts and S. Sawyer eds., Surveys in Econometrics. Basil Blackwell: Oxford, Reprinted from J. Econom. Surveys. Black, F. (1976). Studies in stock price volatility changes. Proceedings of the 1976 Business Meeting of the Business and Economic Statistics Section, Amer. Statist. Assoc. 177181. Black, F. and M. Scholes (1973). The pricing of options and corporate liabilities. J. Politic. Econom. 81, 637654. Bollerslev, T. (1986). Generalized autoregressive conditional heteroskedasticity. J. Econometrics 31, 307327. Bollerslev, T., Y. C. Chou and K. Kroner (1992). ARCH modelling in finance: A selective review of the theory and empirical evidence. J. Econometrics 52, 201224. Bollerslev, T. and R. Engle (1993). Common persistence in conditional variances. Econometrica 61, 166187. Bollerslev, T., R. Engle and D. Nelson (1994). ARCH models. In: R. F. Engle and D. McFadden eds., Handbook of Econometrics, Volume IV. NorthHolland, Amsterdam. Bollerslev, T., R. Engle and J. Wooldridge (1988). A capital asset pricing model with time varying eovariances. J. Politic. Econom. 96, 116131. Bollerslev, T. and E. Ghysels (1994). On periodic autoregression conditional heteroskedasticity. J. Business Econom. Statist., to appear. Bollerslev, T. and H. O. Mikkelsen (1995). Modeling and pricing longmemory in stock market volatility. J. Econometrics, to appear. Bossaerts, P , C. Harrier and W. Hfirdle (1995). Foreign exchange rates have surprising volatility. Discussion Paper, CentER, University of Tilburg. Bossaerts, P. and P. Hillion (1995). Local parametric analysis of hedging in discrete time. o r. Econometrics, to appear. Breidt, F. J., N. Crato and P. de Lima (1993). Modeling longmemory stochastic volatility. Discussion paper, Iowa State University. Breidt, F. J. and A. L. Carriquiry (1995). Improved quasimaximum likelihood estimation for stochastic volatility models. Mimeo, Department of Statistics, University of Iowa. Broadie, M., J. Detemple, E. Ghysels and O. Torr& (1995). American options with stochastic volatility: A nonparametric approach. Discussion Paper, CIRANO. Broze, L., O. Scaitlet and J. M. Zakoian (1994). Quasi indirect inference for diffusion processes. Discussion Paper CORE. Broze, L., O. Scaillet and J. M. Zakoian (1995). Testing for continuous time models of the short term interest rate. J. Emp. Finance, 199223. Campa, J. M. and P. H. K. Chang (1995). Testing the expectations hypothesis on the term structure of implied volatilities in foreign exchange options. J. Finance 50, to appear. Campbell, J. Y. and A. S. Kyle (1993). Smart money, noise trading and stock price behaviour. Rev. Econom. Stud. 60, 134. Canina, L. and S. Figlewski (1993). The informational content of implied volatility. Rev. Financ. Stud. 6, 659682. Cauova, F. (1992). Detrending and Business Cycle Facts. Discussion Paper, European University Institute, Florence.
Stochastic volatility
185
Chesney, M. and L. Scott (1989). Pricing European currency options: A comparison of the modified BlackScholes model and a random variance model. J. Financ. Quant. Anal. 24, 267284. Cheung, Y.W. and F. X. Diebold (1994). On maximum likelihood estimation of the differencing parameter of fractionallyintegrated noise with unknown mean. J. Econometrics 62, 301316. Chiras, D. P. and S. Manaster (1978). The information content of option prices and a test of market efficiency. J. Financ. Econom. 6, 213234. Christensen, B. J. (1992). Asset prices and the empirical martingale model. Discussion Paper, New York University. Christie, A. A. (1982). The stochastic behavior of common stock variances: Value, leverage, and interest rate effects. J. Financ. Econom. 10, 407432. Clark, P. K. (1973). A subordinated stochastic process model with finite variance for speculative prices. Econometrica 41, 135156. Clewlow, L and X. Xu (1993). The dynamics of stochastic volatility. Discussion Paper, University of Warwick. Comte, F. and E. Renault (1993). Long memory continuous time models. J. Econometrics, to appear. Comte, F. and E. Renault (1995). Long memory continuous time stochastic volatility models. Paper presented at the HFDFI Conference, Ziirich. Conley, T., L. P. Hansen, E. Luttmer and J. Scheinkman (1995). Estimating subordinated diffusions from discrete time data. Discussion paper, University of Chicago. Cornell, B. (1978). Using the options pricing model to measure the uncertainty producing effect of major announcements. Financ. Mgmt. 7, 5459. Cox, J. C. (1975). Notes on option pricing I: Constant elasticity of variance diffusions. Discussion Paper, Stanford University. Cox, J. C. and S. Ross (1976). The valuation of options for alternative stochastic processes. J. Financ. Econom. 3, 145166. Cox, J. C. and M. Rubinstein (1985). Options Markets. Englewood Cliffs, PrenticeHall, New Jersey. Dacorogna, M. M., U. A. Miiller, R. J. Nagler, R. B. Olsen and O. V. Pictet (1993). A geographical model for the daily and weekly seasonal volatility in the foreign exchange market. J. Internat. Money Finance 12, 413438. Danielsson, J. (1994). Stochastic volatility in asset prices: Estimation with simulated maximum likelihood. J. Econometrics 61, 375400. Danielsson, J. and J. F. Richard (1993). Accelerated Gaussian importance sampler with application to dynamic latent variable models. ,/. AppL Econometrics 3, S153S174. Dassios, A. (1995). Asymptotic expressions for approximations to stochastic variance models. Mimeo, London School of Economics. Day, T. E. and C. M. Lewis (1988). The behavior of the volatility implicit in the prices of stock index options. J. Financ. Econom. 22, 103122. Day, T. E. and C. M. Lewis (1992). Stock market volatility and the information content of stock index options. J. Econometrics 52, 267287. Diebold, F. X. (1988). Empirical Modeling of Exchange Rate Dynamics. Springer Verlag, New York. Diebold, F. X. and J. A. Lopez (1995). Modeling Volatility Dynamics. In: K. Hoover ed., Macroeconomics: Developments, Tensions and Prospects. Diebold, F. X. and M. Nerlove (1989). The dynamics of exchange rate volatility: A multivariate latent factor ARCH Model. J. AppL Econometrics 4, 122. Ding, Z., C. W. J. Granger and R. F. Engle (1993). A long memory property of stock market returns and a new model. J. Emp. Finance 1, 83108. Diz, F. and T. J. Finucane (1993). Do the options markets really overreact? J. Futures Markets 13, 298312. Drost, F. C. and T. E. Nijman (1993). Temporal aggregation of GARCH processes. Econometrica 61, 90~927. Drost, F. C. and B. J. M. Werker (1994). Closing the GARCH gap: Continuous time GARCH modelling. Discussion Paper CentER, University of Tilburg. Duan, J. C. (1994). Maximum likelihood estimation using price data of the derivative contract. Math. Finance 4, 155167.
186
Duan, J. C. (1995). The GARCH option pricing model. Math. Finance 5, 1332. Duffle, D. (1989). Futures Markets. PrenticeHall International Editions. Duffle, D. (1992). Dynamic Asset Pricing Theory. Princeton University Press. Duffle, D. and K. J. Singleton (1993). Simulated moments estimation of Markov models of asset prices. Econometrica 61, 929952. Dunsmuir, W. (1979). A central limit theorem for parameter estimation in stationary vector time series and its applications to models for a signal observed with noise. Ann. Statist. 7, 490506~ Easley, D. and M. O'Hara (1992). Time and the process of security price adjustment. J. Finance, 47, 577~505. Ederington, L. H, and J. H. Lee (1993). How markets process information: News releases and volatility. J. Finance 48, 11611192. Elsheimer, B., M. Fisher, D. Nychka and D. Zirvos (1995). Smoothing splines estimates of the discount function based on US bond Prices. Discussion Paper Federal Reserve, Washington, D.C. Engle, R. F. (1982). Autoregressive conditional heteroskedasticity with estimates of the variance of United Kingdom inflation. Econometrica 50, 9871007. Engle, R. F. and C. W. J. Granger (1987). Cointegration and error correction: Representation, estimation and testing. Econometrica 55, 251576. Engle, R. F. and S. Kozicki (1993). Testing for common features. J. Business Econom. Statist. 11, 369379. Engle, R. F. and G. G. J. Lee (1994). Estimating diffusion models of stochastic volatility. Discussion Paper, Univeristy of California at San Diego. Engle, R. F. and C. Mustafa (1992). Implied ARCH models from option prices. J. Econometrics 52, 289311. Engle, R. F. and V. K. Ng (1993). Measuring and testing the impact of news on volatility. J. Finance 48, 17491801. Fama, E. F. (1963). Mandelbrot and the stable Paretian distribution. J. Business 36, 420~29. Fama, E. F. (1965). The behavior of stock market prices. J. Business 38, 34105. Foster, D. and S. Viswanathan (1993a). The effect of public information and competition on trading volume and price volatility. Rev. Financ. Stud. 6, 2356. Foster, D. and S. Viswanathan (1993b). Can speculative trading explain the volume volatility relation. Discussion Paper, Fuqua School of Business, Duke University. French, K. and R. Roll (1986). Stock return variances: The arrival of information and the reaction of traders. J. Financ. Econom. 17, 526. Gallant, A. R., D. A. Hsieh and G. Tauchen (1994). Estimation of stochastic volatility models with suggestive diagnostics. Discussion Paper, Duke University. Gallant, A. R., P. E. Rossi and G. Tauchen (1992). Stock prices and volume. Rev. Financ. Stud. 5, 199242. Gallant, A. R., P. E. Rossi and G. Tauchen (1993). Nonlinear dynamic structures. Econometrica 61, 871907. Gallant, A. R. and G. Tauchen (1989). Semipararnetric estimation of conditionally constrained heterogeneous processes: Asset pricing applications. Econometrica 57, 10911120. Gallant, A. R. and G. Tauchen (1992). A nonparametric approach to nonlinear time series analysis: Estimation and simulation. In: E. Parzen, D. Brillinger, M. Rosenblatt, M. Taqqu, J. Geweke and P. Caines eds., New Dimensions in Time Series Analysis. SpringerVerlag, New York. Gallant, A. R. and G. Tauchen (1994). Which moments to match. Econometric Theory, to appear. Gallant, A. R. and G. Tauchen (1995). Estimation of continuous time models for stock returns and interest rates. Discussion Paper, Duke University. Gallant, A. R. and H. White (1988). A Unified Theory of Estimation and Inference for Nonlinear Dynamic Models. Basil Blackwell, Oxford. Garcia, R. and E. Renault (1995). Risk aversion, intertemporal substitution and option pricing. Discussion Paper CIRANO. Geweke, J. (1994). Comment on Jacquier, Poison and Rossi. J. Business Econom. Statist. 12, 397399.
Stochastic volati6ty
187
Geweke, J. (1995). Monte Carlo simulation and numerical integration. In: H. Amman, D. Kendrick and J. Rust eds., Handbook of Computational Economics. North Holland. Ghysels, E., C. Gourirroux and J. Jasiak (1995a). Market time and asset price movements: Theory and estimation. Discussion paper CIRANO and C.R.D.E., Univerist6 de Montrral. Ghysels, E., C. Gourirroux and J. Jasiak (1995b). Trading patterns, time deformation and stochastic volatility in foreign exchange markets. Paper presented at the HFDF Conference, Zfirich. Ghysels, E. and J. Jasiak (1994a). Comments on Bayesian analysis of stochastic volatility models. J. Business Econom. Statist. 12, 399401. Ghysels, E. and J. Jasiak (1994b). Stochastic volatility and time deformation a n application of trading volume and leverage effects. Paper presented at the Western Finance Association Meetings, Santa Fe. Ghysels, E., L. Khalaf and C. Vodounou (1994). Simulation based inference in moving average models. Discussion Paper, CIRANO and C.R.D.E. Ghysels, E., H. S. Lee and P. Siklos (1993). On the (mis)specification of seasonality and its consequences: An empirical investigation with U.S. Data. Empirical Econom. 18, 747760. Goodhart, C. A. E. and M. O'Hara (1995). High frequency data in financial markets: Issues and applications. Paper presented at HFDF Conference, Z0a'ich. Gourirroux, C. and A. Monfort (1993a). Simulation based Inference: A survey with special reference to panel data models, J. Econometrics 59, 533. Gourirroux, C. and A. Monfort (1993b). Pseudolikelihood methods in Maddala et al. ed., Handbook of Statistics Vol. 11, North Holland, Amsterdam. Gouri~roux, C. and A. Monfort (1994). Indirect inference for stochastic differential equations. Discussion Paper CREST, Paris. Gouri~roux, C. and A. Monfort (1995). SimulationBased Econometric Methods. CORE Lecture Series, LouvainlaNeuve. Gourirroux, C., A. Monfort and E. Renault (1993). Indirect inference. J. Appl. Econometrics 8, $85Sl18. Gourirroux, C., A. Monfort and C. Tenreiro (1994). Kernel Mestimators: Nonparametric diagnostics for structural models. Discussion Paper, CEPREMAP. Gouri+roux, C., A. Monfort and C. Tenreiro (1995). Kernel Mestimators and functional residual plots. Discussion Paper CREST  ENSAE, Paris. Gourirroux, C., E. Renault and N. Touzi (1994). Calibration by simulation for small sample bias correction. Discussion Paper CREST. Gourirroux, C. and O. Scaillet (1994). Estimation of the term structure from bond data. J. Emp. Finance, to appear. Granger, C. W. J. and Z. Ding (1994). Stylized facts on the temporal and distributional properties of daily data for speculative markets. Discussion Paper, University of California, San Diego. Hall, A. R. (1993). Some aspects of generalized method of moments estimation in Maddala et al. ed., Handbook o f Statistics Vol. 11, North Holland, Amsterdam. Hamao, Y., R. W. Masulis and V. K. Ng (1990). Correlations in price changes and volatility across international stock markets. Rev. Financ. Stud. 3, 281307. Hansen, B. E. (1995). Regression with nonstationary volatility. Econometrica 63, 11131132. Hansen, L. P. (1982). Large sample properties of generalized method of moments estimators. Econometrica 50, 10291054. Hansen, L. P. and J. A. Scheinkman (1995). Back to the future: Generating moment implications for continuoustime Markov processes. Econometrica 63, 767804. Harris, L. (1986). A transaction data study of weekly and intradaily patterns in stock returns. J. Financ. Econom. 16, 99117. Harrison, M. and D. Kreps (1979). Martingale and arbitrage in multiperiod securities markets. J. Econom. Theory 20, 381408. Harrison, J. M. and S. Pliska (1981). Martingales and stochastic integrals in the theory of continuous trading. Stochastic Processes and Their Applications 11, 215260.
188
Harrison, P. J. and C. F. Stevens (1976). Bayesian forecasting (with discussion). J. Roy. Statis. Soc., Ser. B, 38, 205247. Harvey, A. C. (1989). Forecasting, Structural Time Series Models and the Kalman Filter. Cambridge University Press. Harvey, A. C. and A. Jaeger (1993). Detrending, stylized facts and the business cycle. J. Appl. Econometrics 8, 231247. Harvey, A. C. (1993). Long memory in stochastic volatility. Discussion Paper, London School of Economics. Harvey, A. C. and S. J. Koopman (1993). Forecasting hourly electricity demand using timevarying splines. J. Amer. Statist. Assoc. 88, 12281236. Harvey, A. C., E. Ruiz and E. Sentana (1992). Unobserved component time series models with ARCH Disturbances, J. Econometrics 52, 129158. Harvey, A. C., E. Ruiz and N. Shephard (1994). Multivariate stochastic variance models. Rev. Econom. Stud. 61, 247264. Harvey, A. C. and N. Shephard (1993). Estimation and testing of stochastic variance models, STICERD Econometrics. Discussion paper, EM93/268, London School of Economics. Harvey, A. C. and N. Shephard (1996). Estimation of an asymmetric stochastic volatility model for asset returns. J. Business Econom. Statist. to appear. Harvey, C. R. and R. D. Huang (1991). Volatility in the foreign currency futures market. Rev. Financ. Stud. 4, 543569. Harvey, C. R. and R. D. Huang (1992). Information trading and fixed income volatility. Discussion Paper, Duke University. Harvey, C. R. and R. E. Whaley (1992). Market volatility prediction and the efficiency of the S&P 100 index option market. J. Financ. Econom. 31, 4374. Hausman, J. A. and A. W. Lo (1991). An ordered probit analysis of transaction stock prices. Discussion paper, Wharton School, University of Pennsylvania. He, H. (1993). Option prices with stochastic volatilities: An equilibrium analysis. Discussion Paper, University of California, Berkeley. Heston, S. L. (1993). A closedform solution for options with stochastic volatility with applications to bond and currency options. Rev. Financ. Stud. 6, 327343. Heynen, R., A. Kemna and T. Vorst (1994). Analysis of the term structure of implied volatility. J.
Financ. Quant. Anal.
Hull, J. (1993). Options, futures and other derivative securities. 2nd ed. PrenticeHall International Editions, New Jersey. Hull, J. (1995). Introduction to Futures and Options Markets. 2nd ed. PrenticeHall, Englewood Cliffs, New Jersey. Hull, J. and A. White (1987). The pricing of options on assets with stochastic volatilities. J. Finance 42, 281300. Huffman, G. W. (1987). A dynamic equilibrium model of asset prices and transactions volume. J. Politic. Econom. 95, 138159. Hutchinson, J. M., A. W. Lo and T. Poggio (1994). A nonparametric approach to pricing and hedging derivative securities via learning networks. J. Finance 49, 851890. Jacquier, E., N. G. Poison and P. E. Rossi (1994). Bayesian analysis of stochastic volatility models (with discussion). J. Business Econom. Statist. 12, 371417. Jacquier, E., N. G. Poison and P. E. Rossi (1995a). Multivariate and prior distributions for stochastic volatility models. Discussion paper CIRANO. Jacquier, E., N. G. Polson and P. E. Rossi (1995b). Stochastic volatility: Univariate and multivariate extensions. Rodney White center for financial research. Working Paper 1995, The Wharton School, University of Pennsylvania. Jacquier, E., N. G. Poison and P. E. Rossi (1995c). Efficient option pricing under stochastic volatility. Manuscript, The Wharton School, University of Pennsylvania. Jarrow, R. and Rudd (1983). Option Pricing. Irwin, Homewood III. Johnson, H. and D. Shanno (1987). Option pricing when the variance is changing. J. Financ. Quant. Anal. 22, 143152.
Stochastic volatility
189
Jorion, P. (1995). Predicting volatility in the foreign exchange market. J. Finance 50, to appear. Karatzas, l. and S. E. Shreve (1988). Brownian Motion and Stochastic Calculus. SpringerVerlag: New York, NY. Karpoff, J. (1987). The relation between price changes and trading volume: A survey. J. Financ. Quant. Anal. 22, 109126. Kim, S. and N. Shephard (1994). Stochastic volatility: Optimal likelihood inference and comparison with ARCH Model. Discussion Paper, Nuffield College, Oxford. King, M., E. Sentana and S. Wadhwani (1994). Volatility and links between national stock markets. Econometrica 62, 901934. Kitagawa, G. (1987). NonGaussian state space modeling of nonstationary time series (with discussion). J. Amer. Statist. Assoc. 79, 378389. Kloeden, P. E. and E. Platten (1992). Numerical Solutions of Stochastic Differential Equations. SpringerVerlag, Heidelberg. Lamoureux, C. and W. Lastrapes (1990). Heteroskedasticity in stock return data: Volume versus GARCH effect. J. Finance 45, 221229. Lamoureux, C. and W. Lastrapes (1993). Forecasting stockreturn variance: Towards an understanding of stochastic implied volatilities. Rev. Financ. Stud. 6, 293326. Latane, H. and R. Jr. Rendleman (1976). Standard deviations of stock price ratios implied in option prices. J. Finance 31, 369381. Lezan, G., E. Renault and T. deVitry (1995) Forecasting foreign exchange risk. Paper presented at 7th World Congres of the Econometric Society, Tokyo. Lin, W. L., R. F. Engle and T. Ito (1994). Do bulls and bears move across borders? International transmission of stock returns and volatility as the world turns. Rev. Financ. Stud., to appear. Lo, A. W. (1995). Statistical inference for technical analysis via nonparametric estimation. Discussion Paper, MIT. Mahieu, R. and P. Schotrnan (1994a). Stochastic volatility and the distribution of exchange rate news. Discussion Paper, University of Limburg. Mahieu, R. and P. Schotman (1994b). Neglected common factors in exchange rate volatility. J. Emp. Finance 1, 279 311. Mandelbrot, B. B. (1963). The variation of certain speculative prices. J. Business 36, 394416. Mandelbrot, B. and H. Taylor (1967). On the distribution of stock prices differences. Oper. Res. 15, 10571062. Mandelbrot, B. B. and J.W. Van Ness (1968). Fractal Brownian motions, fractional noises and applications. S l A M Rev. 1O, 422437. McFadden, D. (1989). A method of simulated moments for estimation of discrete response models without numerical integration. Econometrica 57, 10271057. Meddahi, N. and E. Renault (1995). Aggregations and marginalisations of GARCH and stochastic volatility models. Discussion Paper, GREMAQ. Melino, A. and M. Turnbull (1990). Pricing foreign currency options with stochastic volatility. J. Econometrics 45, 239265. Melino, A. (1994). Estimation of continuous time models in finance. In: C.A. Sims ed., Advances in Econometrics (Cambridge University Press). Merton, R. C. (1973). Rational theory of option pricing. Bell J. Econom. Mgmt. Sci. 4, 141183. Merton, R. C. (1976). Option pricing when underlying stock returns are discontinuous. J. Pinanc. Econom. 3, 125144. Merton, R. C. (1990). Continuous Time Finance. Basil Blackwell, Oxford. Merville, L. J. and D. R. Pieptea (1989). Stockprice volatility, meanreverting diffusion, and noise. J. Financ. Econom. 242, 193214. Metropolis, N., A. W. Rosenbluth, M. N. Rosenbluth, A. H. Teller and E. Teller (1954). Equation of state calculations by fast computing machines. J. Chem. Physics 21, 10871092. Miiller, U. A., M. M. Dacorogna, R. B. Olsen, W. V. Pictet, M. Schwarz and C. Morgenegg (1990). Statistical study of foreign exchange rates. Empirical evidence of a price change scaling law and intraday analysis. J. Banking Finance 14, 11891208.
190
Nelson, D. B. (1988). Time series behavior of stock market volatility and returns. Ph.D. dissertation, MIT. Nelson, D. B. (1990). ARCH models as diffusion approximations. J. Econometrics 45, 739. Nelson, D. B. (1991). Conditional heteroskedasticity in asset returns: A new approach. Econometrica 59, 347370. Nelson, D. B. (1992). Filtering and forecasting with misspecified ARCH Models I: Getting the right variance with the wrong model. J. Econometrics 25, 6190. Nelson, D. B. (1994). Comment on Jacquier, Poison and Rossi. J. Business Eeonom. Statist. 12, 403 406. Nelson, D. B. (1995a). Asymptotic smoothing theory for ARCH Models. Econometrica, to appear. Nelson, D. B. (1995b). Asymptotic filtering theory for multivariate ARCH models. J. Econometrics, to appear. Nelson, D. B. and D. P. Foster (1994). Asymptotic filtering theory for univariate ARCH models. Econometrica 62, 141. Nelson, D. B. and D. P. Foster (1995). Filtering and forecasting with misspecified ARCH models II: Making the right forecast with the wrong model. J. Econometrics, to appear. Noh, J., R. F. Engle and A. Kane (1994). Forecasting volatility and option pricing of the S&P 500 index. J. Derivatives, 1730. Ogaki, M. (1993). Generalized method of moments: Econometric applications. In: Maddala et al. ed., Handbook of Statistics Vol. 11, North Holland, Amsterdam. Pagan, A. R. and G. W. Schwert (1990). Alternative models for conditional stock volatility. J. Econometrics 45, 267290. Pakes, A. and D. Pollard (1989). Simulation and the asymptotics of optimization estimators. Eeonometrica 57, 9951026. Pardoux, E. and D. Talay (1985). Discretization and simulation of stochastic differential equations. Acta AppL Math. 3, 2347. Pastorello, S., E. Renault and N. Touzi (1993). Statistical inference for random variance option pricing. Discussion Paper, CREST. Patell, J. M. and M. A. Wolfson (1981). The exante and expost price effects of quarterly earnings announcement reflected in option and stock price. J. Account. Res. 19, 434458. Patell, J. M. and M. A. Wolfson (1979). Anticipated information releases reflected in call option prices. J. Account. Econom. 1, 117140. Pham, H. and N. Touzi (1993). Intertemporal equilibrium risk premia in a stochastic volatility model. Math. Finance, to appear. Platten, E. and Schweizer (1995). On smile and skewness. Discussion Paper, Australian National University, Canberra. Poterba, J. and L. Summers (1986). The persistence of volatility and stock market fluctuations. Amer. Eeonom. Rev. 76, 11421151. Renault, E. (1995). Econometric models of option pricing errors. Invited Lecture presented at 7th W.C.E.S., Tokyo, August. Renault, E. and N. Touzi (1992). Option hedging and implicit volatility. Math. Finance, to appear. Revuz, A. and M. Yor (1991). Continuous Martingales and Brownian Motion. SpringerVerlag, Berlin. Robinson, P. (1993). Efficient tests of nonstationary hypotheses. Mimeo, London School of Economics. Rogers, L. C. G. (1995). Arbitrage with fractional Brownian motion. University of Bath, Discussion paper. Rubinstein, M. (1985). Nonparametric tests of alternative option pricing models using all reported trades and quotes on the 30 most active CBOE option classes from August 23, 1976 through August 31, 1978. J, Finance 40, 455480. Ruiz, E. (1994). Quasimaximum likelihood estimation of stochastic volatility models. J. Econometrics 63, 289306. Schwert, G. W. (1989). Business cycles, financial crises, and stock volatility. Carneg&Rochester Conference Series on Public Policy 39, 83126.
Stochastic volatility
191
Scott, L. O. (1987). Option pricing when the variance changes randomly: Theory, estimation and an application. J. Finane. Quant. Anal. 22, 419438. Scott, L. (1991). Random variance option pricing. Advances in Futures and Options Research, Vol. 5, 113135. Sheikh, A. M. (1993). The behavior of volatility expectations and their effects on expected returns. J. Business 66, 93116. Shephard, N. (1995). Statistical aspect of ARCH and stochastic volatility. Discussion Paper 1994, Nuffield College, Oxford University. Sims, A. (1984). Martingalelike behavior of prices. University of Minnesota. Sowell, F. (1992). Maximum likelihood estimation of stationary univariate fractionally integrated time series models. J. Econometrics 53, 165188. Stein, J. (1989): Overreactions in the options market. J. Finance 44, 10111023. Stein, E. M. and J. Stein (1991). Stock price distributions with stochastic volatility: An analytic approach. Rev. Financ. Stud. 4, 727752. Stock, J. H. (1988). Estimating continuous time processes subject to time deformation. J. Amer. Statist. Assoc. 83, 7784. Strook, D. W. and S. R. S. Varadhan (1979). Multidimensional Diffusion Processes. SpringerVerlag, Heidelberg. Tanner, T. and W. Wong (1987). The calculation of posterior distributions by data augmentation. J. Amer. Statist. Assoc. 82, 52~549. Tauchen, G. (1995). New minimum chisquare methods in empirical finance. Invited Paper presented at the 7th World Congress of the Econometric Society, Tokyo. Tauchen, G. and M. Pitts (1983). The price variabilityvolume relationship on speculative markets. Econometrica 51,485505. Taylor, S. J. (1986). Modeling Financial Time Series. John Wiley: Chichester. Taylor, S. J. (1994). Modeling stochastic volatility: A review and comparative study. Math. Finance 4, 183204. Taylor, S. J. and X. Xu (1994). The term structure of volatility implied by foreign exchange options. J. Finane. Quant Anal. 29, 5774. Taylor, S. J. and X. Xu (1993). The magnitude of implied volatility smiles: Theory and empirical evidence for exchange rates. Discussion Paper, University of Warwick. Von Furstenberg, G. M. and B. Nam Jeon (1989). International stock price movements: Links and messages. Brookings Papers on Economic Activity 1,125180. Wang, J. (1993). A model of competitive stock trading volume. Discussion Paper, MIT. Watanabe, T. (1993). The time series properties of returns, volatility and trading volume in financial markets. Ph.D. Thesis, Department of Economics, Yale University. West, M. and J. Harrison (1990). Bayesian Forecasting and Dynamic Models. SpringerVerlag, Berlin. Whaley, R. E. (1982). Valuation of American call options on dividendpaying stocks. J. Financ. Econom. 10, 2958. Wiggins, J. B. (1987). Option values under stochastic volatility: Theory and empirical estimates. J. Financ. Econom. 19, 351372. Wood, R. T. McInish and J. K. Ord (1985). An investigation of transaction data for NYSE Stocks. J. Finance 40, 723739. Wooldridge, J. M. (1994). Estimation and inference for dependent processes. In: R.F. Engle and D. McFadden eds., Handbook of Econometrics Vol. 4. North Holland, Amsterdam.
G. S. Maddala and C. R. Rao, eds., Handbook of Statistics, Vol. 14 1996 Elsevier Science B.V. All rights reserved
t~
Stephen F. LeRoy
1. Introduction
In the early days of the efficient capital markets literature, discourse between finance academics and practitioners was characterized by mutual incomprehension. Academics held that security prices were governed exclusively by their prospective payoffs  in fact, the former equaled the discounted expected value of the latter. Practitioners, on the other hand, made no secret of their opinion that only naive academics could take the present value relation seriously as a theory of asset pricing: everyone knows that traders routinely ignore cash flows, and that large price changes often occur in the complete absence of news about future cash flows. Academics, at least since Samuelson's (1965) paper, responded that rejection of the present value relation implies the existence of profitable trading rules. Given that no one appeared to be identifying a trading rule that significantly outperforms buyandhold, academics saw no grounds for rejecting the presentvalue relation. Prior to the 1980's, empirical tests of market efficiency were conducted on the home court of the academics: one searched for evidence of return predictability; failing to find it, one concluded in favor of market efficiency. The variancebounds tests introduced by Shiller (1981) and LeRoy and Porter (1981), however, can be interpreted as shifting the locus of the debate from the home court of the academics to that of the practitioners  instead of looking for patterns in returns that are ruled out by market efficiency, one looked for the price patterns that are implied by market efficiency. Specifically, one asked whether security price changes are of about the magnitude one would expect if they were generated exclusively by fundamentals. The implications of this shift from returns tests to pricelevel tests were at first difficult to sort out since finding a predictable pattern has opposite interpretations in the two cases: finding that fundamentals predict future security returns argues against market efficiency, whereas finding that fundamentals predict current prices supports market efficiency. In both cases the early evidence suggested that the correlation being sought was not in the data; hence the returns tests accepted market efficiency, whereas the variancebounds tests rejected efficiency.
193
194
S. F. LeRoy
To understand the relation between returns and variancebounds tests of market efficiency, note that the simplest specification of the efficient markets model (applied to stock prices) says that
gt(t't+l) = p , (1.1)
where rt is the (gross) rate of return on stock, p is a constant greater than one, and Et denotes mathematical expectation conditional on some information set /t. Equation 1.1 says that no matter what agents' information is, the conditional expected rate of return on stock is p; past information, such as past realized stock returns, should not be correlated with future returns. Conventional efficiency tests directly investigated this implication. Variancebounds tests, on the other hand, used the definition of the rate of return, r,+l = dt+l + Pt+l , Pt (1.2)
(1.3)
where/3 ~ 1/(1 + p). After successive substitution and application of the law of iterated expectations, (1.3) may be written as
P t = Et(fldt+l + fl2dt+2 . . .
on+l.s ~ p Ut+n+l ,~n+l ]3 Pt+n+lJ"~ 
(1.4)
(1.5)
where p; is the expost rational stock price; i.e., the value stock would have if future dividends were perfectly forecastable: Pt = ~
n=l
B"dt+n .
(1.7)
Because the conditional expectation of any random variable is less volatile than that random variable itself, (1.6) implies the variance bounds inequality V(pt) _< V(p~) . (1.8)
Both Shiller and LeRoyPorter reported reversal of the empirical counterpart to inequality (1.8): prices appear to be more volatile than the upper bound implied by the volatility of dividends under market efficiency.
195
Several statistical issues must be considered in interpreting the fact that the empirical counterpart of inequality 1.8 is apparently reversed. These are (1) bias in parameter estimation, (2) nuisance parameter problems, and (3) sample variation of parameter estimates. Of these, discussion in the variancebounds literature has concentrated almost exclusively on bias. However, bias is not a serious problem in the absence of nuisance parameter or sample variability problems since the rejection region can always be modified to allow for bias. In contrast, nuisance parameter problems  which occur whenever the sample distribution of the test statistic is materially affected by a parameter which is unrestricted under the null hypothesis  make it difficult or impossible to set rejection regions so that rejection will occur with prespecified probability if the null hypothesis is true. Therefore they are much more serious. High sample variability in the test statistic is also a serious problem since it diminishes the ability of the test to distinguish between the null and the alternative, therefore reducing the power of the test for given size. In testing (1.8) one immediately encounters the fact that Pt cannot be directly constructed from any finite sample since dividends after the end of the sample are unobservable. The problems of bias, nuisance parameters and sample variability in testing (1.8) take different forms depending on how this problem is addressed. Two methods for estimating V(p~) are available, the modelfree estimator used by Shiller and the modelbased estimator used by LeRoyPorter. The modelfree estimator simply replaces the unobservable Pt with the expected value of p~ conditional on the sample, which is observable. This is given by setting the terminal value P ~ r of the observable proxy series P~r equal to actual Pr:
P ~  PT
and computing earlier values .Pt~T from the backward recursion Pl[T ~ fl(Pt*+llT q dt+ 1) ' which has the required property: g(ptlrhOl, all, . ,PT, dT) = p*t
(2.1)
(2.2)
(2.3)
(under the assumption that the population value of fl is used in the discounting). The estimated series is modelfree in the sense that its construction requires no assumptions about how dividends are generated, an attractive property. Using the modelfree p~l: r series to construct P(p~) has several less attractive consequences. Most important, if the modelbuilder is unwilling to commit to a model for dividends, there is no prospect of evaluating the sample variability of l/(p~), rendering construction of confidence intervals impossible. Thus it was no accident that Shiller reported point estimates of V(pt) and P(p~), but no tstatistics.
196
S.~LeRoy
One can, however, investigate the statistical properties of ll(pt ) under particular models of dividends, and this has been done. As Flavin (1983) and Kleidon (1986) showed, because of the very high serial correlation ofPtlr, l?(pt) is severely biased downward as an estimator of V(Pt); see Gilles and LeRoy (1991) for intuitive interpretation. As noted above, this by itself is not a problem since the rejection region can always be modified to offset the effect of bias. However, such modification cannot be implemented without committing to a dividends model, so if one takes this route the advantage of the modelfree estimator is foregone. Also, it is known that the modelfree estimator l?(pt) has higher sample variability than its modelbased counterpart, discussed below. A modelbased estimator of V(pt) can be constructed if one is willing to specify a statistical model assumed to generate dividends. For example, suppose dividends are generated by a firstorder autoregression:
dt+l = ~.dt + et+l .
(2.4)
Then an expression for the population value of V ( P t ) is readily computed as a function of 2, a~ 2 and fl, and a modelbased estimator P'(Pt) can be constructed by substituting parameter estimates for their population counterparts. Assuming the dividends model is correctly specified, the modelbased estimator has little bias (at least in some settings) and, more important, very low sample variability (LeRoyParke (1992)). In the setting of LeRoyParke the modelbased point estimate of V ( p ~ ) is about three times greater than the estimate of V ( p t ) , suggesting acceptance of (1.8). However, due to the nuisanceparameter problem to be discussed now, this result is not of much importance. Besides the ambiguities resulting from the various methods of constructing 11(p~), an even more basic problem arises from the fact that 1.8 is an inequality rather than an equality. Assuming that the null hypothesis is true, the population value of V(pt) depends on the magnitude of the error in investors' estimates of future dividends. Therefore the same is true of the volatility parameter V(p~;)  V(pt), the sample counterpart of which constitutes the test statistic. This error variance is not restricted by the assumption of market efficiency, leading to its characterization as a nuisance parameter. In LeRoyParke it is argued that this problem is very serious quantitatively: there is no way to set a rejection region for the volatility statistic /~(p~)  V(Pt). It is argued there that because of this nuisance parameter problem, directly testing Eq. (1.8) is essentially impossible. Since (1.8) is the bestknown of the variancebounds relations, this is not a minor conclusion. There exist other variancebounds tests that are betterbehaved econometrically than inequality (1.8). To develop these, define et+l as the innovation in stock payoffs: 8t+l ~ dt+l +Pt+t  Et(dt+l +Pt+l) , so that the presentvalue relation (1.3) can be written as
Pt = flEt(dt+l + Pt+l) = fl(dt+l + pt+l  et+l)
(2.5)
(2.6)
197
Substituting recursively, using the definition 1.7 of p7 and assuming convergence, (2.6) becomes
oo p* t = Pt + Z i~t+i i=1 (2.7)
so that the difference between Pt and Pt is expressible as a weighted sum of payoff innovations. Equation (2.7) implies V(Pt) = V(pt) + ~ V(et) . (2.8)
Put this result aside for the moment. The upper bound for price volatility is derived by considering the volatility of a hypothetical price series that would obtain if investors had perfect information about future dividends. LeRoyPorter also showed that a lower bound on price volatility could be derived if one was willing to specify that investors have at least some minimal information about future dividends. Suppose that one assumes that investors know at least current and past dividends; they may or may not have access to other variables that predict future dividends. Let Pt denotes the stock price that would prevail under this minimal information specification: Pt = E(Ptldt, drl, dr2,...) . (2.9)
Then because It is a refinement of the information partition induced by dt, dtl, d t  2 , . . . , we have f t = E([E(PTIIt)]Id, dt1, d,_2,...) , by the law of iterated expectations, or fit = E(Ptldt, drl, dr2,...) , (2.11) (2.10)
using (1.6). Therefore, by exactly the same reasoning used to derive (1.8), we obtain V(fit) <_ V(pt) , (2.12)
so the variance of pt is a lower bound for the variance of pt. This lower bound is without direct empirical interest since no one has seriously suggested that stock prices are less volatile than is implied by the presentvalue model under the assumption that investors know current and past dividends. However, the lower bound may be put to a more interesting use. By defining ~t+l as the payoff innovation under the information set generated by dr, drl, dr2,..., ~t+l ~ dt+l +Pt+l  E(dt+l + Pt+l [dt, drl, d t  2 , . . . ) we derive
O~3
p ; = 13t~ ~~fli~t+i
(2.13)
(2.14)
i=1
198
S.F. LeRoy
(2.15)
Equations (2.8) and (2.15) plus the lower bound inequality (2.12) imply
V(~t) >_ V(et)
(2.16)
Thus the presentvalue relation implies not just that prices are less volatile than they would be if investors had perfect information, but also that net oneperiod payoffs are less volatile than they would be if investors had less information than they (by assumption) do. To test (2.16), one simply fits a univariate timeseries model to dividends and uses it to compute I?(gt), while V(et) is just the estimated residual variance in the regression
dt + Pt = fllPtI + et
(2.17)
This adaptation of LeRoyPorter's lower bound on price volatility to the formally equivalent  but much more interesting econometrically  upper bound on payoff volatility is due to West (1988). The West test, like Shiller and LeRoyPorter's upper bound tests on price volatility, resulted in rejection. West reported statistically significant rejection (as noted, Shiller did not compute confidence intervals, while LeRoyPorter's rejections were only of borderline statistical significance). Generally, the West test is free of the most serious econometric problems that beset the price bounds tests. Most important, under the null hypothesis payoff innovations are serially uncorrelated, so sample means yield good estimates of population means (recall that modelfree tests of price volatility are subject to the problem that pt and Pt are highly serially correlated). Further, the associated tstatistics can be used to compute rejection regions. Finally, there is no need to specify investors' information since a modelfree estimate of V(et) is used, implying that the nuisance parameter problem that occurs under modelbased price bounds tests does not appear here.
3. Dividendsmoothing and nonstationarity One objection sometimes raised against the variancebounds tests is that corporate managers smooth dividends. That being the case, and because the expost rational stock price is in turn a highly smoothed average of dividends, it is argued that we should not be surprised that actual stock prices are choppier than expost rational prices. This point was raised most forcefully by Marsh and Merton (1983), (1986). 1 MarshMerton asserted that the variancebounds theorems rel This discussion is drawn from the 1988 version of GillesLeRoy(1991), available from the author. Discussionof MarshMertonwas deletedfrom the publishedversionof that paper in response to an editor's request.
199
quire for their derivation the assumption that dividends are exogenous, and also that the resulting series is stationary. If these assumptions are not satisfied the variancebounds theorems are reversed. To prove this, MarshMerton (1986) assumed that managers set dividends as a distributed lag on past stock prices:
N aft = E 2 i P t  i i=1
(3.1)
Further, from (1.7) the expost rational stock price can be written as
Ti P t = x" ~ i=l
i4+ i + P
* " PT
(3.2)
Finally, MarshMerton took the terminal expost rational stock price to be given by the sample average stock price: p.~ _ Et~=l Pt (3.3) T Substituting (3.1) and (3.3) into (3.2), it is seen that pt is expressible as a weighted average of the insample pt's. Using this result, MarshMerton proved that in every sample p; has lower variance than pt, just the opposite of the variancebounds theorem. Questions emerge about MarshMerton's assertion that the variancebounds inequality is reversed if managers smooth dividends. The most important question arises from the fact that none of the rigorous derivations of the variancebounds theorems available in the literature make use, explicitly or implicitly, of any assumption of exogeneity or stationarity: instead, the theorems depend only on the fact that the conditional expectation of a random variable is less volatile than the random variable itself. How, then, does dividend smoothing reverse the variancebounds theorem? It turns out that MarshMerton are not in fact asserting that the variancebounds theorems are incorrect, but only that in the setting they specify the sample counterparts of the variance of Pt and P t reverse the population inequality; MarshMerton's failure to use notation that distinguishes population from sample moments renders careful reading of their paper needlessly difficult. MarshMerton's dividend specification implies that dividends and prices are necessarily nonstationary (this is proved explicitly in Shiller's (1986) comment on MarshMerton). Sample moments cannot be expected to satisfy the same inequalities as population moments if the latter are infinite (or timevarying, depending on the interpretation). In nonstationary populations, in fact, there is essentially no relation between population moments and the corresponding sample moments 2  indeed, the very idea that there is a correspondence between
2 GillesLeRoy (1991) set out an example, adapted from Kleidon (1986), in which the martingale convergence theorem implies that the sample counterpart of the variancebounds inequality is reversed with arbitrarily high probability in arbitrarily long samples despite being true at each date in the population. As with MarshMerton, nonstationarity is the culprit.
200
S.F. LeRoy
sample and population moments in timeseries analysis derives its meaning from the analysis of stationary series. Thus there is no inconsistency whatever between the assertion that the population variancebounds inequality is satisfied at every date, as it is in MarshMerton's model, and MarshMerton's demonstration that under their specification its sample counterpart is reversed for every possible sample. What MarshMerton's example demonstrates is that if one uses analytical methods appropriate under stationarity when the data under investigation are nonstationary, one can be misled. Thus formulated, MarshMerton's conclusion is surely correct. The logical implication is that one wishes to make progress with the analysis of stock price volatility, one should go on to formulate statistical procedures that are appropriate in the nonstationary setting they assume. MarshMerton did not do so, and no easy extension of their model would have allowed them to take this next step. The reason is that MarshMerton's model does not contain any specification of what exogenous variables drive their model; the only behavior they model is managers' response to stock prices, treated as exogenous, in setting dividends. MarshMerton made two criticisms of the variancebounds tests: (1) that they depend on the assumption that dividends are stationary, and (2) that they depend on the assumption that dividends are exogenous, as opposed to being smoothed by managers (this second criticism is especially prominent in MarshMerton's unpublished paper (1983) dealing with LeRoyPorter (1981)). MarshMerton treated the two points as interchangeable, so that exogeneity was taken to imply stationarity, and dividendsmoothing nonstationarity. In fact dividend exogeneity neither implies nor is implied by stationarity, and the variancebounds theorems require neither one, as we saw above. It is true that the specific empirical implementation adopted by Shiller has attractive econometric properties only when dividends are stationary in levels.3 However, whether or not the analyst chooses to model the dividendpayout decision, as MarshMerton did, or directly assigns dividends a probabilistic model, as LeRoyPorter did, is immaterial: if the assumed dividends model under the latter coincides with the behavior implied for dividends in the former case, the two are equivalent. It follows that any implementation of the variancebounds tests that accurately characterizes dividend behavior is acceptable, regardless of whether corporate managers are smoothing dividends and regardless of whether such behavior, if occurring, is modeled. Whether or not Shiller's assumption of trendstationarity is acceptable has been controversial: many analysts believe that major macroeconomic time series, such as GNP, have a unit root. The debate about trendstationarity vs. unit roots in macroeconomic timeseries is not reviewed here, except to note that (1) of all
3 LeRoyPorter used a trend correction based on reversing the effect of earnings retention that should have resulted in stationary data, but in fact produced series with a downward trend (which explained why their rejections of the variancebounds theorems were of only marginal statistical significance). The reasons for the failure of LeRoyPorter's trend correction are unclear.
201
the major macroeconomic time series, aggregate dividends appears closest to trendstationarity, and (2) many econometricians believe that it is difficult to distinguish empirically between the trendstationary and unitroot cases. Kleidon (1986) showed that if dividends have a unit root, so that dividend shocks have a permanent component, then stock prices should be more volatile than they would be if dividends were stationary. Kleidon expressed the opinion that the evidence of excess volatility reflects nothing more than the nonstationarity of dividends. However, this opinion cannot be sustained. First, the West test is valid if dividends are generated by a linear timeseries process with a unit root, so that, if the expected presentvalue model is correct, dividends and stock prices are cointegrated. West, it is recalled, found significant excess volatility. Other tests, of which Campbell and Shiller (1988) was the first to be published, dealt with dividend nonstationarity by working with the pricedividend ratio instead of price levels. Again the conclusion was that stock prices are excessively volatile. LeRoyParke (1992) showed that the variance equality that LeRoyPorter had used,
2
(3.4)
av(r,),
(3.5)
where 6 is a function of various parameters, under the assumption that all variances of the intensive variables p t / d t , p ' f / d t and rt remain constant over time (this is the counterpart of the assumption, required to derive (3.4), that variances of extensive variables like pt, P~ and et remain constant over time). LeRoyParke also found excess volatility (see also LeRoy and Steigerwald, 1993). Thus the debate about whether dividends are trendstationary or have a unit root is, from the point of view of the variancebounds tests, irrelevant: either way, volatility exceeds that predicted by the presentvalue model.
4. Bubbles
These results show that excess volatility occurs under at least some forms of dividend nonstationarity. However, they do not necessarily completely dispose of MarshMerton's criticisms; any modelbased variancebounds test requires some specification of the probability law, stationary or nonstationary, assumed to generate dividends, and critics can always question this specification. For example, LeRoyParke assumed that dividends follow a geometric random walk, a characterization that appears not to do great violence to the data. However, it may be that the dividendsmoothing behavior of managers results in a less parsimonious model for dividends, in which case LeRoyParke's results may reflect nothing more than misspecification of the dividends model.
202
S.F. LeRoy
Two sets of circumstances might invalidate variancebounds tests based on particular dividend specification such as the geometric random walk. First, it may be that even data sets as long as a century (the length of Shiller's 1981 data set, which was also used in several of the subsequent variancebounds papers) are too short to allow accurate estimation of dividend volatility. Regime shift models, for example, require very long data sets for accurate estimation. Alternatively, the stock market may be subject to a "peso problem"  investors might attach timevarying probabilities to an event which did not occur in the finite sample. The second circumstance that might invalidate variancebounds tests is rational speculative bubbles. Thus consider an extreme case of MarshMerton's dividendsmoothing behavior: suppose that firms pay some positive (but low) level of dividends that is deterministic. 4 Thus all fluctuations in earnings show up as additions to (or subtractions from) capital. In this setting the market value of the firm will reflect the value of its capital, which by assumption does not depend on past dividends. Price volatility will obviously exceed the volatility implied by dividends, since the latter is zero, so the variancebounds theorem is violated. Theoretically, what is happening in this case is that the limiting condition (1.5) is not satisfied, so that stock prices do not equal the limit of the present value of dividends. Models in which (1.5) fails are defined as rational speculative bubbles: prices are higher than the present value of future dividends but, because they are expected to rise still higher, (1.3) is satisfied. Thus insofar as they are suggesting that dividend smoothing invalidates empirical tests of the variancebounds relations even in infinite samples, MarshMerton are asserting the existence of rational speculative bubbles. Bubbles have received much study in the recent economics literature, partly because of their potential role in resolving the excess volatility puzzle (for theoretical studies of rational bubbles, see Gilles and LeRoy (1992) and the sources cited there; for a summary of the empirical results as they apply to variancebounds, see Flood and Hodrick (1990)). This is not the place for a complete discussion of bubbles; we remark only that the widelyheld impression that bubbles cannot occur in models incorporating rationality is incorrect. This impression is fostered by the practice of referring incorrectly to (1.5) as a transversality condition (a transversality condition is associated with an optimization problem; no such problem has been specified here), suggesting that its satisfaction is somehow virtually automatic. In fact, (1) there exist wellposed optimization problems that do not have necessary transversality conditions, and (2) transversality conditions, even when necessary for optimization, do not always imply (1.5.) Examples are found in GillesLeRoy (1992). These examples, it is true, appear recondite. However, recall that the goal here is to explain behavior 
4 This specificationconflicts with limited liability, which in conjunction with random earnings implies that firm managers may not be able to commit to paying positivedividendswith certainty into the infinite future. This objection, while valid, is extraneous to the present concern, and hence is set aside.
203
excess volatility  that is itself counterintuitive; given this, we should not readily dismiss out of hand counterintuitive specifications of preferences. If (1.3) is satisfied but (1.5) fails, then the price of stock differs from the expected present value of dividends by a bubble term that satisfies
bt+l : (1 + p)bt 4 qt+l
,
(4.1)
so that a bubble is a martingale with drift p. Since the bubble increases in value at average rate p, which exceeds the growth rate of dividends (otherwise stock prices would be infinite), stock prices rise more rapidly than dividends. Therefore the dividendprice ratio will decrease over time. Informal examination of a plot of the dividendprice ratio shows no clear downward trend, and the majority of the empirical studies surveyed by FloodHodrick (1990) do not find evidence of bubbles. This literature is under rapid development, however, from both the theoretical and empirical sides, and this conclusion may shortly be reversed. For now, however, it is difficult to find support for the contention that firms are smoothing dividends in such a way as to invalidate the stationarity presumed in the variancebounds tests.
One possible explanation for the apparent excess volatility of securities prices is that conditionally expected rates of return depend on the values taken on by the conditioning variables, contradicting (1.1). There is no reason, other than a desire for simplification, to adopt the restriction that the conditional expected return on stock is constant over time, as implied by (1.1). If agents are risk averse, one would expect the conditions of equilibrium in asset markets to reflect a riskreturn tradeoff, so that (1.1) would be replaced by a term involving the higher moments of return distributions as well as the conditional mean (consider CAPM, for example). Thus equilibrium conditions like (1.1) are best interpreted as obtaining in efficient markets under the additional assumption of riskneutrality (LeRoy (1973), Lucas (1978)). Further, in simple models in which agents are risk averse, price volatility is likely to exceed that predicted by riskneutrality. The intuition is simple: under risk aversion agents try to transfer consumption from dates when income and consumption are high to dates when they are low. Decreasing returns in production mean that this transfer is increasingly costly, so security prices must behave in such a way as to penalize agents who make this transfer. If stock prices are high (low) when income is high (low), then agents are motivated to adapt their saving or dissaving to the production technology, as they must in equilibrium. Thus the more risk averse agents are, the more choppy equilibrium stock prices will be (LaCivita and LeRoy (1981), Grossman and Shiller (1981)). This raises the possibility that the apparent volatility is nothing more than an artifact of the misspecification of risk neutrality implicit in (1.1).
204
s. F. LeRoy
A very simple modification of the efficient markets model is seen to be, in principle, sufficient to explain existing price volatility. Providing other explanations subsequently became a minor cottage industry, perhaps because it is so easy to modify the characterization of market efficiency so as to alter its volatility prediction (1.8) (see Eden and Jovanovic 1994, Romer 1993 or Allen and Gale 1994, for example, for recent contributions). For example, consider an overlapping generations model in which the aggregate endowment is deterministic, but some stochastic factor like a random wealth transfer or monetary shock affects individual agents. In general this random shock will affect equilibrium stock prices. This juxtaposition of deterministic aggregate dividends and stochastic prices contradicts the simplest formulation of market efficiency, since deterministic dividends means that the righthand side of (1.8) is zero, while the lefthand side is strictly positive. Evidently, however, such models are efficient in any reasonable sense of the word: transactions costs are excluded and agents are assumed to be rational and to have rational expectations. Models with asymmetric information can be shown to predict price volatility that exceeds that associated with the conventional market efficiency definition. These efforts have been instructive, but should not be viewed as disposing of the volatility puzzle. The variancebounds literature was never properly interpreted as pointing to a puzzle for which potential theoretical explanations were in short supply. Rather, it consisted in showing that a simple model which had served well in some contexts did not appear to serve so well in another context. Resolving the puzzle would consist not in pointing out that other more general models do not generate the volatility implication that the data contradict  this was never in doubt  but in showing that these models actually explain the observed variations in security prices. Such exPlanations have not been forthcoming. For example, attempts to incorporate the effects of risk aversion in security pricing have not succeeded (Hansen and Singleton (1983), Mehra and Prescott (1985)), nor have any of the other proposed explanations of excess volatility been successfully implemented empirically. The enduring interest of the variancebounds controversy lies in the fact that it was here that it was first pointed out that we do not have good explanations, even ex post, for why security prices behave as they do. It is hard to imagine a more important conclusion, and nothing in the recent development of empirical finance has altered it.
6. Interpretation
Variancebounds tests as currently formulated appear to be essentially free of major econometric problems  for example, LeRoyParke (1992) relied on Monte Carlo simulations to assess the behavior of test statistics, thus ensuring that any econometric biases in the realworld statistics appears equally in the simulated statistics. Therefore econometric problems are automatically accommodated in
205
setting the rejection region. These reformulated variancebounds tests have continued to find excess price volatility. The debate about statistical problems with the variancebounds tests has died out in recent years: it is no longer seriously argued that there does not exist excess price volatility relative to that implied by the simplest expected presentvalue relation. As important as the abovementioned refinements of the variancebounds tests were in leading to this outcome, another development was still more important: conventional market efficiency tests were themselves evolving at the same time as the variancebounds tests were being developed. The most important modification of the conventional return market efficiency tests was that they investigated return autocorrelations over much longer time horizons than had the earlier tests. Fama and French (1988) found significant predictability in returns. These return autocorrelations are most significant when returns are averaged over five to ten years; earlier studies, such as those reported in Fama (1970), had investigated return autocorrelations over weeks or months rather than years. There are several general methodological lessons to be learned from comparison of conventional market efficiency tests and variance bounds tests about econometric testing of economic theories. Since the same null hypothesis is tested, one would presume that there exist no grounds for a different interpretation o f rejection in one case relative to the other. Yet it is extraordinarily difficult to keep this in mind: the existence of excess volatility suggests the conclusion that "we cannot explain security prices", whereas the return autocorrelation results suggest the more workaday conclusion that "average security returns are subject to gradual shifts over time". To bring home the point that this difference in interpretation is unjustified, assume that security prices equal those predicted by the presentvalue model plus a random term independent of dividends which has low innovation variance, but is highly autocorrelated. One can interpret that random term either as representing an irrational fad or as capturing smooth shifts in security returns due to changes in investment opportunities, shifts in social conditions, or whatever. This modification will generate excess volatility, and will also generate return autocorrelations of the type observed. With the same alternative hypothesis generating both the excess volatility and the return autocorrelations by assumption, there can be no justification for attaching different verbal interpretations to the two rejections. The lesson to be learned is that rejection of a model is just that: rejection of a model. One must be careful about basing interpretations of the rejection on the particular test leading to the rejection, rather than on the model being rejected. Despite being generally aware of the possibility that excess price volatility is the same thing statistically as longhorizon return autocorrelation, many financial economists nonetheless dismiss the possibility that excess price volatility has anything to do with capital market efficiency. Fama (1991) is a good example. Fama began his 1991 update of his survey (1970) by reemphasizing the point (made also in his 1970 survey) that any test of market efficiency is necessarily a joint test with a particular returns model. He then surveyed the evidence (to which
206
s. F. LeRoy
he has been a major contributor) that there exists high negative autocorrelation in returns at long horizons, remarking that this is statistically equivalent to "long swings away from fundamental value" (p. 1581). However, in discussing the variancebounds tests, Fama expressed the opinion that, despite the fact that they are "another useful way to show that expected returns vary through time", variancebounds tests "are not informative about market efficiency". Contrary to this, it would seem that the jointhypothesis problem applies no less or more to variancebounds tests than to return autocorrelation tests: if one type of evidence is relevant to market efficiency, so is the other. Another lesson is that one must be careful about applying implicit psychological metrics that seem appropriate, but in fact are not. For example, it is easy to regard the apparently spectacular rejections of the variance bounds tests as justifying a strong verbal characterization, whereas the extraneous random term that accounts for return autocorrelations appears too small to justify a similar interpretation. This too is incorrect: a random term that adds and subtracts two or three percentage points, on average, to real stock returns (which average some six or eight per cent) will, if it is highly autocorrelated, routinely translate into a large increase in price variance. The small change in real stock returns is the same thing arithmetically as the large increase in price volatility, so the two should be accorded a similar verbal characterization.
7. Conclusion
In the introduction it was noted that the early interchanges between academics and finance practitioners about capital market efficiency generated more heat than light. Models derived from market efficiency, such as CAPMbased portfolio management models, made some inroads among practitioners, but for the most part the debate between proponents and opponents of rationality in financial markets died down. Parties on both sides agreed to disagree. The evidence of excess price volatility reopened the debate, since it seemed at first to give unambiguous testimony to the existence of irrational elements in security price determination. Now it is clear that there exist other more conservative ways to interpret the evidence of excess volatility: for example, that we simply do not know what causes changes in the rates at which future expected dividends are discounted. The variancebounds controversy, together with parallel developments in financial economics, permit a considerable narrowing of the gap separating proponents and opponents of market efficiency. The existence of excess volatility implies that there are profitable trading rules, but it is known that these generate only small utility gains to those employing them. In fact, this juxtaposition between large departures from presentvalue pricing and small gains to those who try to exploit these departures provides the key to finding some middle ground in the efficiency debate. Proponents of market efficiency are vindicated because no one has identified trading rules that are more than marginally profitable. De
207
tractors of market efficiency are vindicated because a large proportion of the variation in security prices remains unexplained by market fundamentals. Both are correct; both are discussing the same sets of stylized facts. Some proponents of market efficiency go to great lengths to argue that it is unscientific to interpret excess volatility as evidence in favor of the importance of psychological elements in security price determination; see, for example, Cochrane's otherwise excellent review (1991) of Shiller's (1989) book. On this view, evidence is scientific only when it is incontrovertible and, presumably, not susceptible to interpretations other than that proposed. At best this is an unconventional use of the term "scientific". Indeed, if the term "unscientific" is to be applied at all, should it not be to those who feel no embarrassment about the continuing presence in their models of an uninterpreted residual that accounts for most of the variation in the data? Given the continuing failure of financial models based exclusively on received neoclassical economics to provide expost explanations of security price behavior, why does being scientific rule out broadening the field of inquiry to include psychological considerations?
References
Allen, F. and D. Gale (1994). Limited market participation and volatility of asset prices. Amer. Econom. Rev. 84, 933955. Campbell, J. Y. and R. J. Shiller (1988). The dividendprice ratio and expectations of future dividends and discount factors. Rev. Financ. Stud. 1, 195228. Cochrane, J. (1991). Volatility tests and efficient markets: A review essay. J. Monetary Eeonom. 27, 463485. Eden, B. and B. Jovanovic (1994). Asymmetric information and the excess volatility of stock prices. Economic Inquiry 32, 228235. Fama, E. F. (1970). Efficient capital markets: A review of theory and empirical work. J. Finance 25, 283417. Fama, E. F. (1991). Efficient capital markets: II. J. Finance 46, 15751617. Fama, E. F. and K. R. French (1988). Permanent and transitory components of stock prices. J. Politic. Econom. 96, 246273. Flavin, M. (1983). Excess volatility in the financial markets: A reassessment of the empirical evidence. J. Politic. Econom. 91, 929956. Flood, R. P. and R. J. Hodrick (1990). On testing for speculative bubbles. J. Econom. Perspectives 4, 85 101. Gilles, C. and S. F. LeRoy (1992). Bubbles and charges. Internat. Econom. Rev. 33, 323339. Gilles, C. and S. F. LeRoy (1991). Economic aspects of the variancebounds tests: A survey. Rev. Finane. Stud. 4, 753791. Grossman, S. J. and R. J. Shiller (1981). The determinants of the variability of stock prices. Amer. Econom. Rev. Papers Proc. 71, 222227. Hansen, L. and K. J. Singleton (1983). Stochastic consumption, risk aversion, and the temporal behavior of asset returns. Eeonometrica 91, 249265. Kleidon, A. W. (1986). Variance bounds tests and stock price valuation models. J. Politic. Econom. 94, 9531001. LaCivita, C. J. and S. F. LeRoy (1981). Risk aversion and the dispersion of asset prices. J. Business 54, 535547.
208
S. F. LeRoy
LeRoy, S. F. (1973). Risk aversion and the martingale model of stock prices. Internat. Econom. Rev. 14, 436446. LeRoy, S. F. and W. R. Parke (1992). Stock price volatility: Tests based on the geometric random walk. Amer. Econom. Rev. 82, 981992. LeRoy, S. F. and A. D. Porter (1981). Stock price volatility: Tests based on implied variance bounds. Econometrica 49, 555574. LeRoy, S. F. and D. G. Steigerwald (1993). Volatility. University of Minnesota. Lucas, R. E. (1978). Asset prices in an exchange economy. Econometrica 46, 14291445. Marsh, T. A. and R. C. Merton (1986). Dividend variability and variance bounds tests for the rationality of stock market prices. Amer. Econom. Rev. 76, 483498. Marsh, T. A. and R. E. Merton (1983). Earnings variability and variance bounds tests for stock market prices: A comment. Reproduced, MIT Mehra, R. and E. C. Prescott (1985). The equity premium: A puzzle. J. Monetary Econom. 15, 145161. Romer, D. (1993). Rational asset price movements without news. Amer. Econom. Rev. 83, 11121130. Samuelson, P. A. (1965). Proof that properly anticipated prices flutuate randomly. Indust. Mgmt. Rev. 6, 4149. Shiller, R. J. (1981). Do stock prices move too much to be justified by subsequent changes in dividends? Amer. Econom. Rev. 71, 421436. Shiller, R. J. (1989). Market Volatility. MIT Press, Cambridge, MA. Shiller, R. J. (1986). The MarshMerton model of managers' smoothing of dividends. Amer. Econom. Rev. 76, 499503. West, K. (1988), Bubbles, fads and stock price volatility: A partial evaluation. J. Finance 43, 636656.
G. S. Maddala and C. R. Rao, eds., Handbook of Statistics, Vol. 14 1996 ElsevierScienceB. V. All rights reserved.
"7
F. C. P a l m
I. Introduction Until some fifteen years ago, the focus of statistical analysis of time series centered on the conditional first moment. The increased role played by risk and uncertainty in models of economic decision making and the finding that common measures of risk and volatility exhibit strong variation over time lead to the development of new time series techniques for modeling timevariation in second moments. In line with BoxJenkins type models for conditional first moments, Engle (1982) put forward the Autoregressive Conditional Heteroskedastic (ARCH) class of models for conditional variances which proved to be extremely useful for analyzing economic time series. Since then an extensive literature has been developed for modeling higher order conditional moments. Many applications can be found in the field of financial time series. This vast literature on the theory and empirical evidence from A R C H modeling has been surveyed in Bollerslev et al. (1992), Nijman and Palm (1993), Bollerslev et al. (1994), Diebold and Lopez (1994), Pagan (1995) and Bera and Higgings (1995). A detailed treatment of A R C H models at a textbook level is also given by Gourifroux (1992). The purpose of this chapter is to provide a selective account of certain aspects of conditional volatility modeling in finance using A R C H and G A R C H (generalized A R C H ) models and to compare the A R C H approach to alternatives lines of research. The emphasis will be on recent developments for instance in multivariate modeling using factorARCH models. Finally, an evaluation of the state of the art will be given. In Section 2, we introduce the univariate and multivariate G A R C H models (including A R C H models), discuss their properties and the choice of the functional form and compare them with alternative volatility models. Section 3 will be devoted to problems of inference in these models. In Section 4, the statistical properties of G A R C H models, their relationships with continuous time diffusion
* The author acknowledgesmany helpfulcommentsby G. S. Maddala on an earlier version of the paper. 209
210
F. C P a l m
models and the forecasting volatility will be discussed. Finally in Section 5 we conclude and comment on potentially fruitful directions of future research.
2. GARCH models
2.1. Motivation
G A R C H models have been developed to account for empirical regularities in financial data. As emphasized by Pagan (1995) and Bollerslev et al. (1994), many financial time series have a number of characteristics in common. First, asset prices are generally nonstationary, often have a unit root whereas returns are usually stationary. There is increasing evidence that some financial series are fractionally integrated. Second, return series usually show no or little autocorrelation. Serial independence between the squared values of the series however is often rejected pointing towards the existence of nonlinear relationships between subsequent observations. Volatility of the return series appears to be clustered. Heavy fluctuations occur for longer periods. Small values for returns tend to be followed by small values. These phenomena point towards timevarying conditional variances. Third, normality has to be rejected frequently in favor of some thicktailed distribution. The presence of unconditional excess kurtosis in the series could be related to the timevariation in the conditional variance. Fourth, some series exhibit socalled leverage effects [see Black (1976)], that is changes in stock prices tend to be negatively correlated with changes in volatility. Some series have skewed unconditional empirical distributions pointing towards the inappropriateness of the normal distribution. Fifth, volatilities of different securities very often move together, indicating that there are linkages between markets and that some common factors may explain the temporal variation in conditional second moments. In the next subsection, we shall present several models which account for temporal dependence in conditional variances, for skewness and excess kurtosis.
2.2. Univariate G A R C H models
.1/2
P q o~ 2
(2.1)
iYti
ht=~0+~flihti+~
i=1 i=1
(2.2)
with Eet = 0, Var(et) = 1, ~0 > 0, fli > 0, (zi ~ 0, and ~iPl fli ~ ~ i L l 0Q < 1. This is the (p,q)th order G A R C H model introduced by Bollerslev (1986). When fli = O, i = l, 2, ...p, it specializes to the ARCH(q) model put forward in a seminal paper by Engle (1982). The nonnegativity conditions imply a nonnegative variance, while the condition on the sum of the ai's and fli's is required for wide sense
211
stationarity. These sufficient conditions for a nonnegative conditional variance can be substantially weakened as shown by Nelson and Cao (1992). The conditional variance of yt can become larger than the unconditional variance given by o2 = e 0 / ( 1  ~P=l/~i  ~q=l e~)if past realizations of Yt; have been larger than o.2. As shown by Anderson (1992), the G A R C H model belongs to the class of deterministic conditional heteroskedasticity models in which the conditional variance is a function of variables that are in the information set available at time t. Adding the assumption of normality, the model can be written as Y, t ~,1 ~ N(0, h,) , (2.3)
with ht being given by (2.2) and ~t1 being the set of information available at time t1. Anderson (1994) distinguishes between deterministic, conditionally heteroskedastic, conditionally stochastic and contemporaneously stochastic volatility processes. Loosely speaking, the volatility process is deterministic if the information set (o.field) is identical to the o.field of all random vectors in the system up to and including time t = 0, the process is conditionally heteroskedastic if contains information available and observable at time t  1 , the process is conditionally stochastic if contains all random vectors up to period t  1 whereas the volatility process is contemporaneously stochastic if the information set contains the random vectors up to period t. Notice the order imposed on the information structure of the various volatility representations. When S~P=I fl~ + ~q=l ~ / = 1, the integrated G A R C H (IGARCH) model arises [see Engle and Bollerslev (1986)]. From the G A R C H ( p , q ) model in (2.2), we obtain that [1  c~(L) /~(L)]yt2 = e0 + [1  fl(L)]vt, where vt = ~  ht are the innovations in the conditional variance process and c~(L)=y~fl~ziLi and //(L) =~P=I ,G~ Li. The fractionally integrated G A R C H model [FIGARCH(p, d, q)] proposed by Baillie, Bollerslev and Mikkelsen (1993) arises when the dpolynomial in the lag operator L, 1  e(L) /~(L), can be factorized as ~b(L)(1  L) where the roots of ~b(z) = 0 lie outside the unit circle and 0 < d < 1. The F I G A R C H model nests the GARCH(p, q) model for d = 0, and the I G A R C H ( p , q ) model for d = 1. Allowing d to take a value in the interval between zero and one gives additional flexibility that may be important when modeling longrun dependence in the conditional variance. In the empirical analysis of financial data, GARCH(1,1) or GARCH(1,2) models have often been found to appropriately account for conditional heteroskedasticity. This finding is similar to that low order A R M A models usually describe the dynamics of the conditional mean of many economic time series quite well. It is important to notice that for the above models positive and negative past values have a symmetric effect on the conditional variance. Many financial series however are strongly asymmetric. Negative equity returns are followed by larger increases in volatility than equally large positive returns. Black (1976) interpreted this phenomenon as the leverage effect according to which large declines in equity values would not be matched by a decrease in the value of debt and would raise the debt to equity ratio. Models such as the exponential G A R C H (EGARCH)
212
F. C. Palm
put forward,by Nelson (1991), the quadratic G A R C H (QGARCH) model of Sentana (1991) and Engle (1990) and the threshold G A R C H ( T G A R C H ) of Zakoian (1994) allow for asymmetry. Nelson's E G A R C H model reads as follows
P q
In h, = a0 + ~
i=1
fli In h t  i
q Z
i=1
(2.4)
where the parameters ~0, ~i, fli are not restricted to be nonnegative. A negative shock to the returns which would increase the debt to equity ratio and therefore increase uncertainty of future returns could be accounted for when ~ ~ 0 and ~p < 0. Similarly, when fractional integration is allowed for in an exponential G A R C H model, the F I E G A R C H model is obtained. The Q G A R C H model is written by Sentana (1991) as
P
ht
= 02 ~
(2.5)
where Xtq = (Yt1, Yt2, ..., Ytq) t. The linear term allows for asymmetry. The offdiagonal elements of A account for interaction effects of lagged values of xt on the conditional variance. The various quadratic variance functions proposed in the literature are nested in (2.5). The augmented G A R C H (GAARCH) model of Bera and Lee (1990) assumes ~ = 0 . Engle's (1982) A R C H model restricts = O, fli = 0 and A to be diagonal. The asymmetric G A R C H model of Engle (1990) and Engle and Ng (1993) assumes A to be diagonal. The linear standard deviation model studied by Robinson (1991) restricts fli = O, 0  2 = p2, ~ = 2p~o and A = ~o~0', a matrix of rank 1. The conditional variance then becomes
ht = (p
~otXtq) 2.
P q
ht = ~o + Z
i=1
'
(2.6)
where y+ = max{yt, 0} and y; = min{yi, 0}. It accounts for asymmetries by allowing the coefficients a+ and ~7 to differ. As shown by Hentschel (1994) many members of the family of G A R C H models (taking p = q = 1) can be embedded in a BoxCox transformation of the absolute G A R C H (AGARCH) model (at2  1)/2 ~ cx0+ ~10t2_1ff(et1) + fl(0L,  1)/2 , (2.7)
where at = h~/2, f(zt) =[ et  b [ c(et  b) is the news impact curve introduced by Pagan and Schwert (1990). For 2 > 1, the BoxCox transformation is convex, for ). < 1, it is concave. For )~ = v = 1 and I c I<_ 1 expression (2.7) specializes to become the A G A R C H model. The model for the conditional standard deviation suggested by Taylor (1986) and Schwert (1989) arises when 2 = v = 1 and b = c  0. The exponential G A R C H model (2.4) for p  q = 1 arises from (2.7)
213
when 2 = 0, v = 1 and b = 0. The T G A R C H model for the standard deviation is obtained from (2.7) when 2 = v = 1, b = 0 and I c I< 1. The G A R C H model (2.2) arises if 2 = v = 2 and b = e = 0. Engle and Ng's (1993) nonlinear asymmetric G A R C H corresponds to the values of 2 = v = 2 and c = 0 whereas the G A R C H model proposed by GlostenJagannathanRunkle (1993) is obtained when 2 = v = 2, b = 0. The nonlinear A R C H model of Higgings and Bera (1992) leaves 2 free and v equal to 2 with b = e = 0. The asymmetric power A R C H ( A P A R C H ) of Ding, Granger and Engle (1993) leaves 2 free and v equal to 2,b = 0 and I c I_< 1. Sentana's (1991) Q G A R C H is not nested in the specification (2.7). As shown by Hentschel (1994), nesting existing G A R C H models in a general specification like (2.7)highlights the relations between these models and offers opportunities for testing sequences of nested hypotheses regarding the functional form for conditional second order moments. Crouhy and Rockinger (1994) put forward the general socalled hysteresis G A R C H ( H G A R C H ) model, in which, in addition to a threshold G A R C H part, they include a short term, up to a few days, and a long term, up to a few weeks, impact of returns on volatility. Engle, Lilien and Robins (1987) introduce the A R C H in mean ( A R C H  M ) model in which the conditional mean is a function of the conditional variance of the process
(2.8)
where zt1 is a vector of predetermined variables, g is some function ofzt1 and h t is generated by an A R C H ( q ) process. O f course, when ht follows a G A R C H process, expression (2.8) will be a G A R C H in mean equation. The most simple A R C H  M model has g(zt1, ht) = 6hr. G A R C H in mean models arise in a natural way in theories of finance where for instance 9(zt1, ht) could denote expected return on some asset with ht being a measure of risk. The mean equation (2.8) would then reflect the tradeoff between risk and expected return. Pagan and Ullah (1988) refer to these models as models with risk terms.
214
F. C Palm
A related estimator for the volatility may be obtained from the interperiod highs and lows. As shown by Parkinson (1980), a highlow estimator for the variance in a random walk with constant variance and continuous time parameter is more efficient than the conventional sample variance based on the same number of endofinterval observations. Along these lines, the relationship between volatility and the bidask spread for prices could be used to construct variance estimates for returns [see e.g. Bollerslev and Domowitz (1993)], Similarly, the recent efforts into developing option pricing formulae in the presence of stochastic volatility [see e.g. Melino and Turnbull (1990)] have established a positive relationship between the value of an option and the variance of the underlying security, that could be used to assess the volatility of the security price. Finally, information on the price returns distribution across assets at given points in time could also be used to quantify market volatility. When deciding on the form of the specification for the conditional variance one has to define the conditioning set of information and to select a functional form for the mapping between the conditioning set and the conditional variance. Usually, the conditioning set is restricted to include past values of the series itself. A simple twostep estimator of the conditional residual variance can be obtained from a regression of the square residuals against their own lagged values [see Davidian and Caroll (1987)]. Pagan and Schwert (1990) show that the OLS estimator is consistent although not efficient. This twostep estimator's role is that of a benchmark which can be computed in a straightforward way. Jump or mixture models possibly combined with a GARCH specification for the conditional variance have been used to describe timevariation in volatility measures, fattails and skewness of financial series. In the Poisson jump model it is assumed that upon the arrival of abnormal information a jump occurs in the returns. The number of jumps occuring at time t, nt, is generated by a Poisson distribution with parameter 2. Conditionally on the number of jumps nt, returns are normally distributed with mean ntO and variance at 2 = a~ + ntcr~. The parameter 0 denotes the expected jump size. The conditional mean and variance of the returns depend on the number of jumps at period t. Additional time dependency could be introduced by assuming that a} is generated by a GARCHtype process. In the finance literature, stochastic jumps have been usually modeled by means of a Poisson process [see e.g. Ball and Torous (1985), Jorion (1988), Hsieh (1989), Nieuwland et al. (1991) and Ball and Roma (1993)]. Vlaar and Palm (1993) compare the Poisson jump process with the Bernoulli jump model for weekly exchange rate data from the European Monetary System (EMS). The performance of both models is very similar in most instances. Using the Bernoulli process has the advantage that one avoids making a truncation error when cutting off the infinite sum in a Poisson process. The mixing parameter 2 could be allowed to vary over time. For instance Vlaar and Palm (1994) assume that the mixing parameter 2 of a Bernoulli jump model for risk premia on European currencies depends on the inflation differential with respect to Germany.
215
Another way of allowing for time dependence is to assume that the probabilities of being in state 1 during period t differ, depending on whether the economy was in say state 1 or state 2 in period t  1 . Such a model has been put forward by Hamilton (1989) and applied to exchange rates [Engel and Hamilton (1990)], interest rates [Hamilton (1988)] and stock returns [Pagan and Schwert (1990)]. In Hamilton's basic model, an unobserved state variable zt can take the values 0 or 1. The transition probabilities from state j in period t  1 to state i in period t, Pij are constant and given by Pll = P, P10  1  p, P00 = q and P 0 1 = 1   q. AS shown by Pagan (1995), zt evolves as an AR(1) process. Observed returns Yt in Hamilton's model are assumed to be generated by
(2.9)
with et '~ NID(O, 02). The expected values of Yt in the two states are t0 and t0 + fll respectively. The variances are 02 and a 2 + <k. The model therefore generates states with high volatility and states with low volatility. Expected returns can also vary across these types of states. The variance of returns conditional on the state in the period t  1 can be expressed as War(yt [ Zt_l)
=
[<r2+ (1  q)q~](1   z t _ l ) + [ p c ~ + a 2 ] Z t _ l
(2.10)
Quite obviously the conditional variance (2.10) exhibits time dependence. Hamilton and Susmel (1994) generalize the Markov switching regime model by allowing the disturbances to be ARCH. Their model is called switching regime A R C H model (SWARCH). As in equation (2.9), the conditional mean of the S W A R C H model depends linearly on the state variable zt. The disturbance term of yt is assumed to follow an autoregressive process of order p with an error ut = v / ~ f t t where fit follows an ARCH(q) process with leverage effects as in the model of Glosten et al. (1993) and 9~t is constant factor which differs across regimes. The innovation fit is assumed to have a conditional student tdistribution with mean zero. Transitions between regimes are governed by an unobserved Markov chain. The authors use weekly returns on the valueweighted portfolio of stocks traded on the New York Stock Exchange for the period July 3, 1962 to December 29, 1987. Various A R C H models are compared to S W A R C H models allowing for up to four regimes. The S W A R C H specification with leverage terms, a conditional student tdistribution with a low number of degrees of freedom and allowing for four regimes is found to perform best. Along similar lines using a twostate S W A R C H model, Cai (1994) examines the issue of volatility persistence in monthly returns of threemonth treasury bills in the period 1964,8 to 1991,11. The persistence in A R C H processes found in previous studies can be accounted for by discrete shifts in the intercept in the conditional variance o f the process. Two periods during which a regime shift occurred are the period of the oil crisis 1974,2  1974,8 and the period 1979, 9  1982,8 associated with a policy change of the Federal Reserve Bank. Estimates of the conditional variance which do not depend on specific assumptions about the functional form can be obtained using nonparametric
216
F. C Palm
methods. Pagan and Schwert (1990) and Pagan and Hong (1991) use a nonparametric kernel estimator and a nonparametric flexible Fourier form estimator. The kernel estimator of a conditional moment of Yt, denoted by O(Yt) with a finite number of conditioning variables xt reads as
T s=l T
(2.11)
where K is a kernel function which smoothes the data. Various types of kernels might be employed. A popular one is the normal kernel which has also been used by Pagan and Schwert (1990)
K(xtXs)
= (27r)  l / 2
I H 11/2 exp[~(xtXs)tH(xtXs)]
(2.12)
H is a diagonal matrix with kth diagonal element set equal to the bandwidth &kTll(4+q), with ~kbeing the standard deviation ofxk,, k = 1, ...q, with q being the dimension of the conditioning set. An alternative nonparametric estimator involves a global approximation of the conditional variance using a series expansion. Among the many existing series expansions, the Flexible Fourier Form (FFF) proposed by Gallant (1981) has been used extensively in finance. The conditional variance is represented as the sum of a loworder polynomial and trigonometric terms constructed from past &t's (the residuals from a regression for yt). Then, the specification for 0.t 2 becomes
L j=l 2 k=l
+6jk sin(kOt_j)]} .
(2.13)
In theory, the number of trigonometric terms should tend to infinity, but in practice in terms of significance, it is often not worthwhile to go beyond an order of two. A drawback of (2.13) is the possibility that estimates of 0.2 can be negative. The estimator in (2.13) has been applied to stock returns by Pagan and Schwert (1990) for L = 1. The estimate of 0.2 is roughly constant and similar for the kernel, GARCH(1,2) and FFF estimation methods across most of the range of ~t1. Only for large positive and negative values of &t1 the estimators exhibit a different behavior. For negative values of &tl, the volatility estimates increase dramatically. Also, the trigonometric terms in (2.13) appear to be highly significant when tested jointly using an Ftest. The nonparametric estimates of conditional volatility using kernels or Fourier series differ from the parametric estimates for the GARCH, EGARCH and Hamilton model in periods when stock prices fall. In particular, large negative unexpected returns lead to a large increase in volatility. Parametric estimates appear to slowly adjust to large shocks and the effects of these shocks exhibit persistence. The parametric methods use the persistent aspects while the nonparametric methods use the highly nonlinear response to large negative shocks. While the nonparametric estimators of conditional volatility have a much higher
217
explanatory power than the parametric GARCH, EGARCH and Hamilton models, in particular in explaining asymmetries, they are inefficient compared with parametric methods. This suggests that improvements could be obtained by merging the two approaches to capture a richer set of specifications than are currently employed. Other nonparametric approaches have been put forward in the literature. Gouri6roux and Monfort (1992) propose to approximate the unknown relation between yt and et by a step function of the form
J J
yt = Z
j=l
o~jlAj(ytl) + Z f l j l A j ( Y t  1 ) 8 t
j=l
(2.14)
where A j , j = 1,2, ...J is a partition of the set of values of Yt1, l&(Yt1) is an indicator variable taking the value 1 when yt1 is in A} and zero otherwise and et is white noise. This model is called Qualitative Threshold Autoregressive Conditionally Heteroskedastic (QTARCH) model. If regime j applies to the variable Yt1, the conditional mean and variance of Yt are given by ~} and flj respectively. The process of yt is determined by qualitative state variables zt = (1~, (yt), ... 1As (yt)) which are generated by a Markov chain. For instance, the partition AI, ...Aj may correspond to the different stages of expansion and contraction of the financial market. By refining the partition A1, ...Aj sufficiently, one can use (2.14) to approximate more complex specifications for the conditional mean and variance of Yr. Alternatively the conditional variance specification could be refined by adding a GARCH term. The pseudomaximum likelihood estimators of aj and flj are the sample mean and variance computed for regime j. The QTARCH model approximates the conditional mean and variance by step functions whereas the TARCH model of Zakoian (1994) relies on a piecewise linear approximation of the conditional variance function. The nonparametric kernel estimators smooth the conditional moments and the F F F estimators approximate the conditional moments using functions which are smoother than piecewise linear or step functions. Along similar lines, Engle and Ng (1993) use linear splines to estimate the shape of the response to news. Their procedure is called partially nonparametric (PNP) as the long memory component is modeled as parametric and the relationship between news and volatility is treated nonparametrically. Among semiparametric methods extensively used in analyzing dependencies in financial data, we should mention the seminonparametric (SNP) models based on a series expansion with a Gaussian VAR leading term proposed by Gallant and Tauchen (1989). Assume that the conditional distribution of an N x 1 vector yt given the entire past depends only on a finite number L of lagged values of yt, denoted by xti = (Y'ti., Y'tL+l, ""Y~I )' which is a vector of length L . N . The procedure consists of approximating the conditional density of yt given xt~ by a truncated Hermite expansion which has the form of a polynomial in zt times the standard normal density, where zt is the centered and scaled value of yt, zt = R ~ (yt  bo  Bxt_l).
218
F. C. Palm
The truncated expansion is the semiparametric model. The conditional SNP density for zt given xt1 is approximated by
K~ 2
i~=oa~(xt_l)U ~] (p(u)du
where ~0 denotes the standard Gaussian density, ~ = (~1, ~2...~N)I, z ~ a~v/=l (zi) ~' which is of degree ] ~ I= ~N=I ] ~i ], a~(x) = 2..,iBj=oa~x ,/3 = (/31,/32,/3uS, [/31   ~ N ~ l l / 3 i l x P = ~iZ_l ~ (xi) P, and Kz and Kx are positive integers. The conditional density of Yt given xt1 is h(yt [ xt1) = f i R 1 (Yt  bo BXt_l) [ xt1]/ det(R). As pointed out by Gallant and Tauchen (1989), by increasing Kz and Kx simultaneously, an SNP model will yield arbitrarily accurate approximations to a class of models which includes fattail distributions (tlike distributions) and skewed distributions. As the stationary distribution of the A R C H models is not known in closed form, one cannot say that the A R C H model belongs to the above class. However, the stationary distribution of the A R C H model has fat tails and only a finite number of moments as the tdistribution. Conditionally, the variances of A R C H and SNP models are polynomials in a finite number of lags. One might therefore expect that the conditional density of an A R C H model could be approximated arbitrarily closely by SNP for large Kz and Kx. For large L, this may also be true for G A R C H models, of which the conditional variance is a polynomial in an infinite number of lags. An alternative to using the A R C H framework is to assume the changing variance to follow some latent process. This leads to a stochastic variance or volatility (SV) model [see e.g. Ghijsels et al. (1995)]. Assuming for the sake of simplicity of exposition that the drift parameter is zero, a simple SV model for returns yt has been proposed by Taylor (1986)

y, = et exp(~t/2), et N NID(O, 1) ,
(2.16)
G A R C H models o f volatility
219
(1993) compares the GARCH(1,1), EGARCH(1,0) and ARV(1) models when applied to daily exchange rates from 1/10/1981 to 28/6/1985 for the Pound sterling, Deutsche mark, Yen and Swiss franc vis~ivis the U.S. dollar. Within sample performance of the three models is very similar. When the models are used to forecast outofsample volatility, the A R C H models exhibit severe biases which do not occur for the SV volatilities. For daily and weekly returns on the S&P 500 index over the periods 7/3/1962 to 12/31/1987 and 7/11/1962 to 12/30/1992 respectively, Kim and Shephard (1994) conclude that a simple first order SV model fits the data as well as the popular A R C H models. For daily data on the S&P 500 index for the years 1980 to 1987, Danielsson (1994) finds that the EGARCH(2,1) model performs better than ARCH(5), GARCH(1,2), IGARCH(1,1,0) models. It also outperforms a simple SV model estimated by simulated maximum likelihood. The difference between a dynamic SV model and the E G A R C H loglikelihood values is 25.5 in favor of the SV model with four parameters whereas the E G A R C H model has five parameters.
yt
= ~t
c~1/2 ~t
(2.17)
with et being an N 1 i.i.d, vector with Eet = 0 and Var(et) = IN and f2t being the N N covariance matrix of Yt conditional on information available at time t. In a multivariate linear G A R C H ( p , q ) model, Bollerslev, Engle and Wooldridge (1988) assume that f2t is given by a linear function of the lagged cross squared errors and lagged values of f2t
q i=1 P
(2.18)
where veeh(.) denotes the operator that stacks the lower portion of an N x N matrix as an N(N + 1)/2 by 1 vector. In (2.18), ~0 is an N(N + 1)/2 vector and the Ai and Bi's are N(N + 1)/2 matrices. The number of unknown parameters in (2.18) equals N ( N + 1)[1 N ( N + 1)(p +q)/2]/2 and in practice some simplifying assumptions have to be imposed to achieve parsimony. For instance, Bollerslev et al. (1988) use the diagonal GARCH(p,q) model assuming that the matrices Ai and Bi are diagonal. Other representations include the constant conditional correlation model used by Baillie and Bollerslev (1990) and Vlaar and Palm (1993) who assume the conditional variances to be G A R C H processes. Conditions for the parametrization (2.18) to ensure that f2t is positive definite for all values of et are difficult to check in practice. Engle and Kroner (1995)
220
F. C. Palm
propose a parametrization of the multivariate G A R C H process to which they refer as the B E K K (Baba, Engle, Kraft and Kroner) representation
K q K p
(2.19)
where C~),Ai* k and Gi* k a r e N N parameter matrices with C~ being triangular and the summation limit K determines the generality of the process. The covariance matrix in (2.19) will be positive definite under weak conditions. Also this representation is sufficiently general that it includes all positive definite diagonal representations and most positive definite vec representations of the form (2.18). The representation (2.19) is usually more parsimonious in terms of numbers of parameters than (2.18). Given that the two parametrizations are found to be equivalen t under quite general circumstances, the B E K K parametrization might be preferred because then positive definiteness is ensured quite easily. Engle, Ng and Rothschild (1990) have proposed the factorARCH model as a parsimonious structure for the conditional covariance matrix of asset excess returns. These models incorporate the notion that risk on financial assets can be decomposed in a limited number of common factors f t and an asset specific (idiosyncratic) disturbance term. A factor structure arises from the Arbitrage Pricing Theory (APT) although APT does not imply that the number of factors is finite. The factorARCH model is used by Engle, Ng and Rothschild (1990) to model interest rate risk while in a companion paper, Ng et al. (1992) consider risk premia and anomalies to the capital asset pricing model (CAPM) on the U.S. stock market. Diebold and Nerlove (1989) apply a one factor model to exchange rates whereas King, Sentana and Wadhwani (1994) analyze the links between national stock markets using a factor model. The factor model reads as follows
yt = #t + Bft + et ,
(2.20)
with Yt being an N 1 vector of returns, #t is an N x 1 vector of expected returns, B is a N x k matrix of factor loadings, f t is a k 1 vector of factors with conditional covariance matrix At and ~t denotes an N x 1 vector of idiosyncratic shocks with conditional covariance matrix 7tt. The factors and the idiosyncratic shocks are uncorrelated. The conditional covariance matrix of Yt is then given by
f2t = B A t f f + 7tt .
(2.21)
When 7~t is constant and At has constant (possibly zero) offdiagonal elements, the covariance matrix Ot can be expressed as
k
at : ~_~ bibl2it + V ,
i=1
(2.22)
where bi denotes the i  th column of B and 7' groups the offdiagonal elements of At with the constant elements of the covariance matrix of et. As pointed out by
221
Engle et al. (1990) the model in (2.22) ~s observationally equivalent to a similar model with constant 2 's but timevarying b 's. An implication of the factor model (2.22) is that if k < N, we can construct N  k portfolio's of assets, i.e. linear combinations of yt, which have constant variance. There are k portfolios which have 2it plus a constant as conditional variance. The factor model (2.20) has to be completed by specifying processes for the factor variances. One could for instance assume that 2~t is generated by a univariate G A R C H process. Applying a one factor model to weekly data on the log differences for seven exchange rates vis/t vis the US dollar for the period July 1973 to August 1985, Diebold and Nerlove (1989) assume that the single common factor has a variance 2t = ~0 + 0 ~'~}21=( 1 3  i'lf 2 j tr Notice that their covariance matrix is of dimension seven by seven but contains only nine unknown parameters, cf. those of 7/, e0 and 0. By imposing a linearly decreasing pattern on the ARCHcoefficients, they achieve a substantial reduction of the number of parameters to estimate. A GARCH(1, 1) specification would instead yield geometrically decreasing A R C H coefficients. An alternative proposed by Engle et al. (1990) consists in assuming that the returns of each of the k factorrepresenting portfolios follow a G A R C H process. For i = 1,...k, the conditional variance of the i  th portfolio is then given by
q~ifa,i
, ~ fli~)i~'~t 1 ~ i '
(2.23)
where for simplicity reason a GARCH(1, 1) model is assumed and q~i is an N x 1 vector of weights of the portfolio. The conditional variances of the portfolios differ from 2it by a constant term only, i.e. dplfatefi = 2it + 4)l~q~i, which together with (2.23) can be substituted into (2.22) so as to express the conditional covariance matrix fat in (2.22) in terms of the conditional portfolio variances. Notice that 4@i = 1 and ~blb j = 0,j i. While the factorGARCH model has theoretically appealing features, its estimation requires highly nonlinear methods. Maximum likelihood estimation has been considered among others by Lin (1992). Also, an identification issue has to be resolved when the factor portfolios are not directly observed before the model can be estimated [see Sentanta (1992)]. In particular the factor representing portfolios have to be identified. In some instances, it is appropriate to assume that the factor representing portfolios are known and observed. For example, Engle et al. (1990) explain the monthly returns on Treasury bills with maturities ranging from one to twelve months and the valueweighted index of NYSEAMSE stocks, for the period from August 1964 to November 1985. They select two factorrepresenting portfolios one of which having equal weights on each of the bills and zero weight on the stock index and the other having zero weights on the bills and all weight on the stock index. Models with observed factorrepresenting portfolios can be consistently estimated in twosteps. One can first estimate the univariate models for the portfolios. Using the estimates obtained in the first step, the factor loadings can be estimated consistently up to a sign as individual assets have a variance which is linear in the factor variances with coefficients that are equal to the squared factor loadings.
222
F. C. Palm
King et al. (1994) estimate a multivariate factor model as in (2.20) from monthly data on US dollar excess returns for 16 national stock markets for the period 1970,1 to 1988,10 using the maximum likelihood method. They assume that the risk premium #t can be expressed as #t = BAtz, with At being a diagonal matrix and z being a k x 1 vector of constant parameters representing the price of risk for each factor. King et al. (1994) consider the model for k = 6 with 4 observed and 2 unobserved factors. The observable factors represent the unanticipated shocks to asset returns. These shocks are estimated as the common factors extracted from a fourfactor model applied to the residuals from a vector autoregression for xt, a set of 10 observed macroeconomic variables. The variances of the common and idiosyncratic terms are assumed to follow univariate GARCH(1, 1) processes in which the past squared values of the factors are replaced by their linear projection given some available information set. Notice that when the covariance matrix of the factorGARCH model depends on prior unobservables, the return components have a conditionally stochastic volatility representation [see Anderson (1992), Harvey et al. (1992)]. A major finding is that only a small proportion of the covariances between national stock markets and their timevariation can be explained by observed factors. Conditional second moments are explained to a large extent by unobserved factors. This finding underlines the usefulness of models allowing for unobservable factors in explaining volatility within markets and volatility spillovers between markets. The application in King et al. (1994) also illustrates the appropriateness and feasibility of the use of factor models to explain the timedependence in second order moments of a multivariate time series of dimension 16. While it was possible to jointly estimate the factor model with some 200 parameters, the authors had to estimate the vector autoregression for xt separately in a first step. Given that the dimension of the parameter space of multivariate factorGARCH models will usually be high, twostep estimation procedures will be a feasible alternative to fully joint estimation procedures based on the likelihood principle.
223
characteristic polynomial det[I  A(2 1)  B(~.1)] = 0 lie inside the unit circle. In that case, there will be no persistence in the variance. On the other hand, if some eigenvalues lie on the unit circle, shocks to the conditional covariance matrix remain important for forecasts of all horizons. If the eigenvalues are outside the unit circle, the effect of a shock to the covariance matrix will explode over time. Notice that the above conditions on the roots of the characteristic polynomial also apply to the BEKK model (2.19), as shown by Engle and Kroner (1995).In many empirical studies of financial data using univariate GARCH(p, q) models, the estimated parameters are found to have a sum close to one. A detailed survey of the literature can be found in Bollerslev, Chou and Kroner (1992). The multivariate k factor model (2.20) with a GARCH(p, q) process of the form (2.23) for the factors will be covariance stationary if the portfolios and ~t are covariance stationary. In line with the concept of cointegration between a set of variables, Bollerslev and Engle (1993) put forward a definition of copersistence in variance. The basic idea is that several time series may show persistence in the variance while at the same time some linear combinations of the variables may exhibit no persistence in the variance. Bollerslev and Engle (1993) derive necessary and sufficient conditions for copersistence in the variances of a multivariate GARCH(p, q) process. In practice, copersistence in the variances allows one to construct portfolios with stationary volatilities from the assets which have nonstationary return volatilities. The finding of unit roots in multivariate GARCH models has led to new developments in factorARCH models. Engle and Lee (1993) formulate a factor model of the form of the King et al. (1994) within which they allow for permanent IGARCH(1,0, 1) and transitory GARCH (1, 1) components in the volatilities. Engle and Lee (1993) apply several variants of the component model to daily returns on the CRSP valueweighted index and fourteen individual stocks of large U.S. companies for a sample period from July 1, 1962 to December 31, 1991. Their major empirical finding is that the persistence of individual return volatilities is due to the persistence of both market volatility (assumed to be a common factor) and idiosyncratic volatilities of individual stocks. These results imply that the hypothesis that stock return volatility is copersistent with market volatility is rejected when market shocks are assumed not to affect idiosyncratic volatility. Using a factorcomponentGARCH model with observed factors, Palm and Urbain (1995) also find significant persistence in the common and idiosyncratic factors volatilities using daily observations on returns of stock price indices for Europe, the FarEast and NorthAmerica for the period February 1982August 1995. While the use of factorcomponentGARCH models is still in its infancy, the empirical finding of persistence in return volatilities [see e.g. French, Schwert and Stambaugh (1987), Chou (1988), Pagan and Schwert (1990), Ding et al. (1993) and Engle and GonzalezRivera (1991)], common factor and/or idiosyncratic factor volatilities raises a number of important questions. For instance is the finding of persistence in volatilities in agreement with the stationarity assumption for asset returns which has often been made in the literature? Would finance
224
F. C. Palm
theory not predict that a nonstationarity in the volatility leads to a nonstationarity in asset returns? What is the precise form of the persistence in volatilities and in the return series? Should it be modeled as a unit root in the permanent component of the conditional variances or should one allow for fractional integration or should it be modeled as regimes switches as e.g. in Cai (1994) or in Hamilton and Susmel (1994)? There is increasing evidence that return series exhibit fractional integration [see e.g. Baillie (1994)]. The difficulty of empirically distinguishing between persistence arising from unit roots or from fractional differencing is due to the low power of many existing testing procedures.
3. Statistical inference
L ( y l O) = Z L t
t=l
,
!
(3.1)
where Lt = c 1n h t  ~/h, with 0 = (c~0,~l,/~l),hi = a 2 = ~0/(1  ~l  31) and ht given by (2.2) for t > 1. Given initial values for the parameter vector 0, the loglikelihood function (3.1) can be evaluated by computing ht, t = 1,2, ...T recursively and substituting the values in (3.1). Standard numerical algorithms can be used to compute the maximum of (3.1). As is wellknown, under regularity conditions given for instance in Crowder (1976), the value of 0 which maximizes L, 0ML, is consistent, asymptotically normally distributed and efficient V~(bML  0) ~ N(0, Var(/)ML)) , (3.2)
where Var(0ML) = IT 1 ~[]~=1EO2LT/O000'] I The asymptotic covariance matrix of 0Mr can be consistently estimated by the inverse of the Hessian matrix associated with (3.1), evaluated at 0~L. A proof of the consistency and asymptotic
225
normality of the MLestimator in GARCH(1, 1) and IGARCH(1, 1) models is given by Lumsdaine (1992) under the condition that E[ln(ele2 + fll)] < 0. The existence of finite fourth moments of et is not required. Unlike models with a unit root in the conditional mean, the ML estimator in models with and without a unit root in the conditional variance have the same limiting distribution. As shown by Weiss (1986) for time series models with ARCH errors, by Bollerslev and Wooldridge (1992) and Gouri6roux (1992) for GARCH processes, the quasiML estimator or the pseudoML estimator of 0 is obtained by maximizing the normal loglikelihood function (3.1) although the true probability density function is nonnormal. Under regularity conditions the QMLestimator has the following asymptotic distribution
(3.3)
where A = Eo[OLt/O0 OZt/O0 t] is the covariance matrix of the score vector of L and B = Eo[O2Lt/tgOOO~] where E0 denotes the expectation conditional on the true probability density function for the data. Of course, if the latter is the normal distribution, the asymptotic distributions in (3.2) and (3.3) will be identical. Lee and Hansen (1994) prove consistency and asymptotic normality of the QML estimator of the Gaussian GARCH(1, 1) model. The disturbance scaled by its conditional standard deviation need not be normally distributed nor independent over time. The GARCH process may be integrated el + fll = 1 and even explosive el + fll > 1 provided the conditional fourth moment of the scaled disturbance is bounded. In finite samples, for symmetric departures from conditional normality the QML has been found close to the exact MLestimator in a simulation study by Bollerslev and Wooldridge (1992). For nonsymmetric conditional true distributions, both in small and large samples the loss of efficiency of QML compared to exact ML can be quite substantial. Semiparametric density estimation as proposed by Engle and GonzalezRivera (1991) using a linear spline with smoothness priors will then be an attractive alternative to QML. With respect to ML and QML methods to estimate GARCH models, some comments can be made. First, although GARCH generates fattails in the unconditional distribution, when combined with conditional normality, it does not fully account for excesskurtosis present in many financial data. The student tdistribution with the number of degrees of freedom to be estimated has been used by several authors. Other densities which have been used in the estimation of GARCH models are the normalPoisson mixture [see e.g. Jorion (1988), Nieuwland et al. (1991)], the normallognormal mixture distribution [e.g. Hsieh (1989)] and the generalized error distribution [see e.g. Nelson (1991)] and the Bernoullinormal mixture [Vlaar and Palm (1993)]. De Vries (1991) proposes to use a GARCHlike process with conditional stable distribution which models the clustering of volatility, has fat tails and an unconditional stable distribution. Second, for some models such as the regression model under conditional normal ARCHdisturbances, the information matrix is blockdiagonal [see e.g. Engle (1982)]. The implications are important in that the regression coefficients
226
F.C. Palm
and the A R C H parameters can be estimated separately without loss of asymptotic efficiency. Also, their variances can be obtained separately. These results have been generalized by Linton (1993) who shows that the parameters of the conditional mean are adaptive in the sense of Bickel when the errors follow a stationary ARCH(q) process with an unknown conditional density which is symmetric about zero. In other words, estimating the unknown score function using the kernel method based on the normal density function yields parameter estimates of the conditional mean which have the same asymptotic distribution as the M L estimator based on the true distribution. This blockdiagonality does not hold for the A R C H  M model as there the conditional mean of a series depends on parameters of the conditional variance process. Also for an E G A R C H disturbance process, the blockdiagonality of the information matrix fails to hold. Indirect inference put forward by Gourirroux and Monfort (1993) and the efficient method of moments by Gallant et al. (1994) will be attractive when it is difficult to apply Q M L or M L but it is possible to estimate some function of the parameters of interest from the data. The indirect estimator has been used by Engle and Lee (1994) to estimate diffusion models of stochastic volatility. As a starting point, they estimate GARCH(1,1) models from daily returns on the S&P 500 Index for the period 1991,11990,9. The resulting Q M L estimates for 0 are used to estimate the parameters of the underlying diffusion model for the asset price Pt and its conditional variance o~ (a) (b) (c) Yt = 12 dt + atdwyt da~ = qS(m  ~rZ)dt + ~r~tdw~t correl(dwy, dw~) = p (3.4)
with yt = d p t / p t , dwy and dw~ being Wiener processes, using the relationships which match the first and second order conditional moments of the G A R C H model and the diffusion model (see Nelson (1990b): m = ~0, ~b = (1  ~1  ~l)dt, = ~1 ( v / ~  1)dt, 6 = 1 with t being the conditional kurtosis of the shocks of the G A R C H model. Indirect estimation based on estimates of a discrete time G A R C H model appears to be an appropriate way to estimate the parameters of the underlying diffusion process. To estimate stochastic volatility models, Gallant et al. (1994) use an indirect method based on the score of two auxiliary models. Both auxiliary models assume an SNP density as given in (2.15). When the SNP density is in the form of an A R C H model with conditionally homogeneous nonGaussian innovations, it is termed nonparametric A R C H model because it is similar to the nonparametric A R C H process considered by Engle and GonzalezRivera (1991). In the second model, the homogeneity constraint is dropped and the model is called the fully nonparametric specification. The SNP models are estimated by QML. Gallant et al. (1994) use daily observations on the S&P Composite Index for the period 19281987 to estimate a univariate model and daily observations for the period 19771992 to estimate a trivariate model for the S&P NYSE Index, the
227
DM/$ exchange rate and the tree month Eurodollar interest rate. The stochastic volatility model is found to be able to match the A R C H part of the nonparametric A R C H score for stock prices and interest rates. However it does not match the moments of the distribution of the innovations. For the exchange rate series, the stochastic volatility model fails to fit the A R C H part. Testing for the presence of ARCH(q) has also been extensively considered in the literature. A simple and frequently used test of the hypothesis H0 : al = a2 . . . . . ~q = 0 against the alternative H0 : al _> 0, ...~q ~ 0 with at least one strict inequality is the Lagrange multiplier (LM) test proposed by Engle L M = ~1fpoz(jz)_lz~ f ,
1Y t 21 , ' " Y t 2q ) , l where zt = ( ,
( y 2 / ~ 0  1).
(3.5)
t
z = (Zl,...ZT)
An asymptotically equivalent statistic is L M = TR 2, where R 2 is the squared multiple correlation between f 0 and z and T is the sample size. This is also the R 2 of a regression of Y~t on an intercept and q lagged values of y2. As shown by Engle (1982), a twosided L M test has an asymptotic z2distribution with q degrees of freedom. Demos and Sentana (1991) report critical values for the onesided L M test which are robust to nonnormality. A difficulty in constructing L M tests for G A R C H disturbances is that the block of the information matrix whose inverse is required, is singular, as pointed out by Bollerslev (1986). This is due to the fact that under the null hypothesis, fll in the GARCH(1,1) model is not identified. Lee (199l) has shown how this difficulty can be avoided and that the L M tests for A R C H and G A R C H errors are identical. Lee and King (1993) derive a locally most mean powerful (LMMP)based score (LBS) test for the presence of A R C H and G A R C H disturbances. The test is based on the sum of the scores evaluated at the null hypothesis and nuisance parameters replaced by their M L estimates. In the absence of nuisance parameters, the test is LMMP. The sum of the scores is then standardized by dividing it by its large sample standard error. The resulting test statistic has an asymptotic N(0,1) distribution. The test statistics used to test against an ARCH(q) process can also be used to test against a G A R C H ( p , q) process. In small samples, the LBS test appears to have better power than the LMtest and its asymptotic critical values were found to be at least as accurate. Wald and likelihood ratio (LR) criteria could be used to test the hypothesis of conditional homoskedasticity e.g. against a GARCH(1,1) alternative. The statistics associated with H0 : ~1 = 0 and fl~ = 0 against//1 : al _> 0 or fll > 0 with at least one strict inequality do not have a z2distribution with two degrees of freedom as the standard assumption that the true parameter value under H0 does not lie on the border of the parameter space does not hold. A LR test which uses a z2distribution with two degrees of freedom can be shown to be conservative [see e.g. Kodde and Palm (1986)]. Also, the problem of lack of identification of some parameters mentioned above can lead to a break down of standard Wald and LR testing procedures. These A R C H statistics test for specific forms of conditional
228
F. C. Palm
heteroskedasticity. Many tests however have been designed to test for general departures from independently, identically distributed random variables. For instance, the BDS test put forward by Brock, Dechert and Scheinkman (1987) tests for general nonlinear dependence. Its power against ARCH alternatives is similar to that of the LMARCH test [see e.g. Brock, Hsieh and LeBaron (1991)]. For other alternatives, the power of the BDS test may be higher. The application by Bera and Lee (1993) of the White Information Matrix (IM) criterion to the linear regression model with autoregressive disturbances lead to a generalization of Engle's LM test for ARCH where ARCH processes are specified as random coefficient autoregressive models. Several authors have noted that ARCH can be given a random coefficient interpretation [see e.g. Tsay (1987)]. Bera, Lee and Higgings (1992) point out the dangers of tackling specification problems one at a time rather than considering them jointly and provide a framework for analyzing autocorrelation and ARCH simultaneously. That such a framewok is needed has been illustrated by e.g. Diebold (1987) in a convincing way by showing that in the presence of ARCH, standard tests for serial correlation will lead to overrejection of the null hypothesis. Notice that the presence of ARCH could be interpreted in several ways such as nonnormality (excess kurtosis, skewness for asymmetric ARCH) [see e.g. Engle (1982)] and nonlinearity [see e.g. Higgings and Beta (1992)]. Recently Bollerslev and Wooldridge (1992) have developed robust LM tests for the adequacy of the jointly parametrized mean and variance. Their test is based on the gradient of the loglikelihood function evaluated at the constrained QMLestimator and can be computed from simple auxiliary regressions. Only first derivatives of the conditional mean and variance functions are required. The authors present simulation results revealing that in most cases, the robust test statistics compare favorably to nonrobust (standard) Wald and LM tests. This conclusion is in line with findings by Lumsdaine (1995) who compares GARCH(I,1) and IGARCH(1,1) models in a simulation study of the finitesample properties of the ML estimator and related test statistics, While the asymptotic distribution is found to be well approximated by the estimated tstatistics, parameter estimators are skewed for finite sample size, Wald tests have the best size, the standard LM test is highly oversized but versions that are robust to possible nonnormality perform better. Various model diagnostics have been proposed in the literature. For instance, Li and Mak (1994) examine the asymptotic distribution of the squared standardized residual autocorrelations from a Gaussian process with timedependent conditional mean and variance estimated by ML. The residuals are then standardized by dividing them by their conditional standard deviation and substracting their sample mean. The conditional mean and variance of the process can be nonlinear functions of the information available at time t. These functions are assumed to have continuous second order derivatives. When the data generating process is ARCH(q), a BoxPierce type portmanteau test based on autocorrelations of squared standardized residuals of order r up to M will have an asymptotic x2distribution with M  r degrees of freedom when r > q. These types of diagnostics are very useful for checking the adequacy of the model.
229
Specific kinds of hypotheses can arise in multivariate G A R C H models. For instance, G A R C H can be a common feature to several time series. Engle and Kozicki (1993) define a feature that is present in a group of time series as common to those series if there exists a nonzero linear combination of the series that does not have the feature. As an example, consider the bivariate version of the factorA R C H model in (2.20) with one factor and constant idiosyncratic factor covariance matrix. If the variance of f t follows a G A R C H process, the series Yit will also be G A R C H , but the linear combination ylt  bl/b2y2t will have a constant conditional variance. In this example, the series ylt and y2t share a common feature of the form of a common factor with a timevarying conditional variance. Engle and Kozicki (1993) put forward tests for common features. Engle and Susmel (1993) apply the procedure to test for A R C H as common feature in international equity markets. The approach is as follows. First, test for the presence of A R C H in the individual time series. Second, if the A R C H effects are significant in both series, consider the linear combination ylt  6y2t and regress its squared value on lagged squared values and lagged cross products of the series yit up to lag q and minimize TR2(6) over the coefficient 6. If instead of two series, a set of k series is considered, 6 becomes a (k  1) x 1 vector. As shown by Engle and Kozicki (1993) the test statistic which minimizes TR2(6) with respect to 6 has a )~2distribution with degrees of freedom given by the number of lagged squared values included in the regressions minus (k  1). Engle and Susmel (1993) applied the test to weekly returns on stock market indexes for 18 major stock markets in the world over the period January 1980 to January 1990. They found two groups of countries, one of European countries and one of Far East countries which show similar timevarying volatility. The common feature tests therefore confirm the existence of a common factorARCH structure for each group.
4. Statistical properties
In this section, we shall summarize the main results about the statistical properties of G A R C H models and give appropriate references to the literature. 4.1. Moments Bollerslev (1986) has shown that under conditional normality, the G A R C H process (2.2) is wide sense stationary with Eyt = 0 and var(yt)= e0[1  e(1)  fl(1)] 1 and cov(yt, y,) = 0 for t ~ s if and only if e(1) //(1) < 1. For the GARCH(1,1) model given in (2.2), a necessary and sufficient condition for the existence of the 2 rth moment is ~ = o ( ~ ) a j ~ / / ~  J < l when a 0 = 1 and aj = 4 = 1 ( 2 i  1),j= 1,2, ... Bollerslev (1986) also provides a recursive formula for even moments of yt when p = q = 1. The fourth moment of a conditionally normal GARCH(1,1) variable will be E y 4 = 3(Eyt2)2[1  (//1 + ~1)2]/[ 1  (//1 + el) 2  2 ~ ] if it exists. As a result of the symmetry of the normal distribution, odd moments are zero if they exist. These results extend results for the ARCH(q) process given in Engle (1982). The condition given above is sufficient for strict stationarity but not necessary.
230
E C Palm
As shown in Krengel (1985), strict stationarity of a vector A R C H process yt is equivalent to the conditions that fit Q(Yt1, Yt2, ...) being measurable and trace Qt~'t < c~ a.s. [see also Bollerslev et al. (1994)]. Moment boundedness i.e. E[ trace (~tQ't) r] being finite for some r > 0 implies trace (t]t~'t) < ~ a.s. Nelson (1990a) has shown that for the GARCH(1,1) model (2.2), yt is strictly stationary if and only if E[ln(fl 1 + ~iet2)] < 0 with et being i.i.d. (not necessarily conditional normal) and y~ nondegenerate. This requirement is much weaker than ~1 +/31 < 1. He also has shown that the IGARCH(1,1) model without drift converges almost surely to zero, while in the presence of a positive drift it is strictly stationary and ergodic. Extensions to general univariate G A R C H ( p , q ) processes have been obtained by Bougerol and Picard (1992).
=
G A R C H models are nonlinear stochastic difference equations which can be estimated more easily than the stochastic differential equations used in the theoretical finance literature to model timevarying volatility. In practice, observations are usually recorded at discrete points in time so that a discrete time model or a discrete time approximation to a continuous model will have to be used in statistical inference. Nelson (1990b) derives conditions for the convergence of stochastic difference equations, among which A R C H processes, to stochastic differential equations as the length of the interval between observations h goes to zero. He applies these results to the GARCH(1,1) and the E G A R C H model. Nelson (1992) investigates the properties of estimates of the conditional covariance matrix generated by a misspecified A R C H model. When a diffusion process is observed at discrete time intervals of length h, the difference between an estimate of its conditional instantaneous covariance matrix based on a GARCH(1,1) model or on an E G A R C H model and the true value converges to zero in probability as h ~ 0. The required regularity conditions are that the distribution does not have fat tails and that the conditional covariance matrix moves smoothly over time. Using highfrequency data, misspecified A R C H models can yield accurate estimates of volatility. In a way, the G A R C H model which averages squared values of variables can be interpreted as a nonparametric estimate of the conditional variance at time t. Discrete time models can also be approximated by continuous time diffusion models. Different A R C H models will in general have different diffusion limits. As shown by Nelson (1990b), the continuous limit may yield convenient approximations for forecast and other moments when a discrete time model leads to intractable distributions. Nelson and Foster (1994) examine the issue of selecting an A R C H process to consistently and efficiently estimate the conditional variance of the diffusion process generating the data. They obtain the approximate distribution of the measurement error resulting from the use of an approximate A R C H filter. Their result allows to compare the efficiency of various A R C H filters and to characterize asymptotically optimal A R C H conditional variance estimates. They derive optimal A R C H filters for three diffusion models and examine the filtering
231
properties of several G A R C H models. For instance, if the data generating process is given by the diffusion equations (3.4) with independent Brownian motions (p = 0) and 6 = 1, the asymptotically optimal filter for ot 2 sets the drift for Yt [t = # and the conditional variance
1/2 ~Sy,t+h 2 ~+h = w.h + (1  q~h  ~ h l / 2 ) o "2 ~/'/
(4.1)
with ey,t+h = h1/2[yt+h  Yt  Et(y,+h  Yt)], w = m~b and ~ = / v ~ . The asymptotically optimal filter for (3.4) with independent Brownian motions therefore is the GARCH(1,1) model. When Wy and w, are correlated, the GARCH(1,1) model (4.1) is no longer optimal. Nelson and Foster (1994) show that the nonlinear asymmetric G A R C H model proposed by Engle and Ng (1993) fulfills the optimality conditions in this case. Nelson and Foster (1994) also study the properties of various A R C H filters when the data are generated by a discrete time neardiffusion process. Their findings have important implications for the choice of a functional form for the A R C H filter in empirical research. The use of continuous record asymptotics has greatly enhanced our understanding of the relationship between continuous time stochastic differential equations and discrete time A R C H models as the sampling frequency increases. Similarly, issues of temporal aggregation play an important role in modeling timevarying volatilities, in particular when an investigator has the choice between using data observed with a high frequency or using observations sampled less frequently. More efficient parameter estimates may be obtained from the high frequency data. On other occasions, an investigator may be interested in the parameters of the high frequency model while only low frequency observations are available. The temporal aggregation problem has been addressed by Diebold (1988) who has shown that the conditional heteroskedasticity disappears in the limit as the sampling frequency decreases and that in the case of flow variables the marginal distribution of the low frequency observations converges to the normal distribution. Drost and Nijman (1993) study the question whether the class of G A R C H processes is closed under temporal aggregation when either stock or flow variables are modeled. The question can be answered if some qualifications are made. Three definitions of G A R C H are adopted. The sequence of variables Yt in (2.2) is defined to be generated by a strong G A R C H process if ~0, ~i, i = 1,2,...q and fli, i = 1 , 2 , . . . p can be chosen such that et = y t h t U2 is i.i.d, with mean zero and variance 1.The sequence yt is said to be semistrong G A R C H if E[yt ] y t  l , yt2, ...] = 0 and E[~]ytl,Yt2,...] = h t whereas it is weakly G A R C H ( p , q ) is P[yt [ Y t  l , yt2, ...] = 0 and P [ ~ I y t  x , yt2, ...] = ht where P denotes the best linear predictor in terms of a constant, y t  l , yt2, ..., y2_1, ~  2 , ... The main finding of Drost and Nijman (1993) is that the class of symmetric weak G A R C H processes for either stock or flow variables is closed under tern
232
F. C. Palm
poral aggregation. This means that if the high frequency process is symmetric (weak) GARCH, the low frequency process will also be symmetric weak GARCH. The parameters of the conditional variance of the low frequency process depend upon the mean, variance and kurtosis of the corresponding high frequency process. The conditional heteroskedasticity disappears as the sampling frequency increases for GARCH processes with ~q=l c~i+ ~P=I fli < 1. The class of strong or semistrong GARCH processes is generally not closed under temporal aggregation suggesting that strong or semistrong GARCH processes will often be approximations only to the data generating process if the observation frequency does not exactly correspond with the frequency of the data generating process. In a companion paper, Drost and Werker (1995) study the properties of a continuous time GARCH process, i.e. a process of which the increments Xt+h Xt, t C hN are weak GARCH for each fixed time interval h > 0. Obviously in the light of the results by Drost and Nijman (1993) a continuous time GARCH process cannot be strong or semistrong GARCH as the classes of these processes are not closed under temporal aggregation. The assumption of an underlying continuous time GARCH process leads to a kurtosis in excess of three for the associated discrete GARCH models, implying thick tails. Drost and Werker (1995) show how the parameters of the continuous time diffusion process can be identified from the discrete time GARCH parameters. The relations between the parameters of the continuous and discrete time models can be used to estimate the diffusion model from discrete time observations in a fairly straightforward way. Nijman and Sentana (1993) complement the results of Drost and Nijman (1993) by showing that contemporaneous aggregation of independent univariate GARCH processes yields a weak GARCH process. Then they generalize this finding by showing that a linear combination of variables generated by a multivariate GARCH process will also be weak GARCH. The marginal processes of multivariate GARCH models will be weak GARCH as well. Finally, from simulation experiments the authors conclude that in many instances, estimators which are ML under the assumption that the process is strong GARCH with conditional normal distribution converge to values close to the weak GARCH parameters as the sample size increases. The findings on temporal and contemporaneous aggregation of GARCH processes indicate that linear transformations of GARCH processes are generally only weak GARCH.
G A R C H models o f volatility
233
(1986), Granger, White and Kamstra (1989) are concerned with the construction of onestepahead forecast intervals with timevarying variances. Baillie and Bollerslev (1992) consider a single equation regression model with ARMAG A R C H disturbances, for which they derive the minimum MSE forecast. They also derive the moments of the forecast error distribution for the dynamic model with GARCH(1,1) disturbances. These moments are used in the construction of forecast intervals using the CornishFisher asymptotic expansion. Geweke (1989) obtains the multistep ahead forecast error density for linear models with A R C H disturbances by numerical integration within a Bayesian context. Nelson and Foster (1995) derive conditions under which for data observed at high frequency a misspecified A R C H model performs well in forecasting of a time series process and its volatility. In line with the conditions for successful filtering obtained by Nelson and Foster (1994), the basic requirement is that the A R C H model correctly specifies the functional form of the first two conditional moments of all state variables. To illustrate the construction of estimates of the forecast error variance, consider a stationary AR(1) process
Yt = (gYt1 + Us ,
(4.2)
where ut = gth]/2 is a GARCH(1,1) process as in (2.2). The minimum MSE forecast of Yt+s at period t is E t ( Y t + s ) = (aS y t . The forecast error wts = yt+s  (9S y t can be expressed as wt, = ut+s + cbut+~i + ... + ~b~lut+l. Its conditional variance at time t
sI
Var(wts)
= Z
i=0
2 ~ 2i Et(u,+~_i), s> 0 ,
(4.3)
can be computed recursively. The GARCH(1,1) process for us leads to an A R M A representation for ut 2 [see Bollerslev (1986)] 2 = ~0 Jr (~1 ~ f l l ) U 2 _ l  f l l O ,  I t 1)t , Ut (4.4)
with vt = u~  ht. The expectations on the r.h.s, of (4.3) can be readily obtained from expression (4.4)
Et(ht+s) = Et(u2+s) = ~0 + (~1 + fll)Et(u2+~_l),s > 1 ,
(4.5)
as shown by Engle and Bollerslev (1986). As the forecast horizon increases, the optimal forecast converges monotonically to the unconditional variance ~0/(1  ~ l  i l l ) . For the IGARCH(1,1) model, shocks to the conditional variance are persistent and Et(ht+s) = ~0(s  1) + hr. The expression (4.5) can be used as a forecast of future volatility. Baillie and Bollerslev (1992) derive an expression for the conditional MSE of Et(ht+s) as a forecast of the conditional variance at period t + s.
234
5. Conclusions
F. C. Palm
In this paper, we have surveyed the literature on modeling timevarying volatility using GARCH processes. In reviewing the vast number of contributions we have put most emphasis on recent developments. In tess than fifteen years since the pathbreaking publication of Engle (1982) much progress has been made in understanding GARCH models and in applying them to economic time series. This progress has drastically changed the way in which empirical time series research is carried out. At the same time, the statistical properties of time series, in particular financial time series which were not accounted for by existing models have led to new developments in the field of volatility modeling. The finding of skewness and skewed correlations defined as [(~t y2yt+k)/(Ta3v~)] fostered the development of asymmetric GARCH models. The presence of excess kurtosis in GARCH models with conditional normally distributed innovations has led to the use of studentGARCH models and GARCHjump models. Persistence in conditional variances was modeled using variance component models with a stochastic trend component. The finding of timevariation in conditional covariances and correlations resulted in the development of multivariate GARCH and factorGARCH models. FactorGARCH models have several attractive features. First, they can be easily interpreted in terms of economic theory (factor models like the arbitrage pricing theory have been used extensively in finance). Second, they allow for a parsimonious representation of timevarying variances and covariances for a high dimensional vector of variables. Third, they can account for both observed and unobserved factors. Fourth, they have interesting implications for common features of the variables. These common features can be tested in a straightforward way. Fifth, they have appeared to fit well in several instances. As has become apparent in Section 2, the functional forms of timevarying volatility has attracted a lot of attention by researchers to an extent where one wonders whether the returns from designing new GARCH specification are still positive. While some specifications are close if not perfect substitutes for others, the results by Nelson and Foster on the use of GARCH as filters to estimate the conditional variance of an underlying diffusion model put the issue of choosing a functional form for the GARCH model in a new perspective. For a given diffusion process some GARCH model will be an optimal (efficient) filter whereas others with similar properties might not be optimal. The research by Nelson and Foster (1994) suggests that prior knowledge about the form of the underlying diffusion process will be useful when choosing the functional form for the GARCH model. As shown by Anderson (1992,1994) GARCH processes belong to the class of deterministic, conditionally heteroskedastic volatility processes. The ease of evaluating the GARCH likelihood function and the ability of the GARCH specification to accommodate the timevarying volatility, in particular to yield a flexible, parsimonious representation of the correlation found for the squared values of many series (comparable to the parsimonious representation of condi
235
tional means using A R M A schemes) has led to the widespread use of G A R C H models. The history of the stochastic volatility model is brief. This model has been put forward as a parsimoniously parameterized alternative to G A R C H models. While one of its attractive features is the low number of parameters needed to fit the timevariation of volatility of many time series, likelihoodbased inference of stochastic volatility models requires numerical integration or the use of the Kalman filter. As mentioned in Section 3, many of these problems have by now been resolved. The statistical properties of G A R C H models and stochastic volatility models differ. Comparisons of these models [see for instance Danielson (1994), Hsieh (1991), Jacquier et al. (1995) and Ruiz (1993)] on the basis of financial time series led to the conclusion that these models put different weights on various moments functions. The choice among these models will very often be an empirical question. In other instances, a G A R C H model will be preferred because it yields an optimal filter of the variance of the underlying diffusion model. FactorGARCH models with unobserved factors will lead to stochastic volatility components when one has to condition on the latent factors. The borders between the two classes of volatility models are expected to lose sharpness. Results on temporal aggregation of G A R C H processes indicate that weak G A R C H is the most common case. For reasons of aggregation, models relying on strong G A R C H are at best approximations to the data generating process, a situation in which a pragmatic view of using data information to select the model might be the most appropriate. Topics for future research are improving our understanding and the modeling of relationships between volatilities of different series and markets. Multivariate G A R C H , factorGARCH and stochastic volatility models will be used and extended. Questions regarding the nature and the transmission of persistence in volatility from one series to another, the transmission of persistence in volatility into the conditional expected return will have to receive more attention in the future. Finally, statistical methods for testing and estimating volatility models and for forecasting volatility will be on the research agenda for a while. In particular, nonparametric and semiparametric methods appear to open up new perspectives to modeling timevariation in conditional distributions of economic time series.
References
Anderson, T. G. (1992). Volatility. Department of Finance, Working Paper No. 144, Northwestern University. Anderson, T. (1994). Stochastic autoregressive volatility: A framework for volatility modeling. Math. Finance 4, 75102. Baillie, R. T. and T. Bollerslev (1990). A multivariate generalized ARCH approach to modeling risk premia in forward foreign exchange rate markets. J. Internat. Money Finance 9, 309324. Baillie, R. T. and T. Bollerslev (1992). Prediction in dynamic models with timedependent conditional variances. J. Econometrics 52, 91113.
236
F. C. Palm
Baillie, R. T., T. Bollerslev, and H. O. Mikkelsen (1993). Fractionally integrated generalized autoregressive conditional heteroskedasticity. Michigan State University, Working Paper. Baillie, R. T. (1994) Long memory processes and fractional integration in econometrics. Michigan State University, Working Paper. Ball, C. A. and A. Roma (1993). A jump diffusion model for the European Monetary System. J. Internat. Money Finance 12, 475492. Ball, C. A. and W. N. Torous (1985). On jumps in common stock prices and their impact on call option pricing. J. Finance 40, 155173. Bera, A. K. and S. Lee (1990). On the formulation of a general structure for conditional heteroskedasticity. University of Illinois at UrbanaChampaign, Working Paper. Bera, A. K., S. Lee, and M. L. Higgins (1992). Interaction between autocorrelation and conditional heteroskedasticity : A random coefficient approach. J. Business Econom. Statist. 10, 133142. Bera, A. K. and S. Lee (1993). Information matrix test, parameter heterogeneity and ARCH. Rev. Econom. Stud. 60, 229240. Bera, A. K. and M. L. Higgins (1995). On ARCH models : Properties, estimation and testing. In: Oxley L., D. A. R. George, Roberts, C. J., and S. Sayer eds., Surveys in Econometrics, Oxford, Basil Blackwell, 215272. Black, F. (1976). Studies in stock price volatility changes. Proc. Amer. Statist. Assoc., Business and Economic Statistics Section 177181. Bollerslev, T. (1986). Generalized autoregressive conditional heteroskedasticity. J. Econometrics 31, 307327. Bollerslev, T., R. F. Engle, and J. M. Wooldridge (1988). A capital asset pricing model with time varying covariances. J. Politic. Econom. 96, 116131. Bollerslev, T., R. Y. Chou, and K. F. Kroner (1992). ARCH modeling in finance: A review of the theory and empirical evidence. J. Econometrics 52, 559. Bollerslev, T. and J. M. Wooldridge (1992). Quasi maximum likelihood estimation and inference in dynamic models with time varying covariances. Econometric Rev. 11, 143 172. Bollerslev, T. and I. Domowitz (1993). Trading patterns and the behavior of prices in the interbank foreign exchange market. J. Finance, to appear. Bollerslev, T. and R. F. Engle (1993). Common persistence in conditional variances. Econometrica 61, 166187. Bollerslev, T. and H. O. Mikkelsen (1993). Modeling and pricing longmemory in stock market volatility. Kellogg School of Management, Northwestern University, Working Paper No. 134. Bollerslev, T., R. F. Engle and D. B. Nelson (1994). ARCH models. Northwestern University, Working Paper, prepared for The Handbook o f Econometrics Vol. 4. Bougerol, Ph. and N. Picard (1992). Stationarity of GARCH processes and of some nonnegative time series. J. Econometrics 52, 115128. Brock, A. W., W. D. Dechert and J. A. Scheinkman (1987). A test for independence based on correlation dimension. Manuscript, Department of Economics, University of Wisconsin, Madison. Brock, A.W., D. A. Hsieh and B. LeBaron (1991). Nonlinear Dynamics, Chaos and Instability: Statistical Theory and Economic Evidence. MIT Press, Cambridge, MA. Cai, J. (1994). A Markov model of switchingregime ARCH. J. Business Econom. Statist. 12, 309 316. Chou, R. Y. (1988). Volatility persistence and stock valuations: Some empirical evidence using GARCH. J. Appl. Econometrics 3, 279294. Crouhy, M. and C. M. Rockinger (1994). Volatility clustering, asymmetry and hysteresis in stock returns : International evidence. Paris, HECSchool of Management, Working Paper. Crowder, M. J. (1976). Maximum likelihood estimation with dependent observations. J. Roy. Statist. Soc. Ser. B 38, 4553. Danielson, J. (1994). Stochastic volatility in asset prices : Estimation with simulated maximum likelihood. J. Econometrics 64, 375400. Davidian, M. and R. J. Carroll (1987). Variance function estimation. J. Amer. Statist. Assoc. 82, 10791091.
237
Demos, A. and E. Sentana (1991). Testing for GARCH effects: A onesided approach. London School of Economics, Working Paper. De Vries, C. G. (1991). On the relation between GARCH and stable processes. J. Econometrics 48, 313724. Diebold, F. X. (1987) Testing for correlation in the presence of ARCH. Proceedings from the ASA Business and Economic Statistics Section, 323328. Diebold, F. X. (1988). Empirical Modeling of Exchange Rates. Berlin, SpringerVerlag. Diebold, F. X. and M. Nerlove (1989). The dynamics of exchange rate volatility: A multivariate latent factor ARCH model. 9". Appl. Econometrics 4, 121. Diebold, F. X. and J. A. Lopez (1994). ARCH models. Paper prepared for Hoover K. ed., Macroeconometrics: Developments, Tensions and Prospects. Ding, Z., R. F. Engle, and C. W. J. Granger (1993). A long memory property of stock markets returns and a new model. J. Empirical Finance 1, 83106. Drost, F. C. and T. E. Nijman (1993). Temporal aggregation of GARCH processes. Econometrica 61, 909927. Drost, F. C. and B. J. M. Werker (1995). Closing the GARCH gap: Continuous time GARCH modeling. Tilburg University, paper to appear in J. Econometrics. Engel, C. and J . D. Hamilton (1990). Long swings in the exchange rate : Are they in the data and do markets know it ? Amer. Econom. Rev. 80, 689713. Engle, R. F. (1982). Autoregressive conditional heteroskedasticity with estimates of the variance of U.K. inflation. Econometrica 50, 9871008. Engle, R. F. and D . F. Kraft (1983). Multiperiod forecast error variances of inflation estimated from ARCH models. In: Zellner, A. ed., Applied Time Series Analysis of Economic Data, Bureau of the Census, Washington D.C., 293302. Engle, R. F. and T. Bollerslev (1986). Modeling the persistence of conditional variances. Econometric Rev. 5, 150. Engle, R. F., D . M. Lilien, and R. P. Robins (1987). Estimating time varying risk premia in the term structure : The ARCHM model, Econometrica 55, 391407. Engle, R. F. (1990). Discussion: Stock market volatility and the crash of 87. Rev. Financ. Stud. 3, 103106. Engle, R. F., V . K. Ng, and M. Rothschild (1990). Asset pricing with a factor ARCH covariance structure: Empirical estimates for treasury bills. J. Econometrics 45, 213238. Engle, R. F. and G. GonzalezRivera (1991). Semiparametric ARCH models. J. Business Econom. Statist. 9, 345359. Engle, R. F. and V . K. Ng (1993). Measuring and testing the impact of news on volatility. J. Finance 48, 1749 1778. Engle, R. F. and G. G. J. Lee (1993). Long run volatility forecasting for individual stocks in a one factor model. Unpublished manuscript, Department of Economics, UCSD. Engle, R. F. and S. Kozicki (1993). Testing for common features (with discussion). J. Business Econom. Statist. 11, 369380. Engle, R. F. and R. Susmel (1993). Common volatility and international equity markets. J. Business Econom. Statist. 11, 167176. Engle, R. F. and G. G. J. Lee (1994). Estimating diffusion models of stochastic volatility. Mimeo, University of California at San Diego. Engle, R. F. and K . F. Kroner (1995). Multivariate simultaneous generalized ARCH. Econometric Theory 11, 122150. French, K. R., G . W. Schwert and R . F. Stambaugh (1987). Expected stock returns and volatility. J. Financ. Econom. 19, 330. Gallant, A. R. (1981). On the bias in flexible functional forms and an essentially unbiased form : The Fourier flexible form. J. Econometrics 15, 211244. Gallant, A. R. and G. Tauchen (1989). Seminonparametric estimation of conditionally constrained heterogeneous processes : Asset pricing applications. Econometrica 57, 10911120.
238
F. C. Palm
Gallant, A. R., D. Hsieh and G. Tauchen (1994). Estimation of stochastic volatility models with suggestive diagnostics. Duke University, Working Paper. Geweke, J. (1989). Exact predictive densities for linear models with ARCH disturbances. J. Econometrics 40, 6386. Geweke, J. (1994). Bayesian comparison of econometric models. Federal Reserve Bank of Minneapolis, Working Paper. Ghysels, E., A. C, Harvey and E. Renault (1995). Stochastic volatility. Prepared for Handbook of Statistics, Vol.14. Glosten, L. R., R. Jagannathan, and D. Runkle (1993). Relationship between the expected value and the volatility of the nominal excess return on stocks. J. Finance 48, 17791801. Gouri+roux, C. and A. Monfort (1992). Qualitative threshold ARCH models. J. Econometrics 52, 159 199. Gouri6roux, C. (1992). ModOles A R C H et Application Financigres. Paris, Economica. Gouri6roux, C., A. Monfort and E. Renault (1993). Indirect inference. J. Appl. Econometrics 8, $85Sl18. Granger, C. W. J., H. White and M. Kamstra (1989). Interval forecasting: An analysis based upon ARCHquantile estimators. J. Econometrics 40, 87 96. Hamilton, J. D. (1988). Rationalexpectations econometric analysis of changes in regime: An investigation of the term structure of interest rates. J. Econom. Dynamic Control 12, 385423. Hamilton, J. D. (1989). Analysis of time series subject to changes in regime. J. Econometrics 64, 307333. Hamilton, J. D. and R. Susmel (1994). Autoregressive conditional heteroskedasticity and changes in regime. J. Econometrics 64, 307333. Harvey, A. C., E. Ruiz and E. Sentana (1992). Unobserved component time series models with ARCH disturbances. J. Econometrics 52, 129158. Hentschel, L. (1994). All in the family : Nesting symmetric and asymmetric GARCH models. Paper presented at the Econometric Society Winter Meeting, Washington D.C., to appear in J. Financ. Econom. 39, hr. 1. Higgins, M. L. and A. K. Bera (1992). A class of nonlinear ARCH models. Internat. Econom. Rev. 33, 137158. Hsieh, D. A. (1989). Modeling heteroskedasticity in daily foreign exchange rates. J. Business Econom. Statist. 7, 307317. Hsieh, D. (1991). Chaos and nonlinear dynamics: Applications to financial markets. J. Finance 46, 18391877. Hull, J. and A. White (1987). The pricing of options on assets with stochastic volatilities. J. Finance 42, 281300. Jacquier, E., N. G. Polson and P. E. Rossi (1994). Bayesian analysis of stochastic volatility models. J. Business. Econom. Statist. 12, 371389. Jorion, P. (1988). On jump processes in foreign exchange and stock markets. Rev. Finan. Stud. 1,427445. Kim, S. and N. Sheppard (1994). Stochastic volatility: Likelihood inference and comparison with ARCH models. Mimeo, Nuffield College, Oxford. King, M., E. Sentana and S. Wadhwani (1994). Volatility links between national stock markets. Econometrica 62, 901933. Kodde, D. A. and F. C. Palm (1986). Wald criteria for jointly testing equality and inequality restrictions. Econometrica 54, 12431248. Krengel, U. (1985). Ergodic Theorems. Walter de Gruyter, Berlin. Lee, J. H. H. (1991). A Lagrange multiplier test for GARCH models. Econom. Lett. 37, 265271. Lee, J. H. H. and M . L. King (1993). A locally most mean powerful based score test for ARCH and GARCH regression disturbances. J. Business Econom. Statist. 11, 1727. Lee, S. W. and B. E. Hansen (1994). Asymptotic theory for the GARCH(1,1) quasimaximum likelihood estimator. Econometric Theory 10, 2952. Li, W. K. and T. K. Mak (1994). On the squared residual autocorrelations in nonlinear time series with conditional heteroskedasticity. J. Time Series Analysis 15, 627636.
239
Lin, W.L. (1992). Alternative estimators for factor GARCH models  A Monte Carlo comparison. J. Appl. Econometrics 7, 259279. Linton, O. (1993). Adaptive estimation in ARCH models. Econometric Theory 9, 539569. Lumsdaine, R. L. (1992). Asymptotic properties of the quasimaximum likelihood estimator in GARCH(1,1) and IGARCH(1,1) models. Unpublished manuscript, Department of Economics, Princeton University. Lumsdaine, R. L. (1995). Finitesample properties of the maximum likelihood estimator in GARCH(1,1) and IGARCH(1,1) models: A Monte Carlo investigation. J. Business Econom. Statist. 13, 110. Melino, A. and S. Turnbull (1990). Pricing foreign currency options with stochastic volatility. J. Econometrics 45, 239266. Nelson, D. B. (1990a). Stationarity and persistence in the GARCH(1,1) model. Econometric Theory 6, 318334. Nelson, D. B. (1990b). ARCH models as diffusion approximations. J. Econometrics 45, 738. Nelson, D. B. (1991). Conditional heteroskedasticity in asset returns : A new approach. Econometrica 59, 347370. Nelson, D. B. (1992). Filtering and forecasting with misspecified ARCH models I. J. Econometrics 52, 6190. Nelson, D. B. and C. Q. Cao (1992). Inequality constraints in univariate GARCH models. J. Business Econom. Statist. 10, 229235. Nelson, D. B. and D . P. Foster (1994). Asymptotic filtering theory for univariate ARCH models. Econometrica 62, 141. Nelson, D. B. and D. P. Foster (1995). Filtering and forecasting with misspecified ARCH models II Making the right forecast with the wrong model. J. Econometrics 67, 303335. Ng, V., R. F. Engle, and M. Rothschild (1992). A multidynamicfactor model for stock returns. J. Econometrics 52, 245266. Nieuwland, F. G. M. C., W. F. C. Verschoor, and C. C. P. Wolff (1991). EMS exchange rates. J. lnternat. Financial Markets, Institutions and Money 2, 2142. Nijman, T. E. and F. C. Palm (1993). GARCH modelling of volatility : An introduction to theory and applications. In: De Zeeuw, A . J. ed., Advanced Lectures in Quantitative Economics II, London, Academic Press, 153183. Nijman, T. E. and E. Sentana (1993). Marginalization and contemporaneous aggregation in multivariate GARCH processes. Tilburg University, CentER, Discussion Paper No. 9312, to appear in J. Econometrics. Pagan, A. R. and A. Ullah (1988). The econometric analysis of models with risk terms. J. Appl. Econometrics 3, 87105. Pagan, A. R. and G. W. Schwert (1990). Alternative models for conditional stock volatility. J. Econometrics 45, 267290. Pagan, A. R. and Y. S. Hong (1991). Nonparametric estimation and the risk premium. In: Barnet, W. A., J. Powell and G. Tauchen, eds., Nonparametric and Semiparametric Methods in Econometrics and Statistics, Cambridge University Press, Cambridge. Pagan, A. R. (1995). The econometrics of financial markets. ANU and the University of Rochester, Working Paper, to appear in the J. Empirical Finance. Palm, F. C. and J. P. Urbain (1995). Common trends and transitory components of stock price volatility. University of Limburg, Working Paper. Parkinson, M. (1980). The extreme value method for estimating the variance of the rate of return. J. Business 53, 6165. Ruiz, E. (1993). Stochastic volatility versus autoregressive conditional heteroskedasticity. Universidad Carlos III de Madrid, Working Paper. Robinson, P. M. (1991). Testing for strong serial correlation and dynamic conditional heteroskedasticity in multiple regression. J. Econometrics 47, 6784. Schwert, G. W. (1989). Why does stock market volatility change over time? J. Finance 44, 11151153.
240
F. C. Palm
Sentana, E. (1991). Quadratic ARCH models: A potential reinterpretation of ARCH models. Unpublished manuscript, London School of Economics. Sentana, E. (1992). Identification of multivariate conditionally heteroskedastic factor models. London School of Economics, Working Paper. Taylor, S. (1986). Modeling Financial Time Series. J. Wiley & Sons, New York, NY. Taylor, S. J. (1994). Modeling stochastic volatility: A review and comparative study. Math. Finance 4, 183204. Tsay, R. S. (1987). Conditional heteroskedastic time series models. J. Amer. Statist. Assoc. 82, 590604. Vlaar, P. J. G. and F. C. Palm (1993). The message in weekly exchange rates in the European Monetary System : Mean reversion, conditional heteroskedasticity and jumps. J. Business. Econom. Statist. 11, 351360. Vlaar, P. J. G. and F. C. Palm (1994). Inflation differentials and excess returns in the European Monetary System. CEPR Working Paper Series of the Network in Financial Markets, London. Weiss, A. A. (1986), Asymptotic theory for ARCH models: Estimation and testing. Econometric Theory 2, 107131. Zakoian, J. M. (1994). Threshold heteroskedastic models. J. Econom. Dynamic Control 18, 931955.
G. S. Maddala and C. R. Rao, eds., Handbook of Statistics, Vol. 14 1996 Elsevier Science B.V. All rights reserved.
0 0
It is obvious that forecasts are of great importance and widely used in economics and finance. Quite simply, good forecasts lead to good decisions. The importance of forecast evaluation and combination techniques follows immediately  forecast users naturally have a keen interest in monitoring and improving forecast performance. More generally, forecast evaluation figures prominently in many questions in empirical economics and finance, such as: Are expectations rational? (e.g., Keane and Runkle, 1990; Bonham and Cohen, 1995) Are financial markets efficient? (e.g., Fama, 1970, 1991) Do macroeconomic shocks cause agents to revise their forecasts at all horizons, or just at short and mediumterm horizons? (e.g., Campbell and Mankiw, 1987; Cochrane, 1988)  Are observed asset returns "too volatile"? (e.g., Shiller, 1979; LeRoy and Porter, 1981) Are asset returns forecastable over long horizons? (e.g., Fama and French, 1988; Mark, 1995)  Are forward exchange rates unbiased and/or accurate forecasts of future spot prices at various horizons? (e.g., Hansen and Hodrick, 1980) Are government budget projections systematically too optimistic, perhaps for strategic reasons? (e.g., Auerbach, 1994; Campbell and Ghysels, 1995) Are nominal interest rates good forecasts of future inflation? (e.g., Fama, 1975; Nelson and Schwert, 1977)

Here we provide a fivepart selective account of forecast evaluation and combination methods. In the first, we discuss evaluation of a single forecast, and in particular, evaluation of whether and how it may be improved. In the second, we discuss the evaluation and comparison of the accuracy of competing forecasts. In the third, we discuss whether and how a set of forecasts may be combined to produce a superior composite forecast. In the fourth, we describe a number of
* We thank Clive Granger for useful comments, and we thank the National Science Foundation, the Sloan Foundation and the University of Pennsylvania Research Foundation for financial support.
241
242
forecast evaluation topics of particular relevance in economics and finance, including methods for evaluating directionofchange forecasts, probability forecasts and volatility forecasts. In the fifth, we conclude. In treating the subject of forecast evaluation, a tradeoff emerges between generality and tedium. Thus, we focus for the most part on linear leastsquares forecasts of univariate covariance stationary processes, or we assume normality so that linear projections and conditional expectations coincide. We leave it to the reader to flesh out the remainder. However, in certain cases of particular interest, we do focus explicitly on nonlinearities that produce divergence between the linear projection and the conditional mean, as well as on nonstationarities that require special attention.
The properties of optimal forecasts are well known; forecast evaluation essentially amounts to checking those properties. First, we establish some notation and recall some familiar results. Denote the covariance stationary time series of interest by yt. Assuming that the only deterministic component is a possibly nonzero mean, /~, the Wold representation is yt = # + et + bl et1 + b2 et2 ~where W]V(0~'2), and WN denotes serially uncorrelated (but not necessarily Gaussian, and hence not necessarily independent) white noise. We assume invertibility throughout, so that an equivalent onesided autoregressive representation exists. The kstepahead linear leastsquares forecast is .Yt+k,t = # ~ bk ct + bk+l Ct1 + . . . , and the corresponding kstepahead forecast error is
, Et ~
+ bk1
et+l
(1)
~2
var(et+k,t)
(~__~k1 62 ) \
2\i_z25, (2)
Four key properties of errors from optimal forecasts, which we discuss in greater detail below, follow immediately: (1) Optimal forecast errors have a zero mean (follows from (1)); (2) 1stepahead optimal forecast errors are white noise (special case of (1) corresponding to k = 1); (3) kstepahead optimal forecast errors are at most MA(k1) (general case of(l)); (4) The kstepahead optimal forecast error variance is nondecreasing in k (follows from (2)). Before proceeding, we now describe some exact distributionfree nonparametric tests for whether an independently (but not necessarily identically) distributed series has a zero median. The tests are useful in evaluating the properties
243
of optimal forecast errors listed above, as well as other hypotheses that will concern us later. Many such tests exist; two of the most popular, which we use repeatedly, are the sign test and the Wilcoxon signedrank test. Denote the series being examined by xt, and assume that T observations are available. The sign test proceeds under the null hypothesis that the observed series is independent with a zero median. 1 The intuition and construction of the test statistic are straightforward  under the null, the number of positive observations in a sample of size T has the binomial distribution with parameters T and 1/2. The test statistic is therefore simply
T t=l
In large samples, the studentized version of the statistic is standard normal, ST~2 N(O, 1) .
v /4
Thus, significance may be assessed using standard tables of the binomial or normal distributions. Note that the sign test does not require distributional symmetry. The Wilcoxon signedrank test, a related distributionfree procedure, does require distributional symmetry, but it can be more powerful than the sign test in that case. Apart from the additional assumption of symmetry, the null hypothesis is the same, and the test statistic is the sum of the ranks of the absolute values of the positive observations,
T
W = ) I+(xt)Rank(lxt D ,
t=l
where the ranking is in increasing order (e.g., the largest absolute observation is assigned a rank of T, and so on). The intuition of the test is simple  if the underlying distribution is symmetric about zero, a "very large" (or "very small") sum of the ranks of the absolute values of the positive observations is "very unlikely." The exact finitesample null distribution of the signedrank statistic is free from nuisance parameters and invariant to the true underlying distribution, and it has been tabulated. Moreover, in large samples, the studentized version of the statistic is standard normal,
244
Given a track record of forecasts, )vt+k,t, and corresponding realizations, Yt+k, forecast users will naturally want to assess forecast performance. The properties of optimal forecasts, cataloged above, can readily be checked.
a. Optimal forecast errors have a zero mean
A variety of standard tests of this hypothesis can be performed, depending on the assumptions one is willing to maintain. For example, if et+~,t is Gaussian white noise (as might be the case for 1stepahead errors), then the standard ttest is the obvious choice because it is exact and uniformly most powerful. If the errors are nonGaussian but remain independent and identically distributed 0id), then the ttest is still useful asymptotically. However, if more complicated dependence or heterogeneity structures are (or may be) operative, then alternative tests are required, such as those based on the generalized method of moments. It would be unfortunate if nonnormality or richer dependence/heterogeneity structures mandated the use of asymptotic tests, because sometimes only short track records are available. Such is not the case, however, because exact distributionfree nonparametric tests are often applicable, as pointed out by Campbell and Ghysels (1995). Although the distributionfree tests do require independence (sign test) and independence and symmetry (signedrank test), they do not require normality or identical distributions over time. Thus, the tests are automatically robust to a variety of forecast error distributions, and to heteroskedasticity of the independent but not identically distributed type. For k > 1, however, even optimal forecast errors are likely to display serial correlation, so the nonparametric tests must be modified. Under the assumption that the forecast errors are ( k  1)dependent, each of the following k series of forecast errors will be free of serial correiation: {el+k,1, el+2k,l+k, el+3k,l+2k,...): {ez+k,2, e2+zk,2+k, e2+3k,Z+Zk,...},{e3+k,3, e3+zk,3+k, e3+3k,3+2k,. .),., {e2~,k, e3k,2k, e4k,3k,...}. Thus, a Bonferroni bounds test (with size bounded above by c~) is obtained by performing k tests, each of size a/k, on each of the k error series, and rejecting the null hypothesis if the null is rejected for any of the series. This procedure is conservative, even asymptotically. Alternatively, one could use just one of the k error series and perform an exact test at level ~, at the cost of reduced power due to the discarded observations. In concluding this section, let us stress that the nonparametric distributionfree tests are neither unambiguously "better" nor "worse" than the more common tests; rather, they are useful in different situations and are therefore complementary. To their credit, they are often exact finitesample tests with good finitesample power, and they are insensitive to deviations from the standard
245
assumptions of normality and homoskedasticity required to justify more standard tests in small samples. Against them, however, is the fact that they require independence of the forecast errors, an assumption even stronger than conditionalmean independence, let alone linearprojection independence. Furthermore, although the nonparametric tests can be modified to allow for kdependence, a possibly substantial price must be paid either in terms of inexact size or reduced power.
b. 1Stepahead optimal forecast errors are white noise More precisely, the errors from linea~ Ieast squares forecasts are linearprojection independent, and the errors from least squares forecasts are conditionalmean independent. The errors never need be fully serially independent, because dependence can always enter through higher moments, as for example with the conditionalvariance dependence of GARCH processes. Under various sets of maintained assumptions, standard asymptotic tests may be used to test the white noise hypothesis. For example, the sample autocorrelation and partial autocorrelation functions, together with Bartlett asymptotic standard errors, may be useful graphical diagnostics in that regard. Standard tests based on the serial correlation coefficient, as well as the BoxPierce and related statistics, may be useful as well. Dufour (1981) presents adaptations of the sign and Wilcoxon signedrank tests that yield exact tests for serial dependence in 1stepahead forecast errors, without requiring normality or identical forecast error distributions. Consider, for example, the null hypothesis that the forecast errors are independent and symmetrically distributed with zero median. Then median (et+l,tet+2,t+l) = 0, that is, the product of two symmetric independent random variables with zero median is itself symmetric with zero median. Under the alternative of positive serial dependence, median (et+l,tet+2,t+l) > 0, and under the alternative of negative serial dependence, median (et+l,tet+2,t+l) < O. This suggests examining the crossproduct series zt = et+l,tet+2,t+l for symmetry about zero, the obvious test for which is the signedrank test, WD = ~f=lI+(zt)Rank([zt[). Note that the zt sequence will be serially dependent even if the et+l,t sequence is not, in apparent violation of the conditions required for validity of the signedrank test (applied to zt). Hence the importance of Dufour's contribution  Dufour shows that the serial correlation is of no consequence and that the distribution of WD is the same as that of W. c. kStepahead optimal forecast errors are at most M A ( k  1 ) Cumby and Huizinga (1992) develop a useful asymptotic test for serial dependence of order greater than k  1. The null hypothesis is that the et+k,t series is MA(q) (0 _< q < k  1) against the alternative hypothesis that at least one autocorrelation is nonzero at a lag greater than k  1. Under the null, the sample autocorrelations of et+k,t,19 [[)q+l,...,[)q+s], are asymptotically distributed
=
v~
246
is asymptotically distributed as Z~ under the null, where ~" is a consistent estimator of V. Dufour's (1981) distributionfree nonparametric tests may also be adapted to provide a finitesample bounds test for serial dependence of order greater than k  1. As before, separate the forecast errors into k series, each of which is serially independent under the null of ( k  1)dependence. Then, for each series, take Zk,t et+k,tet+2k,t+k and reject at significance level bounded above by ~ if one or more of the subset test statistics rejects at the ~ / k level.
~
The kstepahead forecast error variance, a~ = var(et+k,t) = o2X'k~b2~ ~z.~i=l i J, is nondecreasing in k. Thus, it is often useful simply to examine the sample kstepahead forecast error variances as a function of k, both to be sure the condition appears satisfied and to see the pattern with which the forecast error variance grows with k, which often conveys useful information. 3 Formal inference may also be done, so long as one takes care to allow for dependence of the sample variances across horizons.
Assessing optimality with respect to an information set
The key property of optimal forecast errors, from which all others follow (including those cataloged above), is unforecastability on the basis of information available at the time the forecast was made. This is true regardless of whether linearprojection optimality or conditionalmean optimality is of interest, regardless of whether the relevant loss function is quadratic, and regardless of whether the series being forecast is stationary. Following Brown and Maital (1981), it is useful to distinguish between partial and full optimality. Partial optimality refers to unforecastability of forecast errors with respect to some subset, as opposed to all subsets, of available information, ~qt. Partial optimality, for example, characterizes a situation in which a forecast is optimal with respect to the information t~sed to construct it, but the information used was not all that could have been used. Thus, each of a set of competing forecasts may have the partial optimality property if each is optimal with respect to its own information set. One may test partial optimality via regressions of the form et+k,t = offxt ~ Ut, where xt C f2t. The particular case of testing partial optimality with respect to Yt+k,t has received a good deal of attention, as in Mincer and Zarnowitz (1969). The relevant regression is et+k,t ~ o~ 0 + ~lYt+k,t + bit or Yt+k = flO + fllYt+k,t +ut, where partial optimality corresponds to (~0, cq) = (0, 0) or (flo,fll) = (0, 1). 4 One 3 Extensions of this idea to nonstationary longmemoryenvironments are developedin Diebold and Lindner (1995). 4 In such regressions,the disturbance should be white noise for 1stepaheadforecastsbut may be serially correlated for multistepaheadforecasts.
247
may also expand the regression to allow for various sorts of nonlinearity. For example, following Ramsey (1969), one may test whether all coefficients in the J ~ ^" regression et+k,t = ~j=0 J~+k,t + ut are zero. Full optimality, in contrast, requires the forecast error to be unforecastable on the basis of all information available when the forecast was made (that is, the entirety of Qt). Conceptually, one could test full rationality via regressions of the form et+k,t : O~lXt q Ut. If ~ 0 for all xt C f2t, then the forecast is fully optimal. In practice, one can never test for full optimality, but rather only partial optimality with respect to increasing information sets. Distributionfree nonparametric methods may also be used to test optimality with respect to various information sets. The sign and signedrank tests, for example, are readily adapted to test orthogonality between forecast errors and available information, as proposed by Campbell and Dufour (1991, 1995). If, for example, et+l,t is linearprojection independent of xt E ~'~t, then c o v ( e t + l , t , x t ) ~ O. Thus, in the symmetric case, one may use the signedrank test for whether E[zt] = E[et+l,tXt] = O, and more generally, one may use the sign test for whether median(zt) = median(et+l,txt)= 0. 5 The relevant sign and signedrank statistics T l T+(zt) and W are S~ = ~ t = Moreover, one may allow for nonlinear transformations of the elements of the information set, which is useful for assessing conditionalmean as opposed to simply linearprojection independence, by taking zt = et+l,tg(xt), where g(.) is a nonlinear function of interest. Finally, the tests can b e generalized to allow for kstepahead forecast errors as before. Simply take zt = et+k,tg(xt), divide the zt series into the usual k subsets, and reject the orthogonality null at significance level bounded by a if any of the subset test statistics are significant at the c~/k level. 6
=
~f=lI+(z,)Rank(lz,[).
5 Again, it is not obvious that the conditions reqtfired for application of the sign or signedrank test to zt are satisfied, but they are; see Campbell and D u f o u r (1995) for details. 6 Our discussion has implicitly assumed that both et+l,t and g(xt) are centered at zero. This will hold for et+l,t if the forecast is unbiased, but there is no reason why it should hold for g(xt). Thus, in general, the test is based on g(xt)  ,at, where Pt is a centering parameter such as the mean, median or trend of g(xt). See Campbell and D u f o u r (1995) for details.
248
may be very different across different loss functions and/or different horizons. This result has led some to argue the virtues of various "universally applicable" accuracy measures. Clements and Hendry (1993), for example, argue for an accuracy measure under which forecast rankings are invariant to certain transformations. Ultimately, however, the appropriate loss function depends on the situation at hand. As stressed by Diebold (1993) among many others, forecasts are usually constructed for use in particular decision environments; for example, policy decisions by government officials or trading decisions by market participants. Thus, the appropriate accuracy measure arises from the loss function faced by the forecast user. Economists, for example, may be interested in the profit streams (e.g., Leitch and Tanner, 1991, 1995; Engle et al., 1993) or utility streams (e.g., McCulloch and Rossi, 1990; West, Edison and Cho, 1993) flowing from various forecasts. Nevertheless, let us discuss a few stylized statistical loss functions, because they are used widely and serve as popular benchmarks. Accuracy measures are usually defined on the forecast errors, et+k,t = Y t + k  Yt+k,t, or percent errors, Pt+k,t 1 T = (Yt+k~t+k,t)/yt+k. For example, the mean error, M E = p ~ t = let+ k,t , and mean percent error, MPE = ~ 1t = lrP t + k , t , provide measures of bias, which is one component of accuracy. The most common overall accuracy measure, by far, is mean squared error, MSE = ~y'~t=let+k,t, r 2 1 v,T 2 or mean squared percent error, MSPE __  ~Lt=lPt+k,r Often the square roots of these measures are used to preserve units, yielding the root
/ 1 x,r e 2 mean squared error, RMSE = V~/_,t=l t+k,t~ and the root mean squared percent / 1 K,T p 2 error, RMSPE = VpZ_,t=l t+k,t" Somewhat less popular, but nevertheless com1 r mon, accuracy measures are mean absolute error, M A E = 7~t=l[et+k,tl, and 1 T mean absolute percent error, MAPE = ~ t = l IPt+k,t[. MSE admits an informative decomposition into the sum of the variance of the forecast error and its squared bias,
+ ( E [ y , + k l  E[Y,+k,,]) 2 ,
or equivalently MSE = var(yt+k) + var(Yt+k,,)  2 cov(y,+k, Y,+k,t) + (E[yt+k] Z[Y,+k,,]) 2 This result makes clear that MSE depends only on the second moment structure of the joint distribution of the actual and forecasted series. Thus, as noted in Murphy and Winkler (1987, 1992), although MSE is a useful summary statistic for the joint distribution of Yt+k and ~vt+k,t, in general it contains substantially less information than the actual joint distribution itself. Other statistics highlighting different aspects of the joint distribution may therefore be useful as well. Ultimately, of course, one may want to focus directly on estimates of the joint dis
249
tribution, which may be available if the sample size is large enough to permit relatively precise estimation.
Measuring forecastability
It is natural and informative to evaluate the accuracy of a forecast. We hasten to add, however, that actual and forecasted values may be dissimilar, even for very good forecasts. To take an extreme example, note that the linear least squares forecast for a zeromean white noise process is simply zero  the paths of forecasts and realizations will look very different, yet there does not exist a better linear forecast under quadratic loss. This example highlights the inherent limits to forecastability, which depends on the process being forecast; some processes are inherently easy to forecast, while others are hard to forecast. In other words, sometimes the information on which the forecaster optimally conditions is very valuable, and sometimes it isn't. The issue of how to quantify forecastability arises at once. Granger and Newbold (1976) propose a natural definition of forecastability for covariance stationary series under squarederror loss, patterned after the familiar R 2 of linear regression Gvar(~t+l,,) var(yt+l) 1 var(et+l,,) var(yt+l) '
where both the forecast and forecast error refer to the optimal (that is, linear least squares or conditional mean) forecast. In closing this section, we note that although measures of forecastability are useful constructs, they are driven by the population properties of processes and their optimal forecasts, so they don't help one to evaluate the "goodness" of an actual reported forecast, which may be far from optimal. For example, if the variance of)t+l,t is not much lower than the variance of the covariance stationary series yt+l, it could be that either the forecast is poor, the series is inherently almost unforecastable, or both.
250
F. X. Dieboldand J. A. Lopez
Stekler (1987) proposes a rankbased test of the hypothesis that each of a set of forecasts has equal expected loss. 8 Given N competing forecasts, assign to each forecast at each time a rank according to its accuracy (the best forecast receives a rank of N, the secondbest receives a rank o f N  1, and so forth). Then aggregate the periodbyperiod ranks for each forecast, T
H = ~N ( H i  NT/2)2 i=1 7v /2
Under the null, H ~ X~r1. As described here, the test requires the rankings to be independent over space and time, but simple modifications along the lines of the Bonferroni bounds test may be made if the rankings are temporally (k  1)dependent. Moreover, exact versions of the test may be obtained by exploiting Fisher's randomization principle. 9 One limitation of Stekler's rankbased approach is that information on the magnitude of differences in expected loss across forecasters is discarded. In malay applications, one wants to know not only whether the difference of expected losses differs from zero (or the ratio differs from 1), but also by how much it differs. Effectively, one wants to know the sampling distribution of the sample mean loss differential (or of the individual sample mean losses), which in addition to being directly informative would enable Wald tests of the hypothesis that the expected loss differential is zero. Diebold and Mariano (1995), building on earlier work by Granger and Newbold (1986) and Meese and Rogoff (1988), develop a test for a zero expected loss differential that allows for forecast errors that are nonzero mean, nonGaussian, serially correlated and contemporaneously correlated. In general, the loss function is L(yt+k, Y~+k,t). Because in many applications the loss function will be a direct function of the forecast error, L(yt+k, ~'~+k,t)= L(e~+k,t) , we write L(e~+k,t) from this point on to economize on notation, while recognizing that certain loss functions (such as directionofchange) don't collapse to the L(e~+k,t) form. 1 The null hypothesis of equal forecast accuracy for i two. forecasts is E[L(e~+k,t)] = E[L(4+k,t)l, or Etdt] = O, where dt = L(et+k,t)L(~+k,t ) is the loss differential. If dt is a covariance stationary, shortmemory series, then standard results may be used to deduce the asymptotic distribution of the sample mean loss differential, v ~ ( d  #) a N(0, 2~zfd(0)) ,
s Stekler uses RMSE, but other loss functionsmay be used. 9 See, for example, Bradley(1968), Chapter 4. 10 In such cases, the L(Yt+k,)i,t+k,t) form should be used.
251
i where d = is the sample mean loss differential, fa(O) = 1 / 2 ~ = _ ~ y a ( v ) is the spectral density of the loss differential at frequency zero, 7a(~) = E[(dt  #)(dtr  / t ) ] is the autocovariance of the loss differential at displacement v, and/~ is the population mean loss differential. The formula for fa(O) shows that the correction for serial correlation can be substantial, even if the loss differential is only weakly serially correlated, due to the cumulation of the autocovariance terms. In large samples, the obvious statistic for testing the null hypothesis of equal forecast accuracy is the standardized sample mean loss differential,
1/rE,r=1 [L(e,+k,,)L(C+k,,)]
B
V/2rt~fa(O)/T '
where .fa(O) is a consistent estimate of fa(O). It is useful to have available exact finitesample tests of forecast accuracy to complement the asymptotic tests. As usual, variants of the sign and signedrank tests are applicable. When using the sign test, the null hypothesis is that the median of the loss differential is zero, median(L(e~+k,t)  L(~+k,t)) = O. Note that the null of a zero median loss differential is not the same as the null of zero difference between median losses; that is, median(L(e~+k,t)L(e{+k)) # median(L(e~+k,t) ) median(L(~+k,t)). For this reason, the null differs slightly in spirit from that associated with the asymptotic DieboldMariano test, but nevertheless, it has the intuitive and meaningful interpretation that
252
the estimation period grows in length relative to the forecast period, the effects of parameter uncertainty vanish, and the DieboldMariano and West statistics are identical. West's approach is both more general and less general than the DieboldMariano approach. It is more general in that it corrects for nonstationarities induced by the updating of parameter estimates. It is less general in that those corrections are made within the confines of a more rigid framework than that of Diebold and Mariano, in whose framework no assumptions need be made about the often unknown or incompletely known models that underlie forecasts. In closing this section, we note that it is sometimes informative to compare the accuracy of a forecast to that of a "naive" competitor. A simple and popular such comparison is achieved by Theil's (1961) U statistic, which is the ratio of the 1stepahead MSE for a given forecast relative to that of a random walk forecast Yt+l,t : Yt; that is,
T ~~ (yt+l  Yt+l,t) ^ 2 t=l T
U:
Z(Yt+lyt) 2
t=l
Generalization to other loss functions and other horizons is immediate. The statistical significance of the MSE comparison underlying the U statistic may be ascertained using the methods just described. One must remember, of course, that the random walk is not necessarily a naive competitor, particularly for many economic and financial variables, so that values of the U statistic near one are not necessarily "bad." Several authors, including Armstrong and Fildes (1995), have advocated using the U statistic and close relatives for comparing the accuracy of various forecasting methods across series.
3. C o m b i n i n g f o r e c a s t s
In forecast accuracy comparison, one asks which forecast is best with respect to a particular loss function. Regardless of whether one forecast is "best," however, the question arises as to whether competing forecasts may be fruitfully combined  in similar fashion to the construction of an asset portfolio  to produce a composite forecast superior to all the original forecasts. Thus, forecast combination, although obviously related to forecast accuracy comparison, is logically distinct and of independent interest.
253
casts. The idea dates at least to Nelson (1972) and Cooper and Nelson (1975), and was formalized and extended by Chong and Hendry (1986). For simplicity, let us focus on the case of two forecasts, Yt+k,t 1 and ~t+k,t. Consider the regression
^
^1
If (flO, ill, f12) = (0, 1,0), one says that model 1 forecastencompasses model 2, and if (fl0, ill, f12) = (0,0, 1), then model 2 forecastencompasses model 1. For any other (fl0, ill, f12) values, neither model encompasses the other, and both forecasts contain useful information about yt+k. Under certain conditions, the encompassing hypotheses can be tested using standard methods. 11 Moreover, although it does not yet seem to have appeared in the forecasting literature, it would be straightforward to develop exact finitesample tests (or bounds tests when k > 1) of the hypothesis using simple generalizations of the distributionfree tests discussed earlier. Fair and Shiller (1989, 1990) take a different but related approach based on the regression
^1 ^2 (Yt+k  Yt) = flO q fll (Yt+k,t  yt) d fl 2(Yt+k,tYt) [et+k,t .
As before, forecastencompassing corresponds to coefficient values of (0,1,0) or (0,0,1). Under the null of forecast encompassing, the ChongHendry and FairShiller regressions are identical. When the variable being forecast is integrated, however, the FairShiller framework may prove more convenient, because the specification in terms of changes facilitates the use of Gaussian asymptotic distribution theory.
Forecast combination
Failure of one model's forecasts to encompass other models' forecasts indicates that all the models examined are misspecified. It should come as no surprise that such situations are typical in practice, because all forecasting models are surely misspecified  they are intentional abstractions of a much more complex reality. What, then, is the role of forecast combination techniques? In a world in which information sets can be instantaneously and costlessly combined, there is no role; it is always optimal to combine information sets rather than forecasts. In the long run, the combination of information sets may sometimes be achieved by improved model specification. But in the short run  particularly when deadlines must be met and timely forecasts produced  pooling of information sets is typically either impossible or prohibitively costly. This simple insight motivates the pragmatic idea of forecast combination, in which forecasts rather than models are the basic object of analysis, due to an assumed inability to combine information sets. Thus, forecast combination can be viewed as a key link between the short11 Note that MA(k  1) serial correlation will typicallybe present in et+k,t if k > 1.
254
run, realtime forecast production process, and the longerrun, ongoing process of model development. Many combining methods have been proposed, and they fall roughly into two groups, "variancecovariance" methods and "regressionbased" methods. Let us consider first the variancecovariance method due to Bates and Granger (1969). Suppose one has two unbiased forecasts from which a composite is formed as ~2
Yt+k,t = oJYt+k,t ]
^c
^1
(1
 co)yt+k, t
^2
Because the weights sum to unity, the composite forecast will necessarily be unbiased. Moreover, the combined forecast error will satisfy the same relation as the combined forecast; that is,
et+k, t ~ (.Oet+k, t ~
(1  fo)et+k, 2 t :
2 = co20~1+ (1  (D) 2022 2 ~ 2co(1  co)0t2,where 0~1 and 0"22are with a variance 0c unconditional forecast error variances and 012 is their covariance. The combining weight that minimizes the combined forecast error variance (and hence the combined forecast error MSE, by unbiasedness) is CO*~ 022  012 2 + 011 2 _ 2012 022
Note that the optimal weight is determined by both the underlying variances and covariances. Moreover, it is straightforward to show that, except in the case where one forecast encompasses the other, the forecast error variance from the optimal composite is less than min(012~,022). Thus, in population, one has nothing to lose by combining forecasts and potentially much to gain. In practice, one replaces the unknown variances and covariances that underlie the optimal combining weights with consistent estimates; that is, one estimates co* ^ = 1 / T y']~t=le't+k,t~+k,t, r i by replacing a q with 0ij yielding &. = 0"22 ~12 ~r~2 + ~r2l  2~12
In finite samples of the size typically available, sampling error contaminates the combining weight estimates, and the problem of sampling error is exacerbated by the collinearity that typically exists among primary forecasts. Thus, while one hopes to reduce outofsample forecast MSE by combining, there is no guarantee. In practice, however, it turns out that forecast combination techniques often perform very well, as documented Clemen's (1989) review of the vast literature on forecast combination. Now consider the "regression method" of forecast combination. The form of the ChongHendry and FairShiller encompassing regressions immediately sug12The generalizationto the case of M > 2 competing unbiased forecasts is straightforward, as shown in Newbold and Granger (1974).
255
gests combining forecasts by simply regressing realizations on forecasts. Granger and Ramanathan (1984) showed that the optimal variancecovariance combining weight vector has a regression interpretation as the coefficient vector of a linear projection of yt+k onto the forecasts, subject to two constraints: the weights sum to unity, and no intercept is included. In practice, of course, one simply runs the regression on available data. In general, the regression method is simple and flexible. There are many variations and extensions, because any "regression tool" is potentially applicable. The key is to use generalizations with sound motivation. We shall give four examples: timevarying combining weights, dynamic combining regressions, Bayesian shrinkage of combining weights toward equality, and nonlinear combining regressions.
a. Timevarying combining weights
Timevarying combining weights were proposed in the variancecovariance context by Granger and Newbold (1973) and in the regression context by Diebold and Pauly (1987). In the regression framework, for example, one may undertake weighted or rolling estimation of combining regressions, or one may estimate combining regressions with explicitly timevarying parameters. The potential desirability of timevarying weights stems from a number of sources. First, different learning speeds may lead to a particular forecast improving over time relative to others. In such situations, one naturally wants to weight the improving forecast progressively more heavily. Second, the design of various forecasting models may make them relatively better forecasting tools in some situations than in others. For example, a structural model with a highly developed wageprice sector may substantially outperform a simpler model during times of high inflation. In such times, the more sophisticated model should received higher weight. Third, the parameters in agents' decision rules may drift over time, and certain forecasting techniques may be relatively more vulnerable to such drift.
b ~ Dynamic combining regressions
Serially correlated errors arise naturally in combining regressions. Diebold (1988) considers the covariance stationary case and argues that serial correlation is likely to appear in unrestricted regressionbased forecast combining regressions when fll + t2 1. More generally, it may be a good idea to allow for serial correlation in combining regressions to capture any dynamics in the variable to be forecast not captured by the various forecasts. In that regard, Coulson and Robins (1993), following Hendry and Mizon (1978), point out that a combining regression with serially correlated disturbances is a special case of a combining regression that includes lagged dependent variables and lagged forecasts, which they advocate.
256
flposterior
where/~prior is the prior mean vector, Q is the prior precision matrix, F is the design matrix for the combining regression, and/~ is the vector of least squares combining weights. The obvious shrinkage direction is toward a measure of central tendency (e.g., the arithmetic mean). In this way, the combining weights are coaxed toward the arithmetic mean, but the data are still allowed to speak, when (and if) they have something to say.
The states that govern the combining weights can depend on past forecast errors from one or both models or on various economic variables. Furthermore, the indicator weight need not be simply a binary variable; the transition between states can be made more gradual by allowing weights to be functions of the forecast errors or economic variables. 4. Special topics in evaluating economic and financial forecasts
257
The question of how to evaluate such forecasts immediately arises. Our earlier results on tests for forecast accuracy comparison remain valid, appropriately modified, so we shall not restate them here. Instead, we note that one frequently sees assessments of whether directionofchange forecasts "have value," and we shall discuss that issue. The question as to whether a directionofchange forecast has value by necessity involves comparison to a naive benchmark  the directionofchange forecast is compared to a "naive" coin flip (with success probability equal to the relevant marginal). Consider a 2 2 contingency table. For ease of notation, call the two states into which forecasts and realizations fall " / " and "j". Commonly, for example, I = " u p " and j = "down." Tables 1 and 2 make clear our notation regarding observed cell counts and unobserved cell probabilities. The null hypothesis that a directionofchange forecast has no value is that the forecasts and realizations are independent, in which case Pij = PI.P4, Vi, j. As always, one proceeds under the null. The true cell probabilities are of course unknown, so one uses the consistent estimates/5i. = Oi./O and t54 = 0 4 / 0 . Then one consistently estimates the expected cell counts under the null, Eij = Pi.P.jO, by E~j = P~PjO = 0i.04/0. Finally, one constructs the statistic C = ~ j = l (Oij  Eij)2/Eij. lJnier the null, CdX~. An intimatelyrelated test of forecast value was proposed by Merton (1981) and Henriksson and Merton (1981), who assert that a forecast has value if Pu/Pg. + PjJPj. > 1. They therefore develop an exact test of the null hypothesis that Pii/Pi. + Pjj/Pj. = 1 against the inequality alternative. A key insight, noted in varying degrees by Schnader and Stekler (1990) and Stekler (1994), and formalized by Pesaran and Timmermann (1992), is that the HenrikssonMerton null is equivalent to the contingencytable null if the marginal probabilities are fixed at the observed relative frequencies, Oi./O and 04/0. The same unpalatable assumption is necessary for deriving the exact finitesample distribution of the HenrikssonMerton test statistic.
Table 1 Observed cell counts Actual i Forecast i Forecast j Marginal Actual j Marginal
Oil Oil
O.i
Oij Ojj 04
Oi. Oj
Total: O
Table 2 Unobserved cell probabilities Actual i Forecast i Forecast j Marginal Actual j Marginal
Pii Pji
P.~
Pij Pjj
P4
Pi. Pj.
Total: 1
258
Asymptotically, however, all is well; the square of the HenrikssonMerton statistic, appropriately normalized, is asymptotically equivalent to C, the chisquared contingency table statistic. Moreover, the 2 x 2 contingency table test generalizes trivially to the N x N case, with
cN= E
N (Oij
i,j= 1
Under the null, CN aN X~N_I)(N_I). A subtle point arises, however, as pointed out by Pesaran and T i m m e r m a n n (1992). In the 2 x 2 case, one must base the test on the entire table, as the offdiagonal elements are determined by the diagonal elements, because the two elements of each row must sum to one. In the N x N case, in contrast, there is more latitude as to which cells to examine, and for purposes of forecast evaluation, it may be desirable to focus only on the diagonal cells. In closing this section, we note that although the contingency table tests are often of interest in the directionofchange context (for the same reason that tests based on Theil's Ustatistic are often of interest in more standard contexts), forecast "value" in that sense is neither a necessary nor sufficient condition for forecast value in terms of a profitable trading strategy yielding significant excess returns. For example, one might beat the marginal forecast but still earn no excess returns after adjusting for transactions costs. Alternatively, one might do worse than the marginal but still make huge profits if the "hits" are "big," a point stressed by C u m b y and Modest (1987).
14 The probabillity forecast assigned to the Nth event is implicitly determined by the restriction that the probabilities sum to 1.
259
qes =
t=l
2(Pt+k,, R,+k) 2
Clearly, QPS c [0, 2], and it has a negative orientation (smaller values indicate more accurate forecasts).15 To understand the QPS, note that the accuracy of any forecast refers to the expected loss when using that forecast, and typically loss depends on the deviation between forecasts and realizations. It seems reasonable, then, in the context of probability forecasting under quadratic loss, to track the average squared divergence between Pt+k,t and Rt+~, which is what the QPS does. Thus, the QPS is a rough probabilityforecast analog of MSE. The QPS is only a rough analog of MSE, however, because Pt+k,t is in fact not a forecast of the outcome (which is 01), but rather a probability assigned to it. A more natural and direct way to evaluate probability forecasts is simply to compare the forecasted probabilities to observed relative frequencies  that is, to assess calibration. An overall measure of calibration is the global squared bias, GSB = 2(/3 R)2 , where P = 1/T~Vt=xP,+k# and R = 1/T~=,Rt+k. GSB C [0, 2] with a negative orientation. Calibration may also be examined locally in any subset of the unit interval. For example, one might check whether the observed relative frequency corresponding to probability forecasts between 0.6 and 0.7 is also between 0.6 and 0.7. One may go farther to form a weighted average of local calibration across all cells of a Jsubset partition of the unit interval into J subsets chosen according to the user's interest and the specifics of the situation. 16 This leads to the local squared bias measure,
J j=l
2v,(pj R;)2,
where Tj is the number of probability forecasts in set j , / ' ] is the average forecast in set j, and Rj is the average realization in set j, j = 1, ..., J. Note that LSB c [0, 2], and LSB = 0 implies that GSB = 0, but not conversely. Testing for adequate calibration is a straightforward matter, at least under independence of the realizations. For :r a given event and a corresponding sequence of forecasted probabilities {Pt+k,t}t=l , create J mutually exclusive and collectively exhaustive subsets of forecasts, and denote the midpoint of each range rcj,j = 1,... ,J. Let Rj denote the number of observed events when the forecast was in set j, respectively, and define "range j" calibration statistics,
i5 The "2" that appears in the QPS formula is an artifact from the full vector case. We could of course drop it without affecting the QPS rankings of competing forecasts, but we leave it to maintain comparability to other literature. 16 For example, Diebold and Rudebusch (1989) split the unit interval into ten equal parts.
260
(Rj 
Zj
( T j z g ( 1  ~j)) Uz 
w~/2 , , j =
l,...,J
Z0 
1/2 W+
'
where R+ = ~J=IRj, e+ = ~J=l Tjnj, and w+ = ~J=l Tj ~j (1  7rj). Zo is a joint test of adequate local calibration across all cells, while the Zj statistics test cellbycell local calibration. 17 Under independence, the binomial structure would obviously imply that Z0 a N(0, 1), and Zj N(O, 1), Vj = 1 , . . . , J. In a fascinating development, SeillierMoiseiwitsch and Dawid (1993) show that the asymptotic normality holds much more generally, including in the dependent situations of practical relevance. One additional feature of probability forecasts (or more precisely, of the corresponding realizations), called resolution, is of interest: RES = ~ j~l 2Tj(Rj  ~)2 . RES is simply the weighted average squared divergence between R and the [~js, a measure of how much the observed relative frequencies move across cells. RES >_ 0 and has a positive orientation. As shown by Murphy (1973), an informative decomposition of QPS exists, QPS = QPSR + LSB
1 J
RES ,
where QPS~ is the QPS evaluated at Pt+k,t = R. This decomposition highlights the tradeoffs between the various attributes of probability forecasts. Just as with Theil's Ustatistic for "standard" forecasts, it is sometimes informative to compare the performance of a particular probability forecast to that of a benchmark. Murphy (1974), for example, proposes the statistic
M=QPSQPSR=LSBRES
which measures the difference in accuracy between the forecast at hand and the benchmark forecast R. Using the earlierdiscussed DieboldMariano approach, one can also assess the significance of differences in QPS and QPSk, differences in QPS or various other measures of probability forecast accuracy across forecasters, or differences in local or global calibration across forecasters.
17 One may of course test for adequate global calibration by using a trivial partition of the unit interval  the unit interval itself.
261
Many interesting questions in finance, such as options pricing, risk hedging and portfolio management, explicitly depend upon the variances of asset prices. Thus, a variety of methods have been proposed for generating volatility forecasts. As opposed to point or probability forecasts, evaluation of volatility forecasts is complicated by the fact that actual conditional variances are unobservable. A standard "solution" to this unobservability problem is to use the squared realization el+ k as a proxy for the true conditional variance ht+k, because E[e~+klOt+k ~]E[ht+kv2+klt2t+k1] = hi+k, where vt+k ~ WN(0, 1). 18 Thus, for exT 1(et+ 2 k ht+k,t) ^ 2. Although MSE as often used to measure ample, MSE = 1/T}~t= volatility forecast accuracy, Bollerslev, Engle and Nelson (1994) point out that MSE is inappropriate, because it penalizes positive volatility forecasts and negative volatility forecasts (which are meaningless) symmetrically. Two alternative loss functions that penalize volatility forecasts asymmetrically are the logarithmic loss function employed in Pagan and Schwert (1990),
HMSE = T ~
_1V , [ [ht+k,, d+ k 
Bollerslev, Engle and Nelson (1994) suggest the loss function implicit in the Gaussian quasimaximum likelihood function often used in fitting volatility models; that is, GMLE=~t=~ 1 ln(ht+k,t)+~]. As with all forecast evaluations, the volatility forecast evaluations of most interest to forecast users are those conducted under the relevant loss function. West, Edison and Cho (1993) and Engle et al. (1993) make important contributions along those lines, proposing economic loss functions based on utility maximization and profit maximization, respectively. Lopez (1995) proposes a framework for volatility forecast evaluation that allows for a variety of economic loss functions. The framework is based on transforming volatility forecasts into probability forecasts by integrating over the assumed or estimated distribution of et. By selecting the range of integration corresponding to an event of interest, a
18Although el+k is an unbiased estimator of ht+k, it is an imprecise or "noisy" estimator. For example, if vt+k N N(O, 1),eZ+k ht+kvt+ 2 k has a conditional mean of ht+k because v~+k~ X~ Yet, because the median of a ;(12distribution is 0.455, e~+ k < 1/2ht+k more than fifty percent of the time.

262
forecast user can incorporate elements of her loss function into the probability forecasts. For example, given et+k]Ot ~ D(O, ht+k,t) and a volatility forecast ]~t+k,t, an options trader interested in the event et+k E [L~,t+k, U~,t+k] would generate the probability forecast
Pr ,, ~t+k,t
(L~,t+k <
zt+k < ~ /
U~,,+~]
= fl~,,+k f(zt+k)dzt+k ,
uo,,+k
where zt+k is the standardized innovation, f(zt+k) is the functional form of D(0, 1), and [l~,t+k, u~,t+k] is the standardized range of integration. In contrast, a forecast user interested in the behavior of the underlying asset, yt+k = #t+k,~ + et+k where ~tt+k, t = E[yt+k]flt], might generate the probability forecast
Pt+k,t : Pr(Ly,t+k < Yt+k < Uy,t+k) [Ly,,+k  ~,+k,, Uy,,+k  ~,+k,t'~
E[It+k,tilt,tk,Itl,tk1,...Ik+l,1] = (1  ~) ,
where It+k,t = 1, 0, if yt+k C [Ly,t+k, Uy,t+k] if otherwise.
263
That is, Christoffersen suggests checking conditional coverage. 19 Standard evaluation methods for interval forecasts typically restrict attention to unconditional coverage, E[It+klt] = (1  ct). But simply checking unconditional coverage is insufficient in general, because an interval forecast with correct unconditional coverage may nevertheless have incorrect conditional coverage at any particular time. F o r onestepahead interval forecasts (k = 1), the conditional coverage criterion becomes
E[It+l,tlIt,t~,Itl,t2,...12,1]
or equivalently,
It+ll t ~ Bern(1  a) .
= (1  ~) ,
Given T values of the indicator variable for T interval forecasts, one can determine whether the forecast intervals display correct conditional coverage by testing the hypothesis that the indicator variable is an iid Bernoulli(1  ct) random variable. A likelihood ratio test of the iid Bernoulli hypothesis is readily constructed by comparing the log likelihoods of restricted and unrestricted M a r k o v processes for the indicator series {It+i#}. The unrestricted transition probability matrix is
[ 7ql 1Tzll )
YI =
1  7r00
x00
'
where re11 =P(It+llt = l[Itlt1 = 1), and so forth. The transition probability matrix under the null is [I~ ~] The corresponding approximate likelihood functions are
L(rlii ) =
and
L(~II) = (1
~)(nn+n0l)(00(nl0+n00)
where n; is the number of observed transitions from I to j and I is the indicator sequence. 2 The likelihood ratio statistic for the conditional coverage hypothesis is
19 In general, one wants to test whether E[lt+kltIOtJ = (1  ~), where t2t is all information available at time t. For present purposes, Ot is restricted to past values of the indicator sequence in order to construct general and easily applied tests. 20 The likelihoods are approximate because the initial terms are dropped. All the likelihood ratio tests presented are of course asymptotic, so the treatment of the initial terms is inconsequential.
264
where l'I are the maximum likelihood estimates. Under the null hypothesis,
LRcc~a.z 2.
The likelihood ratio test of conditional coverage can be decomposed into two separately interesting hypotheses, correct unconditional coverage, E[It+llt] = (1  ~), and independence, 7~11 = 1  ~00. The likelihood ratio test for correct unconditional coverage (given independence) is
LRuc = 2[lnL(~lI )  l n L ( ~ l I ) ] ,
where Z(nlI ) = ( 1  n)(n'+ni)(rC) (n~+n). Under the null hypothesis, LRu~ ~ Z~. The independence hypothesis is tested separately by
ZRind :
2[lnL(flII)  l n Z ( ~ l I ) ]
Under the null hypothesis, LRind a X~' It is apparent that LRcc = LRuc+LRind, in small as well as large samples. The independence property can also be checked in the case where k = 1 using the group test of David (1947), which is an exact and uniformly most powerful test against firstorder dependence. Define a group as a string of consecutive zeros or ones, and let r be the number of groups in the sequence {It+l,t}. Under the null that the sequence is iid, the distribution of r given the total number of ones, nl, and the total number of zeros, no, is fr for r > 2
fr=
fz~+l
f 2S(~s) 2s),
Finally, the generalization to k > 1 is simple in the likelihood ratio framework, in spite of the fact that kstepahead prediction errors are serially correlated in general. The basic framework remains intact but requires a kthorder Markov chain. A kthorder chain, however, can always be written as a firstorder chain with an expanded state space, so that direct analogs of the results for the firstorder case apply.
5. C o n c l u d i n g r e m a r k s
Three modern themes permeate this survey, so it is worth highlighting them explicitly. The first theme is that various types of forecasts, such as probability forecasts and volatility forecasts, are becoming more integrated into economic and financial decision making, leading to a derived demand for new types of forecast evaluation procedures.
265
The second theme is the use of exact finitesample hypothesis tests, typically based on distributionfree nonparametrics. We explicitly sketched such tests in the context of forecasterror unbiasedness, kdependence, orthogonality to available information, and when more than one forecast is available, in the context of testing equality of expected loss, testing whether a directionofchange forecast has value, etc. The third theme is use of the relevant loss function. This idea arose in many places, such as in forecastability measures and forecast accuracy comparison tests, and may readily be introduced in others, such as orthogonality tests, encompassing tests and combining regressions. In fact, an integrated tool kit for estimation, forecasting, and forecast evaluation (and hence model selection and nonnested hypothesis testing) under the relevant loss function is rapidly becoming available; see Weiss and Andersen (1984), Weiss (1995), Diebold and Mariano (1995), Christoffersen and Diebold (1994, 1995), and Diebold, Ohanian and Berkowitz (1995).
References
Armstrong, J. S. and R. Fildes (1995). On the selection of error measures for comparisons among forecasting methods. J. Forecasting 14, 6771. Auerbach, A. (1994). The U.S. fiscal problem: Where we are, how we got here and where we're going. NBER Macroeconomics Annual, MIT Press, Cambridge, MA. Bates, J. M. and C. W. J. Granger (1969). The combination of forecasts. Oper. Res. Quart. 20, 451468. Bollerslev, T., R. F. Engle and D. B. Nelson (1994). ARCH models. In: R. F. Engle and D. McFadden, eds., Handbook of Econometrics, Vol. 4, NorthHolland, Amsterdam. Bollerslev, T. and E. Ghysels (1994). Periodic autoregressive conditional heteroskedasticity. Working Paper No. 178, Department of Finance, Kellogg School of Management, Northwestern University. Bonham, C. and R. Cohen (1995). Testing the rationality of price forecasts: Comment. Amer. Econom. Rev. 85, 284289. Bradley, J. V. (1968). Distributionfree statistical tests. Prentice Hall, Englewood Cliffs, NJ. Brier, G. W. (1950). Verification of forecasts expressed in terms of probability. Monthly Weather Review 75, 13. Brown, B. W. and S. Maital (1981). What do economists know? An empirical study of experts' expectations. Econometrica 49, 491504. Campbell, B. and J.M. Dufour (1991 Overrejections in rational expectations models: A nonparametric approach to the MankiwShapiro problem. Econom. Lett. 35, 285290. Campbell, B. and J.M. Dufour (1995). Exact nonparametric orthogonality and random walk tests. Rev. Econom. Statist. 77, 116. Campbell, B. and E. Ghysels (1995). Federal budget projections: A nonparametric assessment of bias and efficiency. Rev. Econom. Statist. 77, 1%31. Campbell, J. Y. and N. G. Mankiw (1987). Are output fluctuations transitory? Quart. J. Econom. 102, 857880. Chong, Y. Y. and D. F. Hendry (1986). Econometric evaluation of linear macroeconomic models. Rev. Econom. Stud. 53, 671~590. Christoffersen, P. F. (1995). Predicting uncertainty in the foreign exchange markets. Manuscript, Department of Economics, University of Pennsylvania. Christoffersen, P. F. and F. X. Diebold (1994). Optimal prediction under asymmetric loss. Technical Working Paper No. 167, National Bureau of Economic Research, Cambridge, MA.
266
Clemen, R. T. (1989). Combining forecasts: A review and annotated bibliography. Internat. J. Forecasting 5, 559581. Clemen, R. T. and R. L. Winkler (1986). Combining economic forecasts. J. Econom. Business Statist. 4, 3946. Clements, M. P. and D. F. Hendry (1993). On the limitations of comparing mean squared forecast errors. J. Forecasting 12, 617638. Cochrane, J. H. (1988). How big is the random walk in GNP? J. Politic. Eeonom. 96, 893920. Cooper, D. M. and C. R. Nelson (1975). The exante prediction performance of the St. Louis and F.R.B.M.I.T.Penn econometric models and some results on composite predictors. J. Money, Credit and Banking 7, 132. Coulson, N. E. and R. P. Robins (1993). Forecast combination in a dynamic setting. J. Forecasting 12, 6367. Curnby, R. E. and J. Huizinga (1992). Testing the autocorrelation structure of disturbances in ordinary least squares and instrumental variables regressions. Econometrica 60, 185195. Cumby, R. E. and D. M. Modest (1987). Testing for market timing ability: A framework for forecast evaluation. J. Finane. Econom. 19, 16%189. David, F. N. (1947). A power function for tests of randomness in a sequence of alternatives. Biometrika 34, 335339. Deutsch, M., C. W. J. Granger and T. Tersvirta (1994). The combination of forecasts using changing weights. Internat. J. Forecasting 10, 4757. Diebold, F. X. (1988). Serial correlation and the combination of forecasts. J. Business Econom. Statist. 6, 105111. Diebold, F. X. (1993). On the limitations of comparing mean square forecast errors: Comment. J. Forecasting 12, 641642. Diebold, F. X. and P. Lindner (1995). Fractional integration and interval prediction. Econom. Lett., to appear. Diebold, F. X. and R. Mariano (1995). Comparing predictive accuracy. J. Business Eeonom. Statist. 13, 253264. Diebold, F. X. L. Ohanian and J. Berkowitz (1995). Dynamic equilibrium economies: A framework for comparing models and data. Technical Working Paper No. 174, National Bureau of Economic Research, Cambridge, MA. Diebold, F. X. and P. Pauly (1987). Structural change and the combination of forecasts. J. Forecasting 6, 2140. Diebold, F. X. and P. Pauly (1990). The use of prior information in forecast combination. Internat. J. Forecasting 6, 503508. Diebold, F. X. and G. D. Rudebusch (1989). Scoring the leading indicators. J. Business 62, 369391. Dufour, J.M. (1981). Rank tests for serial dependence. J. Time Ser. Anal. 2, 117128. Engle, R. F., C.H. Hong A. Kane and J. Nob (1993). Arbitrage valuation of variance forecasts with simulated options. In: D. Chance and R. Tripp, eds., Advances in Futures and Options Research, JIA Press, Greenwich, CT. Engle, R. F. and S. Kozicki (1993). Testing for common features. J. Business Econom. Statist. 11, 369395. Fair, R. C. and R. J. Shiller (1989). The informational content of exante forecasts. Rev. Econom. Statist. 71, 325331. Fair, R. C. and R. J. Shiller (1990). Comparing information in forecasts from econometric models. Amer. Eeonom. Rev. 80, 375389. Fama, E. F. (1970). Efficient capital markets: A review of theory and empirical work. J. Finance 25, 383417. Fama, E. F. (1975). Shortterm interest rates as predictors of inflation. Amer. Econom. Rev. 65, 269282. Fama, E. F. (1991). Efficient markets II. J. Finance 46, 15751617. Fama, E. F. and K. R. French (1988). Permanent and temporary components of stock prices. J. Politic. Econom. 96, 246273.
267
Granger, C. W. J. and P. Newbold (1973). Some comments on the evaluation of economic forecasts. Appl. Econom. 5, 3547. Granger, C. W. J. and P. Newbold (1976). Forecasting transformed series. J. Roy. Statist. Soc. B 38,
189203.
Granger, C. W. J. and P. Newbold (1986). Forecasting economic time series. 2nd ed., Academic Press, San Diego. Granger, C. W. J. and R. Ramanathan (1984). Improved methods of forecasting. J. Forecasting 3, 197204. Hansen, L. P. and R. J. Hodrick (1980). Forward exchange rates as optimal predictors of future spot rates: An econometric investigation. J. Politic. Econom. 88, 829853. Hendry, D. F. and G. E. Mizon (1978). Serial correlation as a convenient simplification, not a nuisance: A comment on a study of the demand for money by the Bank of England. Econom. J. 88, 549563. Henriksson, R. D. and R. C. Merton (1981). On market timing and investment performance II: Statistical procedures for evaluating forecast skills. J. Business 54, 513533. Keane, M. P. and D. E. Runkle (1990). Testing the rationality of price forecasts: New evidence from panel data. Amer. Econom. Rev. 80, 714735. Leitch, G. and J. E. Tanner (1991). Economic forecast evaluation: Profits versus the conventional error measures. Amer. Econom. Rev. 81, 580590. Leitch, G. and J. E. Tanner (1995). Professional economic forecasts: Are they worth their costs? 3. Forecasting 14, 143157. LeRoy, S. F. and R. D. Porter (1981). The present value relation: Tests based on implied variance bounds. Econometrica 49, 555574. Lopez, J. A. (1995). Evaluating the predictive accuracy of volatility models. Manuscript, Research and Market Analysis Group, Federal Reserve Bank of New York. Mark, N. C. (1995). Exchange rates and fundamentals: Evidence on longhorizon predictability. Amer. Econ. Rev. 85, 201218. McCulloch, R. and P. E. Rossi (1990). Posterior, predictive and utilitybased approaches to testing the arbitrage pricing theory. J. Financ. Econ. 28, 738. Meese, R. A. and K. Rogoff (1988). Was it real? The exchange rate interest differential relation over the modern floatingrate period. J. Finance 43, 933948. Merton, R. C. (1981). On market timing and investment performance I: An equilibrium theory of value for market forecasts. J. Business 54, 513533. Mincer, J. and V. Zarnowitz (1969). The evaluation of economic forecasts. In: J. Mincer, ed., Economic forecasts and expectations, National Bureau of Economic Research, New York. Murphy, A. H. (1973). A new vector partition of the probability score. J. Appl. Meteor. 12, 595600. Murphy, A. H. (1974). A sample skill score for probability forecasts. Monthly Weather Review 102, 4855. Murphy, A. H. and R. L. Winkler (1987). A general framework for forecast evaluation. Monthly Weather Review 115, 13301338. Murphy, A. H. and R. L. Winkler (1992). Diagnostic verification of probability forecasts. Internat. J. Forecasting 7, 435455. Nelson, C. R. (1972). The prediction performance of the F.R.B.M.1.T.Penn model of the U.S. economy. Amer. Econom. Rev. 62, 902917. Nelson, C. R. and G. W. Schwert (1977). Short term interest rates as predictors of inflation: On testing the hypothesis that the real rate of interest is constant. Amer. Econom. Rev. 67, 478486. Newbold, P. and C. W. J. Granger (1974). Experience with forecasting univariate time series and the combination of forecasts. J. Roy. Statist. Soc. A 137, 131146. Pagan, A. R. and G. W. Schwert (1990). Alternative models for conditional stock volatility. J. Econometrics 45, 267290. Pesaran, M. H. (1974). On the general problem of model selection. Rev. Econom. Stud. 41, 153171. Pesaran, M. H. and A. Timmermann (1992). A simple nonparametric test of predictive performance. J. Business Econom. Statist. 10, 461465.
268
Ramsey, J. B. (1969). Tests for specification errors in classical leastsquares regression analysis. J. Roy. Statist. Soe. B 2, 350371. Satchell, S. and A. Timmermann (1992). An assessment of the economic value of nonlinear foreign exchange rate forecasts. Financial Economics Discussion Paper FE6/92, Birkbeck College, Cambridge University. Schnader, M. H. and H. O. Stekler (1990). Evaluating predictions of change. J. Business 63, 99107. SeillierMoiseiwitsch, F. and A. P. Dawid (1993). On testing the validity of sequential probability forecasts. J. Amer. Statist. Assoc. 88, 355359. Shiller, R. J. (1979). The volatility of long term interest rates and expectations models of the term structure. J. Politic. Econom. 87, 11901219. Stekler, H. O. (1987). Who forecasts better? J. Business Econom. Statist. 5, 155158. Stekler, H. O. (1994). Are economic forecasts valuable? J. Forecasting 13, 495505. Theil, H. (1961). Economic Forecasts and Policy. NorthHolland, Amsterdam. Weiss, A. A. (1995). Estimating time series models using the relevant cost function. Manuscript, Department of Economics, University of Southern California. Weiss, A. A. and A. P. Andersen (1984). Estimating forecasting models using the relevant forecast evaluation criterion. J. Roy. Statist. Soc. A 137, 484~87. West, K. D. (1994). Asymptotic inference about predictive ability. Manuscript, Department of Economics, University of Wisconsin. West, K. D., H. J. Edison and D. Cho (1993). A utilitybased comparison of some models of exchange rate volatility, ar. lnternat. Econom. 35, 2345. Winkler, R. L. and S. Makridakis (1983). The combination of forecasts. J. Roy. Statist. Soc. A 146,
150157.
G. S. Maddala and C. R. Rao, eds., Handbook of Statistics, Vol. 14 1996 Elsevier Science B.V. All rights reserved.
(~
Gautam Kaul
1. Introduction
Predictability of stock returns has always fascinated practioners (for obvious reasons) and academics (for not so obvious reasons). In this paper, I attempt to review empirical methods used in the financial economics literature to uncover predictable components in stock returns. Given the amazing growth in the recent literature on predictability, I cannot conceivably review all the papers in this area. I will therefore concentrate primarily on the empirical techniques introduced and/ or adapted to gauge the extent of predictability in stock returns in the recent literature. Also, consistent with the emphasis in the empirical literature, I will concentrate on the predictability of the returns of large portfolios of stocks, as opposed to predictability in individualsecurity returns. With the exception of some studies that uncover interesting empirical regularities, I will not review papers that are primarily "results oriented." Also, this review concentrates on the commonly used statistical procedures implemented in the recent literature to determine the importance of predictable components in stock returns. 1 Given that predictability of stock returns is inextricably linked with the concept of "market efficiency," I will discuss some of the issues related to the behavior of asset prices in an informationally efficient market [see Fama (1970, 1991) for outstanding reviews of market efficiency]. To keep the scope of this review manageable, I do not review the rich and growing literature on market microstructure and its implications for return predictability. Finally, even for the papers reviewed in this article, I will concentrate
* I really appreciate the time and effort spent by John Campbell, Jennifer Conrad, Wayne Ferson, Tom George, Campbell Harvey, David Heike, David Hirshleifer, Bob Hodrick, Ravi Jagannathan, Charles Jones, Bob Korajczyk, G.S. Maddala, M. Nimalendran, Richard Roll, Nejat Seyhun, and Robert Shiller in providing valuable feedback on earlier drafts of this paper. Partial funding for the project is provided by the School of Business Administration, University of Michigan, Ann Arbor, MI. i For example, I do not review frequencydomainbased procedures [see, for example, Granger and Morgenstern (1963)] or the relatively infrequently used tests of dependence in stock prices based on the rescaled range [see Goetzmann (1993), Lo (1991), and Mandelbrot (1972)]. Also, more recent applications of genetic algorithms to discover profitable trading rules [see Allen and Karjalainen (1993)] are not reviewed in this paper.
269
270
G. Kaul
virtually exclusively on the empirical methodology and minimize the discussion of the empirical results. To the extent that stylized facts themselves are inextricably linked to subsequent methodological developments, however, some discussion of the empirical evidence is imperative.
2. Why predictability?
Before discussing the economic importance of predictability and the recent advances made in empirical methodology, I need to explicitly define predictability. Let the return on a stock, Rt, follow a stationary and ergodic stochastic process with finite expectation E(Rt) = # and finite autocovariances E[(Rt  #)(Rtk /~)] = Vk. Let ~t1 denote the information set that exists at time t  l , of which Xt1 (an M x 1 vector) is the subset of information that is available to the econometrician. We then define predictability as specific restrictions on the parameters of the linear projection of Rt on Xtl:
(1)
where fl(lxM) 0(tM). Therefore, for the purposes of this paper, predictability is defined strictly in terms of the predictability of returns. I do not review the rich and growing literature on the predictability of the second moment of asset returns [see Bollerslev, Chou, and Kroner (1992)]. Therefore, for convenience, and unless explicitly stated otherwise, I assume that the errors, et, are conditionally normal, with mean zero and constant variance ~ 2 . F r o m a conceptual standpoint we can, in fact, assume that returns follow a random walk process because we are not directly interested in predictability in the second (or higher) moments of returns. Consequently, the otherwise important difference between martingales and random walks becomes irrelevant [see F a m a (1970)]. Clearly, statistical inferences based on estimates of (1) will depend on any departures from normality, homoskedasticity and/or autocorrelation in et's. Given that the use of statistical proeedures to obtain heteroskedasticity and/or autocorrelation consistent standard errors has been widespread in economics and finance for over a decade, I will not discuss these procedures. The interested reader is referred to Hansen (1982), Hansen and Hodrick(1980), Newey and West(1987), and White(1980). 2
2The assumption of homoskedasticity unfortunately precludes this review from covering the obviously important literature on the relation between conditional volatilityand expected returns [see, for example, French, Schwert, and Stambaugh (1987) and Stambaugh (1993)]. It is also important to realize that the assumption of normality for stock returns is made for convenience so that the coverage of this review is limited to a finite set of papers. Nevertheless, to the extent that normality may be critical to some of the results reviewed in this paper, the readers are cautioned against generalizing these results.
271
Having defined predictability in statistical terms, it appears natural to wonder why it has received such overwhelming attention since the advent of trading in financial securities. Clearly, as so eloquently emphasized by Roll (1988) in his American Finance Association Presidential Address, the ability to predict important phenomenon is the hallmark of any mature science. 3 However, predictability takes on several different connotations, for practitioners, individual investors, and academics, when it comes to stock markets. Practitioners and individual investors have understandably been excited about predictability in asset returns because, more often than not, they equate predictability with "beating the market." Though some academics exhibit similar unabashed excitement over discovering predictability, the academic profession's preoccupation with predictability is also based on more complex implications of returnpredictability. Consider the model for speculative prices presented by Samuelson (1965). Suppose that the world is populated by risk neutral agents, all of w h o m have c o m m o n and constant time preferences and c o m m o n beliefs about future states of nature. In this world, stock prices will follow submartingales and, consequently, stock returns are a fair game [see also Mandelbrot (1966)]. Specifically, let Pt, the logarithm of stock price follow a submartingale, that is, E(ptlg2t1) = Pt1 + r , where r > 0 is the exogenously given riskfree rate. Stock returns, Rt~ will therefore be given by a fair game, or,
E(Rtlf2t_l) = r .4
(2)
(3)
In a riskneutral world, therefore, it is clear that any predictability in stock returns as defined in (1) (that is, f l 0), would have very strong implications for financial economics: any predictability in stock returns would necessarily imply that the stock market is informationally inefficient. An important assumption for this result to hold is that the riskfree rate is exogenously determined and does not vary through time. In fact, Roll (1968) shows that expected returns on Treasury bills would vary if there is any timevariation in expected inflation. This is probably the first recognition in the financial economics literature of the fact that
3Roll's main focus is of course different from the focus of this paper. While we are interested in the predictability of future returns, he investigates our ability to explain movements in current stock returns using both past and current information. 4 It is important to note that the stock price Pt itself will not generally be a martingale in a riskneutral world. TechnicallyPt should be understood as the "price" inclusiveof reinvested dividends [see LeRoy (1989)]. Also, in this paper, the martingale behavior of stock prices is assumed to be an implication of riskneutrality. It is important to note however that (a) risk neutrality does not ensure that stock prices will follow martingales [see Lucas (1978)] and (b) stock prices can follow martingales even if agents are riskaverse [see Ohlson (1977)].
272
G. Kaul
asset prices may be predictable even in efficient stock markets, without the predictability resulting from changes in risk premia (see discussion below). Of course, market efficiency could be defined on a finer grid [see, for example, Roberts (1959) and Fama (1970)] depending on the type of information used at time t1 to predict future returns. The stock market is weakform, semistrong form, or strongform efficient if stock returns are unpredictable using past stock prices, past publicly available information, and past private information. Until the early seventies, the critical role of risk neutrality in determining the martingale behavior of stock prices was not evident. Consequently, it is not surprising that predictability became synonymous with market inefficiency in the financial economics literature. In fact, the academic literature reinforced the "real world" belief that predictability of stock returns was obvious evidence of mispricing of financial assets. This occurred in spite of the fact that, as early as 1970, Fama (1970) provided a very clear and precise discussion of the critical role of expected returns in determining the timeseries properties of asset returns, and the unavoidable link between the basic assumption about expected returns and tests of market efficiency. By the late seventies, however, the work of LeRoy (1973) and Lucas (1978) had demonstrated the critical role played by risk preferences in the martingale behavior of stock prices in efficient markets [see also Hirshleifer (1975)]. And today most academics realize that predictability is not immediately synonymous with market inefficiencies because in a riskaverse world rational timevarying risk premia could lead to returnpredictability. Nevertheless, one cannot a priori rule out the possibility that predictability in stock returns arises due to the irrational "animal spirits" of agents. Today, therefore, the existence of predictability has complex implications for financial economics. Given the history of the economic implications of returnpredictability, the past two decades have witnessed a fastflowing stream of research on (a) whether stock returns are predictable, and (b) on whether predictability reflects rational timevarying risk premia or irrational mispricing of securities [see Fama (1991)]. Fortunately, my task is limited to a review of the empirical methodology used to address issue (a) above; that is, to describe and evaluate the empirical techniques used to uncover any predictability in stock returns. One final thought on the importance of returnpredictability for the financial economics literature. There has been a fascination with testing capital asset pricing model(s), which is understandable because without a theoretically sound and empirically verifiable model (or models) of relative expected returns of fundamental financial securities such as common stock, the foundations of modern finance would be shaky. Returnpredictability plays a crucial part in at least a subset of these tests; specifically, without reliable predictability of stock returns, the important distinction between unconditional and conditional tests becomes irrelevant. [The distinction between conditional and unconditional tests of asset pricing models is well elucidated by Gibbons and Ferson (1985)].
273
3.1.1. The regression approach: Shortterm Let Xt1 in (1) be limited to one variable: the past return on the stock, Rt1. We
can then rewrite (1) as:
R t = # + (91Rt1 q 8t (91 __ C o v ( R t , R t _ l
where
(4)
) _ ~1
Var(Rt) Y0 We can similarly regress Rt on returns from any past period, tk, to gauge predictability, with the corresponding autocorrelation coefficient being denoted by (gk. The statistical significance of any predictability can be gauged, for example, by conducting a hypothesis test that any particular coefficient (gj = 0. Such a test can be implemented using the asymptotic distribution of the vector o f f h order autocorrelations [see Bartlett (1946)]
=
.. (9ij ~ N(O, I)
T
l'a
where
(5a)
(5b)
and T = total number of timeseries observations in the sample. A joint test of the hypothesis (9k = 0 V k can also be conducted under the null hypothesis of no predictability using the Qstatistic introduced by Box and Pierce (1970), where
G. Kaul
(6)
Given the early preoccupation with random walks, and Working's (1934) claim that random walks characteristically develop patterns similar to those observed in stock prices, several of the earlier studies concentrated on autocorrelationbased tests of randomness in stock prices [see Kendall (1953) and Fama (1965, 1970)]. These early empirical studies concluded that stock prices either follow random walks or that the observed autocorrelations in returns, though occasionally statistically significant, are economically trivial. 5 The economic implications of any small autocorrelations in returns were also suspect once Working (1960) and Fisher (1966) showed that temporal and/or crosssectional aggregation of stock prices could induce spurious predictability in returns, both at the individualsecurity and portfolio levels. More recently, however, the shortterm autocorrelationbased tests have taken different forms and have been motivated by different factors. Given that riskaversion could lead to timevarying riskpremia in stock returns, Conrad and Kaul (1988) hypothesize a parsimonious AR(1) model for conditional expected returns and test whether realized returns follow the implied ARMA representation. Specifically, let
Rt = E t  I ( R t ) + et
and
(7a) (7b)
where Et1 (Rt) = conditional expectation of Rt at time t  1, et = unexpected stock return and loll < 1. Given the model in (7a) and (7b), realized stock returns will follow an ARMA (1,1) model of the form:
Rt = ~ + IPlRt1 + at + Olat1
(8)
where [011 _< 1. Note that the positive autocovariance in expected stock returns [see (7b)] will also induce positive autocovariance in realized returns. A positive shock to future expected returns, however, causes a contemporaneous capital loss which, in turn, leads to negative autocovariance in realized returns. Specifically, in (8) the autoregressive coefficient denotes the positive persistence parameter if/l, but the moving average parameter, 01, is negative [see Conrad and Kaul (1988) and Campbell (1991)]. Some researchers therefore argue that it may be very difficult to uncover any predictability in stock returns due to the confounding effects of changes in expected returns on stock prices. Nevertheless, using weekly returns Conrad and Kaul (1988) find that: (a) estimates of the autoregressive coefficient, ~Ol, are positive and range between 0.40 and 0.60, and (b) more importantly, 5Granger and Morgenstern (1963)used spectral analysis to reach similar conclusions.
Predictablecomponentsin stockreturns
275
predictability in stock returns can explain up to 25 percent of the variation in the returns to a portfolio of small NYSE/AMEX firms. Given the rapidly meanreverting component in weekly stock returns (recall the ~l'S range between 0.40 and 0.60), Conrad and Kaul (1989) show that predictability of monthly returns can be substantial when decreasing weights are given to past intramonth information. This occurs because the most recent intramonth information is most informative about next month's expected returns; using monthly data to predict monthly returns effectively ignores intramonth information by assigning equal weights to all past intramonth information. Specifically, define monthly continuously compounded stock returns R~' as
3
Rt = Z RtWk
k=0
(9)
where R~'_ k = continuously compounded stock return in week t  ,t:. From (7b) it follows that the monthly expected stock return for the current month is given by
Et4(Rt)=Et4I~RtWk] k=0
= rclRt_ 4 + rc2Rt_ 5 + . . . .
where ~i = (01)il(~kl + 01)(1 + ~, + ~2 + ~ ) V i = 1,2,3,.... Therefore, the typical weights for past intramonth data would decline dramatically if we were interested in predicting monthly stock returns. Using geometrically declining weights on past weekly and daily returns, Conrad and Kaul (1989) show that up to 45 percent of the monthly returns of a portfolio of small firms can be explained based on ex ante information. On the other hand, studies using past monthly returns typically explain only 3 to 5 percent of variation in realized returns since they implicitly weigh all past intramonth information equally. Although recent autoregressionbased (and varianceratiobased, see Section 3.3) tests conducted on shortterm returns reveal statistically and economically significant return predictability, a caveat is in order. Most of the shortrun studies use weekly portfolio returns, and at least some of the observed predictability may be spuriously induced by market microstructure effects. Specifically, nonsynchronous trading could lead to nontrivial positive autocovariance in portfolio returns [ see, for example, Boudoukh, Richardson, and Whitelaw (1994), Fisher (1966), Lo and Mackinlay (1990b), Muthuswamy (1988) and Scholes and Williams (1977)].
276
G. Kaul
Alternatively, it was claimed that the lack of reliable predictability of returns implied that stock prices are close to their intrinsic value. There are however two problems with this conclusion. First, recent research (see above) has revealed nontrivial predictability of shorthorizon returns [Conrad and Kaul (1988, 1989) and Lo and MacKinlay (1988)]. Second, as shown by Campbell (1991), small but very persistent variation in expected returns can have a dramatic impact on a security's stock price. In fact, ShiUer (1984) and Summers (1986) argue that stock prices contain an important irrational component which takes long swings away from the fundamental value. This slowly meanreverting component, however, cannot be detected in shortterm stock returns. Stambaugh (1986a), in a discussion of Summers (1986), argues that although these long swings away from intrinsic value will not be detectable in shortterm data, longterm returns should be significantly negatively autocorrelated. Fama and French (1988) formalize this basic intuition by proposing a model for asset prices which now forms the alternative hypothesis for virtually all (longrun) tests of market efficiency. Let the logarithm of stock price, p , contain a random walk component, qt, and a slowly decaying stationary component, zt. Specifically,
P t = qt + Zt
(10)
~lt ~ iid(O, a2n) et ~ iid(O, ~ )
where
qt = # + qt1 + tit
Zt ~ (91Zt1 ~ ~t , ,
and I~1 < 1 and E(qt~t) = O. The two components of stock prices, qt and zt, are also labeled the permanent and temporary components. Given the model for stock prices in (10), stock returns can be written as:
Rt = Pt
 P t  1 =
[qt
 q t  l ] + [zt  z t  1 ]
o0
= lg + ~lt + gt + (t~l 1) Z
i=l
il
ff) l ~3ti
(11)
Fama and French (1988) suggest using the multiperiod autocorrelation coefficient to detect predictability by regressing a kperiod return on its own value lagged one period (of length k). Specifically,
k k
Z
i=l
+ ut(k) .
(12)
From (12) it is clear that fl(k) measures the multiperiod autocorrelation, and the ordinary least squares estimator of this parameter is given by
277
Rt/+,]
13a>
Var[Z,tlR,+,]
Some algebra manipulation shows that the probability limit of/~(k) is given by [see, for example, Jegadeesh (1991) ] plim[/~(k)] = (1  ~bk)2 27k(1  ~b~) + 2(1  ~bl k) (13b)
2/ 2 where 7 = (1 + ~bl)a~/2cr ~ = ratio of the unconditional variances of the returns 1 attributable to the permanent versus temporary components, and the asymptotic variance of/~(k) under the null hypothesis is given by
TVar[/~(k)]  2k2 + 1 3k
(14)
It is clear from (13) that the temporary component is entirely responsible for any predictability in stock returns [that is, if ~bl = 1, p lim[/~(k)] = 0]. More importantly, with q~l close to unity, it follows that shortterm returns [that is, small values of k in (12)] will exhibit small autocorrelations, while the negative autocorrelation will be large at long horizons (that is, for large k). Specifically, Fama and French (1988) argue that the negative autocorrelations in returns may exhibit a Ushaped pattern: close to zero at very short and long horizons, but significantly negative at reasonably long horizons. As the cumulation interval for returns k ~ ~z, p lim[/~(k)] ~  1 / 2 due to the temporary component, but the variance of the permanent component of a kperiod return will eventually dominate the variance of the temporary component since it increases linearly with k (that is, k7 ~ ~ for very large k). This, in turn, will push plim [/~(k)] up toward zero for large k. Jegadeesh (1991) provides an alternative estimator of longterm return predictability [see also Hodrick (1992)]. He argues that, if stock prices follow the process in (10), power considerations (see Section 4) dictate that a singleperiod return should be regressed on a multiperiod return. Specifically,
k
R, =
+ B(1,k) F_,R,_, + u , .
i=1
(15)
Rti]
(16a)
278
G. Kaul
plim[fi(1,k)] =  ( 1  q51)(1  q~k) , 2yk(l  ~bl) + 2(1  ~b~) (16b)
and the asymptotic variance of/~(1, k) under the null hypothesis of no predictability is given by TVar[fi(1,k)] = 1 / k . (17)
Comparing (16) with (13), we see that increasing the measurement interval of the dependent variable leads to a larger slope coefficient of the regression of longterm returns on lagged longterm returns if the alternative hypothesis is the model shown in equation (10). However, increasing the measurement interval of the dependent variable will also increase the standard error of the estimate [compare (17) with (14)]. Using Geweke's (1981) approximateslope procedure to gauge the relative asymptotic power of fi(k) versus fl(1, k), Jegadeesh (1991) shows that the latter effect always dominates. Consequently, for reasonable parameter values, the optimal choice of k for the dependent variable is always unity. The choice of the measurement interval for the independent variable however depends on plausible parameter specifications for the alternative hypothesis. Not surprisingly, for q~l close to one long measurement intervals are required to uncover predictability, while shorter measurement intervals are recommended if the share of the permanent component in the variance of returns, 7, is large. [A more detailed discussion of the power issues is presented in Section 4.]
279
The basic intuition for the varianceratio statistic follows directly from the random walk model for asset prices. If stock prices follow random walks, then the variance of a kperiod return should be k times the variance of a singleperiod return. In other words, the variances of returns should increase in proportion to the measurement interval, k. The kperiod variance ratio is defined as:
(/'(k) = V a r ( Z i i l R ' + i )
k Var(Rt)
1,
(18)
where, for convenience, the factor k is used in the denominator of the variance ratio and unity is subtracted from the ratio.. The intuitively appealing aspect of the varianceratiostatistic, V(k), is that it will be equal to zero under the null hypothesis of no predictability. Moreover, as shown below, (7(k)<>0 depending on whether singleperiod returns are positively (negatively) autocorrelated (or equivalently, whether there is mean reversion in security returns or security prices). Under the null hypothesis of no predictability, the asymptotic variance of V(k) is given by [see Lo and MacKinlay (1988) and Richardson and Smith (1991)]: TVar[l?(k) ] = 2(2k  l)(k  1) 3k (19)
3.3. A synthesis
In this section, we present a synthesis of all the statistics presented to test for the existence of predictability in stock returns based on the information contained in past stock prices. 7 All tests of return predictability discussed above are (approximately) linear combinations of autocorrelations in singleperiod returns. Under the null hypothesis of no predictability, all these statistics will therefore have zero expected values. However, the behavior of the various statistics could be substantially different under different alternative hypotheses because they place different weights on singleperiod autocorrelations of different lags. Recall from Section 3.1.1 that the asymptotic distribution of the vector o f j thorder autocorrelations is given by
(20a)
where k = the length of the measurement interval, and q~j(k)] = jthorder autocorrelation. For convenience, we redefine thejthorder autocorrelation coefficient such that:
7The discussionin this sectionis based in largepart on the analysisin Richardsonand Smith(1994). See also Daniel and Torous (1993).
280
~j(k) =
(20b)
Note that the fhorder autocorrelation coefficient in (20b) is different from the one in (5b) in that the autocovariance is not weighted by the singleperiod variance. Instead, since the independent variables in both the Fama and French (1988) multiperiod autoregression (12) and Jegadeesh's (1991) modified autoregression (15) are kperiod returns, the autocovariance in (20b) is weighted by a kperiod variance. Clearly, under the null hypothesis of no predictability this modification to the fhorder autocorrelation coefficient has no effect in large samples. However, under different alternative hypotheses, this seemingly minor modification could have nontrivial effects on inferences. As mentioned earlier, all the statistics discussed so far can be rewritten as weighted averages of the f h  o r d e r autocorrelations, albeit with different weights. We can define the entire set of test statistics as linear combinations of autocorrelations, such that
2s(k) = Z
J
~ojsdpj(k) ,
(21)
where o)js = weights assigned to thejthorder autocorrelation by a particular teststatistic, 2s(k) ]where s is the index for the test statistic]. Under the null hypothesis of no predictability, from (20a) it follows that s i ( O , o o;) .
(22)
The normality of all the test statistics follows because each one is an (approximately) linear combination o f f h  o r d e r autocorrelations which, in turn, have asymptotically normal distributions under the null hypothesis [see (20a)]. And using (21), the three estimators may be rewritten as [see Cochrane (1988), Jegadeesh (1990), Lo and MacKinlay (1988), and Richardson and Smith (1994)]:
j)dpj(k)
' (23a) (23b)
and
8A related stream of research measures the profitability to linear trading strategies of various horizons [see DeBondt and Thaler (1985) and Lehmann (1990)].In these studies, the profits of trading strategies are functions of average autocovariances,both for individual securitiesand portfolios [see Ball, Kothari, and Shanken (1995), Conrad and Kaul (1994),Jegadeesh(1990),Jegadeeshand Titman (1993), and Lo and MacKinlay (1990a)].
281
k1 ~'(k) = 2 z ( k ~ J) q~j(1) . (23c) j=l Given the weights and the exact formulae in (23a)(23c), it is simple to calculate the asymptotic variances of each of the estimators under the null hypothesis [or any other estimator of the form 2s(k) = ~jogjs(aj(k)]. Specifically, TVar[2s (k)] = ~ j co}~.Therefore, the asymptotic variances of the three estimators can be calculated as: T Var[/}(k)]  2 ~ + 1 3k ' TVar[/}(1,k)] = 1/k , and (24a) (24b)
TVar[l~(k) ] = 2(2k  1 ) ( k  1) (24c) 3k The appropriateness of a particular test statistic 2s(k) will depend entirely on the alternative hypothesis under consideration. For example, suppose stock prices reflect "true" value but are recorded with wellbehaved measurement errors caused by market microstructure effects, that is, observed price /~t = Pt + et (where Pt = true price and et = random measurement error). Then clearly the alternative model for stock returns will follow an MA(1) process, and the optimal weights to detect such predictability would be ~oj = 0 g j > 1. Any alternative weighting scheme would make the resulting test statistic inefficient [see Kaul and Nimalendran (1990)]. A more detailed examination of this important dependence of the choice of a particular test statistic 2,(k) and the alternative hypothesis is provided in Section 4. An additional important point made by Richardson and Smith (1994) in the context of the alternative test statistics used in the literature is that if the null hypothesis is true, then the estimators will be strongly correlated with each other. This occurs because/~(k),/}(1,k), and #(k) will tend to capture common sampling errors. Specifically, the asymptotic variancecovariance matrix of the three estimators can be written as: 9
rVarl
\
(1,2k)
V(2k)
2k
2kI 2k
a4k, 2k (2 l/J
6k /
(25)
For large k, the correlations vary between 75% and 88%, and Richardson and Smith (1994) confirm the existence of high correlation between the three estimators in small samples. This issue is particularly important because Richardson (1993), for example, shows that the Ushaped patterns in autocorrelations predicted by the alternative fads model in (10) can obtain even if true prices are completely unpredictable. Given that we can falsely reject the null hypothesis based on fl(k), it would not be very surprising if use of fl(1,2k) and ~'(2k) also lead to the same conclusion. 9Note that for ease of comparison across the three estimators, the variancecovariancematrix is calculated for/}(k),/~(1,2k), and 12(2k).
282
G. Kaul
3.4. Predictability based on fundamental variables Although predictability of stock returns based on past information in stock prices has received the overwhelming share of attention, several researchers gauge the predictability of stock returns using "fundamental" variables. In a seminal contribution to the predictability literature, Fama and Schwert (1977) use treasury bill rates to predict stock and bond returns [see also Fama (1981)]. Over the past decade, several new fundamental variables have been used to predict stock returns. For example, Campbell (1987), Campbell and Shiller (1988), Cutler, Poterba, and Summers (1991), Fama and French (1988, 1989), Flood, Hodrick, and Kaplan (1987), and Keim and Stambaugh (1986), among others, use financial variables such as dividend yield, priceearnings ratios, term structure variables, etc., to predict future stock returns. In a similar vein, Balvers, Cosimano, and MacDonald (1990), Fama (1990), and Schwert (1990) have used macroeconomic fundamentals, such as output and inflation, to predict stock returns [see also Chen (1991)], while Seyhun (1992) uses aggregate insidertrading patterns to uncover predictable components in stock returns. Some recent papers by Ferson and Harvey (1991), Evans (1994), and Ferson and Korajczyk (1995) focus on the relation between predictability of stock returns based on lagged variables and economic "factors" similar to those identified by Chen, Roll, and Ross (1986). Ferson and Schadt (1996) show that conditioning on predetermined public information removes biases in commonly used unconditional measures of the performance of mutual fund managers; mutual fund managers "look better" using conditional measures. Finally, Jagannathan and Wang (1996) show that models that allow for timevarying expected returns on the market portfolio also have the potential to explain the rich crosssectional variation in average returns on different stocks. The typical regression estimated to uncover predictable components in stock returns using fundamental variables is similar to regression (12): k Rt+i = a(k) + fl(k)Xt + ut(k) , i=1 where Xt =dividend yield, output, .... The only difference between (12) and (26) lies in the use of past fundamentals in the latter versus past returns in (12). Also, with the exception of Hodrick (1992), multiperiod returns are regressed on the fundamentals typically measured over a fixed interval. 1 The most significant findings of the studies estimating regressions similar to (26) are: (1) Several different variables predict stock returns; and (2) in virtually all cases, the ,~2's of the regressions increase dramatically as the length of measurement interval for the dependent variable is increased. In effect, therefore, there is strong predictability in longterm stock returns. (26)
10Following Jegadeesh (1991), Hodrick (1992) regresses singleperiod returns on past dividends measured over multipleperiods. See Section4.1 for a discussionof the efficacyof this approach.
283
The more recent literature on returnpredictability based on fundamental variables has therefore concentrated on longterm stock returns. This is quite natural, especially given that the most commonly used alternative timeseries model for returns [see (10)] also implies greater predictability of longterm returns. In fact, the "excess volatility" literature, pioneered by Shiller (1981) and LeRoy and Porter (1981), can be viewed as the precursor of the vast literature on longterm returnpredictability. This literature suggests that if stock prices are excessively volatile relative to subsequent movements in dividends, that implies that longterm returns (or, more specifically, the "infiniteperiod log returns") are forecastable [see also Shiller (1989)]. [Also see discussion below on the forecastability of longterm stock returns using past dividend yields.] It would also be fair to say that among all the potential variables that could be used to predict stock returns, dividend yields have received overwhelming attention [see, for example, Campbell and Shiller (1988a,b), Fama and French (1988b), Flood, Hodrick, and Kaplan (1987), Goetzmann and Jorion (1993), Hodrick (1992), and Rozeff (1984)]. The choice of the dividend yield variable again is no accident; fairly simple models of asset prices can be used to justify (a) the role of dividend yields in predicting stock returns, and (b) the stronger predictive power of dividend yields at long versus short horizons. Following Campbell and Shiller (1988a), consider the present value model of discounted dividends:
Pt = Et ~
i=1
Dt+i .
(27)
Given constant growth rate of dividends, G, and constant expected returns, we obtain the Gordon (1962) model for stock prices (for R > G):
Pt
= (lG)
\RZ~_ G / Dt .
(28)
Campbell and Shiller (1988a) show that with timevarying expected returns, it is useful to study the loglinear approximation of the relation between prices, dividends, and returns. Using this approximation, the "dynamic" version of the dividendgrowth model in (28) may be written as:
Pt
~
(29)
where p = 1/[1 + exp(dp)], k = log(p)  (1  p ) l o g ( l / p  1) and all lower case letters indicate logs of the respective variables and (d  p) is the fixed mean of the (log) dividendprice ratio, which follows a stationary process.
284
G. Kaul
To demonstrate the importance of the dividend yield variable for predicting future stock returns, equation (28) can be rewritten in terms of the (log) dividend yield [see also Campbell, Lo, and MacKinlay (1993)]:
oo
dt  P t 
kp 1+
~ _ E tj=0 Zj[_Adt+l+j+rt+l+j].
(30)
From (30) the potential predictive ability of dividend yields becomes obvious: the current dividend yield would proxy for future expectations of stock returns (the second term in brackets) as long as future dividend growth rates (the first term in brackets) are not too variable. Also, since we discount all future returns in (30), the current yield is likely to have greater predictive ability for longterm stock returns.t1 Given the economic justification for estimating regressions similar to (26), instead of comparatively ad hoc autoregressions similar to (12) or (15), until recently the startling evidence from the "fundamental regressions" was not viewed with suspicion. For example, Jegadeesh (1991), in investigating the power of autoregressions such as (12), reflects the general belief that "... the evidence that the returns at various horizons can be predicted using these [fundamental] variables does not seem to be controversial" (p. 1428). However, there are statistical problems associated with (longrun) regressions such as (26) caused by the unavoidable use of small sample sizes when k is large. The first problem [analyzed by Nelson and Kim (1993) and Goetzmann and Jorion (1993)] deals with bias in the OLS estimator of fl(k) because dividend yields (or other fundamental variables) are lagged endogenous variables. The second statistical problem results from the fact that the OLS standard errors of fi(k) are also biased [see Hodrick (1992), Kim, Nelson, and Startz (1991), Richardson and Smith (1991), and Richardson and Stock (1989)]. The analysis of Mankiw and Shapiro (1986) and Stambaugh (1986b) suggests that the smallsample bias in/~(k) could be substantial. Consider for example the bivariate system [see also Nelson and Kim (1993)]:
Yt = ~ + flXt1 + at ,
~ft = # ~ (oXt1 ~ ?It ,
at ,~ iid(O, ~ )
(30a)
and
~t ~ iid(O, a~)
(30b)
It can be shown that although/~OLS in (30a) is consistent, it is biased in small samples, and the bias is proportional to the bias in the OLS estimator of q~ [see Stambaugh (1986b)]: 1 Campbell, Lo, and MacKinlay (1993) also demonstrate how a highly persistent expected return component [that is, a lpl ~ 1 in 7(b)] could also lead to increased predictive ability of dividend yield (and other fundamental variables) at long horizons.
285
E[(/~  fl)] 
Cov(gt~ qt)
(31a)
And Kendall (1954) shows that the bias in qSOLSis approximately to the order of (1 + 3d~)/T, where T is the sample size. Consequently, E[(/~/3)] _~ Var(r/t)
Cov(~,, nt)
[(1 + 3q~)/T] .
(31b)
From (31a) and (31b), it follows that even if Xt_l truly has no explanatory power in predicting Yt, the small sample bias in estimating ~b results in spurious predictability. The spurious predictability will be stronger: (a) the higher the correlation coefficient between the innovations gt and r/t; (b) the higher the autocorrelation in Xt; and (c) the smaller the sample size. The second problem with regression (26) is that due to small sample sizes, most researchers use overlapping observations for kperiod returns (that is, the dependent variable) which, in turn, induces serial correlation in the errors. Traditional OLS standard errors are appropriate asymptotically only if there is no serial correlation in returns. Hansen and Hodrick (1980) provide autocorrelationconsistent asymptotic standard errors which can be modified for heteroskedasticity [see Hodrick (1992)]. Richardson and Smith (1991) use an innovative approach to derive asymptotic standard errors that replace the Hansen and Hodrick (1980) standarderror adjustments with a very simple form independent of the data. For example, the asymptotic variances of the three autocorrelation based estimators take the same form as in (24a)(24c). Hodrick (1992) provides heteroskedasticityconsistent counterparts to the Richardson and Smith (1991) standard errors within the context of regression (26).12 [Section 4.1 contains a detailed analysis of the efficiency gains from using overlapping observations in estimating regressions similar to (26).] Nelson and Kim (1993) address both problems of biased OLS estimators of fl(k) and biases in their standard errors by jointly modeling stock returns and dividend yields as a firstorder vector autoregressive (VAR) process [see also Hodrick (1992)]. Specifically, let
z, = a
Z,_l + u,
(32)
where Zt represent stock returns and lagged dividend yields. To assess the bias in /~(k) and the properties of the asymptotic standard errors in small samples, both Hodrick (1992) and Nelson and Kim (1993) simulate the VAR model in (32) under the null that the slope coefficients in the return equation are zero. The VAR approach is attractive because it directly addresses the issue of persistence in dividend yields [see q~ in (30b)] and the strong (negative) contemporaneous cor
42 See also Newey and West (1987) for autocorrelation and heteroskedasticityconsistent variance estimators that are positive semidefinite.
286
G. Kaul
relation between innovations in stock returns and dividend yields [proxied by zt and t/t in (30a) and (30b), respectively]. Both Hodrick (1992) and Nelson and Kim (1993) find that inferences could be substantially altered by correcting for (a) small sample bias in/~(k) induced by the endogeneity of dividend yields; and (b) the small sample bias in asymptotic standard errors suggested in the literature [see also Goetzmann and Jorion (1993)]. 13 On a more general level, however, all the tests of predictability will run into datasnooping problems. For example, Lo and MacKintay (1990c) show how grouping stocks into portfolios based on an empirical regularity (such as the size effect) can bias statistical tests. Of more direct concern to us, however, is the work of Foster and Smith (1992) and Lo and Mackinlay(1992) who analyze the properties of the maximal R 2, a widely used measure of the extent of predictability in several scientific contexts [see, for example, Roll (1988)]. Foster and Smith (1992) derive the distribution of the maximal R 2 when a researcher chooses predictor variables from a set of available ones. Consider, for example, a multiple regression:
Yt = ~ + flXt + et , et ~ N(O, a 2)
(33)
where X t is a matrix of k regressors. Under the null hypothesis that the vector fl = 0, the R2 of regression (33) is k T (k+l) distributed Beta [~, ~ ], where Tis the sample size. The distribution of the R 2 can then be used to assess the goodnessoffit of regression (33). The assumption is that researchers choose K predictors from a potential pool of M regressors, and the cutoffR 2 needs to be adjusted for this choice. Using order statistic arguments, Foster and Smith (1992) show that for independent regressions the distribution function for the maximal R 2 is given by
j(m
) <_ r
= [Beta (r
(34)
where Beta (r) is the cumulative distribution function of the beta density function with k / 2 and ~ degrees of freedom. Given that nonindependent regressions are estimated in the literature, equation (34) provides a lower bound for the true distribution function of the maximal R 2. Foster and Smith (1992) show that we could generate reasonably high R2's that do not exceed the maximal R 2 under the assumption of fl = 0 in (33), even if we "snoop" a few predictors from a limited set of potential regressors. Since the (independent and even overlapping) observations (T) in longrun studies are
13 The regression of overlapping returns (even under the null hypothesisof no predictability)on highlyautoeorrelateddividendyieldsand/or prices potentiallyalso sufferfrom the spurious regression phenomenonillustratedby Granger and Newbold (1974).
287
likely to be small, from (34) it follows that one can more easily produce spuriously high values of RZ's in longrun versus shortrun regressions. 14 In a related paper, Lo and MacKinlay (1992) explicitly maximize the predictability of stock returns to, among other things, provide a gauge of whether the predictability uncovered in the literature is economically significant or not. They maximize predictability by varying the dependent variable (specifically, the composition or portfolio weights of the stock portfolios whose returns are being predicted), while holding fixed the regressors in (33). Foster and Smith (1992), on the other hand, maximize predictability across subsets of predictors while keeping fixed the asset returns being predicted. Nevertheless, both studies provide useful bounds on maximal R2 values that can be achieved in empirical studies purely by chance.
4. Power comparisons
Until now we have concentrated on the statistical properties of teststatistics used in the literature to gauge predictability in stock returns under the null hypothesis of no predictability. However, critical to any statistic is its power in discerning departures from the null hypothesis. The power of a teststatistic can be determined within the context of a specific alternative hypothesis. The most common approach for evaluating the power of a statistic is to use computerintensive simulations under different alternative hypotheses [see, for example, Hodrick (1992), Lo and MacKinlay (1989), Kim and Nelson (1993), and Poterba and Summers (1988)]. A classic example of such power comparisons is the exhaustive investigation of the size and power (against several alternative hypotheses) of the varianceratio statistic in finite samples by Lo and MacKinlay (1989). Although, small sample sizes that are characteristic of longrun studies may make a computerintensive approach unavoidable for determining the finite sample properties of any particular statistic, some recent studies suggest that asymptotic power comparisons can help us understand the reasons for the different (or similar) behavior of test statistics under alternative hypotheses. Specifically, Campbell, Lo, and MacKinlay (1993), Hansen and Hodrick (1980), Jegadeesh (1991), and Richardson and Smith (1991, 1994), among others, use the Bahadur (1960) and Geweke (1981) procedure to compare the relative asymptotic power of test statistics, which requires a comparison of their approximate slopes. The approximate slope of a teststatistic, denoted by cs, is defined as the rate at which the logarithm of the asymptotic marginal significance level of the statistic declines, under a given alternative hypothesis, as the sample size is increased. Geweke (1981) shows that when the limiting distribution of a teststatistic 2s (k) is ;(2, its approximate slope is equal to the probability limit of 1 / T times the test statistic under the null hypothesis. 14The unreliabilityof R2's in longrun studies that use overlapping stock returns as dependent variables to increaseT is also emphasizedin Grangerand Newbold(1974).
288
G. Kaul
As an illustration of power comparisons, let us assume that the alternative hypothesis is described by the temporarypermanent stock price model shown in (10). The choice of this alternative is attractive because of its widespread use in the literature. Also, following Jegadeesh (1991) and Richardson and Smith (1991, 1994) let us compare the relative asymptotic powers of the three main autocorrelation based statistics,/~(k), fi(1,2k), and (Z(2k). Note that the choice of these statistics is also natural because, given that they are linear combinations of consistent autocorrelation estimators [see (21)], they have limiting Xa distributions. This, in turn, enables us to directly use Geweke's (1981) procedure to conduct power comparisons. Noting that all the autocorrelationbased statistics are given by 2s(k) ~jO)js~oj(k), we need to choose o) and k to maximize the approximate slope of a particular teststatistic 2~(k) [see Richardson and Smith (1994)]:
=
(35)
The only unknowns in (35) are the probability limits of q~(k) which can be determined easily given the alternative model in (10). Specifically, [1/(1 + ~,)]41(1  qS)2 p lim[@(k)] = 217/(1 + 7)](1  q~) + 211/(1 + 7)](1 . (36)
c~k)/k
Substituting the values of plim [~j(k)] from (36)into (35), we can find the test with the maximal approximate slbpe arid use it as a benchmark to gauge the relative power of all existing test statistics. Specifically, maximizing c, in (35) with respect to co and k, we obtain:
~oj 1 (37)
As Richardson and Smith (1994) note, there are two separate parts of this maximization problem in (37). The first part in brackets is clearly maximized as k is increased, but the marginal gain from increasing k decreases at a rate which is a function of the two unknowns, 7 (the share of the variance of the permanent versus the temporary component of stock prices) and q~ (the persistence parameter of the temporary component). The second component involves a choice of the weights, o), which depend only on ~b because it fully explains the autocorrelation pattern under the alternative model in (10). And given a fixed ~b, the optimal weights o)j = ~bj~ V j, that is, the optimal weights for the asymptotically most powerful statistic will decline geometrically. From the above discussion it would appear that the variance ratio statistic, l;'(2k), which places declining weights on autocorrelations should exhibit the maximum power compared to both the /~(1,2k) statistic, which places equal weights on autocorrelations, and /~(k) which places virtually no weight on the
289
very informative loworder autocorrelations [see (23a)(23c)]. However, Richardson and Smith's (1994) explicit approximate slope comparisons reveal that the /~(1,2k) statistic fares as well as the V(2k) statistic in detecting departures from the null when the alternative model is of the form in (10). The answer to this puzzling result lies in the use of multipleperiod returns in/~(1,2k) versus singleperiod returns in V(2k) for weighting the autocovariances [compare (16a) with (18)]. Thus, the choice of k = 1 for the variance ratio, I;'(2k), reduces its power because the first term in (37) is not maximized. Conversely, the choice o f k > 1 for /~(1,2k) increases its power; however, the flat (as opposed to geometrically declining) weights hurts its power. This useful insight, obtained from theoretical power comparison of the tests, helps us understand the sources of the apparently similar power [given the alternative model in (10)] of two seemingly different test statistics.
4.1. Overlapping observations
A large part of the literature on stockreturn predictability has concentrated on longrun predictability, using both past returns and/or fundamental variables. However, since "theory" is silent about what constitutes a longrun, empirical studies have used holding periods of five to l0 years in gauging the existence of predictability. A paucity of historical data, however, makes it difficult to obtain more than a handful of independent (that is, nonoverlapping) observations on longterm returns. For example, between 1926 (the starting date of the CRSP tapes) and 1994, there are only 14 nonoverlapping fiveyear intervals. Such small samples make inferences very unreliable, and it is not surprising that the past decade has witnessed several attempts to extricate as much information out of the limited historical data at hand. A natural solution to the smallsample problem is to use overlapping data; and this has been the choice of most empiricists. Hansen and Hodrick (1980) use the asymptotic slope procedure of Bahadur (1960) and Geweke (1981) to show that overlapping data leads to an increase in the asymptotic efficiency of estimators of longrun relations. Richardson and Smith (1991) quantify the efficiency gains from the use of overlapping data when past returns are used to predict future returns (see Section 3.1). They show that overlapping data provide approximately 50% more "observations" relative to the nonoverlapping data used for the same period. However, Boudoukh and Richardson (1994) demonstrate that the efficiency gains from the use of overlapping data may be severely diluted when longterm predictability is measured by estimating the information content in fundamental variables [see regression (26)]. Specifically, if the fundamental variables used to predict stock returns are highly autocorrelated, which they invariably are [see, for example, Keim and Stambaugh (1986) and Fama and French (1988b)], the efficiency gains from the use of overlapping data dwindle rapidly. Also, other commonly suggested procedures may actually be even more inefficient than using overlapping observations.
290
G. Kaul
Consider, for example, regression (26) estimated using nonoverlapping data and a single predictor variable; that is, the data are sampled every k periods leading to a sample size of T/k kperiod observations. The asymptotic variance of /~(k) is given by
TVar[fl(k)] = k 2 a2
(38)
where a ] and a~ are the variances of singleperiod returns and the independent variable Xt. Suppose, overlapping observations are used to estimate (26) instead, and let the predictor variable follow an autoregressive model of the form Xt = #x+ ~b~_l + et, with 0 < ~bx < 1.0.15 Under these conditions, Boudoukh and Richardson (1994) show that the asymptotic variance of the overlapping estimator of fl(k), denoted by/~0(k), is given by
TVar[flo(k)] aTxx k
a2[
+l~2~x (
k1q~x
(39)
Note that while the asymptotic variance of both the nonoverlapping and overlapping estimators, /~(k) and to(k), increases with an increase in the measurement interval of returns, k, the asymptotic variance of the latter also increases with ~bx, the autoregressive parameter of the predictor variable process. In fact, Boudoukh and Richardson (1994) show that with 720 months of data and ~x = 0.99 (a sample size and autoregressive parameter common to several longrun studies),/~0^(k)based on fiveyear overlapping intervals would be as efficient as the estimator fl(k) based on only 14 fiveyear nonoverlapping intervals! The importance of the autoregressive parameter q~ in reducing the efficiency gains from using overlapping data can be seen directly from a comparison of (38) and (39): with a q5 x = 0, the nonoverlapping data is less efficient by a factor of k, the length of the longterm interval. Unfortunately, an intuitively appealing alternative approach to resolving this small sample problem may actually be worse than using overlapping data, in spite of the fact that this approach has the advantage of avoiding the calculation of autocorrelationconsistent standard errors. Specifically, following Jegadeesh (1991), Hodrick (1992) suggests that fl(k) in (26) be estimated by using singleperiod returns as the dependent variable, while using the predictor variable aggregated over k periods [see also Cochrane (1991)]. Although the asymptotic efficiency of this alternative estimator,/~a (k) and the overlapping estimator, t0 (k), are identical under the assumption that qS~= 0; Boudoukh and Richardson (1994) show that given the finite history of data available to us, the efficiency of/~A(k) is much lower than the efficiency of/~0(k), especially the larger the measurement interval, k, and the higher the autocorrelation in the predictor variable. This lower efficiency is primarily due to the fact that the denominator of/~A(k) is a k15A firstorderautoregressivemodelfor Xt may be appropriate because,althoughmost predictor variables have autocorrelafionsat lag 1 that are closeto 1.0, higherorderautocorrelationstypically decayfairlyrapidly[seeKeimand Stambaugh(1986)].
291
period variance of Xt, while the denominator of/~0(k) is only a singleperiod variance of Xt. In finite samples, the kperiod variance of X t will be measured much more inefficiently than its singleperiod variance. The above discussion therefore suggests that commonly used approaches to resolving the smallsample problem inherent to longrun studies may be unsatisfactory. Does this imply that longrun regressions have a bleak future? The answer clearly is no. From an economic standpoint, most rational or irrational sources of predictability may be discernible only in the longrun (see Sections 3.1 and 3.4). And ongoing research suggests that even from a statistical standpoint longrun regressions may be informative, in spite of the smallsamplerelated efficiency problems associated with such regressions. For example, Stambaugh's (1993) recent work suggests that violations of OLS assumptions for regressions similar to (26) [for example, the welldocumented heteroskedasticity in stock returns not directly dealt with in this review], may actually enhance the efficiency of longrun regressions relative to their shortrun counterparts; and the relative efficiency gain is even greater for overlapping versus nonoverlapping longrun regressions. Also, the work of Campbell (1993) and Stambaugh (1993) shows that the efficiency gains from overlapping data are magnified for nonzero/~(k) alternatives in (26).
5. Conclusion
In this paper, I attempt to provide a review of the broad spectrum of empirical methods commonly used to uncover predictable patterns in stock returns. I have made a conscious effort to limit discussion of empirical facts to the extent that they are relevant to (and perhaps motivate) the development and/or application of new statistical techniques. This review therefore concentrates on the statistical properties of the most widely used techniques. I have presented both the strengths and shortcomings of the statistical procedures because there is no substitute for robust empirical "facts." Robust facts become the basis for most subsequent theoretical and empirical research. 16 Specifically, given that stock returns contain predictable components it is then imperative to determine the economic significance of such predictability. Broadly speaking, two approaches have recently been used to evaluate the economic significance of stockreturn predictability. The first approach attempts to assess whether the predictability is due to "animal spirits" or timevarying risk premia using different econometric and modeling techniques [see, for example, Bekaert and Hodrick (1992), Bollerslev and Hodrick (1995), Fama and French (1993), Ferson and Harvey (1991), Ferson and Korajczyk (1995), and Jones and Kaul (1996)]. 16Of course, given that most empiricalstudies in financeare based on historical data of surviving firms, any stylizedfact has to outlivebiases induced by the use of survived data [see Brown, Goetzmann, and Ross (1995)].
292
G. Kaul
T h e s e c o n d a p p r o a c h i n v o l v e s a d e t e r m i n a t i o n o f the uses o f p r e d i c t a b i l i t y to i n v e s t o r s m a k i n g asset a l l o c a t i o n decisions. F o r e x a m p l e , Breen, G l o s t e n , a n d J a g a n n a t h a n (1989) s h o w t h a t the p r e d i c t a b i l i t y o f stock r e t u r n s u s i n g t r e a s u r y bill rates h a v e e c o n o m i c significance in the sense t h a t the services o f a p o r t f o l i o m a n a g e r w h o m a k e s use o f the f o r e c a s t i n g m o d e l to shift f u n d s b e t w e e n bills a n d stocks w o u l d be w o r t h a n a n n u a l m a n a g e m e n t fee o f 2 % o f the v a l u e o f the m a n a g e d assets [see also P e s a r a n a n d T i m m e r m a n (1995)]. I n a m o r e r e c e n t p a p e r , K a n d e l a n d S t a m b a u g h (1996) d e m o n s t r a t e t h a t e v e n statistically w e a k p r e d i c t a b i l i t y o f asset r e t u r n s c a n m a t e r i a l l y affect a riskaverse B a y e s i a n inv e s t o r ' s p o r t f o l i o decisions.
References
Allen, F. and R. Karjalainen (1993). Using genetic algorithms to find technical trading rules. Working Paper, University of Pennsylvania, Philadelphia, PA. Bahadur, R. R. (1960). Stochastic comparison of tests. Ann. Math. Statist. 31, 276297. Balvers, R. J., T. F. Cosimano, and B. McDonald (1990). Predicting stock returns in an efficient market. J. Finance 45, 11091128. Ball, R., S. P. Kothari, and J. Shanken (1995). Problems in measuring portfolio performance: An application to contrarian investment strategies. J. Financ. Econom. 38, 79107. Bartlett, M. S. (1946). On the theoretical specification of sampling properties of autocorrelated time series. J. Roy. Statist. Soc. 27, 11201135. Bekaert, G. and R. J. Hodrick (1992). Characterizing predictable components in equity and foreign exchange rates of return. J. Finance 47, 467509. Bollerslev, T., R. Y. Chou, and K. F. Kroner (1992). ARCH modeling in finance: A review of theory and empirical evidence. J. Econometrics 52, 559. Bollerslev, T. and R. J. Hodriek (1995). Financial market efficiency tests. In: M. Hashem Pesaran and Mike Wickens, eds., Handbook o f Applied Econometrics. Basil Blackwell, Oxford, UK. Boudoukh, J. and M. P. Richardson (1994). The statistics of longhorizon regressions revisited. Math. Finance 4, 103119. Boudoukh, J., M. P. Richardson, and R. F. Whitelaw (1994). A tale of three schools: Insights on autocorrelations of shorthorizon security returns. Rev. Financ. Stud. 7, 539573. Box, G. E. P. and D. A. Pierce (1970). Distribution of the residual autocorrelations in autoregressive moving average time series models. J. Amer. Statist. Assoc. 65, 15091526. Breen, W., L. R. Glosten, and R. Jagannathan (1989). Economic significance of predictable variations in stock returns. J. Finance 44, 11771189. Brown, S. J., W. N. Goetzmann, and S. A. Ross (1995). Survival. J. Finance 50, 853873. Campbell, J. Y. (1987). Stock returns and the term structure. J. Financ. Econom. 18, 373399. Campbell, J. Y. (1991). A variance decomposition for stock returns. Econorn. J. 101, 157179. Campbell, J. Y. (1993). Why long horizons? A study of power against persistent alternatives. Working Paper, Princeton University, Princeton, NJ. Campbell, J. Y. and R. J. Shiller (1988a). The dividendprice ratio and expectations of future dividends and discount factors. Rev. Financ. Stud. 1, 195227. Campbell, J. Y. and R. J. Shiller (1988b). Stock prices, earnings, and expected dividends. J. Finance 43, 661676. Campbell, J. Y., A. W. Lo, and A. C. MacKinlay (1993). Present value relations. In: The Econom. o f Financ. Markets. Massachusetts Institute of Technology, Cambridge, MA. Chen, N. (1991). Financial investment opportunities and the macroeconomy. J. Finance 46, 529554. Chert, N., R. Roll, and S. A. Ross (1986). Economic forces and the stock market. J. Business 59, 383403. Cochrane, J. H. (1988). How big is the random walk in GNP? J. Politic. Econom. 96, 893920.
293
Cochrane, J. H. (1991). Volatility tests and efficient markets: A review essay. J. Monetary Econom. 27, 463485. Conrad, J. and G. Kaul (1988). Timevariation in expected returns. J. Business 61, 409425. Conrad, J. and G. Kaul (1989). Mean reversion in shorthorizon expected returns. Rev. Financ. Stud. 2, 225240. Conrad, J. and G. Kaul (1994). An anatomy of trading strategies. Working Paper, University of Michigan, Ann Arbor, MI. Cutler, D. M., J. M. Poterba, and L. M. Summers (1991). Speculative dynamics. Rev. Econom. Stud. 58, 529546. Daniel, K. and W. Torous (1993). Common stock returns and the business cycle. Working Paper, University of Chicago, Chicago, IL. DeBondt, W. and R. Thaler (1985). Does the stock market overreact? J. Finance 40, 793805. Evans, M. D. D. (1994). Expected returns, timevarying risk, and risk premia. J. Finance 49, 65~679. Fama, E. F. (1965). The behavior of stock market prices. J. Business 38, 34105. Fama, E. F. (1970). Efficient capital markets: A review of theory and empirical work. J. Finance 25, 383417. Fama, E. F. (1990). Stock returns, expected returns, and real activity. J. Finance 45, 10891108. Fama, E. F. (1991). Efficient capital markets: II. J. Finance 46, 15751617. Fama, E. F. and K. R. French (1988a). Permanent and temporary components of stock prices. J. Politic Econom. 96, 246273. Fama, E. F. and K. R. French (1988b). Dividend yields and expected stock returns. J. Financ. Econom. 22, 327. Fama, E. F. and Kenneth R. French (1989). Business conditions and expected returns on stocks and bonds. J. Financ. Econom. 25, 2349. Fama, E. F. and G. W. Schwert (1977). Asset returns and inflation. J. Financ. Econom. 5, 115146. Faust, J. (1992). When are variance ratio tests for serial dependence optimal? Econometrica 60, 12151226. Ferson, W. E. and C. R. Harvey (1991). The variation of economic risk premiums. J. Politic Econom. 99, 385415. Ferson, W. E. and R. A. Korajczyk (1995). Do arbitrage pricing model explain predicatability of stock returns? J. Business 68, 309349. Ferson, W. E. and R. W. Schadt (1995). Measuring fund strategy and performance in changing economic conditions. J. Finance, to appear. Fisher, L. (1966). Some new stockmarket indexes. J. Business 39, 191225. Flood, K., R. J. Hodrick, and P. Kaplan (1987). An evaluation of recent evidence on stock market bubbles. Working Paper 1971, National Bureau of Economic Research, Cambridge, MA. Foster, F. D. and T. Smith (1992). Assessing goodnessoffit of asset pricing models: The distribution of the maximal R 2. Working Paper, Duke University, Durham, NC. French, K. R., G. W. Schwert, and R. F. Stambaugh (1987). Expected stock returns and volatility. J. Financ. Econom. 19, 329. Fuller, W. (1976). Introduction to Statistical Time Series. Wiley & Sons, New York. Geweke, J. (1981). The approximate slope of econometric tests. Econometrica 49, 14271442. Gibbons, M. and W. E. Ferson (1985). Testing asset pricing models with changing expectations and an unobservable market portfolio. J. Financ. Econom. 14, 217236. Goetzmann, W. N. (1993). Patterns in three centuries of stock market prices. J. Business 66, 249270. Goetzmann, W, E. and P. Jorion (1993). Testing the predictive power of dividend yields. Y. Finance 48, 663~79. Gordon, M. J. (1962). The investment, financing, and valuation o f the corporation. Irwin, Homewood, IL. Granger, C. W. J. and O. Morgenstern (1963). Spectral analysis of New York stock market prices. Kyklos 16, 127. Granger, C. W. J. and P. Newbold (1974). Spurious regressions in econometrics. J. Econometrics 2, 111120.
294
G. Kaul
Hansen, L. P. (1982). Large sample properties of generalized method of moments estimators. Econometrica 50, 10291057. Hansen, L. P. and R. J. Hodrick (1980). Forward exchange rates as optimal predictors of future spot rates: An econometric analysis. J. Politic. Econom. 88, 829 853. Hirshleifer, J. (1975). Speculation and equilibrium: Information, risk, and markets. Quart. J. Econom. 89, 519542. Hodrick, R. J. (1992). Dividend yields and expected stock returns: Alternative procedures for inference and measurement. Rev. Financ. Stud. 5, 357386. Jagannathan, R. and Z. Wang (1996). The conditional CAPM and the crosssection of expected returns. J. Finance 51, 354. Jegadeesh, N. (1990). Evidence of predictable behavior of security returns. J. Finance 45, 881898. Jegadeesh, N. (1991). Seasonality in stock price mean reversion: Evidence from the U.S. and the U.K. J. Finance 46, 14271444. Jegadeesh, N. and S. Titman (1993). Returns to buying winners and selling losers: Implications for stock market efficiency. J. Finance 48, 6591. Jones, C. M. and G. Kaul (1996). Oil and the stock markets. J. Finance 51, 463492. Kandel, S. and R. F. Stambaugh (1989). Modeling expected stock returns for short and long horizons. Working Paper, University of Chicago, Chicago, IL. Kandel, S. and R. F. Stambaugh (1990). Expectations and volatility of consumption and asset returns. Rev. Financ. Stud. 3, 207232. Kandel, S. and R. F. Stambaugh (1996). On the predictability of stock returns: An assetallocation perspective. J. Finance 51, 385424. Kaul, G. and M. Nimalendrau (1990). Price reversals: Bidask errors or market overreaction? J. Financ. Econom. 28, 6783, Keim, D. and R. F. Stambaugh (1986). Predicting returns in the stock and bond markets. J. Financ. Econom. 17, 357390. Kendall, M. G. (1953). The analysis of economic timeseries, Part I: Prices. J. Roy. Statist. Soc. 96,
1125.
Kendall, M. G. and A. Stuart (1976). The Advanced Theory of Statistics. Vol. 1. Charles Griffin, London. Kim, M. J., C. Nelson, and R. Startz (1991). Mean reversion in stock prices? A reappraisal of the empirical evidence. Rev. Econom. Stud. 58, 515528. Lehmann, B. N. (1990). Fads, martingales, and market efficiency. Quart. J. Econom. 105, 1~8. LeRoy, S. F. (1973). Risk aversion and the martingale property of stock returns. Internat. Econom. Rev. 14, 43~446. LeRoy, S. F. (1989). Efficient capital markets and martingales. J. Econom. Literature 27, 15831621. LeRoy, S. F. and Richard D. Porter (1981). Stock price volatility: Tests based on implied variance bounds. Econometrica 49, 97113. Lo, A. W. (1991). Longterm memory in stock prices. Econometrica 59, 127%1314. Lo, A. W. and A. C. MacKinlay (1988). Stock market prices do not follow random walks: Evidence from a simple specification test. Rev. Financ. Stud. 1, 4166. Lo, A. W. and A. C. MacKinlay (1989). The size and power of the variance ratio test in finite samples: A Monte Carlo investigation. J. Econometrics 40, 203238. Lo, A. W. and A. C. MacKinlay (1990a). When are contrarian profits due to market overreaction? Rev. Financ. Stud. 3, 175205. Lo, A. W. and A. C. MacKinlay (1990b). An econometric analysis of nonsynchronous trading. J. Econometrics 45, 181211. Lo, A. W. and A. C. MacKinlay (1990c). Datasnooping biases in tests of financial asset pricing models. Rev. Financ. Stud. 3, 431467. Lo, A. W. and A. C. MacKinlay (1992). Maximizing predictability in the stock and bond markets. Working Paper, Massachusetts Institute of Technology, Cambridge, MA. Lucas, R. E. (1978). Asset prices in an exchange economy. Econometrica 46, 14291446.
295
Mandelbrot, B. (1966). Forecasts of future prices, unbiased markets, and 'martingale' models. J. Business 39, 394419. Mandelbrot, B. (1972). Statistical methodology for nonperiodic cycles: From the covariance to R/S analysis. Ann. Econom. Social Measurement 1, 259290. Mankiw, N. G., D. Romer, and M. D. Shapiro (1991). Stock market forecastability and volatility: A statistical appraisal. Rev. Econom. Stud. 58, 455477. Mankiw, N. G. and M. D. Shapiro (1986). Do we reject too often? Econom. Lett. 20, 139145. Marriott, F. H. C. and J. A. Pope (1954). Bias in estimation of autocorrelations. Biometrika 41, 390402. Muthuswamy, J. (1988). Asynchronous closing prices and spurious autocorrelations in portfolio re
Mult mai mult decât documente.
Descoperiți tot ce are Scribd de oferit, inclusiv cărți și cărți audio de la editori majori.
Anulați oricând.