Sunteți pe pagina 1din 98

-2-

INTRODUCnON
Artificial neural networks are a class of models developed by cognitive scientists

interested in understanding how computation is performed by the brain. These networks are
capable of learning through a process of trial and error that can be appropriately viewed as sta-

tistical estimation of model parameters. Although inspired by certain aspects of the way infonnation is processed in the brain, these network models and their associated learning paradigms are still far from anything
clo~e to a realistic description of how brains actually work. They nevertheless provide a rich,

powerful and interesting modeling framework with proven and potential application across the sciences. To mention just a handful of such applications, artificial neural networks have been successfully used to translate printed English text into speech (Sejnowski and Rosenberg, 1986), to recognize hand-printed characters (Fukushima and Miyake, 1984), to perform complex coordination tasks (Selfridge, Sutton and Barto, 1985), to play backgammon (Tesauro, 1989), to diagnose chest pain CEaxt, 1991), and to decode deterministic chaos (Lapedes and Farber, 1987; White, 1989; Ga11ant and White ,1991). Successesin these and other areas suggest that artificial neural network models may
serve as a useful addition to the tool-kits of economists and econometricians. Areas with particu-

lar potential for application include time-series modeling and forecasting, nonparametric estimation, and learning by economic agents. The purpose of this article is two-fold: first, to review the basic concepts and theory required to make artificial neural networks accessible to economists and econometricians, with particular focus on econometrically relevant methodology; and second, to develop theory for a
leading neural network learning paradigm to a point comparable to that of the modem theory of

estimation and inference for misspecified nonlinear dynamic models (e.g., Gallant and White, 1988a; Potscher and Prucha, 1991a,b). As we hope will become apparent from our development, not only do artificial neural networks have much to offer economics and econometrics, but there is also considerable

-3 -

potential for economics and econometrics to benefit the neural network field, arising to a considerable degree from economic and econometric experience in modeling and estimating dynamic
systems. Thus, a larger goal of this article is to provide an entry point and appropriate back-

ground for those wishi:tlg to engage in tl}e fascinating intellectual arbitrage required to fully realize the potential gains from trade between economics, econometrics and artificial neural networks.

PART I: OVERVmW 1.1. ARTIFICIAL

AND HEURISTICS MODEL~

NEURAL NETWORK

The simplest general artificial neural network (ANN) models draw primarily on three features of the way that biological neural networks process information: massive parallelism, nonlinear neural unit response to neural unit input, and processing by multiple layers of neural
units. Incorporation of a fourth feature, dynamic feedback among units, leads to even greater

generality and richness. In this section, we describe how these features are embodied in now standard approaches to ANN modeling, and some of the implications of these embodiments. Because of the very considerable breadth of ANN paradigms, we cannot do justice to the entire spectrum of such models; instead, we focus our attention on those most easily related to and with greatest relevance for econometrics. Although not usua1ly thought of in such terms, para1lelism is a familiar aspect of
econometric modeling. A schematic of a simple parallel processing network is shown in Figure

1. Here, input unit ("sensors") send real-valued signals (Xi, i = 1, ..., r) in parallel over connections to subsequent units, designate,d"output units" for now. The signal from input unit i to output unit j may be attenuated or amplified by a factor r ji E IR, so that signals Xi r ji reach output unit j, i = 1, ..., r. The factors r ji are known as "network weights" or "connection strengths."

In simple ANN models, the receiving units process parallel incoming signals in typically simple ways. The simplest is to add the signals seen by the receiver, in which case the output unit produces output

r L Xi r ji , i=l

1, ...,

v.

If, as is common, we permit an input, say Xo, to supply Xo = 1 to the network ( a "bias unit" in network jargon), output can be represented as

/j(x, r) =x'rj

j=l,...,v,

or

(x,

r)

= (1!&1

x)r

where f = (f1, ..., fv)~, x =(1, X1, ..., xr)~, r = <r~1, ..., r~v)~, and rj = <rjO, rj1,

..., rjr)~.

The

"out-

put function"

f is easily recognized

as the systematic part of the standard system of seemingly

unrelated (linear) equations; in the neural network literature, an electronic version of this network was introduced as the MADALINE
Hoff(1960).

(Multiple Adaptive Linear) network by Widrow and


network of Widrow and

When v = 1 (only a single output), we have the ADALlNE

Hoff (1960), easily recognized as the simple linear model, the workhorse of empirical econometrics.
In biological neural systems, the number of processing units can range into the mil-

lions or billions and beyond (hence the teml "massive" parallelism). While such numbers are not usually encountered in economic models, the essential feature of parallel processing is common to both. From the outset of their development, the behavior of artificial neural networks was fornlulated to include another stylized feature of biological systems. This is the tendency of certain types of neurons to be quiescent in the presence of modest levels of input activity, and to become active themselves only after input activity passes a particular threshold. Beyond this threshold, increases in input activity have little further effect This introduces the fundamental
feature of nonlinear response into the ANN paradigm.

For present purposes, it suffices to think either of neural units switching on or off, or to imagine a single dimension along which neural activity (e.g. neural firing rate) can smoothly vary from fully off to fully on. In their seminal article, McCulloch and Pitts (1943) considered

-5 -

the first possibility, proposing networks with output unit activity given by

h(x,

y)

G(x'yj)

j = 1,

, v,
the "Heaviside" or

where G(a) = 1 if a > 0 and G(a) = 0 if a $; 0. This choice for G implements

unit step function. Output unit j thus turns on when X'rj > 0, i.e. when input activity L;=l Xirji exceeds the threshold -rjo. For this reason the Heaviside function is said to implement a "threshold logic unit" (1LU). G is called the "activation function" of the (output) unit Networks with TLU's are appropriate for classification and recognition tasks: the study of such networks exclusively pre-occupied the ANN field through the 1950's and dominated the field through the 1960's. In retrospect, a major breakthrough in the ANN literature

occurred when it was proposed to replace the Heaviside activation function with a smooth sigmoid (s-shaped) function, 1967). Instead of switching in particular the logistic function, G(a) = 1/(1 + exp(-a (Cowan,

abruptly from off to on, "sigmoidal"

units turn on gradually as input from the ANN standpoint will however, binary logit we observe that model

activity increases. The reason why this constituted a breakthrough


be discussed in the next section.

With

this

modification, the familiar

/j(x, r) = G(x'rj)

= 1/(1 + exp(-x'rj

is precisely

probability

(e.g. Amemiya, 1981; 1985, p. 268). Other choices for G yield other models appropriate for classification or qualitative response modeling; for example, if G is the normal cumulative distribution function, we have the binary probit model, etc. As Amemiya (1981) documents in his
classic survey, such models have great utility in econometric applications where binary

classifications or decisions are involved. Although biological networks with direct connections from input to output units are well-known (e.g.. the knee-jerk reflex is mediated by direct connections from sensory receptors in the knee onto motoneurons in the spinal cord that then activate leg muscles), it is much more common to observe processing occurring in multiple layers of units. For example, six distinct processing layers are at work in the human cortex. Such multilayered structures were introduced into the ANN literature by Rosenblatt (1957, 1958) and by Gamba and his associates (palmieri

-6 -

and Sanna, 1960; Gamba, et. al., 1961). Figure 2 shows a schematic diagram of a network containing a single intemlediate layer of processing units separating input from output. Intermediate

layers of this sort are often caned "hidden" layers to distinguish them from the input and output layers. Processing in such networks is straightforward. Units in one layer treat the units in the preceding layer as input, and produce outputs to be processed by the succeeding layer. The output function for such a network with a single hidden (as in Figure 2) is thus of the form

fh(X, ()) = F ({3hO + LJ=l Here F: 1R ~

G(X"rj){3hj), function,

h =

1, ...,

(1.1.1)

1R is the output activation

and {3hj, j = 0, 1, ..., q, h = 1, ..., v are con.

nection strengths from hidden unit j ( j = O indexes a bias unit) to output unit h. The vector

8 = ({3'1,

,j3'v,r'l,

.."r'q)

(with

j3'h = (j3hO. ...,j3hq))

collects

together

all network

weights.

Note that we have q hidden units.


As originally introduced, the hidden layer network activation functions F and G

implemented

11..U's.

However,

modern practice permits F and G to be chosen quite freely. (the logistic) simplicity or F(a) = a (the identity) generality, and we

Leading choices are F(a) = G(a) = 1/(1 + exp(-a)) G(a) = 1/(1 + exp(-a)). Because of its notational

and considerable

adopt the latter choice, and for further simplicity

set v = 1. Thus we shall pay particular attention

to "single hidden layer" networks with output functions of the form

f(x,

q 0) = 130 + L j=l

G(x'rj)13j

(1.1.2)

Although we have seen econometrically familiar models emerge in our foregoing discussion of ANN models (e.g. seemingly unrelated regression systems and logit models), equation (1.1.2) is not so familiar. .It does bear a strong resemblance to the projection pursuit models of modem statistics (Friedman and Stuetzle, 1981; Huber, 1985) in which output response is given by

7q f(x, f) =/30 + L -a Gj(X'rj)fJj.

j=l

However, in projection pursuit models the functions Gj are unknown and must be estimated from data (perrnittingf3 j to be absorbed into Gj), whereas in the hidden layer network model (I.l.2), G is given. The hidden layer network model is thus somewhat simpler than the projection pursuit model.
A variant of the single hidden layer network that is particularly relevant for

econometric applications is depicted in Figure 3. This network has direct connections from the
Input to output layer as well as a single ruaaen layer. output tor this network can be expressed

as

f(x,

0) = F(x'a

q +f3o + L G(x'rj)f3j), j=l

(1.1.3)
weights, and () is now taken to be , {3q)' we nest as

where

is

r x 1

vector

of

input-output
choice

(} = a', f3o,

, {3q, r'l

, ..., r'q)'.

By suitable

of G, a and {3 = ({30, {31,

special cases a1l of the networks discussed so far.


In particular, with F(a) = a (the identity) we have a standard linear model aug-

mented by nonlinear terms. Given the popularity of linear models in econometrics, this form is
particularly appealing, as it suggests that ANN models can be viewed as extensions of, rather to, the familiar models. The hidden unit activations can then be viewed as

than as alternatives

latent variables whose inclusion enriches the linear model. We shall refer to an ANN model with
output of the form (1.1.3) as an "augmented" single hidden layer network. Such networks will

play an important role in the discussion of subsequent sections.


What originally commanded the attention and excitement of a diverse range of dis-

ciplines was the demonstrated successesthat models of the form (1.1.1) and (1.1.2) had in solving previously intractable classification, forecasting and control problems, or in producing superior solutions to difficult problems in orders of magnitude less time than traditional approaches. Until recently, a theoretical basis for such successes was unknown --artificial neural networks just

-8 -

seemedto work surprisingly well.


Motivated by a desire either to delineate the limitations of network models or to

understand their diverse successes, a number of researchers independently produced rigorous


results establishing that functions of the form (1.1.2) can be viewed as I'universal approx.imators,"

that is, as a flexible functional form that, provided with sufficiently many hidden units and properly adjusted 'parameters, can approximate an arbitrary function 9 : IR r -:; IR arbitrarily well in

useful spacesof functions. Results of this sort have been given by Carroll and Dickinson (1989), Cybenko (1989), Funahashi (1989), Hecht-Nielsen (1989), Hornik, Stinchcombe and White (1989, 1990) (HSWa, HSWb) and Stinchcombe and White (1989), among others. The flavor of such results is conveyed by the following Theorem 2.4 of HSWa. paraphrase of part of

THEOREMI.1.1:
};'(G)={f: {30 E JR'~

For r E IN, let Ir(G)


JRlf(x)={30+LJ=IG(X'rj){3j,XE G: JR ~ [0,1]

be the class of hidden


JR';rjE distribution

layer
JR'+I,{3jE

network

functions

JR,j=I,...,q; Then};'(G) is

JR, q E 1]J\l}, where

is any cumulative

function.

unifonnly

dense on compacta in C( 1R~, Le. for every 9 in C( 1R~, every compact subset K of
such that Supx E K I f(x) -g(x) I < E.

/Rr, and every E > 0, there exists f E Ir(G)

Thus, the biologically inspired combination of parallelism, nonlinear response and multilayer processing leads us to a class of functions that can approximate members of the useful class C( 1R~ arbitrarily well. Similar results hold for network models with general (not necessarily sigmoid) activation functions approximating functions in Lp spaces with compactly supported measures, and, as HSWb and Hornik (1991) show, in general Sobolev spaces. Thus, functions of the form (1.1.2) can approximate a function and its derivatives arbitrarily well, and in this sense are as
flex.ible as Ga1lant's (1981) flex.ible Fourier form. Indeed, Ga1lant and White (1988b) construct a

sigmoid choice for G (the "cosine squasher") that nests Fourier series within (1.1.2), so that the flexible Fourier form is a special case of (1.1.3) even for sigmoid G.

-9 -

The econometric usefulness of the flexible form (1.1.2) has been further enhanced by
Hu and Joerding (1990) and Joerding and Meador (1990), who show how to impose constraints

ensuring monotonicity and concavity (or convexity) of the network output function. interested reader is referred to these papers for details.
An issue of both theoretical and practical importance is the "degree of approxima-

tion" problem: how rapidly does the approximation to an arbitrary function improve number of hidden units q increases? Classic results for Fourier series are provided by Edmunds and Moscatelli (1977). Similar results for ANN models are only beginning to appear, and so far are not as sharp as those for Fourier series. Barron (1991a) exploits results of Jones (1991) to
establish essentially that 11/- 9 112= O(l/q 1/2) ( 11.112 denotes an L2 norm) when I is an element

of };r(G)

having q hidden units and continuously

differentiable

sigmoid activation condition

function, and on its Fourier

9 belongs to a certain class of smooth functions satisfying a summability transform. An important

open area for further work is the extension and deepening of results of

this sort, especially as such results may provide key insight into advantages and disadvantages of ANN models compared to standard flexible function families. Degree of approximation results are also necessary for establishing rates of convergence for nonparametric estimation based on
ANN models.

Our focus so far on networks with a single hidden layer is justified by their relative simplicity and their approximative power. However, if nature is any guide, there are advantages to using networks of many hidden layers, as depicted in Figure 4. Output of an l-layer network
can be represented as

ahi = Gh(Ahi(ah-l))

i =

1, ...,

qh;

h =

1, ...,

1,

where ah is a qh X 1 vector with elements ahi, Ahi(.) is an affine function


Ahi(a) = tI'rhi for some (qh + 1) x 1 vector rhi, 2 = (1, a)), Oh is the

of its argument (i.e.


function for

activation

units of layer h, ao = x, qo = r, and ql = v. The single hidden layer networks discussed above correspond to 1 = 2 in this representation.

-10

An interesting

open question

is to what

extent

networks

with

1 ? 3 layers

may be

preferable to networks with 1 = 2 layers.

Specifically,

for what classes of functions

can a three

layer network achieve a given degree of accuracy with fewer connections (free parameters) than a two layer network? Examples are known in which a two layer network cannot exactly

represent a function exactly representable by a three layer network (Blum and Li, 1991), and it is

known that certain mappings containin2 discontinuities relevant in control theory ~~n hp. Imiformly approximated in three but not two layers (Sontag, 1990). HSWa (Corollary 2.7) have shown that additional layers cannot hurt, in the sensethat approximation properties of single hidden layer networks (I = 2) carry over to multi-hidden layer networks. Further research in this interesting area is needed.
A further generalization of the networks represented by (1.1.4) is obtained by replac-

ing the affine function Ahi(.) with a polynomial Phi(.) with degree possibly dependent on i and h.
This modification yields a class of networks containing as a special case the so-called "sigma-pi"

(Ell) networks (Maxwell, Ones, Lee and Chen. 1986: Williams. lCJRn) Stinrhrombe (1991) hM studied the approximation properties of networks for which an arbitrary "inner function" Ihi

replaces Am in (!.1.4)
The richness of this class of network models is now fairly apparent. However, we
still have not exploited a known feature of biological networks, that of internal feedback.

Returning to the relatively

simple single hidden layer networks, such feedbacks can be

represented schematically as in Figure 5. In Figure 5(a), network output feeds back into the hidden layer with a time delay, as proposed by Jordon (1986). In Figure 5(b), hidden layer output Thc outvul

feerl~ h~r.k into th~ hidd~n layer with a time del~y, as proposcd by Elma.n (1988). function of the Elman network can thus be represented as

q fr(xt, (}) = /30 + L atj /3j j=l


atj = G(Xt'rj + a 't-I 8j),

j=I,...,q;t=O,I,2,...,

(1.1.5)

11-

where

at

(atl'

...,

atq)'.

As a consequence of tl1is feedback, network output depends on the ini-

tial value ao, and the entire history of system inputs, xt = (XI' ..., xt).

Such networks are capable of rich dynamic behavior, exhibiting memory and context sensitivity. Because of the presence of internal feedbacks, these networks are referred to in the literature as "recurrent networks," while networks lacking feedback (e.g., with output functions
G.l.3)) are desi2oated "feedforwarrl In econometric dynamic latent variables netwnrk~ II as a nonlinear applications in

terms, a model of the form (1.1.5) can be viewed model. Such models have a great many potential

economics and finance. Their estimation would appear to present some serious computational challenges (see e.g. Hendry and Richard, 1990, and Duffle and Singleton, 1990), but in fact some straightforward recursive estimation procedures related to the Kalman filter can deliver consistent estimates of model parameters (Kuan, Homik and White, 1990; Kuan and White, 1991). We discuss this further in the next section.
Although we have covered a fair amount of grounrl in thi~ ~p~tion. we have only

scratched the surface of the modeling possibilities offered by artificial neural networks. To mention some additional models treated in the ANN literature, we note that fully interconnected networks have been much studied (with applications to such areas as associative memory and solu-

tion of problems like the traveling salesman problem; see e.g. Xu and Tsai, 1990, and Xu and Tsai, 1991), and that networks running in continuous rather than discrete time are also standard objects of investigation (e.g. Williams and Zipser, 1989). Although fascinating, these network models appear to be less relevant to econometrics than those discussed so far, and we shall not
treat them fiJrthp.r As rich as ANN models are, they still ignore a host of biologically relevant features.

Neural systems that have taken perhaps billions of years to evolve will take humans a little more time to model exhaustively than the five decades devoted so far! To mention just a few items,
biological neurons communicate over multiple pathways, chernical as well as electrical --the .l;in-

gle communication dimension ("activation") assumed in most ANN models is quite incomplete.

-12-

Also, biological neurons respond to input activity stochastically and in much more complicated ways than as modeled by the sigmoid activation function --neurons output complex spike trains through time, and are in fact not simple processing units. Of course, these and other lirnitations
of ANN models are daily being challenged by ANN modelers, and we may expect a continuing

increase in the richness of ANN models as the diverse interdisciplinary talents of the ANN community are broueht to bear on these issues.
Despite these limitations sufficiently as descriptions attractive of biological reality, ANN models are modeling.

rich as to present a potentially

set of tools for econometric

Given models, the econometrician wants estimators. We take up estimation in the next section, where we encounter additional interesting tools developed by the ANN community in their study of learning in artificial neural networks.

1.2. LEARNING IN ARTIFICIAL NEURAL NETWORKS


The discussion of the previous section establishes ANN models as flexible functional fomls, extending standard linear specifications. As such, they are potentially useful for econometric modeling. To fulfill this potential, we require methods for finding useful values for the free parameters of the model, the network weights.
TO any econometriCian verse a m the standard tools of the trade, a multitude of

relevant estimation procedures for finding useful parameter values present themselves, typically dependent on the behavior of the data generating process and the goals of the analysis.
For example, suppose we observe a realization of a random sequence of s x 1 vec-

tors {Zt = (yt, X't)'}

(assumed stationary for simplicity), and we wish to forecast yt on the basis

of Xt. The minimum mean-squared error forecast of yt given Xt is the conditional expectation
g(XI} = E(YI I XI}. Although the function 9 is unknown, we can attempt to approximate it using a

neural network with some sufficient number of hidden units. If we adopt (1.1.3) with F the iden-

tity, we obtain a regression model of the form

-13-

f(x,

8)

= x'a

+f3o

q + L j=l

G(x'rj)f3j,

where () = (a,/30'/31 , ...,/3 j, r'l,

..., r'q)'

and for simplicity

we choose q and G a priori. we must acknowledge

Because this model is only intended

as an approximation,

from the outset that it is misspecified. Nevertheless, the theory of least squares for ~sspecified nonlinear regression models (White, 1981; 1992, Ch. 5; Domowitz and White. 1982: Gallant :\nci White, 1988a) applies immediately to establish that a nonlinear least squares estimator 9 n solving the problem

n min n-l /I," A L [yt ~=1 -f(Xt. 8)]2

exists and converges almost surely under general conditions as n ~ ~ to 9., the solution to the
problem

where

a~ = E([Yt -g(xJf).

(See Sussman

(1991)

for

discussion

of

issues relating

to

identification.:
Further, under general conditions
a multivariate nonnal distribution with

{;; (0 n -() *: converges in distribution


estimable

as n ~ 00 to
matrix

mean zero and consistently

covariance

(White, 1981; 1992, Ch. 6; Domowitz and White, 1982). Although least squares is a leading case, the properties of the dependent variable yt will often suggest the appropriateness of a Qua."i-ma:ximllmlik~Jihood procedure different from
least squares. For example, if yt is a binary choice indicator taking values O or 1 only, it may be assumed to follow a conditional Bernoulli distribution, given Xto A network model to approxi-

mate g(X,) = P[Yt = 1 I Xt] = E(Yt I Xt) can be specified as

f(x, O) = F(x'a

+ .80 + L G(x'rj) j=l

.8j) ,

{1.2.1)

14 -

where F(.) is now some appropriate c.d.f. (e.g., the logistic or normal). The mean quasi-log likelihood function for a samDle of size n is then

Ln(Zn,

f)

= n-1

n L[Yt t=l

logf(Xt'

f)

+ (1-

yt) log(l-

f(Xt,

f))].

A quasi-maximum

likelihood

estimator

A 8 n solving

the problem

max. BE e

Ln(Zn,

f)

can be shown under general conditions to exist and converge to 0., the solution to the problem

max E[Yt Jog f(Xt. (1) + (1- YJ 1og(lBee .

f(Xt. (1))].

(See White, 1982; 1992, Ch. 3-5.) The solution ()* minimizes the Kullback-Leibler divergence of the approximate probabillly Inul1el f(Xt, 0.) [rUIIl 111t;Uut; g(Xt). fu inl11t; It;~l :)4U(1lt;:) \;;~t;, -I;; (0 n -0 .) I.;UIIVC;lgC;~

in distribution

as n ~ 00 to a multivariate

normal distribution

with mean zero and consistently

estimable covariance matrix (White, 1982; 1992, Ch. 6).


If Ye represents count data, then a Poisson quasi-maximum likelihood procedure is

natural (e.2. Gourieroux. Monfort and Tro2non. 1984a.b). where fis as in G.2.1) with F chosen to ensure non-negativity (e.g. F(a) = exp(a, so as to permit f(Xt, J) to plausibly approximate
g(Xt) = E(Yt I X,). If Yt represents a survival time, then a Cox proportional hazards model (e.g.

A:r:nemiya,1985, pp. 449-454) is a natural choice, with hazard rate of the form )..(t) f(Xt, 9).
From an econometric would ordinarily standpoint, then, ANN models can be used anywhere with estimation one

use a linear (or transformed linear) specification,

proceeding

via appropriate quasi-maximum likelihood (or, alternatively, generalized method of moments) techniques. The now rather well-developed theory of estimation of misspecified models (White, 1982, 1992; Gallant and "Whitc, 1988a; POt~chcJ: and P1-ucha,1991a,b) applic~ immcdiatcly to provide interpretations and inferential procedures.

15-

The natural instincts of econometricians are not the instincts of those concerned with
artificial neural network learning, however. This is a double blessing, because it means not only

that econometrics has much to offer those who study and apply artificial neural networks, but also
that econometrics may benefit from novel techniques developed by the ANN community. In considering how an artificially intelligent system must go about learning, ANN learning ~ thc; l1lU-

modelers from the outset viewed learnin2 as a ~eql1f'.ntiI11 proce~s. Viewins

cess by which knowledge is acquired, it follows that knowledge accumulates as learning experiences occur, Le. as new data are observed.

In ANN

models, knowledge

is embodied in the network

connection

strengths,(}.

" " Given knowledge (} t at time t, knowledge (} t+ 1 at time t + 1 is then

t+l=Ot+Lltt

where dt embodies incremental therefore


current

knowledge

(1eaming).

A successful learning

procedure must CI.11U which

specify
observables,
..

some appropriate
Zt = (yt, X't)"

way to fonn thf'. llpd~te A, from previous knowlcdgc Thus we seek an appropriate function Vlt for

~t

1fItCZt.

()

t).

Current leading ANN learning methods can trace their history from seminal work of Rosenblatt (1957, 1958, 1961) and Widrow and Hoff(1960). Rosenblatt's learning network, the a-perceptron, was concerned with pattern classification and utilized threshold logic units.
Widrow and Hoffs ADALINE networks do not require a nu, as they are not restricted to being

classifiers.

As a consequence, the Widrow-Hoff

(or "delta") learning law could be generalized in

just the right way to pccmit (tJ:JJ:Jlil.;auun tu nonlinear networks.

For their linear networks (with output for now given by f(x, 8) = x' 8) Widrow and Hoff proposed a version of recursive least squares (itself traceable back to Gauss, 1809 --see Young, 1984),

Ot+1

= Ot + a XtCft

-X't

Ot).

(1.2.2)

16 -A Et = yt -X't O t is the "network -A X't O t and the

Here

error"

between

computed

output

"target"

value

yt.

The scalar a > O is a "learning rate" to be adjusted by trial and error. This recursion

was motivated explicitly by consideration of minimizing expected squared error loss. For networks with nonlinear output f(x, 8) the direct generalization of the delta rule
is

" Ot+1

" = Ot + a Vf(Xt.

" t)

(yt -f(Xt.

" t

(1.2.3)

where V f(x, .)is the gradient of f(x, .)with respect to (J (a column vector).

In the ANN literature,

this recursion is called the "generalized delta rule" or the method of "backpropagation" (a term invented for a related procedure by Rosenblatt, 1961). Its discovery is attributable to many (Werbos, 1974; Parker, 1982,1985; Le Cun, 1985), but the influential work of Rumelhart, Hinton and Williams (1986) is perhaps most responsible for its widespread adoption.
This apparently straightforward generalization of (1.2.2) in fact caused a revolution

in the ANN field, spurring the explosive growth in ANN modeling resDonsible for its vi2or today
and the appearance of an article such as this in a journal devoted to econometrics. The reasons

for this revolution are essentially two. First, until its discovery, there were no methods known to ANN modelers for finding good weights for connections into the hidden units. The focus on threshold logic units in multilayer networks in the 1950's and 1960's led researchers away from
gradient methods, as the derivative of a TLU is zero almost everywhere, and does not obviously

lend itself to gradient methods. This is why the introduction of sigmoid activation functions by
Cowan (1967) amounted to such a significant breakthrough --straightforward gradient methods

become possible with such activation functions. Even so, it took over a decade to sink into the
collective consciousness of the ANN community that a solution to a problem long considered

intractable (even impossible, viz. Minsky and Papert, 1969) was now at hand. The second reason is that once feasible methods for training hidden layer networks were available, they were applied to a vast range of problems with some startling successes. That this should be so is all the more impressive given the considerable difficulties in obtaining convergence via (1.2.3). For

17 -

a period, ANN models coupled with the method ofbackpropagation

came to be viewed as magic,

with considerable accompanying hype and extravagant claims. In 1987 one of us (White, 1987a) pointed out that (!.2.3) is in fact an application of the method of stochastic approximation (Robbins and Monro, 1951; B1um, 1954) to the nonlinear least squares problem (as in Albert and Gardner, 1967). The least squares stochastic approximation recursions are in fact a little more ~eneral. havin~ the form " " " 9t+l = 9 t + at Vf(Xt, 9 t) (yt -f(Xt, " 9 t)),

t=

1, 2,

00

(!.2.4)

The difference is that here the learning rate at is indexed by t, whereas in (1.2.3) it is a constant.
This is quite an important difference With a constant learning rate, the recursion

(1.2.3) can converge only under extremely stringent conditions (there must exist eo such that
y = f(X, eo) almost surely, where Zt has the distribution of Z = (Y; X')' t = 1, 2,
). When this

condition fails, the recursion of (1.2.3) generally converges to a Brownian motion (see Kushner
and Huang. 19R1: Homik ~nr1 Kll~n, 1QQO),not an appealing behavior in this context. Howevcr,

whenever at depends on t appropriately

(e.g. at > 0, L~=l at = 00, L~=l a; < 00, for which it

suffices that at oc t-IC 1/2 < 1( ~ 1), standard results from the theory of stochastic approximation

can be applied (e.g., White, 1989a) to establish the almost sure convergence ofe t in (1.2.4) to ()*,
a local solution of the least squares problem

mill E([Y BE 8

-f(K,

8)]1

Repeated

initialization

of the

recursion

(1.2.4)

from

different

starting

values

A e

(e.g., following

the parameter space partitioning strategy of Morris and Wong, 1991) can lead to rather good local solutions.
This fact is significant. The recursion (!.2.4) provides a computationally very simple

algorithm for getting a consistent estimator for a locally mean square optimal parameter vector in a nonlinear model with just a single pass through the data. Multiple passes through the data

(which can be executed in parallel) permit exploration for a global optimum. Thus, in addition to

21

and Duffle and Singleton (1990). Duffle and Singleton derive consistency and asymptotic normality results for MSM estimators of correctly specified models of conditional distribution. The recursive estimator (1.2.6) is computationally simpler by several orders of magnitude and has useful approximation properties even with misspecified models. It is therefore an interesting estimator in its own right: it also appears promising as a generator of starting estimates for MSM esti-

mation. In all of the discussion so far, we have implicitly assumed that network complexity (indexed by the number of hidden units) is fixed. However, the universal approximation properties described in Section 1.1 suggest that ANN models may prove a useful vehicle for nonparametric estimation. This intuition is correct: using results of White and Wooldridge (1991), White (1990a) shows that nonparametric sieve estimators (Grenander, 1981; Geman and Hwang,
1982) based on ANN models can consistently estimate a square-integrable conditional expecta-

tion function, and White (199Ob) shows that nonparametric sieve estimators based on ANN models can consistently estimate conditional quantile functions. Using results of Gallant (1987), Gallant and White (1991) establish the consistency in Sobolev norm of nonparametric sieve estimators based on ANN models. Thus, ANN models can consistently estimate unknown functions and their derivatives in a manner analogous to the performance of the flexible Fourier function

form (Gallant, 1981; Elbadawi, Ga1lant and Souza, 1983). Given tile early stage or aevelopment ot oegree of approximation results for ANN models, rate of convergence results for nonparametric ANN estimators are only beginning to be obtained. However, Barron (1991b) has obtained rate of convergence results for nonparametric least squares estimators of conditional expectation functions. For i.i.d. samples, these rates are
slightly slower than n 1/2,

To gain some insight into the issues that arise in nonparametric estimation using ANN models, we briefly consider the problem treated by White (1990a). The estimation problem considered there has the standard sieve estimation form

n =

1,2,

...,

(1.2.7)

-22 -

where the sieve en(G) is given by en(G) = T(G, qn' An),

T(G,

q, 11) = {(} E E> I (}( .) = fl(

., ~)

fl(x,

q 8q) =f3o + L G(x'rj)f3 j=l

j ,

x E m.'

q L if3j i ~A, j=O


G is a given hidden layer activation function,

q r L L Irji I .S:qL\} j=li=O


{qn E IN} and {~n E JR+

are sequences tending

to infinity with n, e is the space of functions square integrable with respect to the distribution of
Xt,and now u

s:q -rR =\fJO,

/3
1,...,

/3

'
q,rl,r2,...,rq

) ' .

Given this setup, the estimation problem (1.2.7)is equivalent to the constrained non-

linear least squares problem

mill 8"' e D,

n-1

n L '=1

[f,

-f1'(X"

~')f

, n = 1,2,

""",

(1.2.8)

where Dn = {$1. :L;:o

l{3j I ~lln,

L;:l

L;=o

Irji

I ~qnlln}.

The idea is that for a sample of

size n, one performs a constrained nonlinear least squares estimation on a model with qn hidden units, satisfying certain sumrnability restrictions on the network weights. B y letting the number

of hidden units qn increase gradually with n, and by gradually relaxing the weight constraints, the
network model becomes increasingly inates overfitting asymptotically, flexible as n increases. Proper control of qn and l1n elimof 80, 80(Xt) = E(Yt I Xu, to

a1lowing consistent estimation

" p result, i.e. 118 n -80112 -7 0. White

(1990a) shows that for bounded i.i.d.

{ Zr } , consistency

is

achieved with !!J.n,qn ~ 0:1as n ~ 0:1,!!J.n = o(nl/4)

and qn!!J.~ Jog qn!!J.n= o(n).

For bounded mix-

ing processes of a specific size, ~n = o(n 1/4) and qn~; log qn~n = o(n 1/2) suffice for consistency.

In practice, determining appropriate network complexity is precisely analogous to determining how many terms to include in a nonparametric series regression. As in that case, either cross-validation or information-theoretic methods can be used to determine the number of
hidden units optimal for a given sarnple. Information-theoretic methods in which one optimizes a

complexity-penalized quasi-log likelihood (closely related to the Schwartz Information Criterion,

-23-

Sawa, 1978) have been shown to have desirable properties by Barron (1990). Extension of analysis by Li (1987) as applied by Andrews (1991a) to cross-validated selection of the number of terms in a standard series regression may deliver appropriate optimality results for crossvalidated selection of network complexity , and is an interesting area for further research.

Also an open question is that of the asymptotic distribution of nonparametric neural network estimators. Results of Andrews (199lb) for series estimators may also be extendable to treat nonparametric estimator of ANN models. Additional interesting insights should arise from this analysis.

13. SPECIFICATION

TESTING AND INFERENCE

Consider the nonlinear regression model based on (1.1.3) with F the identity

func-

tion,

The standard linear model ocl::urs as the special case in which {31 = {32 =

..{3q

0.

Thus,

correct specification of the linear model can be tested as

Hq

.fi

=0

v~

H.. .{I; = 0

where {3 = ({310 ...0 {3q)'.

A motnent's

reflection

reveals

an interesting

obstacle

to straightforward

application of the usual tools rf statistical inference: the "nuisance parameters" rj, j = 1, ..., q, are not identified under the nu1l hypothesis, but are identified only under the alternative.
tunately, there is now availabl~ a variety of tools that permits testing of Ho in this context.

The simplest, mo$t naive procedure is to avoid treating the rj as free parameters, instead choosing them a priori in some fashion (e.g., drawing them at random from some
appropriate distribution) and I then proceeding to test Ho using standard methods, e.g. via

Lagrange multiplier or Wald statistics, conditional on the values selected forrjo A procedure of

24precisely this sort was proposed by White (1989b), and the properties of the resulting "neural network test for neglected nonlinearity" were compared to a number of other recognized procedures

for testing linearity by Lee, White and Granger (1991). (See White, 1989b, and Lee, White and Granger, 1991, for implementation details.) The network test was found to perform well in comparison with other procedures. Though no one test dominated the others considered, the network test had good size, was often most powerful, and when not most powerful, was often one of the more powerful procedures. It thus appears to be a useful addition to the modem arsenal of specification testing procedures.
A more !;ophi!;ticated prncedllre i~ tn ~hnfil:p. "1 vfll11~S that optimize the direction in

which nonlinearity

is sought.

Bierens (1990) proposes a specification

test of precisely this sort.

First, the model is estimated under the null hypothesis (linearity),


Et = yt -i't ~n where ~n is an estimator of ([:Jo,a')'.

yielding

residuals

For given r one can show under general

conditions that

under the linearity

hypothesis, where with { Zt } i.i.d. we have

b*(r)

= E(G(X'tr)

X't)

A.

E(Xt

X't)

p where ~n ~ !!.,

Bierens (1990) specifies G( , ) = exp( .), but as we discuss below, this is not the

only possible choice.


It follows that

25 -

A W(r)

A = nM(r)/Gn(r)

d ~xi

under correct specification of the linear model, where a~(r) is a consistent estimator ofa2(r).
Under the alternative, A W(r)/n -717(r) > 0 Q.s. for essentially every choice ofr, as Bierens (1990,

Theorem 2) shows. " To avoid picking r at random, Bierens proposes maximizing W( r) with respect to
r E r Can appropriately specified compact set), yielding Wcr), say. As Bierens notes, this max-

imization renders the xi distribution inapplicable under Ho. However, a xi statistic can be constructed by the following device: choose c > 0, /l E (0, 1) andy n independently of the samDle

and put

r=ro
A =r

if

W(r)

-W(ro)

~ cn)..

if

Bierens
A W(r)/n ~

(1990,

Theorem

4)

shows

that

" d WCr) ~xi

under

correct

specification
essentially

while

SUpre r 7J(r)

> O a.s. under the alternative.

Bierens'

result hold')

regardless

of how r is chosen.
In recent related work, Stinchcombe and White (1991) show that Bierens' concluincluding

sions are preserved if G is chosen to belong to a certain wide class of functions G( .) = exp( .). Other members of this class are G(a) = 1/(1 + exp(-a The choice of c, }., and r o in Bierens' construction

and G(a) = tanh(a).

is problematic.

Two researchers

using the same data and models but using different values for c, ).. and r o can arrive at differing
conclusions in finite samples regarding correctness of a given specification. One way to avoid

such difficulties is to confront the problem head-on and determine the distribution of W( r). Some
useful inequalities are given by Davies (1977, 1987), but these are not terribly helpful when r ?; 3 variables). Recently, Hansen (1991) has proposed a com-

(recall r is the number of explanatory

putationally intensive procedure that permits computation of an asymptotic distribution for W( r) under Ho'

-26-

An interesting

area for further

research

is a comparison

of the relative

performance

and computational cost of the procedures discussed here: the naive procedure of picking rj'S at
random; Bierens' " W(r) procedure; and use of Hansen's (1991) asymptotic distribution " for W(r).

The specification testing procedures just described extend to testing correctness of


nonlinear models, as well as testing the specification models. For testing correct specification of likelihood or method of moment-based

of a nonlinear model, say yt = h(Xt, a) (which for con-

venience includes an intercept) one can test Ho: /3 = O vs. Ha: /3 * Oin the augmented model

v, = h(X"

a)

q 4- ~ j=l

G(X','Yj)f3j

4- I;,

(1.3.2)

p If an is the nonlinear least squares estimator under the null (with an ~ a* under Ha; see White, 1981), then with Et = ft -h(Xt, an) we have

where now

(J"2Cr)

var([GCXtr)

-b*Cr)

A *-1

VahCXt.

a*)]e;)

b*(r) = E(G(X'tr)

V'ah(Xt, a*))

A. = E(Vah(Xt. a *)
We again have " W(r) " = nM(r)/a-

V'ahCXt.

a*))

2 n(r)

d -7xi

under

Ho.

while

" W(rYn

-717(r)

> O a.s.

under

Ha

(mlsspecification) for essentially all r. A consistent specification test is therefore available. A Optimizing W( r) over choice of r leads to considerations regarding asymptotic testing identical to those arising in the linear case.
For testing correct specification of a likelihood-based model, a consistent m-test

(Newey, 1985; Tauchen, 1985; White, 1987b, 1992) can be performed. The starting point is the
fact that if 1 (Zt. 0) is a correctly specified conditionallo~-likelihood for y, 2iven X, Ci.e. for some

-27 -

() 0' exp 1 (Zt, () 0) is the conditional

density

of yt given Xt), then

E(s(ZtJ

(Jo) I Xt) = O ,

where s is the k x llog-likelihood

score function, s(Zt, 0) = V 8 1(Zt, 0). It follows from the law

of iterated expectations that with correct specification

E(s(Zt'

() 0) G(X't

r))

= 0

for all ye r.

A Under standard conditions (e.g. White, 1992, Ch. 9) it follows that with en the

(qua3i-) mll.Ximum likc;lihood c;~timator c;oroi~tc;nt undc;r mi~~pc;cifica.tiOll fOl (}*, wc 11(1VC

where

};(y)

= var([(G(Xt'y)

@ Ik) -b *(y)A *-l]S;)

b*(r)

= E([G(i'tr)

(8)Ik] V'9 S;)

A. = E(V'9 S;)

s: = s(Zt. (}*,
\7'6 S; = \7'6 S(Zt. e*).

Consequently, analogous

A A A W(r) = n M(r)'[Ln(r)]-l of Bierens (1990,

A M(r)

d ~xi

under

correct

specification. ~17(r)

Argument

to that

Theorem

4) delivers

A W(r)/n

> 0 a.s. under

misspecification

for essentially

all r, given an appropriate choice of G, e.g. G(a) = exp(a) as in G(a) = tanh(a), as in Stinchcombe and White 0991). " W(r) over choice ofr leads to considerations

Bierens (1990), or G(a) = 1/(1 + exp(-a, A consistent m-test is thus available.

Optimizing

regarding asymptotic testing identical to those arising in the linear model.


Because ANN models must be recognized from the outset as misspecified, one

-28-

cannot test hypotheses about estimated parameters of the ANN model in the same way that one would test hypotlleses about correctly specified nonlinear models (e.g. as in Gallant, 1973, 1975). Nevertheless, one can test interesting and useful hypotheses within the context of inference for misspecified models (White, 1982, 1992; Gallant and White, 1988a). In this context, two issues arise: the first concerns the interpretation of the hypothesis itself; and the second concerns construction of an appropriate test statistic. Both of these issues can be conveniently illustrated in
the context of nonlinear regression, as in White (1981). A The nonlinear least squares estimator () n solves

min n-l ee8

n L [yt -f(Xt. t=l

(J)r

where, for concreteness we take f(Xt, (}) to be of the fonn (!.1.3) with F the identity function.
White (1981) provides conditions ensuring A Q.S. that (} n ~ (}*, where (}* is the solution to

rnin E([E(Yt BE e

I Xt)

-f(Xt.

(J)J2)

Thus (). is a parameter vector of a minimum

mean squared error approximation

f(Xt. ().) "to

E(Yt I Xt). One can therefore test hypotheses about the parameters of the best approximation. A leading case is that in which a specified explanatory variable (say the rth variable, Xtr) is hypothesized to afford no improvement permitted by f in predicting yt, within the class of approximations

This hypothesis and its alternative are specified as

Ho:

S, e*

= 0

vs.

Ha: S, ()

;!: 0

where s r is a q + 1 x k selection

matrix

that picks out the appropriate

elements of () .(i.e.

ar,rlr,...,rqr.

Testing Ho against Ha in the context of a misspecified model can be conveniently done using either Lagrange multiplier (LM) or Wald-type test statistics, but not likelihood ratio statistics, for reasons described in Foutz and Srivastava (1977), White (1982, 1992) and Gallant

-29-

and White (1988a). The likelihood

ratio statistic requires for its convenient use as a X~+l statis-

tic the validity of the information matrix equality (White, 1982, 1992), which fails under misspecification. The classical LM or Wald statistics also require the validity of the information matrix equality, but can be modified by replacing classical estimators of the asymptotic covariA ance matrix of en with specification robust estimators (White, 1981, 1982, 1992; Gallant and White, 1988a). Thus, a test of Ho against Ha can be conducted using the Wald statistic
.." Wn = n O 'n S'r(Sr .. Sr O n ,

Cn S'r)-l

where

~ Cn

--1 = An

---1 En An

The covariance

estimator

" Cn given here is consistent

when {4}

is i.i.d.,

but modifications

preserving consistency are available in other contexts. Under the hypothesis that Xtr is irrelevant
" d (and with consistent Cn), one can show that Wn ~ X~+l' and that the test is consistent for the

alternative. Similar results hold for the LM test statistic. Details can be found in Gallant and Whit~ (19&&a,CQ 7) and White (19&2; 1992, Ch. 8).

1.4. CHAOS-MODELING EXAMPLES


In this section we illustrate methods for estimating ANN models by fitting single
hidden layer feedforward networks to time series generated by three deterministic chaos

processes. The generating equations for these time series are:

30 -

(a)

The logistic map (Thompson and Stewart, 1986, p. 162):

Yt+l

= 3.8

YtCl -Yt)

(b)

The circle map (Thompson and Stewart, 1986, pp. 164, 285-6):

Y, 11 = Y, + (22/1l')

~in(21l' Y, + ~~)

(c)

The Bier-Bountis map (Thompson and Stewart, 1986, po 171):

Yt+l = -2

+ 28.5 Yt/(l

+ yf)

Chaos (a) is by now a familiar example to economists and econometricians. Chaos (b) and chaos (c) are less familiar, but these three examples, representing polynomial, sinusoidal and !ational polynomial functions, provide a modest range of different functions with which to demonstrate
ANN capab1Unes. Time-series plots or me mree senes are given in Figures 6,7 and 8.

Because we shall not be adding observational error to the chaotic series, our exampIes will provide direct insight into the approximation abilities of single hidden layer feedforward networks. In each case, we fit ANN models of the form

f(Xt.

-q 0) = X't

+ /30 + L j=l

G(X't

rj)

/3 j

(1.4.1)
Several models are examined

to the target chaos, yt, where G(a) = 1/(1 + exp(-a)),

the logistic.

in p~('h in~ance- Specifically, the input X, iE:a E:inslelas of the torgct scrics yt, whilc thc numbcl of hidden units (q) varies from zero to eight. The best model is chosen from these alternatives using the Schwartz Information Criterion (SIC). For each network configuration, we estimate model parameters by a version of the method of nonlinear least squares,Le., we attempt to solve

-31-

Optimization proceeds in two stages. First, the parameter estimates an are obtained by ordinary
least squares, with parameters {3 constrained to zero. (Note that an contains an intercept.) Then

if q > 0, second stage parameter estimates fi n and r n are obtained in such a way as to exploit the

structure of (1.4.1); the an estimates are not subsequently modified, forcing the hidden layer to extract any available structure from the least-squaresresiduals. Inspecting (1.4.1), we see that for given rj..s, ordinary least squares gives fully optimal eGtimntosfor /3. Thus, wc choosc a largc numbcr of ralldol1l vi1luc:)fur tIle elementS or
rj, j = 1, ..., q, and compute the least squares estimates for /3. This implements a form of global

random search of the parameter space. The best fitting values of.8 and r are then used as starting

values for local steepest descent with respect to {:3and r. Within steepest descent, the step size is dynamica11yadjusted to increase when improvements to mean squared error occur, and otherwise to decrease until a mean squared error improvement is found. Convergence is judged to occur
when (mse(k) -mse(k -1)/(1 + mse(k -1)) is sufficiently small, where mse(k) denotes sample

mean squared error on the kth steepest descent iteration. Once a local minimum is reached, the procedure terminates. This algorithm has been found to be fast and reliable across a variety of applic~tions investigated by the authors. The re~lllt~ nf lp.~~t~'111~rp.~ p~tim~tion of a linear model are given in Table 1. Tho simple linear model explains only 12% of the target variance for the circle map, while explaining 84% of the target variance for the Bier-Bountis map. The logistic map is intermediate at 36%, Results for the single hidden layer feedforward network are given in Table 2. In each case the hidden layer network chooses to take as many hidden units as are offered (8), and with this number of hidden units, nearly perfect fits are obtained. Because the relationships studied here are noiseless, the SIC starts to limit the number of hidden units chosen essentially only when machine imprecision begins to corrupt the computations. This lirnit was not reached in these examples. Our examples show that single hidden layer feedforward networks do have

-32 -

appealing flexibility, and can be profitably used to extract approximations at least to some simple
chaos-generating functions. Experience in a wide variety of applications across a spectrum of

scientific disciplines suggests that the usefulness of this flexibility is likely to extend broadly to
econometric contexts.

ANN

models

thus appear to be worthy

additions

to the modem

econometrician's tool-kit.

PART II: RECURSIVE M-ESTlMATION

WIm

DEPENDENT OBSERVATIONS

lI.l.

INTRODUCnON

In Part I, we briefly discussed the method of stochastic approximation (Robbins and Monro,
1951). The Robbins-Monro function '!1(0), say 0" , by (RM) algorithm recursively approximates the zero of an unknown

A () t+1

A = () t + at 1fI(Zt,

A () J

t=

1,2,...

(II.l.l)

where at is a "learning
influcn(;cd

rate" tending to zero, and 1j/(Zt,8) is a measurement of '1'(8) at time t, When 'I'(IJ) = E~V(Zt, tJ)) truS methOd yields a recursive

by 1(111UUl11 v(Ui(1bl~:) Zt.

implementation of the method ofm-estimation of Huber (1964). In particular, the method can be
used to estimate recursively the parameters of nonlinear regression models, such as those arising

in neural network applications. The RM algorithm has two significant advantages: (1) its recursive nature places few demands on computer resources; and (2) in theory , just one pass through a sufficiently large data
set can yield a consistent estimate. The RM algorithm is therefore particularly appealing for

estimating parameters of nonlinear models in large data sets. Very general results relevant to the convergence properties of the RM algorithm have been given by Kushner and Clark (1978) (KC) and Kushner and Huang (1979) (KH). However, the conditions ofKC/KH are not primitive and require some effort to apply. In this part of the paper,

we bridge an existing gap between the results of KC/KH and some interesting and fairly broad

-35-

1/I(z.

())

:5 b(O) h 1(Z) + h 2 (Z); and

there

exist

functions

PI:

/R+

/R+

and

h3

1Rs ~

1R+

such that

p I (U) ~

0 as u ~

0, h3 is measurable-

D3$, and for each (z, (}I , (}2) in

IR$ x e x e

1jI(Z.(}I) -1jI(Z.(}z)

~Pl(

I (}1-(}2

)h3(z),

where

denotes the Euclidean norm.

ASSUMPTION

A.3:

E 1/f(Zt, () < 00 for each () in 0, and there exists a function 'l' : 0-7

IRk

continuous on e such that for each (} in e '(}) = limt -+ ~ E 1fI(Zt. (}).

ASSUMPTION
L;=oat ~ 00 as n

A.4:
~ 00,

{ at} is a sequence of positive real numbers such that at ~ O as t ~ 00 and

ASSUMPTION

A.5:

(a) (b)

For each () in e. L;=o at [1jf(Zt.()) -Evr(Zt. ())] converges a.s,-P; and


For j = 1,2,3, there exist bounded non-stochastic sequences {17jt} such that

L;=o at[hj(Zt)-1Jjt]

converges a.s.-P.

Assumption A.l introduces the data generating process, and Assumption A.2 imposes some suitable and relatively mild restrictions on the growth and smoothnessproperties of the measurement function 11/ . Assumption A.3 is a mild asymptotic mean stationarity requirement at -7 O ensures that the effect of error adjustment eventually
00 allows the adjustment to continue for an arbitrarily long

In
van-

Assumption A.4, the condition


ishes; the condition ~n ""'t=1 at-:;

time,

so that the eventual convergence of (lI.l.l)

is always plausible.

Assumption A.5 imposes mild convergence conditions on the processes depending on Z:. Below we consider more primitive mixingale conditions that ensure the validity of this assumption. Let 1C
IRk ~ e be a measurable projection function (for f) E e, 1t"(f) = f).

We then

A A have that for all RM estimates e t, 7r(e J E e.

In what follows,

A e t will also denote the projected

This result generalizes classical results (e.g., Blum, 1954) in several respects. First, Zr is not required to enter the function 1/1 additively. Second, the learning rate at is not required to be
square summable. Most importantly, general behavior for Zr is allowed, provided that Assump-

tion A.5 holds. As examples, KC consider martingale difference sequences and moving average
processes.

A general class of stochastic processes satisfying the convergence

conditions
denote the

of AssumpLp-norm,

tion A.5 is the class of mix.ingales (McLeish,

1975). Let

.llp

IIXllp=(E x IP)llp. WhenllXllp <~wewriteXe


whenever each element of X belongs to Lp(P).

Lp(P). If Xis a matrix or vector,X e Lp(P)


In this case
lip is as just defined, with

denoting the spectral norm induced by the Euclidean norm. We use the following definition.

DEFINITION

11.2.2: Let {Xt} be a sequence of random variables belonging of IF .The sequence {Xt.
{ct }

to L2(P) and let

IFt} be a filtration
nonnegative IIE(Xt

IFt} is a mixingale

process if for sequences of


0

real

constants
and

and

{'m}

where
:5Ct~m+l

~m

-)

as

m ~ CX), we

have if

/Ft-m)112

:5 Ct~m

IIXt-E(Xtl/Ft+m)lk

{ Xt } is a mixingale

of size -a

~m = O(m).) for some).. < .:-a. (We drop explicit reference to the filtration

when there is no risk

of confusion.)

When C:-m satisfies this last condition,

we also say that C:-m is of size -a.

Our definition

of size is

convenient, but also stronser than that considered by McLeish (1975). As 3pccifll Ca,.jC.', mixingale processesinclude independent sequences,martingale difference sequences, l/>-,p- and amixing processes, finite and certain infinite order moving average processes, and sequences of near epoch dependent functions of infinite histories of mixing processes (discussed further in the next section). Mixingales thus constitute a rather broad class of dependent heterogeneous
processes.

In our applications, -IF',

we always assume that the relevant random variables are measurable condition holds automatically. This avoids anticipativity of

so that the second mixingale

the RM algorithm.

-38-

The following conditions permit application ofMcLeish's

mixingale convergence theorem

(McLeish, 1975, Corollary 1.8) to verify the conditions of Assumption A.5.

ASSUMPTIONA.4':
L;=l at ~~ as n ~~,

{at} is a sequence of real positive

integers such that L~=l

a; < 00 and

ASSUMPTIONA.5':
(a)
For each 0 in e, SUpt II 'I'(Zt, 0) 112~ ~e < 00 and {'1'(Zt, 0) -E'1'(Zt, 0), IFt} is a mix-

ingale of size -1/2, where IFt = a(Zl, (b) Forj

..., 2,); IFt} is amixingale of size

= 1,2,3, SUPtllhj(ZJ112 ~tJ. < 00 and {hj(ZJ-Ehj(ZJ,

Assumption

A.4'

implies

Assumption

A.4.

Note also that SUpt 111fI(Zt, 8) 112 ~ L\e < 00 is

implied by Assumptions A.5'(b) and A.2(b.i), and that we may take 17jt = Ehj(Zt). We have the following result.
A A.4' and A.5', let { ()t} be given by (II.l.l)
II.2.1 hold.

COROLLARY
with

11.2.3: Given Assumptions.A.1-A.3,


Then the conclusions

A eo chosen arbitrarily.

of Theorem

0
of .. () t.

This

provides

general

and

fairly

primitive

conditions

ensuring

the

convergence

Only

Assumption A.5' is a reasonable candidate for further specialization to achieve additional simpliThis is most conveniently done by placing conditions on h 1, h2, h3 and {2, } sufficient to ensure that the mixingale property is valid. We give examples of this in the next section. The present result gives a very considerable generalization of a convergence result of
White (1989a, Proposition 3.1). There Zr is taken to be an i.i.d. uniformly bounded sequence.

Corollary II.2.3 also generalizes results of Englund, HoIst and Ruppert (1988), who assume that
{ Zt } is a stationary mixing process and that 1/' is a bounded function.

Asymptotic normality follows as a consequence of Theorem 2 of KH. As KH show, the


fastest rate of convergence obtains with at = (t + 1)-1; we adopt this rate for the rest of this sec-

-39-

For given 0* E

IRk we write

ut = (t + 1)Y2(8 t -8*).

Straightforward manipulations a11ow us to write Ut+1 = [Ik + (t + 1)-1 Ht] Ut + (t + l)-Yi q; , where
Ht = \"761{/; + [((t +2)/(t+ 1))Yl-1] \"761{/; + Ik/2 + O((t + 1)-1) Ik

(II.2.4)

(II.2.".)

and

q; = ((t + 2) / (t + 1Y' 1//; ,

with 111;= 1II(Zt, 8*), V 9 111;= V 9 1II(Zt, (1*). The piecewise constant interpolation with interpolation intervals { at} is defined as 'r E ['rt, 'rt+l)'

of Ut on [0, 00)

UO('l")

= ut,

and the leftward shifts are defined as Ut('t") = UO('t"t+'l"),

't"?o.

The asymptotic

distribution

A of et is found

by showing

that

Ut(

converges to the solution of a

stochastic differential

equation (SDE) and then characterizing

the weak limit of Ut ( .).

We adopt tlle following conditions:

ASSUMPTION on (.0., IF , P).

B.l:

Assumption

A.1 holds and {Zr, t = 0, :!:1, :!:2, ...} is a stationary sequence

ASSUMPTION B.2:
(a)

Assumption A.2(a) holds; and

40(b) For each z E


IRS, 1f/(z, .) is continuously
m.s-:; m.+ such that

differentiable
P2(U) -:;

such that there exist functions


as u -:; 0, h4 is measurable-1Bs,

P2 : /R+ -?

/R+ and h4

and for some 0 interior to e and each (z, 0) in

IRs x eo, eo an open neighborhood

in e of 0

v 9 1jf(Z,e) -V 9 1jf(Z,eo )

~P2( 10-0

h4(z).

ASSUMPTION
1Jf; E L6(P),

B.3: There exists e* E int e such that e* = eo in Assumption B.2, E1f/; = 0,


and the eigenvalues of H =H" + Ik /2 (with H" = E(V 91/';)) have

V 91Jf; E L2(P),

negative real parts.

ASSUMPTION (a)

noS:

Let IFo = a(Zt) t :5 0) and suppose

IFo)lk
IF )-O"jI12, -* O"j=E(1JIt1Jlt+j), *' .

(b)

For

some

17 4 E

1R,L;=o (t + 1)-1 [h4(Zt) -114] converges a.s. -P; and


* Vo1flt

(c)

L;=o

(t+ 1)-1 [V91f/; -H*

-h*

converge

a.s.

-P,

where h * = E I V 9 1f/; I.

The stationarity imposed in Assumption B.l is extremely convenient; without this, the
analysis becomes exceedingly complicated. Assumption B.2(b) imposes a Lipschitz B.3 imposes additional
stable equilibrium.

condition

on v B 11/analogous to that of A.2(b.ii) tions and identifies (}*


as a

for 11/ .AssumDtion


asymptotically

momp;nt rondiAs we take

candidate

at = (t + 1)-1, there is no analog

to Assumption

A.4 or A.4'.

Finally,

Assumption

B.5 imposes

some further convergence conditions beyond those of A.5. Assumption B.5(a) restricts the local fluctuations (quadratic variation) induced by (t + l)-Y'q; in (II.2.4) to be compatible with those of a Wiener process. Assumption B.5(b,c) (together with B.2) ensures that the effects of the second term and the last term in (II.2.5) eventually vanish.
The asymptotic normality result can be stated as follows. A and B.5 hold, and that et ~e*

THEOREM

11.2.4:

Suppose Assumptions

B.I-B.3

a.s.-P,

-41 A {(J t} is generated A (J o arbitrary,

where

by (II.1.1)

with

at = (t + 1)-1, and (J* is an isolated

element

Then: (a)
(b)
{ Ut} is tight in IRk,

I=~C:O

"""'

J =-=

a.

<00,

(c)

{ Ut( .) } converges W(") dCllUtC~


tilt;

weakly

to the stationary

solution

--~ of dU('l") = HU('l") process.

d'l" + I'

dW('l"),

I)'l(Ulll(Ull

k;-var1ate

Wiener
exp[ii'c]dc

In

partiCUlar

~ N(O, F*), where F* = ico exp[iic]i o


matrix equation HF* + F* H' = -J:;,

is the unique solution to the

(d)
M A M'

If H* = -H*

is symmetric, A

then

F*

= MLM',

where

M is the ortho2onal the eigenvalues

matrix

"uch

that in

, with

the diagonal

matrix

containing

()..l, ..., )..k) of -H*

decreasing order, and Lhas (i,j) element ()..i +)..j-l)-l

Kij, where K = M'IM.

If at is chosen to be (t + 1)-1 A (for finite nonsingular Theorem asymptotic 1I.2.4(C) necomes aut't") = liut't")a'r distribution + Ai~aw('r),

k x k matrix A), then the SDE in and the covariance matrix of the

becomes AF* A '. Part (d) gives an alternative expression for the covari-

ance matrix of the asymptotic distribution, analogous to that given by Fabian (1968). Despite the
assumed stationarity , Theorem II.2.4 generalizes previous results in that the random variables c~n hf'. llnhmlnrlf'.rl ~nd th~ m~~st1rement can be correlated (cf. Ljung and Soderstrom, 1983, Ch.

4, and Fabian, 1968). Again, the properties of mixingales can be exploited to verify the convergence conditions. We impose

ASSUMPTION
(a)

B.5':
of size -2 with ct ~ K for some K < 00, t = 1, 2, ..., ; { bt } such that

(i) {VI; , IFt} is a mixingale

(ii) there exists a constant K < 00 and sequence of real numbers


*' \lE(1fI't 1fI't+j

JFQ ) -a

j 112;5. K bt fOl- a11 j, a.lld { Ut}

j~ uf ~jLC -2.

-42
. V81f1t -h*, IFf} are mix-

(b)

{h4(Z,)-E(h4(Zt))'

IFt},

{V1f/; -H*,

IFt}

and

ingales of size -1/2. We have the following result.


A et

COROLLARY

II.2.5:

Suppose

Assumptions

B.I-B.3

and

B.5'

hold

and

that

-1e*

a.s.-P

where {et} is generated by (1.1) with eO arbitrary, at = (t+ 1)-] and ()* is an isolated element of
EJ+."men the conclusions of'lheorem ll.2.4 hold

0
4.1) from the

This considerably

generalizes

an analogous result of White (1989a, Proposition

i.i.d. unifornlIy bounded case to the stationary dependent case. EngIund, HoIst and Ruppert
(1988) also give a result for i.i.d. observations.

11.3. RECURSIVE NONLINEAR

LEAST SQUARES ESTIMATION

Suppose the nonlinear


8 E D C

model I(Xt, 8)

(I:

IRr x D -:;
variahlp. y.

IR, Xt a random
It i.l: ~nmmon

r x 1 vector,

IRk) is to be used to forecast

the random

to seek 8* , a solu.

tion to the problem


mill E([Yt oe D -f(Xt. 8)r).

and foml a forecast f(Xt,


' (8) = E(Vo

8*). The solution 8* is also a solution to the problem


f(Xt, 8) [yt -f(Xt, 8)]) = 0,

where V b' is the gradient operator with respect to 8 yielding

a k x 1 column vector.

The simple

RM algorithm for this problem in nonlinear least squaresregression is the algorithm (II.l.l)
1/f(Zt, () = v b' f(Xt, 8) [yt -f(Xt, 8)],

with

where Zr = (yt, X;)' and e = 8. The updating equation is


A A " 8 t+1 =8 t + at Vo!t[yt " -It] .-8,). ,

{II.3.1)

where

we have

written

it = f(Xt.

Vo ft = Vo f(Xt.

8,).

This

is known

as a "stochastic

gra-

went method." In tl1is section we consider the properties of tl1is algorithm and two useful

43
variants, the "quick" and the "modified" A disadvantage RM algorithms. is that it may converge very slowly (e.g.

of the simple RM algorithm

White, 1988). To improve the speed of convergence, a natural modification mate Gauss-Newton

is to take an approxialso known as

step at each stage. This yields the modified RM algorithm,

the "stochastic Newton method" The algorithm is given by (1.1) with

[0/1
1JI(Zt, 8) =

(Zt,

1fI2(Zt,

1fIl (Zt. 8) = vec [Vo f(Xt. 8) Vo f(Xt. 8)' -G].

'/f2(Zt.

0) = G 1 Vo f(Xt.

0) [rt -f(Xt.

0 )J

where e = vec G)', 8')'

The updating equations are then

(II.3.2a)
1 8t+1 =8t+atGt+1 --Vo!t[ft-ft].

(II.3.2b)
symmetric matrix.

A We take Go to be an arbitrary

positive-definite

" The difficulties of applying this algorithm are: (1) the inversion of Gt+l is computation ally
A demanding, and (2) the updating estimates Gt need not be positive-definite, pointing the algo-

rithm in the wrong direction. The first problem can be solved by use of the rank one updating formula for the matrix
inverse. Let Pt+l = Gt"ll and }.t = (1at)/ at. The modified RM algorithm is algebraically

equJvalem 10

(II.3.3a)
A A " " " Ot+l =Ot + at Pt+l Volt [ft-It],
of. Ljung and Sodorstrom (1983, A Ch. 2 & 3). Thc; c;hoil;c; Po = Ik j1) uflclll;ullvcnlem.

(II.3.3b)

44A To ensure that Gt is positive-definite,


" Gt " + at [V 8ft V 8ft ",

we may use the following


" -Gt],

modification

of (1I.3.2a):

Gt+l

(II.3.4a)

" Gt+l =

(II.3.4b)

A where E is some predetermined positive number, and Mt+l (E) is chosen so that Gt+l -El is positive-semidefinite. Some practical implementations of this can be found in Ljung and SoderA strom (1983, Ch. 6). A similar device can be applied to Pt. Implementation be understood to employ a projection device restricting .-pact convex set r such that the max-imum and minimum of this algorithm will

A A 8 t to a compact set D and Gt to a comeigenvalues of Gt lie in a bounded

strictly positive interval.


A simplification particular, of the modified RM algorithm is to choose G to be a diagonal matrix. scalar, so that matrix inversion In

we take G = e Ik, where e is a positive the algorithm

is avoided.

This yields the quick RM algorithm,

(II.l.l)

with 1[/ = [1[/~,1[/;]', where now

V'1 (Zt, f)) = Vof(Xt'

8)' Vo f(Xt,

8) -e,

1JI2(Zt,

()) = e-l

Vo f(Xt,

8)[Yt

-f(Xt,

8)],

so that the updating equations become


A et+l = A et A, + at[Vc5ft Vc5f, A -et] A

(II.3.5a)

(II.3.5b) The scalar et can be easily modified to be positive in a manner analogous to (3.4); we also restrict
et to be bounded. The quick RM algorithm is a compromise of the other two algorithms in that it takes a negative gradient direction with a scaling factor utilizing tion. Consequently, the quick algorithm than the modified some local curvature informa-

ought to converge more quickly than the simple algoalgorithm. When al = (t + 1)-1, the quick algorithm

rithm but more slowly

-45 -

then reduces to the "quick and dirty" algorithm of Albert and Gardner (1967, Ch. 7).
It is straightforward to impose conditions ensuri~g the validity of all assumptions required

for the convergence results of the preceding section. Only the mix.ingale assumptions A.5' and B.5' require particular attention. We make use of a convenient and fairly general class of mixingales, near epoch dependent (NED) functions of mixing processes(Billingsley, 1968, McLeish, 1975, Gallant and White, 1988a).
Let { Vt} be a stochastic process on (.0., IF , P) and define the mix.ing coefficients
<1> m = SUp't" SUP{F E 1F' -, G E IF;.m : P(F) > 0}

P(G F) -P(G)

am

SUp't"

SUP{FeF-.

GeJF;.m}

P(G ('\ F) -P(G) P(F)


(uni-

where /F~ =a(V'C, ..., Vt). Whentpm ~ O or am ~ O as m ~ ~ we say that {Vt} istp-mixing fonn mixing) ora-mixing (strong mixing). Whenlf>m = O(mA.) for some)., < -a

we say that {Vt}

is tf>-rnixing of size -a, and sirnilarly for amo We use the following definition of near epoch
dependence. where we ~d()pt the n()t~ti()n P~:!:,'::( .) ~ E( .
1F~.!,':.').

DEFINITION

11.3.1: Let {Zt} be a sequence of random variables belonging to L2(P), and let
process on (0, IF 1 P). Then {Zt} is near epoch dependent (NED) on { Vt} of

{ vt } be a stochastic

size -a ifv m = SUpt II Zt -E~~:::(Zt)

112is of size -a.

The following three results make it straightforward to impose conditions sufficing for Assumptions A.5' and B.5'. The first is obtained by following the argument of Theorem 3.1 of
McLeish (1975).

The second simplifies

a result of Andrews (1989).

The third allows simple

treatment of products of NED sequences.

PROPOSITION
mixing

11.3.2: Let {Zt E Lp(P)}, p ?2 be NED on {Vt} of size -a, where {Vt} is a
If>m of size -ap /(P -1) of size -a. or am of size -2ap / (p -2), p > 2.

sequence with

{Zr -E(Zr)

} is a mixingale

0
II.3.2. Let 9 : IRs

PROPOSmONII.3.3:

Let {Zr} satisfy the conditions ofProposition

-46g(Zl)-g(ZV I ~L

satisfy

Lipschitz

condition,

Zl-Z2

,L

< oo,Zl,Z2,

lRs.

Then

{g(Zt) E Lr(P) } is NED on { Vt} of size -a.

If { Vt} satisfies the conditions D

of Proposition

II.3.2,

then {g(Zt)-E(g(ZJ)}

is amixingale of size -a.

PROPOSITION 11.3.4: Let {Ut} and {Wt} be two sequencesNED on {Vt} ofsize-a.
(a) If SUpt

wt

$ il

< 00 and

SUpt II u t 114 $ il

< 00, then

SUpt II Ut wt 114 $ il2

and

{ Ut Wt}

is

NhU

on t Vt} of size -a /2.

(b) If SUpt II w t 118 ~ ~ < 00 and SUpt II u t 118 ~ ~ < 00, then SUpt II u t w t 114 ~ ~ 2 and { U t Wt } is

NED on { Vt} of size -a /2. (c) If SUptII u t 118 ~ L\ < 00 and { Vt } satisfies tile conditions of Proposition 3.2, tilen tilere exist KX)
E(Ut Ut+j)l!2

and a sequence of real numbers {bt}


::; Kbt and bt is of size -all.

such that supj~oIIE(UtUt+j

D'Q

0
ll.3.4(a), requiring SUpt II ft 114$: !!. and a

Our subsequent

results

will

make

use of Proposition

bound on tho olomonts of Xto Part (b ) i11ustra.tcs usc of thc Ca.uchy-Schwart.Z; incqu0.1ity to cclax

the boundedness condition; the price for this is a corresponding strengthening of moment conditions on ut (corresponding to yt). Here we sha11 adopt boundedness conditions on Xt to minimize moment conditions placed on yt and facilitate verification of the Lipschitz condition of Proposition II.3.3. Part (c) permits verification of Assumption B.5' (a.ii). We impose the following conditions.

ASSUMPTION Zr=(Yt,X;)' with

C.l:

Assumption

A.l

holds, and {Zt}

is NED and {Vt}

on {Vt}

of size -1, where sequence on

Xt bounded

and suPtIlYtIlp~L1<oo,

is a mixing

(.Q, IF , P) withq,m of size -p /2(P -1),

oram of size -p /(P -2),

p ?: 4.

ASSUMPTION
m.k.

C.2: f:

IRr x D ~

IR is jointly measurable, where D is a compact subset of


differentiable, and f (x, .) and Vo f(x, .) each

For each x E

m.r, f(x,

.) is continuously

satIsfy a LlpsCJUIZ COnOltiOn wltn LlpSCJUIZ constantS 1 (X) ana 2 (X), where 1 and L2 are each Lipschitz continuous in x. For each 8 E D, f( ., 8) and V s f( ., 8) each satisfies a Lipschitz

48 -

methods thus coincide, so that the RM estimators tend to the same limit(s) as the nonlinear least squares estimator (cf. Ljung and Soderstrom, 1983). Corollary 11.3.5is more general than the i.i.d. case treated by White (1989a) and the exampIes given in KC (Ch. 2), as we allow the data to be moderately dependent and heterogeneous. This result differs from those of Metivier and Priouret (1984) in that we require neither "conditional independence" nor stationarity .

Corollary II.3.5 also generalizes a result of Ruppert (1983). Ruppert assumesthat for some
O yt = f(Xt, 8*) + Et and that (Xt, EJ is strong mixing of size -p / (p -2), a condition that may

fail when Xt contains lagged ft, because ft need not be mixing when it is generated in this manner, even when Et and other elements of Xt are mix.ing. Indeed, this fact partially motivates our usage of near epoch dependence. Also, we do not require that Yeis generated in the manner assumedby Ruppert (Le., we may be estimating a "rnisspecified" model). Compared to the result of Ljung and Soderstrom (1983), we allow more dependence in the data, as the data need not be
ge:ne:r~te:d by ~ line'Jr filter.

The modified RM algorithm can be identified with the extended Kalman filter for the nonlinear signal model
yt = f(Xt. 8t) + Et

8t = 80 for

all

t.

" " The Kalman gain is at Pt+l V 6 it. Corollary 11.3.5thus provides conditions more general than previously available ensuring consistency of the filter. In particular, the model can be

misspecified and the data can be NED on some underlying mixing sequence. Because the quick RM algorithm includes Albert and Gardner's quick and dirty algorithm, Corollary 11.3.5directly generalizes their consistency result to the case of dependent observatiODS.

To obtain asymptotic normality results for the case of nonlinear regression, we impose the following conditions.

-52-

For this we impose appropriate conditions. In particular, we adopt Assumption C.l. The assumption of uniformly bounded XI causesno loss of generality in the present context. This is a consequence of the fact
~

that E(Yt I Xt) = E(Yt

Xt)

where

Xti

= ~(Xti)'

i = 1, ..., r

and

IR ~ [0, 1] is a strictly increasing continuous function. to g(iJ = E(Yt I it).

If Xt is not unifonnly

bounded then notation in

it is, and we seek an approximation

We revert to our original

what follows, with the implicit understanding that Xt has been transformed so that Assumption C.l holds. Note, however, that yt is not assumedbounded, providing the desired generality.

ASSUMPTION E.l: I:
compact subsets of function continuously IRr,

]R' x D ~

IR is given by (4.1) where D = A x B x r, with A, B and r


respectively, and with G: IR ~ IR a bounded

IRq+l and IRq(r+l) of order 3

differentiable

The conditions on G are readily verified for the logistic c.d.f. and hyperbolic

tangent "squashers"

commonly used in neural network applications.


A C.l, E.l, C.3 and A.4', let {()t} be given by (II.3.1),
algorithms, respectively) with A (} o chosen arbi-

COROLLARY
(II.3.2) or (II.3.5)

11.4.1: Given Assumptions


(the simple, modified

and quick

trarily. Then the conclusions of Theorem II.2.1 hold.

Thus the method ofback-propagation and its generalizations converge to a parameter vector giving a locally
E(Yt

mcan 3quarc optimal

approximation

to thc (;ond1Llonal cAllcl,;laLlull

[Ulll,;l!Ull gen-

Xt) under general conditions

on the stochastic process {2, }. This result considerably

eralizes Theorem 3.2 of White (1989a), For the asymptotic distribution results, we impose the following condition.

ASSUMPTION

F.l:

Assumption E.l holds with G continuously

differentiable

of order 4.
A et

COROLLARY

11.4.2:

Suppose

Assumptions

D.l,

D.2

and

F.l

hold

and

that

~e.a.s.-P

where {ot} is generated by (II.3.1), (II.3.2) or (II.3.5) with 00 chosen arbitrarily, at = (t + 1)-1, and 0. is an isolated element of e. .Then the conclusions of Theorem 11.2.4hold.

-54 -

considered here. For many choices of 1/', the analysis parallels that for the least squares case

rather closely. These results are within relatively easy reach for estimation procedures. For neural network models, it is desirable to relax the assumption that q is fixed. Letting
q -7 00 as the available sample becomes arbitrarily large permits use of neural network models

for purposes of non-parametric estimation. Off-line non-parametric estimation methods for the case of mixing processes are treated by White (1990a) using results for the method of sieves (Grenander, 1981, White and Wooldridge, 1991). On-line non-parametric estimation methods appear possible, but will require convergence to a global optimum of the underlying least squares problem, not just the local optimum that the present methods deliver. Results of Kushner (1987) for the method of simulated annealing provide hope that convergence to the global optimum is achievable for the case of dependent observations with appropriate modifications to the RM procedure. Finally, it is of interest to consider RM algorithms for neural network models that generalize the feedforward networks treated here by a11owing certain intema1 feedbacks. Such "recurrent" network models have been considered by Jordan (1986), Elman (1988) and Williams and Zipser (1989). For example, in the Elman (1988) set up, hidden layer activations feed back,
so that network
CAto.Atl. ",.Atq)-.Ato

output
= 1

is Ot = F(At' /3), Atj = G(Xt' Oj + At-l ' Oj), j = 1, ..., q, where

At =

This a11owsfor intema1 network memory and for rich dynamic

behavior of network output. Learning in such models is complicated by the fact that at any stage of learning. network output depends not only on the entire past history of inputs Xt. but also on .. the entire past history of estimated parameters e t. Results of KC are relevant for treating such
internal feedbacks. Convergence of RM estimates in recurrent networks is studied by Kuan

(1989) and Kuan, Hornik and White (1990).

-57

II.2.1(b) follows from Theorem II.2.1(c). Finally, we show that cycling between two asymptotically stable equilibria is impossible.
It is easy to see that points in e. must be isolated. Let O ~ and 0; be two isolated points in e. t

and let NEI and NEz be neighborhoods of 8~ and 8;, respectively, such that NEl ~ dCe*),
N el ~ d(e* ), and N el f'\ N el = 0. from, say, N 1 to N z infinitely
e ti E N El e = 7t'['(8)] caWlot

" " If the path of e t cycles between e ~ and e;, e t must move

often.

Let {ti} be an infinite subsequence of {t} such that


8 ( .) satisfying the

Then (}t; ( .) is a subsequence of (}t ( .) and has limit


.But for every to O i as .-:. T there is a t > T such that ~.

--8 (t) E N 1. Hence

8 (0) E N I but 8 ('r) Theorem

conycrgc;

'I1ll~ vlo1atc~ IllC; ~y11lpt.Ul1l; ~t.(1billt.y uf{J i and proves

2.1(d).

PROOF

OF COROLLARY condition

11.23:

The result follows from Theorem ll.2.1 because the A.4' implies at ~ O as t ~ ~ and Assumption A.5'

summability

of at in Assumption

implies Assumption A.5 by the mixingale convergence theorem (McLeish, 1975, Corollary

1.8).
PROOF

OF THEOREM

11.2.4: We verify the condition." for Theorem 2 nf KH

We first observe that the conditions [A1], [A4], [A7] and [AS] of KH are directly assumed, and that [A3] ofKH is ensured by Assumption B.5(c) and Lemma Al.
Second, we show that the consequence of [A2] of KH holds under Assumptions B.2(b) and

B.5(b, c). This amounts to showing that the second assertion in Lemma 1 of KH holds. By Assumption B.2(b) we have

L:learly, the integral on the RHS of (a. 10) converges to zero a.s. because e t -7 () .a.s.
a sequence of positive real numbers such that LkEk < 00, and let {Nk} be a sequence

Let {Ek} be
ofintegers

tending to infinity

as k -7 00. Define measurable sets Ak, Bk, Ck, Dk and F k as:

60 -

PROOF

OF COROLLARY

11.2.5: Only Assumption

B.5 needs to be verified.

We observe that

Assumption B.5'(b) is a mixingale condition ensuring Assumption B.5(b, c) by the mixingale convergence theorem. To establish Assumption B.5(a), we see that Assumption B.5'(a.i) ensures
that for K < 00

](t

=IIE(l/f;

lFQ

) 112 $: K

~ IL",t .

(a.15)
The fact that; K;,r is of size -2 implies that

where;

~,' is the mixinsale

memory coefficient.

This establishes AssumDtion B.5(aj). Similarly, AssumDtion B.5' Ca.ii)imDoses


; t = SUPj?:O IIE(1f/; 1f/;:j -E(1f/; 1f/;+j ) IIFo )112 ~ Khto

That bt is of size -2 ensures that Lt=o ~r < 00. This establishes Assumption B.5(a.ii).

0
D

PROOF

OF PROPOSITION

11.3.2: See Gallant and White (1988a, Lemma 3.14).

PROOF

OF PROPOSITION

11.3.3:

See Andrews

(1989, Lemma

1).

PROOF

OF PROPOSITION

113.4:

(a) We first observe

that

EIUtWt-E~.::::(UtWt)

12

where Ut.m= E~.:!:.:::(Ut) and Wt.m= E~.:!:.:::(Wt). Here we employ the fact that E~.:!:.:::(Ut Wt) is the best L2-predictor of Ut Wt among all IF~~:::-measurable functions. Hence,

II Ut

Wt-E~::::(Ut

WJI12

~IIUtWt-Ut,mWt,mI12

-62 -

Similarly,

II ut

Wt-Ut,m

wtI12~{iil3/2I1Ut-Ut,mll~1

Consequently,
v,

+VW,m

).

(c)

Using the same argument as in (b) we can show

II ut

Ut+j-E~~~+m(ut

ut+j)lk

~11 Ut

ut+j

-E~~:::(Ut)

E~tJ~:::(Ut+j)

112

.$: II ut

Ut+j-

ut E~~{::::(Ut+i)

112+ II utE~~1::::(ut+j)

-E~::::(Ut)

E~~~::::(Ut+j)

111

Hence, with Eo( .) = E( .

IFo ), we have

II p o (Ut lJt-t-J )-RTJtTJttjI!2

.sIIEo

E~~J+s

(Ut

Ut+j)

-E

Ut Ut+j

112+ IIEo[Ut

Ut+j

-E~~J+s

(Ut

Ut+j)]

Ik ,

(a.17)

where s = [tll]
t bounded by

is the integer part of tll.

By Jensen's inequality,

the second term in (a.17) is

where Kis a constant. It follows from Lemma 2.1 ofMcLeish (1975) and Lemma 3.14 of Gallant

u
PROOF OF COROLLARY II.3.5: We verify the conditions of Corollary II.2.3. Because the

other conditions obviously hold, for the simple RM estimates it suffices to show that Assumptions A.2(b) and A.5' hold. Given Assumption C.2, it is straightforward to verify that f and Vof are
such that If(x, 8)1 ~ Ql(X) and IVof(x, 8)1 ~ Q2(X) for aI18e D (compact), where Ql and Q2

-63 -

are Lipschitz continuous in x. Therefore,


11jf(z, ())I = I Vof(x, li)[Y-f(x, li)]1

~Q2(x)[lyl

+Ql(X)], h1(z)=1, and h2(z)=Q2(X)[lyl +Ql(X)].

so that Assumption A.2(b.i) holds for b(())=l,

11//(z,

()l)-1//(z,()vl

=1

Vof(x,

81)[y -f(x,

81)] -Vof(x,

82)[y -f(x,

8z)] I

~ I Vof(x, 81)y -Vof(x,


It follows from Assumption C.2 that

8vy I + I Vo!(x, 82)!(x, 8v -Vof(x,

b'l)!(X, 81) I

(a.18)

IVof(x,

81)y-Vof(x,

8vyl

~ lyIL2(X)181-821

I Vof(x,

82)f(x,

8v -Vof(x,

81 )f(x,

81) I

~ I Vof(x,

~)f(x,

82) -Vof(x,

8vf(x,

81) I + I Vof(x,

8vf(x,

81) -Vof(x,

81)f(x, 81) I

:s:l v c5f(x,82)I L1(X)181-821+ If(x, 81)IL2(x) 181-82

Hence (a.18) becomes

Thi;s

c;;sta.bli;shc;~ A~~UUlVUUU

A.2(lJ.li).

64

Because

Iy I, L1(x),

L2(x),

Ql(X)

and

Q2(X)

satisfy

Lipschitz

conditions,

Proposition

11.3.3ensuresthat IYtl , Ll(Xt)' L2(Xt) Ql(XJ and Q2(Xt) are NED on {Vt} of size -1. Because
Xt is bounded, Ql(Xt), Q2(X,), L1(X,) and L2(Xt) are bounded. Because IlYtl14 ~Ll, it follows

from Proposition 1I.3.4(a) and Corollary 4.3(a) of Gallant and White (1988a) (i.e., sums of random variables NED of size -a are also NED of size -a) that h3(ZJ is NED on {Vt} of size -1/2.
The mixing conditions size -112 by Proposition ing Assumption A.5'(ii). of Assumption 3.2. Similarly, C.l then ensure that { h 3 (Zt) -Eh 3(Zt) } is a mixingale {h2(ZJ -Eh2(ZJ} is a mixingale of

of size -112, establish-

We next verify that for each 8 E e, {1jI(Zt, 8) } is a mixingale of size -1/2. Fix 8 ( = 8). Observe that the Lipschitz condition on f( .,8) and the conditions on {2,} imply by Proposition ll.3.3 that {f(Xt. 8) } is NED on
{Yt-fCXt,
Vt} of size -1

The triangle inequality implies that


the continuity offC ., 8),
condition on

8)} is NED on {Vt} of size -1, and the boundedness ofXt,


that II yt -f

and the fact that II yt 114~ 11 < 00 implies

CXt. 8~14 ~ 11 < 00. The Lipschitz

{Vol(

..8)}

and the conditions

of {Zt} imply

by Proposition

11.3.3 that {Vol(Xt.

8)} is also

NED on {Vt} of size -1. Further, the elements ofV f(Xt, 8) are bounded, so that by Proposition
II.3.4(a) {1f/(Zt. ()) = Vof(Xt. 8)[ftf(Xt. 8)] } is NED on
Vt} of size -1/2. It follows from Pro-

position ll.3.2 that { Vo!(Xt. 8)[yt -!(xt,

8)] } is a mixingale of size -lll,

given the mixing con-

ditions imposed on {Vt} by Assumption C.l. Thus, Assumption A.5'(i) holds, and the result for the simple RM procedure follows. For the modified RM estimates we first note that every element of 0-1 is bounded above so
that I G-1 I < 11 for some 11.

Now.

IG-l

Vof(x,

8)[y

-f(x,

8)]

~~

Q2(x)[\yl

Ql(X)]

65 -

IVll(Z, 8)1 = \vec(Vof(x, 0) Vof(x, O)'-G)\

vec (Vof(x, 8) Vof(x, 8)') I +

Ivec

GI

= I Vof(x,

8) 12 + I vec G 1

where we use the fact that I vec A

= [tr(A

' A)]Y%.

Hence

Assumption

A.2(b.i)

holds,

as

= h:1(7)

We now establish a mean-value expansion result for G-1.

Recall that G is restricted to a result shows

convex compact set r, so the mean value theorem applies. A matrix differentiation that when c is symmetric and nonsingular, dC-l/dg/J

a-l Sij a-1 , whccc gij is tl"lC ij-U.

element of G and Sij is a selection matrix whose every element is zero except that the ij-th and ji-th elements are one; see Graybi11 (1983, p. 358). Hence we can write
rl \vec ("0... r:,-l'l a '-' J
-l

-vec

(0-

Sij

0 -1 ) ,

a~;j

The first term of (a.19) is less than

vec

VofCx,

81)VofCx,

8v'

-vec

[ v ofCx,

8z) vofCx,

8z)'

(I@VoJ(x,81))

.Vof(x,

81)-Vof(x,

8V]

(Vof(x,

82)

1)

v 6 f (x, 81) -V 6 f (x, 8V]

Vof(x,

81) -Vof(x,

8v

VofCx,

O2) @ I

where we used the fact that

vec (ARC)

= (C !8> A) vec B.

It

can be verified
and

that

Vof(x,

c5z)@ I

, where

k is the dimension

of

8.

Thus, (a.19) becomes

1Jf1 (z,

(J1)

-1Jf1

(z,

(Jv

~ 2K Q2 (X) L2 (X)

01-021

I vec(G2-G1)

h3' (z)

I f)1 -f)2

where hJ (z) = 2K Q2 (X) L2 (X) + 1 .Hence

Assumption

A.2(b.ii) holds, as

$; 11fIl(Z,

91)-V'1(Z,

92)1

IV'2(Z,

91)-V'2(Z,

9vl

h3(z)

181

-821

with {h2(Zt)

hj(z)

= h; (~) + h3' (;). }, {h3(Zt)

Using

thc

~d1nc

(11 !; Wll~ll~

as

before

we

nave

that

-Eh2(Zt)

-Eh3(Zt)

}, and {1/f(Zt, 0) -E1/f(Zt'

0) } are mixingales

of size -112.

-68-

Hence Assumption A.5' also holds. This yields the desired results for the modified RM estimates. The conclusions for the quick RM estimates follow because the quick algorithm is a special case of the modified algorithm.
D

PROOF OF COROLLARY

11.3.6: We verify the conditions of Corollary

ll.2.5.

For the simple

KM estimates we neea to ShOwthat Assumptions B.2(b) ana B.5' hold In this case v 9 1jI(Z,e) = V 0(\70 f (x, 8) [y -f (x, 8)])

= Voof(x,
hence for 9 in int G and 9 in Go

8) [y- f(x, 8)] -Vof(x,

8) Vof(x, 8)'

I v 6 1jf(Z, (}) -V 6 1jf(Z, (}a) I

Voof(y-f)-VofVof'-Voor(y-r)

+ vor

Vor'

IVqo!Y-Voof'y\

I(Voof')f'-(Voof)!1

lVof'Vof"-VofVof'1

where we have written f = f(x, 8), r

= f(x, ~ ), etc. By Assumption ~ ly\L3(X)18-50 )iI I,and

D.2, 0 = ~ = 8* .Apply-

ing Assumption D.3 we get Voo!Y-Voary\


I (Voof' )I' -(Vool)11

~ I (Voof' )I' -(Voof'

+ I (Voof' )1- (Voof>ll

~ lVoorl

Ll(X)

18-00

Ql

(X)L3

(x)

18-00

18-~1

since

Voor

I .$: Q3(X),

with

Q3

Lipschitz-continuous

in

by

straightforward

arguments.

Funher,
I Vorvar' -Vof Vof' I ~ I varvar' -VorVof' I + I vor Vof' -Vof Vof' I

~ 2Q2(X)

L2(x)

18 -l)O

I,

so that

-71-

I (G-1-( Go)-l) [Voor(y-r) -Vor Vor'] I


It follows from Ca.20) that the fir.1;ttenn in (a22) i.l; If';.1;.I; th~n

(a.22)

11

Iy I L3 (x) + Q3(X) Ll (X) + Ql (X) L3 (X) + 2 Q2(X) L2 (X)] 10-00 I

It can also be verified that the second tenn in (a.22) is less than

I G 1 -(G~) 1 I

Q3(X) ( Iy I + ~1 tX + tQ2(XZ

Iy I + Ql

(X)) + (Q2(X))2

vec(G-GO)

Thus (a.22) becomes


0-1 [Voof(y.n -VofVof'] -(00)-1 [Voor (y-r) -Vor Vor']

$: h;'

(z)

18-8

I,

where

h;' (z) E 11

Iy I L3(X)

+ Q3(X) L1(x)

+ Ql(X)

L3(X)

+ 2Q2(X)

L2(x)

+ Q3(X)( Iy I + Ql (X + (Q2 (X2


We also note the fact that IA I ~ I vec A I ~};j}; j I ajj I, where A is a square matrix and ajj

are its elements. Combining these results we immediately get


v 8 1jI (Z, e) -V 8 1jI (z, eO)
~ h4 (z) 18-81,

where h4 (z) = h~ (z) + h; (z) + h;' (z) .This

establishes Assumption

B.2(b).

A11 other con-

ditions can be verified as the proof for the simple RM algorithm.


tion result .. of 8 t follows from Corollary 1I.2.5 with

Thus the asymptotic distribu-

H; = E(V 91f1;),
where v 611' (z, e) IS given by (a.:Zl), ana

-73

where the first equality follows from the fact that exp[(-Ik/2)c] = exp(-c/2) 1 1 = [exp(-c/2)]Ik,

For the quick RM algorithm,

-I
H * 3 -

(Val;

.Vaol;)

-e

*-1

is also block triangular, and the lower ~ght kxk block ofI3 is

It follows that the lower right kxk block of F; is

so that
...d (t + 1)Y; (8t -8.) -. F3)

N (0,

We now show that F~ -G *-1 i~

G *-1 is a positive

semidefinite

matrix.

From Theorem

1I.2-4(c) we 2et
-};1 =HIFl +F1 HI =(HI +I/2)Fl +F1(Hl +1/2)

=HI Hence,

FI +FI

HI +FI

-(G*

)-1 I1(G*

)-1

=(G*)-I(H~F~

+F~H~

+F~)(G*)-1

-F~(G.)-l

-(G*)-IF~

(G*)-IF~(G*)-1

-75 -

is positive semidefinite, where (F; )y, is such that (F; )Vi (F; )Vi = F; .Since
holds.

i~

= }; 1, the result

PROOF OF COROLLARY

11.4.1: Owing to the compactness of the relevant domains, the spe-

cia} structure of f in (II.4.1) and the continuous differentiability of G, it is straightforward to verify the domination and Lipschitz conditions required for application of Corollary 11.3.5.
D

PROOF

OF COROLLARY

11.4.2:

Direct

application

of Corollary

II.3.6.

-76
TABLE 1

DEmRMINISTIC

CHAOS APPROXIMAmD

BY LINEAR MODELt

Logistic Map
N

Circle Map 250 .2957 .1202

Bier-Bountis Map 250

250 .1967 .3595


-1.60

a
R2

1.787 8472 1.21

SIC

-1.20

t N = number R2 = squared multiple

of observations;

a = regression

staIldard

error; Criterion:

regression
SIC

coefficient;
= log (j

SIC = Schwartz
Ny2N

Information

+ k(Iog

k = number

of estimated

coefficients

( = 2).

77 -

TABLE 2

DETERMINISTIC CHAOS APPROXIMAlED


SINGLE IllDDEN LAYER

BY

FEEDFORW ARD NETWORK

Logistic Map
q N

Circle Map 8 250


1.35 x 10-3

Bier-Bountis Map 8 250


2.34 x 10-2

8 250
2.68 x 10-4

a
R2

.9999

.9999
-6.32

.9999 -3.46

SIC

-7.93

t q. = SIC-optimal

number

of hidden

units;

remaining

symbols

as in Table

1.

-78-

REFERENCES

Albert, A.E., and L.A. Gardner (1967): Stochastic Approximation and Nonlinear Regression, Cambridge: M.I. T. Press.

Amemiya, T. (1981): "Qualitative Response Models: A Survey," Journal of Economic Literature

19, 1483-1536.
Amemiya, T. (1985): Advanced Econometrics. Cambridge: Harvard University Press.

Andrews, D. W.K. (1989): "An Empirical Process Central Limit Theorem for Dependent NonIdentically Distributed Random Variables," Cowles Foundation Discussion Paper, Yale University.

Andrews, D.W.K. (1991a): "Asymptotic Optimality of Generalized CL, Cross-validation and Generalized Cross-validation in Regression with Heteroskedastic Errors," Journal of
Econometrics 47,359-378

Andrews, D.W.K. (1991b): "Asymptotic Normality of Series Estimators for Nonparametric and Semi-parametric Regression Models," Econometrica 59, 307-345.

ArllUlll,

L. (1974):

SlochasItc

DtJ[eremial

EquariOns:

Theory

ana Appllcattons.

New

York:

John

Wiley & Sons.

Barron, A. (1990): "Complexity Regularization with Application to Artificial Neural Networks,"


University Report 57. of minois at Urbana -Champaign Department of Statistics Technical

Barron, A. (1991a): "Universal Approximation Bounds for Superpositions of a Sigmoidal Function," University oflllinois at Urbana -Champaign Department of Statistics Techni-

cal Report 58.

-79-

Barron, A. (1991b): "Approximation and Estimation Bounds for Artificial Neural Networks,"
University
Report 59.

of nlinois

at Urbana -Champaign

Department

of Statistics

Technical

Baxt, W.G. (1991): "The Optimization of the Training of an Artificial Neural Network Trained to Recognize the Presence of Myocardial Infarction by the Variance of Disease Likelihood," UC San Diego Medical Center Technical Report.
Bierens, H. (1990): " A Consistent Conditional Moment Test of Function Form," Econometrica

58,1443-1458.

Bi11ingsley,P. (1968): Convergence of Probability Measures. New York: John Wiley & Sons.

Billingsley, P. (1979): Probability and Measure. New York: Wiley.

Blum, J.R. (1954): "Approximation Methods Which Converge with Probability One," Annals of
Mathematical Statistics 25,382-386.

Blum, E.K. and L.K. Li (1991): Approximation Theory and Feedforward Networks," Neural Networks 4,511-516.

Carroll, S.M. and B.W. Dickinson (1989): "Construction of Neural Nets Using the Radon
Transfornl," in Proceedings of the International Joint Conference on Neural Net-

works, Washington D.C., New York: IEEE Press, pp. 1:607-611

Cybenko, G. (1989):

IIApproximation

by Superpositions of a Sigmoid Function,"

Mathematics

of

Control, Signals, and Systems 2,303-314.

Cowan, J. (1967): "A Mathematical Theory of Central Nervous Activity," dissertation, University of London.

unpublished Ph.D.

Davies, R.B. (1977):

"Hypothesis

Testing When a Nuisance Parameter is Present Only Under the

-80-

AItemative," Biometrika 64,247-254. Davies, R.B. (1987): "Hypothcsis Tcsting Whcn a Nui3ancc Paramctcci~ Pcc~cnt 011ly UllUCl lIIC

Alternative," Biometrika 74, 33-43.


Domowitz, I. and H. White (1982): "Misspecified Models with Dependent Observations," Journal of Econometrics 20, 35-58.

Duffle, D. and KJ.

Singleton

(1990):

"Simulated

Moments

Estimation

of Markov

Models

of

Asset Prices," NBER Technical Paper 87.

Elbadawi, I., A.R. Gallant and G. Souza (1983): "An Elasticity Can be Estimated Consistently Without A Priori Knowledge of Functional FonD," Econometrica 51,1731-1752.

Elman, J.L. (1988): "Finding Structure in Time," CRL Report 8801, Center for Research in Language, UC San Diego.

EngIund, J.-E., U. HoIst, and D. Ruppert (1988): "Recursive M-Estimators of Location and Scale for Dependent Sequences," Scandinavian Journal of Statistics 15,147-159.

Fabian, V. (1968):

"On Asymptotic
StatiJticJ

Normality

in Stochastic Approximation,"

Annals of

lrfathematical

39, 1327-1332.

Foutz, R.V. and R.C. Srivastava (1977): "The Performance of the Likelihood Ratio Test When
the Model is Incorrect, II Annals of Statistics 5, 1183-1194.

Friedman, J.H. and W. Stuetz1e (1981): "Projection Pursuit Regression," Journal of the American Statistical Association 76,817-823.

Fukushima, K and S. Miyake

(1984):

"Neocognition:

A New Algorithm

for Pattern Recognition

Tolerant of Deformations and Shifts in Position," Pattern Reco.Rnition 15.455-469.

Funahashi, K (1989): "On the Approximate Realization of Continuous Mappings by Neural

-81-

Networks," Neural Networks 2,183-192.

Gallant, A.R. (1973): "Inference for Nonlinear Models," North Carolina State University, Institute of Statistics, Mimeograph Series No, 875.

Gallant, A.R. (1975):

"Testing a Subset of the Parameters of a Nonlinear

Regression Model,"

Journal of the American Statistical Association 70,927-932.

Gallant, A.R. (1981):

"On the Bias in Flexible

Functional

Forms and an Essentially 1~, ?11-?45

Unbiased

Form: The Fourier Flexible Form." Journal of E~onomptrir.\'

Ga11ant, A.R. (1987): "Identification and Consistency in Seminonparametric Regression," in T.


Bewlely ed., Advances in Econometrics Fifth World Congress. New York: Cam-

bridge University Press,pp. 145-170.

Ga11ant, A. R. and H. White

(1988a):

A Unified

Theory

of Estimation

and Inference

for

Non-

linear Dynamic Models.

Oxford: Basil Blackwell.

Gallant,

A.R. and H. White


Avoidable

(1988b):

"There Exists a Neural Network


of the Second Annual

that Does Not Make


IEEE Conference on

Mistakes,"

Proceedings

Neural Networks,

San Diego.

New York: IEEE Press, pp. 1:657-664.

Ga11ant, A.R. and H. White (1991): "On Learning the Derivatives of an Unknown Mapping with
Multilayer Feedforward Networks," Neural Networks 4 (to appear).

Gamba, A., L. Gamberini, G. Palmieri and R. Sanna (1961): "Further Experiements with PAPA,"
Nuovo Cimento Suppl. 20,221-231

Gauss. K.F. (1809):

Theoria

Mouts

Corporom

Celestium.

English

translation

(1963):

Theory

of

the Motion of Heavenly Bodies. New York: Dover .

Geman, S. and C. Hwang (1982): "Nonparametric Maximum Likelihood Estimation by the

-82-

Method of Sieves," Annals of Statistics 10,401-414.

Gerencser, L. (1986): "Parameter Tracking of Time-Varying Continuous-Time Linear Stochastic Systems," in C.E. Byrnes and A.
Robust Control, New York: Elsevier,
Lindquist eds., Modelling, Identification and

pp. 581-594.

Go1dstein,L. (1988): "On the Choice of Step Size in the Robbins-Monro Procedure," Statistics and Probability Letters 6, 299-303.

Gourieroux, C., A. Monfort and A. Trognon (1984a)' "Pseudo-Maximum Likelihood Methods: Theory ," Econometrica 52, 681-700.

Gourieroux, C., A. Monfort and A. Trognon (1984b): "Pseudo-Maximum Likelihood Methods: Application to Poisson Models," Econometrica 52, 701- 720.

Graybi1l, F.A. (1983): Matrices with Applications in Statistics, second edition. Belmont:
worth.

Grenander, U. (1981): Abstract Inference. New York: Wiley.

Hansen, B. (1991):

"Inference

When a Nuisance Parameter is Not Identified

Under the Null

Hypothesis," University of Rochester Department of Economics Discussion Paper.

Hecht-Nielsen, R. (1989): "Theory of the Back-Propagation Neural Network," Proceedings of the [nternational Joint Conference on Neural Networks, Washington D.C. York: IEEE Press,pp. 1:593-606.

Hendry, D.F. and J.-F. Richard

(1990):

"Likelihood Evaluation for Dynamic Latent Variable

Models," Duke Institute of Statistics and Decision Sciences Discussion Paper 90A15

Hornik, K. (1991): "Approximation Capabilities of Multilayer Feedforward Nets," Neural Net-

-83

works

4,231-242.

Hornik, K. and C.-M. Kuan (1990): "Convergence of Learning Algorithms with Constant Learning Rates," University of lllinois Urbana -Champaign Department of Economics Discussion Paper.

Hornik, K, M. Stinchcombe, and H. White (1989): "Multi-Layer Feedforward Networks Are Universal Approximators," Neural Networks 2, 359-366.

Hornik, K, M. Stinchcombe and H. White (1990): "Universal Approximation of an Unknown Mapping and Its Derivatives Using Multilayer Feedforward Networks," Neural Networks 3,551-560.

Hu, S. and W. Joerding (1990): "Monotonicity and Concavity Restrictions for a Single Hidden Layer Feedforward Network," Washington State University Department of Economics Discussion Paper.

Huber, P.J. (1964):

"Robust Estimation

of a Location Parameter," Annals of Mathematical

Statis-

tics 35,73-101.

Huber, P.J. (1967): "The Behavior of Maximum Likelihood Estiamtes Under Nonstandard ConOitions,' E'rOCeedmgs of the r tflh Berkeley ~ympostum on Mathematical
and Probability. Berkeley: University of California Press, 1, pp. 221-233.

Statistics

Huber, P.J. (1985): "Projection Pursuit," Annals of Statistics, 13,435-475.

Joerding, W. and J. Meador (1990): "Encoding

A Priori:

Information

in Neural

Networks,"

Washington State University Department of Economics Discussion Paper.

Jones, L.K. (1991): "A Simple Lemma on Greedy Approximation in Hilbert Space and Convergence Rates for Projection Pursuit Regression and Neural Network Training,"
Annals of Statistics (forthcoming).

-84 -

Jordon, M.I. (1986): "Serial Order: A Parallel Distributed

Processing Approach,"

UC San Diego,

Institute for Cognitive Science Report 8604.

Kuan, C.-M. (1989): "Estimation of Neural Network Models," Ph.D. Dissertation, UC San Diego.

Kuan, C.-M., K. Hornik and H. White (1990): "Some Convergence Results for Learning in Recurrent Neural Networks," UCSD Department of Economics Discussion Paper.

Kuan, C.-M. and H. White (1991): "Strong Convergence of Recursive m-estimators for Models with Dynamic Latent Variables," UC San Diego Department of Economics Discussion Paper 91-05R.

Kushner, H.J. (1987): "Asymptotic Global Behavior for Stochastic Approximation and Diffusions with Slowly Decreasing Noise Effects: Global Minimization via Monte Carlo," SlAM
Journal of Applied Mathematics, 47, 169-185.

Kushner, H.J. and D.S. Clark (1978): Stochastic Approximation Methods for Constrained and
Unconstrained Systems New York: SpringerVerlag.

Kushner, H.J. and H. Huang (1979): "Rates of Convergence for Stochastic Approximation Type
Algorithms," SlAM Joumal ofControl and Optimization 17,607-617.

Kushner, H.J. and H. Huang (1981): "Asymptotic Properties on Stochastic Approximations with
Constant Coefficients," SlAM Journal of Control and Optimization, 19,87-105.

Lapedes, A. and R. Farber (1987): "Nonlinear Signal ProcessingUsing Neural Networks: Prediction and System Modeling," Los Alamos National Laboratory Technical Report.

Le CUD, Y. (1985):

"Une Procedure

d' Apprentissage

pour Reseau a Seuil Assymetrique,"

Proceedings of Cognitiva 85, 599-604.

Lee, T.H., H. White and C.W.J. Granger (1991): "Testing for Neglected Nonlinearity in Time

-85 -

Series Models," Journal of Econometrics (forthcoming).

Li, K.-C. (1987): "Asymptotic Optimality for Cp, CL, Cross-Validation and Generalized CrossValidation: Discrete Index Set," Annals of Statistics 15, 958-975.

Ljung, L. (1977): "Analysis of Recursive Stochastic Algorithms,"


Automatic Control AC-22,551-575,

IEEE Transactions on

Ljung, L. and T. Soderstrom (1983): Theory and Practice of Recursive Identification. Cambridge:
M.I.T. Press.

Lukacs, E. (1975): Stochastic Convergence. 2nd ed., New York: Academic Press.

Marcet, A. and T.J. Sargent (1989): "Convergence of Least Squares Learning Mechanisms in Self Referential, Linear Stochastic Models," Journal of Economic Theory 48, 337368.

Maxwell, T., G.L. Giles, Y.C. Lee and H.H. Chen (1986): "Nonlinear Dynamics of Artificial Neural Systems," in J. Denker ed., Neural Networks for Computing. New York:
American Institute of Physics.

McLeish, D.L. (1975): "A Maximal Inequality and Dependent Strong Laws," Annals of Probability 3, 829-839.

Metivier, M. and P. Priouret (1984): "Applications of a Kushner and Clark Lemma to General Classes of Stochastic Algorithm," IEEE Transactions on Infonnation Theory IT-30,
140-151

McCulloch, W.S. and W. Pitts (1943): "A Logical Calculus of the Ideas Immanent in Nervous
Activity ," Bulletin of Mathematical Biophysics 5, 115-133.

Minsky, M. and S. Papert (1969): Perceptrons.

Cambridge:

MIT Press.

-86-

Morris, R. and W.-S. Wong {1991): I"Systematic Choice of Initial Points in Local Search: Extensions and Application tb Neural Networks," Infonnation Processing Letters (forthcoming). Newey, W. (1985): "Maximum rLikelihood Specification Testing and Conditional Moment

Tests," Econometrica 53, 1047-1070.


Palmieri, G. and R. Sanna (1960): Methodos 12, No.48.

Parker, D.B. (1982): "Learnin~ Lo.!!Jc," Invention Report 581-64 (File 1). Stanfnrrl TTniv~r~ity Office of Technology Ljcensing.

Parker, D.B. (1985):

"Learning

Logic,"

l\.1IT Center for Computational

Research in Economics

and Management

Scienbe Technical Report TR-47.

Potscher, B. and I. Prucha (1991a): I"Basic Structure of the Asymptotic Theory in Dynamic Nonlinear Econometric Models, Part I: Consistency and Approximation Concepts,"

Econometric Reviews (forthcoming).

Potscher, B. and I. Prucha (1991b): I"Basic Structure of the Asymptotic Theory in Dynamic Nonlinear Econometric Mo(Jels, Part ll: Asymptotic Normality," Econometric Reviews

(forthcoming).

Robbins, H., and S. Monro (1951): '!A Stochastic Approximation Method," Annals of Mathematical Statistics 22, 400-407.

Rosenb1att, R. (1957):

"The Perceptron:

A Perceiving

and Recognizing

Automaton,"

Project

PARA, Cornell AeronaUtical Laboratory Report 85-460-1

Rosenblatt,

F. (1958):

"The Percdptron:

A Probabilistic

Model

for Information

Storage and

Organization in the Brain," Psychological Reviews 62, 386-408.

-87 -

Rosenblatt,

F. (1961):
Mechanisms.

Principles
Washington

of Neurodynamics:
D.C.: Spartan

Perceptrons

and the Theory of Brain

Books.

Rumelhart, D.E., G.E. Hinton and R.J. Williams (1986): "Learning Internal Representations by Error Propagation," in D. E. Rumelhart and I. L. McClelland eds., Parallel Distributed Processing: Explorations in the Microstructures of Cognition. Cambridge:

M.I.T. Press,1, pp. 318-362.


Ruppert, D. (1983): "Convergence of Stochastic Approximation Algorithms with Non- Additive Dependent Disturbances and Applications," in U. Herkenrath, D. Kalin and W. Vogel
eds., Mathematical Leaming Models-Theory and Algorithms. New York: Springer-

Verlag,pp. 182-190.
Sawa, T. (1978): "Infonnation Criteria for Discriminating Among Alternative Regression

Models," Econometrica 46, 1273-1292.

Sejnowski, T. and C. Rosenberg (1986): "NETalk:

A Parallel Network

That Learns to Read

Aloud," Johns Hopkins University Department of Electrical Engineering and Computer Science Technical Report 86/01.

Selfridge, 0., R. Sutton and A. Barto (1985): "Training and Tracking in Robotics," Proceedings of the Ninth International Joint Conference on Artificial Intelligence. Los Angeles:
Morgan Kaufman, 1, pp. 670-672.

Sontag, E. (1990): "Feedback Stabilization Using 1\vo-Hidden-Layer Nets," Rutgers Center for Systems and Control Technical Report SYCON-90-ll.

Stinchcombe, M. (1991): "Inner Functions and Universal Approximation Properties," UC San Diego Department of Economics Discussion Paper.

Stinchcombe, M. and H. White (1989): "Universal Approximation Using Feedforward Networks

-88-

With Non-Sigmoid Hidden Layer Activation Functions," Proceedings of the International Joint Conference on Neural Networks, San Diego. New York: IEEE Press,

pp. 1:612-617.

Stinchcombe, M. and H. White, (1991): "Consistent Specification Testing Using Duality," UC San Diego Department of Economics Discussion Paper.

Sussman, H. (1991):

"Uniqueness of the Weights for Minimal

Feedforward

Nets with a Given

Input-Output Map," Rutgers Center for Systems and Control Technical Report SYCON-19-06.

Sydaster,

(1981):

Topics

in Mathematical

Analysis

for

Economists.

New

York:

Academic

Press.

Tauchen, G. (1985): "Diagnostic Testing and Evaluation of Maximum Likelihood Models,"


Journal of Econometrics 30,415-444.

Tesauro, G. (1989): "Neurogammon Wins Computer Olympiad," Neural Computation 1, 321323.

rnompson, J.M:l~ and H.B. ~tewart(19H6): Nonlinear Dynamics and Chaos. New York: Wiley.

Walk, H. (1977): "An Invariance Principle for the Robbins-Monro Process in a Hilbert Space,"
Zeitschrift fiir Wahrscheinlichkeitstheorie und Verwandete Gebiete 30, 135-150.

Werbos, P. (1974):

"Beyond Regression: New Tools for Prediction and Analysis in the

Behavioral Sciences," unpublished Ph.D. Dissertation, Harvard University, Department of Applied Mathematics.

White, H. (1981): "Consequences and Detection ofMisspecified Nonlinear Regression Models,"


Journal of the American Statistical Association 76.419-433.

-89 -

White, H. (1982): "Maximum Likelihood Estimation of Misspecified Models," Econometrica

50, 1-25.
White, H. (1987a): "Some Asymptotic Results for Back-Propagation," Proceedings of the IEEE First International Conference on Neural Networks, San Diego. New York: IEEE Press,pp. III:261-266.

White, H. (1987b):

"Specification Testing in Dynamic Models," in Truman Bewley ed.,

Advances in Econometrics Fifth World Congress. New York: Cambridge University

Press, pp. 1-58.


White, H. (1988): "Economic Prediction Using Neural Networks: The Case of mM Stock
Prices," Proceedings of the Second Annual IEEE Conference in Neural Networks.

New York: IEEE Press,pp. ll:451-458.

White, H. (1989a): "Some Asymptotic Results for Learning in Single Hidden Layer Feedforward
Network Models," Jottmal of the American Statistical Association 84,1003-1013.

White, H. (1989b): "An Additional Hidden Unit Test for Neglected Nonlinearity," Proceedings of the International Joint Conference on Neural Networks, Washington D.C. New York: IEEE Press, pp. 11:451-455

White, H. (1990a): "Connectionist Nonparametric Regression: Multilayer Feedforward Networks Can Learn Arbitrary Mappings," Neural Networks 3,535-549.

White, H. (1990b): "Nonparametric Estimation of Conditional Quantiles Using Neural Networks," UC San Diego Department of Economics Discussion Paper.

White, H. (1992): Estimation, Inference and Specification Analysis. New York: Cambridge University Press (forthcoming).

White, H. and J. Wooldridge (1991): "Some Results for Sieve Estimation with Dependent

-90Observations," in W. Barnett, J. Powell and G. Tauchen eds., Nonparametric and


Semiparametric Methods in Economics. New York: Cambridge University Press, pp.

459-493.

Widrow, B. and M.E. Hoff(1960):

"Adaptive Switching CircuitS," Institute of Radio Engineers

WESCON Convention Record, Part 4, 96-104.

Williams, R. (1986): "The Logic of Activation Functions," in D.E. Rumelhart and J.L. McClelland eds., Parallel
C-'ognttton.

Distributed
MITPress,

Processing:

Explorations

in the Microstructures

of

Cambridge:

1, pp. 423-443,

Williams, R.J. and D. Zipser (1989): "A Learning Algorithm for Continua11y Running Fully Recurrent Neural Networks," Neural Computation 2, 270-280.

Woodford, M. (1990): "Learning to Believe in Sunspots," Econometrica 58, 277-308.

Xu, x. and W.T. Tsai {1990): "Constructing Associative Memories Using Neural Networks,"
Neural Networks 3,301-310.

Xu, x. and W.T. Tsai (1991): "Effective Neural Algorithms for the Traveling Salesman Problem," Neural Networks 4, 193-206.

Young, P.C. (1984): Recursive Estimation and Time-Series Analysis. New York: Springer Verlag.

S-ar putea să vă placă și