Artificial Neural Networks An Econometric Perspective

-2-
INTRODUCnON
Artificial neural networks are a class of models developed by cognitive scientists
interested in understanding how computation is performed by the brain. These networks are
capable of learning through a process of trial and error that can be appropriately viewed as sta-
tistical estimation of model parameters. Although inspired by certain aspects of the way infonnation is processed in the brain, these network models and their associated learning paradigms are still far from anything
clo~e to a realistic description of how brains actually work. They nevertheless provide a rich,
powerful and interesting modeling framework with proven and potential application across the sciences. To mention just a handful of such applications, artificial neural networks have been successfully used to translate printed English text into speech (Sejnowski and Rosenberg, 1986), to recognize hand-printed characters (Fukushima and Miyake, 1984), to perform complex coordination tasks (Selfridge, Sutton and Barto, 1985), to play backgammon (Tesauro, 1989), to diagnose chest pain CEaxt, 1991), and to decode deterministic chaos (Lapedes and Farber, 1987; White, 1989; Ga11ant and White ,1991). Successesin these and other areas suggest that artificial neural network models may
serve as a useful addition to the tool-kits of economists and econometricians. Areas with particu-
lar potential for application include time-series modeling and forecasting, nonparametric estimation, and learning by economic agents. The purpose of this article is two-fold: first, to review the basic concepts and theory required to make artificial neural networks accessible to economists and econometricians, with particular focus on econometrically relevant methodology; and second, to develop theory for a
leading neural network learning paradigm to a point comparable to that of the modem theory of
estimation and inference for misspecified nonlinear dynamic models (e.g., Gallant and White, 1988a; Potscher and Prucha, 1991a,b). As we hope will become apparent from our development, not only do artificial neural networks have much to offer economics and econometrics, but there is also considerable
-3 -
potential for economics and econometrics to benefit the neural network field, arising to a considerable degree from economic and econometric experience in modeling and estimating dynamic
systems. Thus, a larger goal of this article is to provide an entry point and appropriate back-
ground for those wishi:tlg to engage in tl}e fascinating intellectual arbitrage required to fully realize the potential gains from trade between economics, econometrics and artificial neural networks.
PART I: OVERVmW 1.1. ARTIFICIAL
AND HEURISTICS MODEL~
NEURAL NETWORK
The simplest general artificial neural network (ANN) models draw primarily on three features of the way that biological neural networks process information: massive parallelism, nonlinear neural unit response to neural unit input, and processing by multiple layers of neural
units. Incorporation of a fourth feature, dynamic feedback among units, leads to even greater
generality and richness. In this section, we describe how these features are embodied in now standard approaches to ANN modeling, and some of the implications of these embodiments. Because of the very considerable breadth of ANN paradigms, we cannot do justice to the entire spectrum of such models; instead, we focus our attention on those most easily related to and with greatest relevance for econometrics. Although not usua1ly thought of in such terms, para1lelism is a familiar aspect of
econometric modeling. A schematic of a simple parallel processing network is shown in Figure
1. Here, input unit ("sensors") send real-valued signals (Xi, i = 1, ..., r) in parallel over connections to subsequent units, designate,d"output units" for now. The signal from input unit i to output unit j may be attenuated or amplified by a factor r ji E IR, so that signals Xi r ji reach output unit j, i = 1, ..., r. The factors r ji are known as "network weights" or "connection strengths."
In simple ANN models, the receiving units process parallel incoming signals in typically simple ways. The simplest is to add the signals seen by the receiver, in which case the output unit produces output
r L Xi r ji , i=l
1, ...,
v.
If, as is common, we permit an input, say Xo, to supply Xo = 1 to the network ( a "bias unit" in network jargon), output can be represented as
/j(x, r) =x'rj
j=l,...,v,
or
(x,
r)
= (1!&1
x)r
where f = (f1, ..., fv)~, x =(1, X1, ..., xr)~, r = <r~1, ..., r~v)~, and rj = <rjO, rj1,
..., rjr)~.
The
"out-
put function"
f is easily recognized
as the systematic part of the standard system of seemingly
unrelated (linear) equations; in the neural network literature, an electronic version of this network was introduced as the MADALINE
Hoff(1960).
(Multiple Adaptive Linear) network by Widrow and

network of Widrow and
When v = 1 (only a single output), we have the ADALlNE
Hoff (1960), easily recognized as the simple linear model, the workhorse of empirical econometrics.
In biological neural systems, the number of processing units can range into the mil-
lions or billions and beyond (hence the teml "massive" parallelism). While such numbers are not usually encountered in economic models, the essential feature of parallel processing is common to both. From the outset of their development, the behavior of artificial neural networks was fornlulated to include another stylized feature of biological systems. This is the tendency of certain types of neurons to be quiescent in the presence of modest levels of input activity, and to become active themselves only after input activity passes a particular threshold. Beyond this threshold, increases in input activity have little further effect This introduces the fundamental
feature of nonlinear response into the ANN paradigm.
For present purposes, it suffices to think either of neural units switching on or off, or to imagine a single dimension along which neural activity (e.g. neural firing rate) can smoothly vary from fully off to fully on. In their seminal article, McCulloch and Pitts (1943) considered
-5 -
the first possibility, proposing networks with output unit activity given by
h(x,
y)
G(x'yj)
j = 1,
, v,
the "Heaviside" or
where G(a) = 1 if a > 0 and G(a) = 0 if a $; 0. This choice for G implements
unit step function. Output unit j thus turns on when X'rj > 0, i.e. when input activity L;=l Xirji exceeds the threshold -rjo. For this reason the Heaviside function is said to implement a "threshold logic unit" (1LU). G is called the "activation function" of the (output) unit Networks with TLU's are appropriate for classification and recognition tasks: the study of such networks exclusively pre-occupied the ANN field through the 1950's and dominated the field through the 1960's. In retrospect, a major breakthrough in the ANN literature
occurred when it was proposed to replace the Heaviside activation function with a smooth sigmoid (s-shaped) function, 1967). Instead of switching in particular the logistic function, G(a) = 1/(1 + exp(-a (Cowan,
abruptly from off to on, "sigmoidal"
units turn on gradually as input from the ANN standpoint will however, binary logit we observe that model
activity increases. The reason why this constituted a breakthrough

be discussed in the next section.
With
this
modification, the familiar
/j(x, r) = G(x'rj)
= 1/(1 + exp(-x'rj
is precisely
probability
(e.g. Amemiya, 1981; 1985, p. 268). Other choices for G yield other models appropriate for classification or qualitative response modeling; for example, if G is the normal cumulative distribution function, we have the binary probit model, etc. As Amemiya (1981) documents in his
classic survey, such models have great utility in econometric applications where binary
classifications or decisions are involved. Although biological networks with direct connections from input to output units are well-known (e.g.. the knee-jerk reflex is mediated by direct connections from sensory receptors in the knee onto motoneurons in the spinal cord that then activate leg muscles), it is much more common to observe processing occurring in multiple layers of units. For example, six distinct processing layers are at work in the human cortex. Such multilayered structures were introduced into the ANN literature by Rosenblatt (1957, 1958) and by Gamba and his associates (palmieri
-6 -
and Sanna, 1960; Gamba, et. al., 1961). Figure 2 shows a schematic diagram of a network containing a single intemlediate layer of processing units separating input from output. Intermediate
layers of this sort are often caned "hidden" layers to distinguish them from the input and output layers. Processing in such networks is straightforward. Units in one layer treat the units in the preceding layer as input, and produce outputs to be processed by the succeeding layer. The output function for such a network with a single hidden (as in Figure 2) is thus of the form
fh(X, ()) = F ({3hO + LJ=l Here F: 1R ~
G(X"rj){3hj), function,
h =
1, ...,
(1.1.1)
1R is the output activation
and {3hj, j = 0, 1, ..., q, h = 1, ..., v are con.
nection strengths from hidden unit j ( j = O indexes a bias unit) to output unit h. The vector
8 = ({3'1,
,j3'v,r'l,
.."r'q)
(with
j3'h = (j3hO. ...,j3hq))
collects
together
all network
weights.
Note that we have q hidden units.

As originally introduced, the hidden layer network activation functions F and G
implemented
11..U's.
However,
modern practice permits F and G to be chosen quite freely. (the logistic) simplicity or F(a) = a (the identity) generality, and we
Leading choices are F(a) = G(a) = 1/(1 + exp(-a)) G(a) = 1/(1 + exp(-a)). Because of its notational
and considerable
adopt the latter choice, and for further simplicity
set v = 1. Thus we shall pay particular attention
to "single hidden layer" networks with output functions of the form
f(x,
q 0) = 130 + L j=l
G(x'rj)13j
(1.1.2)
Although we have seen econometrically familiar models emerge in our foregoing discussion of ANN models (e.g. seemingly unrelated regression systems and logit models), equation (1.1.2) is not so familiar. .It does bear a strong resemblance to the projection pursuit models of modem statistics (Friedman and Stuetzle, 1981; Huber, 1985) in which output response is given by
7q f(x, f) =/30 + L -a Gj(X'rj)fJj.
j=l
However, in projection pursuit models the functions Gj are unknown and must be estimated from data (perrnittingf3 j to be absorbed into Gj), whereas in the hidden layer network model (I.l.2), G is given. The hidden layer network model is thus somewhat simpler than the projection pursuit model.
A variant of the single hidden layer network that is particularly relevant for
econometric applications is depicted in Figure 3. This network has direct connections from the
Input to output layer as well as a single ruaaen layer. output tor this network can be expressed
as
f(x,
0) = F(x'a
q +f3o + L G(x'rj)f3j), j=l
(1.1.3)
weights, and () is now taken to be , {3q)' we nest as
where
is
r x 1
vector
of
input-output
choice
(} = a', f3o,
, {3q, r'l
, ..., r'q)'.
By suitable
of G, a and {3 = ({30, {31,
special cases a1l of the networks discussed so far.

In particular, with F(a) = a (the identity) we have a standard linear model aug-
mented by nonlinear terms. Given the popularity of linear models in econometrics, this form is
particularly appealing, as it suggests that ANN models can be viewed as extensions of, rather to, the familiar models. The hidden unit activations can then be viewed as
than as alternatives
latent variables whose inclusion enriches the linear model. We shall refer to an ANN model with
output of the form (1.1.3) as an "augmented" single hidden layer network. Such networks will
play an important role in the discussion of subsequent sections.

What originally commanded the attention and excitement of a diverse range of dis-
ciplines was the demonstrated successesthat models of the form (1.1.1) and (1.1.2) had in solving previously intractable classification, forecasting and control problems, or in producing superior solutions to difficult problems in orders of magnitude less time than traditional approaches. Until recently, a theoretical basis for such successes was unknown --artificial neural networks just
-8 -
seemedto work surprisingly well.

Motivated by a desire either to delineate the limitations of network models or to
understand their diverse successes, a number of researchers independently produced rigorous

results establishing that functions of the form (1.1.2) can be viewed as I'universal approx.imators,"
that is, as a flexible functional form that, provided with sufficiently many hidden units and properly adjusted 'parameters, can approximate an arbitrary function 9 : IR r -:; IR arbitrarily well in
useful spacesof functions. Results of this sort have been given by Carroll and Dickinson (1989), Cybenko (1989), Funahashi (1989), Hecht-Nielsen (1989), Hornik, Stinchcombe and White (1989, 1990) (HSWa, HSWb) and Stinchcombe and White (1989), among others. The flavor of such results is conveyed by the following Theorem 2.4 of HSWa. paraphrase of part of
THEOREMI.1.1:
};'(G)={f: {30 E JR'~
For r E IN, let Ir(G)

JRlf(x)={30+LJ=IG(X'rj){3j,XE G: JR ~ [0,1]
be the class of hidden

JR';rjE distribution
layer
JR'+I,{3jE
network
functions
JR,j=I,...,q; Then};'(G) is
JR, q E 1]J\l}, where
is any cumulative
function.
unifonnly
dense on compacta in C( 1R~, Le. for every 9 in C( 1R~, every compact subset K of
such that Supx E K I f(x) -g(x) I < E.
/Rr, and every E > 0, there exists f E Ir(G)
Thus, the biologically inspired combination of parallelism, nonlinear response and multilayer processing leads us to a class of functions that can approximate members of the useful class C( 1R~ arbitrarily well. Similar results hold for network models with general (not necessarily sigmoid) activation functions approximating functions in Lp spaces with compactly supported measures, and, as HSWb and Hornik (1991) show, in general Sobolev spaces. Thus, functions of the form (1.1.2) can approximate a function and its derivatives arbitrarily well, and in this sense are as
flex.ible as Ga1lant's (1981) flex.ible Fourier form. Indeed, Ga1lant and White (1988b) construct a
sigmoid choice for G (the "cosine squasher") that nests Fourier series within (1.1.2), so that the flexible Fourier form is a special case of (1.1.3) even for sigmoid G.
-9 -
The econometric usefulness of the flexible form (1.1.2) has been further enhanced by
Hu and Joerding (1990) and Joerding and Meador (1990), who show how to impose constraints
ensuring monotonicity and concavity (or convexity) of the network output function. interested reader is referred to these papers for details.
An issue of both theoretical and practical importance is the "degree of approxima-
tion" problem: how rapidly does the approximation to an arbitrary function improve number of hidden units q increases? Classic results for Fourier series are provided by Edmunds and Moscatelli (1977). Similar results for ANN models are only beginning to appear, and so far are not as sharp as those for Fourier series. Barron (1991a) exploits results of Jones (1991) to
establish essentially that 11/- 9 112= O(l/q 1/2) ( 11.112 denotes an L2 norm) when I is an element
of };r(G)
having q hidden units and continuously
differentiable
sigmoid activation condition
function, and on its Fourier
9 belongs to a certain class of smooth functions satisfying a summability transform. An important
open area for further work is the extension and deepening of results of
this sort, especially as such results may provide key insight into advantages and disadvantages of ANN models compared to standard flexible function families. Degree of approximation results are also necessary for establishing rates of convergence for nonparametric estimation based on
ANN models.
Our focus so far on networks with a single hidden layer is justified by their relative simplicity and their approximative power. However, if nature is any guide, there are advantages to using networks of many hidden layers, as depicted in Figure 4. Output of an l-layer network
can be represented as
ahi = Gh(Ahi(ah-l))
i =
1, ...,
qh;
h =
1, ...,
1,
where ah is a qh X 1 vector with elements ahi, Ahi(.) is an affine function

Ahi(a) = tI'rhi for some (qh + 1) x 1 vector rhi, 2 = (1, a)), Oh is the
of its argument (i.e.

function for
activation
units of layer h, ao = x, qo = r, and ql = v. The single hidden layer networks discussed above correspond to 1 = 2 in this representation.
-10
An interesting
open question
is to what
extent
networks
with
1 ? 3 layers
may be
preferable to networks with 1 = 2 layers.
Specifically,
for what classes of functions
can a three
layer network achieve a given degree of accuracy with fewer connections (free parameters) than a two layer network? Examples are known in which a two layer network cannot exactly
represent a function exactly representable by a three layer network (Blum and Li, 1991), and it is
known that certain mappings containin2 discontinuities relevant in control theory ~~n hp. Imiformly approximated in three but not two layers (Sontag, 1990). HSWa (Corollary 2.7) have shown that additional layers cannot hurt, in the sensethat approximation properties of single hidden layer networks (I = 2) carry over to multi-hidden layer networks. Further research in this interesting area is needed.
A further generalization of the networks represented by (1.1.4) is obtained by replac-
ing the affine function Ahi(.) with a polynomial Phi(.) with degree possibly dependent on i and h.
This modification yields a class of networks containing as a special case the so-called "sigma-pi"
(Ell) networks (Maxwell, Ones, Lee and Chen. 1986: Williams. lCJRn) Stinrhrombe (1991) hM studied the approximation properties of networks for which an arbitrary "inner function" Ihi
replaces Am in (!.1.4)
The richness of this class of network models is now fairly apparent. However, we
still have not exploited a known feature of biological networks, that of internal feedback.
Returning to the relatively
simple single hidden layer networks, such feedbacks can be
represented schematically as in Figure 5. In Figure 5(a), network output feeds back into the hidden layer with a time delay, as proposed by Jordon (1986). In Figure 5(b), hidden layer output Thc outvul
feerl~ h~r.k into th~ hidd~n layer with a time del~y, as proposcd by Elma.n (1988). function of the Elman network can thus be represented as
q fr(xt, (}) = /30 + L atj /3j j=l

atj = G(Xt'rj + a 't-I 8j),
j=I,...,q;t=O,I,2,...,
(1.1.5)
11-
where
at
(atl'
...,
atq)'.
As a consequence of tl1is feedback, network output depends on the ini-
tial value ao, and the entire history of system inputs, xt = (XI' ..., xt).
Such networks are capable of rich dynamic behavior, exhibiting memory and context sensitivity. Because of the presence of internal feedbacks, these networks are referred to in the literature as "recurrent networks," while networks lacking feedback (e.g., with output functions
G.l.3)) are desi2oated "feedforwarrl In econometric dynamic latent variables netwnrk~ II as a nonlinear applications in
terms, a model of the form (1.1.5) can be viewed model. Such models have a great many potential
economics and finance. Their estimation would appear to present some serious computational challenges (see e.g. Hendry and Richard, 1990, and Duffle and Singleton, 1990), but in fact some straightforward recursive estimation procedures related to the Kalman filter can deliver consistent estimates of model parameters (Kuan, Homik and White, 1990; Kuan and White, 1991). We discuss this further in the next section.
Although we have covered a fair amount of grounrl in thi~ ~p~tion. we have only
scratched the surface of the modeling possibilities offered by artificial neural networks. To mention some additional models treated in the ANN literature, we note that fully interconnected networks have been much studied (with applications to such areas as associative memory and solu-
tion of problems like the traveling salesman problem; see e.g. Xu and Tsai, 1990, and Xu and Tsai, 1991), and that networks running in continuous rather than discrete time are also standard objects of investigation (e.g. Williams and Zipser, 1989). Although fascinating, these network models appear to be less relevant to econometrics than those discussed so far, and we shall not
treat them fiJrthp.r As rich as ANN models are, they still ignore a host of biologically relevant features.
Neural systems that have taken perhaps billions of years to evolve will take humans a little more time to model exhaustively than the five decades devoted so far! To mention just a few items,
biological neurons communicate over multiple pathways, chernical as well as electrical --the .l;in-
gle communication dimension ("activation") assumed in most ANN models is quite incomplete.
-12-
Also, biological neurons respond to input activity stochastically and in much more complicated ways than as modeled by the sigmoid activation function --neurons output complex spike trains through time, and are in fact not simple processing units. Of course, these and other lirnitations
of ANN models are daily being challenged by ANN modelers, and we may expect a continuing
increase in the richness of ANN models as the diverse interdisciplinary talents of the ANN community are broueht to bear on these issues.
Despite these limitations sufficiently as descriptions attractive of biological reality, ANN models are modeling.
rich as to present a potentially
set of tools for econometric
Given models, the econometrician wants estimators. We take up estimation in the next section, where we encounter additional interesting tools developed by the ANN community in their study of learning in artificial neural networks.
1.2. LEARNING IN ARTIFICIAL NEURAL NETWORKS

The discussion of the previous section establishes ANN models as flexible functional fomls, extending standard linear specifications. As such, they are potentially useful for econometric modeling. To fulfill this potential, we require methods for finding useful values for the free parameters of the model, the network weights.
TO any econometriCian verse a m the standard tools of the trade, a multitude of
relevant estimation procedures for finding useful parameter values present themselves, typically dependent on the behavior of the data generating process and the goals of the analysis.
For example, suppose we observe a realization of a random sequence of s x 1 vec-
tors {Zt = (yt, X't)'}
(assumed stationary for simplicity), and we wish to forecast yt on the basis
of Xt. The minimum mean-squared error forecast of yt given Xt is the conditional expectation
g(XI} = E(YI I XI}. Although the function 9 is unknown, we can attempt to approximate it using a
neural network with some sufficient number of hidden units. If we adopt (1.1.3) with F the iden-
tity, we obtain a regression model of the form
-13-
f(x,
8)
= x'a
+f3o
q + L j=l
G(x'rj)f3j,
where () = (a,/30'/31 , ...,/3 j, r'l,
..., r'q)'
and for simplicity
we choose q and G a priori. we must acknowledge
Because this model is only intended
as an approximation,
from the outset that it is misspecified. Nevertheless, the theory of least squares for ~sspecified nonlinear regression models (White, 1981; 1992, Ch. 5; Domowitz and White. 1982: Gallant :\nci White, 1988a) applies immediately to establish that a nonlinear least squares estimator 9 n solving the problem
n min n-l /I," A L [yt ~=1 -f(Xt. 8)]2
exists and converges almost surely under general conditions as n ~ ~ to 9., the solution to the
problem
where
a~ = E([Yt -g(xJf).
(See Sussman
(1991)
for
discussion
of
issues relating
to
identification.:
Further, under general conditions
a multivariate nonnal distribution with
{;; (0 n -() *: converges in distribution

estimable
as n ~ 00 to
matrix
mean zero and consistently
covariance
(White, 1981; 1992, Ch. 6; Domowitz and White, 1982). Although least squares is a leading case, the properties of the dependent variable yt will often suggest the appropriateness of a Qua."i-ma:ximllmlik~Jihood procedure different from
least squares. For example, if yt is a binary choice indicator taking values O or 1 only, it may be assumed to follow a conditional Bernoulli distribution, given Xto A network model to approxi-
mate g(X,) = P[Yt = 1 I Xt] = E(Yt I Xt) can be specified as
f(x, O) = F(x'a
+ .80 + L G(x'rj) j=l
.8j) ,
{1.2.1)
14 -
where F(.) is now some appropriate c.d.f. (e.g., the logistic or normal). The mean quasi-log likelihood function for a samDle of size n is then
Ln(Zn,
f)
= n-1
n L[Yt t=l
logf(Xt'
f)
+ (1-
yt) log(l-
f(Xt,
f))].
A quasi-maximum
likelihood
estimator
A 8 n solving
the problem
max. BE e
Ln(Zn,
f)
can be shown under general conditions to exist and converge to 0., the solution to the problem
max E[Yt Jog f(Xt. (1) + (1- YJ 1og(lBee .
f(Xt. (1))].
(See White, 1982; 1992, Ch. 3-5.) The solution ()* minimizes the Kullback-Leibler divergence of the approximate probabillly Inul1el f(Xt, 0.) [rUIIl 111t;Uut; g(Xt). fu inl11t; It;~l :)4U(1lt;:) \;;~t;, -I;; (0 n -0 .) I.;UIIVC;lgC;~
in distribution
as n ~ 00 to a multivariate
normal distribution
with mean zero and consistently
estimable covariance matrix (White, 1982; 1992, Ch. 6).

If Ye represents count data, then a Poisson quasi-maximum likelihood procedure is
natural (e.2. Gourieroux. Monfort and Tro2non. 1984a.b). where fis as in G.2.1) with F chosen to ensure non-negativity (e.g. F(a) = exp(a, so as to permit f(Xt, J) to plausibly approximate
g(Xt) = E(Yt I X,). If Yt represents a survival time, then a Cox proportional hazards model (e.g.
A:r:nemiya,1985, pp. 449-454) is a natural choice, with hazard rate of the form )..(t) f(Xt, 9).
From an econometric would ordinarily standpoint, then, ANN models can be used anywhere with estimation one
use a linear (or transformed linear) specification,
proceeding
via appropriate quasi-maximum likelihood (or, alternatively, generalized method of moments) techniques. The now rather well-developed theory of estimation of misspecified models (White, 1982, 1992; Gallant and "Whitc, 1988a; POt~chcJ: and P1-ucha,1991a,b) applic~ immcdiatcly to provide interpretations and inferential procedures.
15-
The natural instincts of econometricians are not the instincts of those concerned with
artificial neural network learning, however. This is a double blessing, because it means not only
that econometrics has much to offer those who study and apply artificial neural networks, but also
that econometrics may benefit from novel techniques developed by the ANN community. In considering how an artificially intelligent system must go about learning, ANN learning ~ thc; l1lU-
modelers from the outset viewed learnin2 as a ~eql1f'.ntiI11 proce~s. Viewins
cess by which knowledge is acquired, it follows that knowledge accumulates as learning experiences occur, Le. as new data are observed.
In ANN
models, knowledge
is embodied in the network
connection
strengths,(}.
" " Given knowledge (} t at time t, knowledge (} t+ 1 at time t + 1 is then
t+l=Ot+Lltt
where dt embodies incremental therefore

current
knowledge
(1eaming).
A successful learning
procedure must CI.11U which
specify
observables,
..
some appropriate
Zt = (yt, X't)"
way to fonn thf'. llpd~te A, from previous knowlcdgc Thus we seek an appropriate function Vlt for
~t
1fItCZt.
()
t).
Current leading ANN learning methods can trace their history from seminal work of Rosenblatt (1957, 1958, 1961) and Widrow and Hoff(1960). Rosenblatt's learning network, the a-perceptron, was concerned with pattern classification and utilized threshold logic units.
Widrow and Hoffs ADALINE networks do not require a nu, as they are not restricted to being
classifiers.
As a consequence, the Widrow-Hoff
(or "delta") learning law could be generalized in
just the right way to pccmit (tJ:JJ:Jlil.;auun tu nonlinear networks.
For their linear networks (with output for now given by f(x, 8) = x' 8) Widrow and Hoff proposed a version of recursive least squares (itself traceable back to Gauss, 1809 --see Young, 1984),
Ot+1
= Ot + a XtCft
-X't
Ot).
(1.2.2)
16 -A Et = yt -X't O t is the "network -A X't O t and the
Here
error"
between
computed
output
"target"
value
yt.
The scalar a > O is a "learning rate" to be adjusted by trial and error. This recursion
was motivated explicitly by consideration of minimizing expected squared error loss. For networks with nonlinear output f(x, 8) the direct generalization of the delta rule
is
" Ot+1
" = Ot + a Vf(Xt.
" t)
(yt -f(Xt.
" t
(1.2.3)
where V f(x, .)is the gradient of f(x, .)with respect to (J (a column vector).
In the ANN literature,
this recursion is called the "generalized delta rule" or the method of "backpropagation" (a term invented for a related procedure by Rosenblatt, 1961). Its discovery is attributable to many (Werbos, 1974; Parker, 1982,1985; Le Cun, 1985), but the influential work of Rumelhart, Hinton and Williams (1986) is perhaps most responsible for its widespread adoption.
This apparently straightforward generalization of (1.2.2) in fact caused a revolution
in the ANN field, spurring the explosive growth in ANN modeling resDonsible for its vi2or today
and the appearance of an article such as this in a journal devoted to econometrics. The reasons
for this revolution are essentially two. First, until its discovery, there were no methods known to ANN modelers for finding good weights for connections into the hidden units. The focus on threshold logic units in multilayer networks in the 1950's and 1960's led researchers away from
gradient methods, as the derivative of a TLU is zero almost everywhere, and does not obviously
lend itself to gradient methods. This is why the introduction of sigmoid activation functions by
Cowan (1967) amounted to such a significant breakthrough --straightforward gradient methods
become possible with such activation functions. Even so, it took over a decade to sink into the
collective consciousness of the ANN community that a solution to a problem long considered
intractable (even impossible, viz. Minsky and Papert, 1969) was now at hand. The second reason is that once feasible methods for training hidden layer networks were available, they were applied to a vast range of problems with some startling successes. That this should be so is all the more impressive given the considerable difficulties in obtaining convergence via (1.2.3). For
17 -
a period, ANN models coupled with the method ofbackpropagation
came to be viewed as magic,
with considerable accompanying hype and extravagant claims. In 1987 one of us (White, 1987a) pointed out that (!.2.3) is in fact an application of the method of stochastic approximation (Robbins and Monro, 1951; B1um, 1954) to the nonlinear least squares problem (as in Albert and Gardner, 1967). The least squares stochastic approximation recursions are in fact a little more ~eneral. havin~ the form " " " 9t+l = 9 t + at Vf(Xt, 9 t) (yt -f(Xt, " 9 t)),
t=
1, 2,
00
(!.2.4)
The difference is that here the learning rate at is indexed by t, whereas in (1.2.3) it is a constant.
This is quite an important difference With a constant learning rate, the recursion
(1.2.3) can converge only under extremely stringent conditions (there must exist eo such that
y = f(X, eo) almost surely, where Zt has the distribution of Z = (Y; X')' t = 1, 2,
). When this
condition fails, the recursion of (1.2.3) generally converges to a Brownian motion (see Kushner
and Huang. 19R1: Homik ~nr1 Kll~n, 1QQO),not an appealing behavior in this context. Howevcr,
whenever at depends on t appropriately
(e.g. at > 0, L~=l at = 00, L~=l a; < 00, for which it
suffices that at oc t-IC 1/2 < 1( ~ 1), standard results from the theory of stochastic approximation
can be applied (e.g., White, 1989a) to establish the almost sure convergence ofe t in (1.2.4) to ()*,
a local solution of the least squares problem
mill E([Y BE 8
-f(K,
8)]1
Repeated
initialization
of the
recursion
(1.2.4)
from
different
starting
values
A e
(e.g., following
the parameter space partitioning strategy of Morris and Wong, 1991) can lead to rather good local solutions.
This fact is significant. The recursion (!.2.4) provides a computationally very simple
algorithm for getting a consistent estimator for a locally mean square optimal parameter vector in a nonlinear model with just a single pass through the data. Multiple passes through the data
(which can be executed in parallel) permit exploration for a global optimum. Thus, in addition to
21
and Duffle and Singleton (1990). Duffle and Singleton derive consistency and asymptotic normality results for MSM estimators of correctly specified models of conditional distribution. The recursive estimator (1.2.6) is computationally simpler by several orders of magnitude and has useful approximation properties even with misspecified models. It is therefore an interesting estimator in its own right: it also appears promising as a generator of starting estimates for MSM esti-
mation. In all of the discussion so far, we have implicitly assumed that network complexity (indexed by the number of hidden units) is fixed. However, the universal approximation properties described in Section 1.1 suggest that ANN models may prove a useful vehicle for nonparametric estimation. This intuition is correct: using results of White and Wooldridge (1991), White (1990a) shows that nonparametric sieve estimators (Grenander, 1981; Geman and Hwang,
1982) based on ANN models can consistently estimate a square-integrable conditional expecta-
tion function, and White (199Ob) shows that nonparametric sieve estimators based on ANN models can consistently estimate conditional quantile functions. Using results of Gallant (1987), Gallant and White (1991) establish the consistency in Sobolev norm of nonparametric sieve estimators based on ANN models. Thus, ANN models can consistently estimate unknown functions and their derivatives in a manner analogous to the performance of the flexible Fourier function
form (Gallant, 1981; Elbadawi, Ga1lant and Souza, 1983). Given tile early stage or aevelopment ot oegree of approximation results for ANN models, rate of convergence results for nonparametric ANN estimators are only beginning to be obtained. However, Barron (1991b) has obtained rate of convergence results for nonparametric least squares estimators of conditional expectation functions. For i.i.d. samples, these rates are
slightly slower than n 1/2,
To gain some insight into the issues that arise in nonparametric estimation using ANN models, we briefly consider the problem treated by White (1990a). The estimation problem considered there has the standard sieve estimation form
n =
1,2,
...,
(1.2.7)
-22 -
where the sieve en(G) is given by en(G) = T(G, qn' An),
T(G,
q, 11) = {(} E E> I (}( .) = fl(
., ~)
fl(x,
q 8q) =f3o + L G(x'rj)f3 j=l
j ,
x E m.'
q L if3j i ~A, j=O

G is a given hidden layer activation function,
q r L L Irji I .S:qL\} j=li=O

{qn E IN} and {~n E JR+
are sequences tending
to infinity with n, e is the space of functions square integrable with respect to the distribution of
Xt,and now u
s:q -rR =\fJO,
/3
1,...,
/3
'
q,rl,r2,...,rq
) ' .
Given this setup, the estimation problem (1.2.7)is equivalent to the constrained non-
linear least squares problem
mill 8"' e D,
n-1
n L '=1
[f,
-f1'(X"
~')f
, n = 1,2,
""",
(1.2.8)
where Dn = {$1. :L;:o
l{3j I ~lln,
L;:l
L;=o
Irji
I ~qnlln}.
The idea is that for a sample of
size n, one performs a constrained nonlinear least squares estimation on a model with qn hidden units, satisfying certain sumrnability restrictions on the network weights. B y letting the number
of hidden units qn increase gradually with n, and by gradually relaxing the weight constraints, the
network model becomes increasingly inates overfitting asymptotically, flexible as n increases. Proper control of qn and l1n elimof 80, 80(Xt) = E(Yt I Xu, to
a1lowing consistent estimation
" p result, i.e. 118 n -80112 -7 0. White
(1990a) shows that for bounded i.i.d.
{ Zr } , consistency
is
achieved with !!J.n,qn ~ 0:1as n ~ 0:1,!!J.n = o(nl/4)
and qn!!J.~ Jog qn!!J.n= o(n).
For bounded mix-
ing processes of a specific size, ~n = o(n 1/4) and qn~; log qn~n = o(n 1/2) suffice for consistency.
In practice, determining appropriate network complexity is precisely analogous to determining how many terms to include in a nonparametric series regression. As in that case, either cross-validation or information-theoretic methods can be used to determine the number of
hidden units optimal for a given sarnple. Information-theoretic methods in which one optimizes a
complexity-penalized quasi-log likelihood (closely related to the Schwartz Information Criterion,
-23-
Sawa, 1978) have been shown to have desirable properties by Barron (1990). Extension of analysis by Li (1987) as applied by Andrews (1991a) to cross-validated selection of the number of terms in a standard series regression may deliver appropriate optimality results for crossvalidated selection of network complexity , and is an interesting area for further research.
Also an open question is that of the asymptotic distribution of nonparametric neural network estimators. Results of Andrews (199lb) for series estimators may also be extendable to treat nonparametric estimator of ANN models. Additional interesting insights should arise from this analysis.
13. SPECIFICATION
TESTING AND INFERENCE
Consider the nonlinear regression model based on (1.1.3) with F the identity
func-
tion,
The standard linear model ocl::urs as the special case in which {31 = {32 =
..{3q
0.
Thus,
correct specification of the linear model can be tested as
Hq
.fi
=0
v~
H.. .{I; = 0
where {3 = ({310 ...0 {3q)'.
A motnent's
reflection
reveals
an interesting
obstacle
to straightforward
application of the usual tools rf statistical inference: the "nuisance parameters" rj, j = 1, ..., q, are not identified under the nu1l hypothesis, but are identified only under the alternative.
tunately, there is now availabl~ a variety of tools that permits testing of Ho in this context.
The simplest, mo$t naive procedure is to avoid treating the rj as free parameters, instead choosing them a priori in some fashion (e.g., drawing them at random from some
appropriate distribution) and I then proceeding to test Ho using standard methods, e.g. via
Lagrange multiplier or Wald statistics, conditional on the values selected forrjo A procedure of
24precisely this sort was proposed by White (1989b), and the properties of the resulting "neural network test for neglected nonlinearity" were compared to a number of other recognized procedures
for testing linearity by Lee, White and Granger (1991). (See White, 1989b, and Lee, White and Granger, 1991, for implementation details.) The network test was found to perform well in comparison with other procedures. Though no one test dominated the others considered, the network test had good size, was often most powerful, and when not most powerful, was often one of the more powerful procedures. It thus appears to be a useful addition to the modem arsenal of specification testing procedures.
A more !;ophi!;ticated prncedllre i~ tn ~hnfil:p. "1 vfll11~S that optimize the direction in
which nonlinearity
is sought.
Bierens (1990) proposes a specification
test of precisely this sort.
First, the model is estimated under the null hypothesis (linearity),

Et = yt -i't ~n where ~n is an estimator of ([:Jo,a')'.
yielding
residuals
For given r one can show under general
conditions that
under the linearity
hypothesis, where with { Zt } i.i.d. we have
b*(r)
= E(G(X'tr)
X't)
A.
E(Xt
X't)
p where ~n ~ !!.,
Bierens (1990) specifies G( , ) = exp( .), but as we discuss below, this is not the
only possible choice.

It follows that
25 -
A W(r)
A = nM(r)/Gn(r)
d ~xi
under correct specification of the linear model, where a~(r) is a consistent estimator ofa2(r).
Under the alternative, A W(r)/n -717(r) > 0 Q.s. for essentially every choice ofr, as Bierens (1990,
Theorem 2) shows. " To avoid picking r at random, Bierens proposes maximizing W( r) with respect to
r E r Can appropriately specified compact set), yielding Wcr), say. As Bierens notes, this max-
imization renders the xi distribution inapplicable under Ho. However, a xi statistic can be constructed by the following device: choose c > 0, /l E (0, 1) andy n independently of the samDle
and put
r=ro
A =r
if
W(r)
-W(ro)
~ cn)..
if
Bierens
A W(r)/n ~
(1990,
Theorem
4)
shows
that
" d WCr) ~xi
under
correct
specification
essentially
while
SUpre r 7J(r)
> O a.s. under the alternative.
Bierens'
result hold')
regardless
of how r is chosen.
In recent related work, Stinchcombe and White (1991) show that Bierens' concluincluding
sions are preserved if G is chosen to belong to a certain wide class of functions G( .) = exp( .). Other members of this class are G(a) = 1/(1 + exp(-a The choice of c, }., and r o in Bierens' construction
and G(a) = tanh(a).
is problematic.
Two researchers
using the same data and models but using different values for c, ).. and r o can arrive at differing
conclusions in finite samples regarding correctness of a given specification. One way to avoid
such difficulties is to confront the problem head-on and determine the distribution of W( r). Some
useful inequalities are given by Davies (1977, 1987), but these are not terribly helpful when r ?; 3 variables). Recently, Hansen (1991) has proposed a com-
(recall r is the number of explanatory
putationally intensive procedure that permits computation of an asymptotic distribution for W( r) under Ho'
-26-
An interesting
area for further
research
is a comparison
of the relative
performance
and computational cost of the procedures discussed here: the naive procedure of picking rj'S at
random; Bierens' " W(r) procedure; and use of Hansen's (1991) asymptotic distribution " for W(r).
The specification testing procedures just described extend to testing correctness of

nonlinear models, as well as testing the specification models. For testing correct specification of likelihood or method of moment-based
of a nonlinear model, say yt = h(Xt, a) (which for con-
venience includes an intercept) one can test Ho: /3 = O vs. Ha: /3 * Oin the augmented model
v, = h(X"
a)
q 4- ~ j=l
G(X','Yj)f3j
4- I;,
(1.3.2)
p If an is the nonlinear least squares estimator under the null (with an ~ a* under Ha; see White, 1981), then with Et = ft -h(Xt, an) we have
where now
(J"2Cr)
var([GCXtr)
-b*Cr)
A *-1
VahCXt.
a*)]e;)
b*(r) = E(G(X'tr)
V'ah(Xt, a*))
A. = E(Vah(Xt. a *)
We again have " W(r) " = nM(r)/a-
V'ahCXt.
a*))
2 n(r)
d -7xi
under
Ho.
while
" W(rYn
-717(r)
> O a.s.
under
Ha
(mlsspecification) for essentially all r. A consistent specification test is therefore available. A Optimizing W( r) over choice of r leads to considerations regarding asymptotic testing identical to those arising in the linear case.
For testing correct specification of a likelihood-based model, a consistent m-test
(Newey, 1985; Tauchen, 1985; White, 1987b, 1992) can be performed. The starting point is the
fact that if 1 (Zt. 0) is a correctly specified conditionallo~-likelihood for y, 2iven X, Ci.e. for some
-27 -
() 0' exp 1 (Zt, () 0) is the conditional
density
of yt given Xt), then
E(s(ZtJ
(Jo) I Xt) = O ,
where s is the k x llog-likelihood
score function, s(Zt, 0) = V 8 1(Zt, 0). It follows from the law
of iterated expectations that with correct specification
E(s(Zt'
() 0) G(X't
r))
= 0
for all ye r.
A Under standard conditions (e.g. White, 1992, Ch. 9) it follows that with en the
(qua3i-) mll.Ximum likc;lihood c;~timator c;oroi~tc;nt undc;r mi~~pc;cifica.tiOll fOl (}*, wc 11(1VC
where
};(y)
= var([(G(Xt'y)
@ Ik) -b *(y)A *-l]S;)
b*(r)
= E([G(i'tr)
(8)Ik] V'9 S;)
A. = E(V'9 S;)
s: = s(Zt. (}*,
\7'6 S; = \7'6 S(Zt. e*).
Consequently, analogous
A A A W(r) = n M(r)'[Ln(r)]-l of Bierens (1990,
A M(r)
d ~xi
under
correct
specification. ~17(r)
Argument
to that
Theorem
4) delivers
A W(r)/n
> 0 a.s. under
misspecification
for essentially
all r, given an appropriate choice of G, e.g. G(a) = exp(a) as in G(a) = tanh(a), as in Stinchcombe and White 0991). " W(r) over choice ofr leads to considerations
Bierens (1990), or G(a) = 1/(1 + exp(-a, A consistent m-test is thus available.
Optimizing
regarding asymptotic testing identical to those arising in the linear model.

Because ANN models must be recognized from the outset as misspecified, one
-28-
cannot test hypotheses about estimated parameters of the ANN model in the same way that one would test hypotlleses about correctly specified nonlinear models (e.g. as in Gallant, 1973, 1975). Nevertheless, one can test interesting and useful hypotheses within the context of inference for misspecified models (White, 1982, 1992; Gallant and White, 1988a). In this context, two issues arise: the first concerns the interpretation of the hypothesis itself; and the second concerns construction of an appropriate test statistic. Both of these issues can be conveniently illustrated in
the context of nonlinear regression, as in White (1981). A The nonlinear least squares estimator () n solves
min n-l ee8
n L [yt -f(Xt. t=l
(J)r
where, for concreteness we take f(Xt, (}) to be of the fonn (!.1.3) with F the identity function.
White (1981) provides conditions ensuring A Q.S. that (} n ~ (}*, where (}* is the solution to
rnin E([E(Yt BE e
I Xt)
-f(Xt.
(J)J2)
Thus (). is a parameter vector of a minimum
mean squared error approximation
f(Xt. ().) "to
E(Yt I Xt). One can therefore test hypotheses about the parameters of the best approximation. A leading case is that in which a specified explanatory variable (say the rth variable, Xtr) is hypothesized to afford no improvement permitted by f in predicting yt, within the class of approximations
This hypothesis and its alternative are specified as
Ho:
S, e*
= 0
vs.
Ha: S, ()
;!: 0
where s r is a q + 1 x k selection
matrix
that picks out the appropriate
elements of () .(i.e.
ar,rlr,...,rqr.
Testing Ho against Ha in the context of a misspecified model can be conveniently done using either Lagrange multiplier (LM) or Wald-type test statistics, but not likelihood ratio statistics, for reasons described in Foutz and Srivastava (1977), White (1982, 1992) and Gallant
-29-
and White (1988a). The likelihood
ratio statistic requires for its convenient use as a X~+l statis-
tic the validity of the information matrix equality (White, 1982, 1992), which fails under misspecification. The classical LM or Wald statistics also require the validity of the information matrix equality, but can be modified by replacing classical estimators of the asymptotic covariA ance matrix of en with specification robust estimators (White, 1981, 1982, 1992; Gallant and White, 1988a). Thus, a test of Ho against Ha can be conducted using the Wald statistic
.." Wn = n O 'n S'r(Sr .. Sr O n ,
Cn S'r)-l
where
~ Cn
--1 = An
---1 En An
The covariance
estimator
" Cn given here is consistent
when {4}
is i.i.d.,
but modifications
preserving consistency are available in other contexts. Under the hypothesis that Xtr is irrelevant
" d (and with consistent Cn), one can show that Wn ~ X~+l' and that the test is consistent for the
alternative. Similar results hold for the LM test statistic. Details can be found in Gallant and Whit~ (19&&a,CQ 7) and White (19&2; 1992, Ch. 8).
1.4. CHAOS-MODELING EXAMPLES

In this section we illustrate methods for estimating ANN models by fitting single
hidden layer feedforward networks to time series generated by three deterministic chaos
processes. The generating equations for these time series are:
30 -
(a)
The logistic map (Thompson and Stewart, 1986, p. 162):
Yt+l
= 3.8
YtCl -Yt)
(b)
The circle map (Thompson and Stewart, 1986, pp. 164, 285-6):
Y, 11 = Y, + (22/1l')
~in(21l' Y, + ~~)
(c)
The Bier-Bountis map (Thompson and Stewart, 1986, po 171):
Yt+l = -2
+ 28.5 Yt/(l
+ yf)
Chaos (a) is by now a familiar example to economists and econometricians. Chaos (b) and chaos (c) are less familiar, but these three examples, representing polynomial, sinusoidal and !ational polynomial functions, provide a modest range of different functions with which to demonstrate
ANN capab1Unes. Time-series plots or me mree senes are given in Figures 6,7 and 8.
Because we shall not be adding observational error to the chaotic series, our exampIes will provide direct insight into the approximation abilities of single hidden layer feedforward networks. In each case, we fit ANN models of the form
f(Xt.
-q 0) = X't
+ /30 + L j=l
G(X't
rj)
/3 j
(1.4.1)
Several models are examined
to the target chaos, yt, where G(a) = 1/(1 + exp(-a)),
the logistic.
in p~('h in~ance- Specifically, the input X, iE:a E:inslelas of the torgct scrics yt, whilc thc numbcl of hidden units (q) varies from zero to eight. The best model is chosen from these alternatives using the Schwartz Information Criterion (SIC). For each network configuration, we estimate model parameters by a version of the method of nonlinear least squares,Le., we attempt to solve
-31-
Optimization proceeds in two stages. First, the parameter estimates an are obtained by ordinary
least squares, with parameters {3 constrained to zero. (Note that an contains an intercept.) Then
if q > 0, second stage parameter estimates fi n and r n are obtained in such a way as to exploit the
structure of (1.4.1); the an estimates are not subsequently modified, forcing the hidden layer to extract any available structure from the least-squaresresiduals. Inspecting (1.4.1), we see that for given rj..s, ordinary least squares gives fully optimal eGtimntosfor /3. Thus, wc choosc a largc numbcr of ralldol1l vi1luc:)fur tIle elementS or
rj, j = 1, ..., q, and compute the least squares estimates for /3. This implements a form of global
random search of the parameter space. The best fitting values of.8 and r are then used as starting
values for local steepest descent with respect to {:3and r. Within steepest descent, the step size is dynamica11yadjusted to increase when improvements to mean squared error occur, and otherwise to decrease until a mean squared error improvement is found. Convergence is judged to occur
when (mse(k) -mse(k -1)/(1 + mse(k -1)) is sufficiently small, where mse(k) denotes sample
mean squared error on the kth steepest descent iteration. Once a local minimum is reached, the procedure terminates. This algorithm has been found to be fast and reliable across a variety of applic~tions investigated by the authors. The re~lllt~ nf lp.~~t~'111~rp.~ p~tim~tion of a linear model are given in Table 1. Tho simple linear model explains only 12% of the target variance for the circle map, while explaining 84% of the target variance for the Bier-Bountis map. The logistic map is intermediate at 36%, Results for the single hidden layer feedforward network are given in Table 2. In each case the hidden layer network chooses to take as many hidden units as are offered (8), and with this number of hidden units, nearly perfect fits are obtained. Because the relationships studied here are noiseless, the SIC starts to limit the number of hidden units chosen essentially only when machine imprecision begins to corrupt the computations. This lirnit was not reached in these examples. Our examples show that single hidden layer feedforward networks do have
-32 -
appealing flexibility, and can be profitably used to extract approximations at least to some simple
chaos-generating functions. Experience in a wide variety of applications across a spectrum of
scientific disciplines suggests that the usefulness of this flexibility is likely to extend broadly to
econometric contexts.
ANN
models
thus appear to be worthy
additions
to the modem
econometrician's tool-kit.
PART II: RECURSIVE M-ESTlMATION
WIm
DEPENDENT OBSERVATIONS
lI.l.
INTRODUCnON
In Part I, we briefly discussed the method of stochastic approximation (Robbins and Monro,
1951). The Robbins-Monro function '!1(0), say 0" , by (RM) algorithm recursively approximates the zero of an unknown
A () t+1
A = () t + at 1fI(Zt,
A () J
t=
1,2,...
(II.l.l)
where at is a "learning
influcn(;cd
rate" tending to zero, and 1j/(Zt,8) is a measurement of '1'(8) at time t, When 'I'(IJ) = E~V(Zt, tJ)) truS methOd yields a recursive
by 1(111UUl11 v(Ui(1bl~:) Zt.
implementation of the method ofm-estimation of Huber (1964). In particular, the method can be
used to estimate recursively the parameters of nonlinear regression models, such as those arising
in neural network applications. The RM algorithm has two significant advantages: (1) its recursive nature places few demands on computer resources; and (2) in theory , just one pass through a sufficiently large data
set can yield a consistent estimate. The RM algorithm is therefore particularly appealing for
estimating parameters of nonlinear models in large data sets. Very general results relevant to the convergence properties of the RM algorithm have been given by Kushner and Clark (1978) (KC) and Kushner and Huang (1979) (KH). However, the conditions ofKC/KH are not primitive and require some effort to apply. In this part of the paper,
we bridge an existing gap between the results of KC/KH and some interesting and fairly broad
-35-
1/I(z.
())
:5 b(O) h 1(Z) + h 2 (Z); and
there
exist
functions
PI:
/R+
/R+
and
h3
1Rs ~
1R+
such that
p I (U) ~
0 as u ~
0, h3 is measurable-
D3$, and for each (z, (}I , (}2) in
IR$ x e x e
1jI(Z.(}I) -1jI(Z.(}z)
~Pl(
I (}1-(}2
)h3(z),
where
denotes the Euclidean norm.
ASSUMPTION
A.3:
E 1/f(Zt, () < 00 for each () in 0, and there exists a function 'l' : 0-7
IRk
continuous on e such that for each (} in e '(}) = limt -+ ~ E 1fI(Zt. (}).
ASSUMPTION
L;=oat ~ 00 as n
A.4:
~ 00,
{ at} is a sequence of positive real numbers such that at ~ O as t ~ 00 and
ASSUMPTION
A.5:
(a) (b)
For each () in e. L;=o at [1jf(Zt.()) -Evr(Zt. ())] converges a.s,-P; and

For j = 1,2,3, there exist bounded non-stochastic sequences {17jt} such that
L;=o at[hj(Zt)-1Jjt]
converges a.s.-P.
Assumption A.l introduces the data generating process, and Assumption A.2 imposes some suitable and relatively mild restrictions on the growth and smoothnessproperties of the measurement function 11/ . Assumption A.3 is a mild asymptotic mean stationarity requirement at -7 O ensures that the effect of error adjustment eventually
00 allows the adjustment to continue for an arbitrarily long
In
van-
Assumption A.4, the condition

ishes; the condition ~n ""'t=1 at-:;
time,
so that the eventual convergence of (lI.l.l)
is always plausible.
Assumption A.5 imposes mild convergence conditions on the processes depending on Z:. Below we consider more primitive mixingale conditions that ensure the validity of this assumption. Let 1C
IRk ~ e be a measurable projection function (for f) E e, 1t"(f) = f).
We then
A A have that for all RM estimates e t, 7r(e J E e.
In what follows,
A e t will also denote the projected
This result generalizes classical results (e.g., Blum, 1954) in several respects. First, Zr is not required to enter the function 1/1 additively. Second, the learning rate at is not required to be
square summable. Most importantly, general behavior for Zr is allowed, provided that Assump-
tion A.5 holds. As examples, KC consider martingale difference sequences and moving average
processes.
A general class of stochastic processes satisfying the convergence
conditions
denote the
of AssumpLp-norm,
tion A.5 is the class of mix.ingales (McLeish,
1975). Let
.llp
IIXllp=(E x IP)llp. WhenllXllp <~wewriteXe

whenever each element of X belongs to Lp(P).
Lp(P). If Xis a matrix or vector,X e Lp(P)

In this case
lip is as just defined, with
denoting the spectral norm induced by the Euclidean norm. We use the following definition.
DEFINITION
11.2.2: Let {Xt} be a sequence of random variables belonging of IF .The sequence {Xt.
{ct }
to L2(P) and let
IFt} be a filtration
nonnegative IIE(Xt
IFt} is a mixingale
process if for sequences of

0
real
constants
and
and
{'m}
where
:5Ct~m+l
~m
-)
as
m ~ CX), we
have if
/Ft-m)112
:5 Ct~m
IIXt-E(Xtl/Ft+m)lk
{ Xt } is a mixingale
of size -a
~m = O(m).) for some).. < .:-a. (We drop explicit reference to the filtration
when there is no risk
of confusion.)
When C:-m satisfies this last condition,
we also say that C:-m is of size -a.
Our definition
of size is
convenient, but also stronser than that considered by McLeish (1975). As 3pccifll Ca,.jC.', mixingale processesinclude independent sequences,martingale difference sequences, l/>-,p- and amixing processes, finite and certain infinite order moving average processes, and sequences of near epoch dependent functions of infinite histories of mixing processes (discussed further in the next section). Mixingales thus constitute a rather broad class of dependent heterogeneous
processes.
In our applications, -IF',
we always assume that the relevant random variables are measurable condition holds automatically. This avoids anticipativity of
so that the second mixingale
the RM algorithm.
-38-
The following conditions permit application ofMcLeish's
mixingale convergence theorem
(McLeish, 1975, Corollary 1.8) to verify the conditions of Assumption A.5.
ASSUMPTIONA.4':
L;=l at ~~ as n ~~,
{at} is a sequence of real positive
integers such that L~=l
a; < 00 and
ASSUMPTIONA.5':
(a)
For each 0 in e, SUpt II 'I'(Zt, 0) 112~ ~e < 00 and {'1'(Zt, 0) -E'1'(Zt, 0), IFt} is a mix-
ingale of size -1/2, where IFt = a(Zl, (b) Forj
..., 2,); IFt} is amixingale of size
= 1,2,3, SUPtllhj(ZJ112 ~tJ. < 00 and {hj(ZJ-Ehj(ZJ,
Assumption
A.4'
implies
Assumption
A.4.
Note also that SUpt 111fI(Zt, 8) 112 ~ L\e < 00 is
implied by Assumptions A.5'(b) and A.2(b.i), and that we may take 17jt = Ehj(Zt). We have the following result.
A A.4' and A.5', let { ()t} be given by (II.l.l)
II.2.1 hold.
COROLLARY
with
11.2.3: Given Assumptions.A.1-A.3,

Then the conclusions
A eo chosen arbitrarily.
of Theorem
0
of .. () t.
This
provides
general
and
fairly
primitive
conditions
ensuring
the
convergence
Only
Assumption A.5' is a reasonable candidate for further specialization to achieve additional simpliThis is most conveniently done by placing conditions on h 1, h2, h3 and {2, } sufficient to ensure that the mixingale property is valid. We give examples of this in the next section. The present result gives a very considerable generalization of a convergence result of
White (1989a, Proposition 3.1). There Zr is taken to be an i.i.d. uniformly bounded sequence.
Corollary II.2.3 also generalizes results of Englund, HoIst and Ruppert (1988), who assume that
{ Zt } is a stationary mixing process and that 1/' is a bounded function.
Asymptotic normality follows as a consequence of Theorem 2 of KH. As KH show, the

fastest rate of convergence obtains with at = (t + 1)-1; we adopt this rate for the rest of this sec-
-39-
For given 0* E
IRk we write
ut = (t + 1)Y2(8 t -8*).
Straightforward manipulations a11ow us to write Ut+1 = [Ik + (t + 1)-1 Ht] Ut + (t + l)-Yi q; , where
Ht = \"761{/; + [((t +2)/(t+ 1))Yl-1] \"761{/; + Ik/2 + O((t + 1)-1) Ik
(II.2.4)
(II.2.".)
and
q; = ((t + 2) / (t + 1Y' 1//; ,
with 111;= 1II(Zt, 8*), V 9 111;= V 9 1II(Zt, (1*). The piecewise constant interpolation with interpolation intervals { at} is defined as 'r E ['rt, 'rt+l)'
of Ut on [0, 00)
UO('l")
= ut,
and the leftward shifts are defined as Ut('t") = UO('t"t+'l"),
't"?o.
The asymptotic
distribution
A of et is found
by showing
that
Ut(
converges to the solution of a
stochastic differential
equation (SDE) and then characterizing
the weak limit of Ut ( .).
We adopt tlle following conditions:
ASSUMPTION on (.0., IF , P).
B.l:
Assumption
A.1 holds and {Zr, t = 0, :!:1, :!:2, ...} is a stationary sequence
ASSUMPTION B.2:
(a)
Assumption A.2(a) holds; and
40(b) For each z E

IRS, 1f/(z, .) is continuously
m.s-:; m.+ such that
differentiable
P2(U) -:;
such that there exist functions

as u -:; 0, h4 is measurable-1Bs,
P2 : /R+ -?
/R+ and h4
and for some 0 interior to e and each (z, 0) in
IRs x eo, eo an open neighborhood
in e of 0
v 9 1jf(Z,e) -V 9 1jf(Z,eo )
~P2( 10-0
h4(z).
ASSUMPTION
1Jf; E L6(P),
B.3: There exists e* E int e such that e* = eo in Assumption B.2, E1f/; = 0,

and the eigenvalues of H =H" + Ik /2 (with H" = E(V 91/';)) have
V 91Jf; E L2(P),
negative real parts.
ASSUMPTION (a)
noS:
Let IFo = a(Zt) t :5 0) and suppose
IFo)lk
IF )-O"jI12, -* O"j=E(1JIt1Jlt+j), *' .
(b)
For
some
17 4 E
1R,L;=o (t + 1)-1 [h4(Zt) -114] converges a.s. -P; and

* Vo1flt
(c)
L;=o
(t+ 1)-1 [V91f/; -H*
-h*
converge
a.s.
-P,
where h * = E I V 9 1f/; I.
The stationarity imposed in Assumption B.l is extremely convenient; without this, the
analysis becomes exceedingly complicated. Assumption B.2(b) imposes a Lipschitz B.3 imposes additional
stable equilibrium.
condition
on v B 11/analogous to that of A.2(b.ii) tions and identifies (}*

as a
for 11/ .AssumDtion

asymptotically
momp;nt rondiAs we take
candidate
at = (t + 1)-1, there is no analog
to Assumption
A.4 or A.4'.
Finally,
Assumption
B.5 imposes
some further convergence conditions beyond those of A.5. Assumption B.5(a) restricts the local fluctuations (quadratic variation) induced by (t + l)-Y'q; in (II.2.4) to be compatible with those of a Wiener process. Assumption B.5(b,c) (together with B.2) ensures that the effects of the second term and the last term in (II.2.5) eventually vanish.
The asymptotic normality result can be stated as follows. A and B.5 hold, and that et ~e*
THEOREM
11.2.4:
Suppose Assumptions
B.I-B.3
a.s.-P,
-41 A {(J t} is generated A (J o arbitrary,
where
by (II.1.1)
with
at = (t + 1)-1, and (J* is an isolated
element
Then: (a)
(b)
{ Ut} is tight in IRk,
I=~C:O
"""'
J =-=
a.
<00,
(c)
{ Ut( .) } converges W(") dCllUtC~

tilt;
weakly
to the stationary
solution
--~ of dU('l") = HU('l") process.
d'l" + I'
dW('l"),
I)'l(Ulll(Ull
k;-var1ate
Wiener
exp[ii'c]dc
In
partiCUlar
~ N(O, F*), where F* = ico exp[iic]i o

matrix equation HF* + F* H' = -J:;,
is the unique solution to the
(d)
M A M'
If H* = -H*
is symmetric, A
then
F*
= MLM',
where
M is the ortho2onal the eigenvalues
matrix
"uch
that in
, with
the diagonal
matrix
containing
()..l, ..., )..k) of -H*
decreasing order, and Lhas (i,j) element ()..i +)..j-l)-l
Kij, where K = M'IM.
If at is chosen to be (t + 1)-1 A (for finite nonsingular Theorem asymptotic 1I.2.4(C) necomes aut't") = liut't")a'r distribution + Ai~aw('r),
k x k matrix A), then the SDE in and the covariance matrix of the
becomes AF* A '. Part (d) gives an alternative expression for the covari-
ance matrix of the asymptotic distribution, analogous to that given by Fabian (1968). Despite the
assumed stationarity , Theorem II.2.4 generalizes previous results in that the random variables c~n hf'. llnhmlnrlf'.rl ~nd th~ m~~st1rement can be correlated (cf. Ljung and Soderstrom, 1983, Ch.
4, and Fabian, 1968). Again, the properties of mixingales can be exploited to verify the convergence conditions. We impose
ASSUMPTION
(a)
B.5':
of size -2 with ct ~ K for some K < 00, t = 1, 2, ..., ; { bt } such that
(i) {VI; , IFt} is a mixingale
(ii) there exists a constant K < 00 and sequence of real numbers

*' \lE(1fI't 1fI't+j
JFQ ) -a
j 112;5. K bt fOl- a11 j, a.lld { Ut}
j~ uf ~jLC -2.
-42
. V81f1t -h*, IFf} are mix-
(b)
{h4(Z,)-E(h4(Zt))'
IFt},
{V1f/; -H*,
IFt}
and
ingales of size -1/2. We have the following result.

A et
COROLLARY
II.2.5:
Suppose
Assumptions
B.I-B.3
and
B.5'
hold
and
that
-1e*
a.s.-P
where {et} is generated by (1.1) with eO arbitrary, at = (t+ 1)-] and ()* is an isolated element of
EJ+."men the conclusions of'lheorem ll.2.4 hold
0
4.1) from the
This considerably
generalizes
an analogous result of White (1989a, Proposition
i.i.d. unifornlIy bounded case to the stationary dependent case. EngIund, HoIst and Ruppert
(1988) also give a result for i.i.d. observations.
11.3. RECURSIVE NONLINEAR
LEAST SQUARES ESTIMATION
Suppose the nonlinear

8 E D C
model I(Xt, 8)
(I:
IRr x D -:;
variahlp. y.
IR, Xt a random
It i.l: ~nmmon
r x 1 vector,
IRk) is to be used to forecast
the random
to seek 8* , a solu.
tion to the problem

mill E([Yt oe D -f(Xt. 8)r).
and foml a forecast f(Xt,

' (8) = E(Vo
8*). The solution 8* is also a solution to the problem

f(Xt, 8) [yt -f(Xt, 8)]) = 0,
where V b' is the gradient operator with respect to 8 yielding
a k x 1 column vector.
The simple
RM algorithm for this problem in nonlinear least squaresregression is the algorithm (II.l.l)
1/f(Zt, () = v b' f(Xt, 8) [yt -f(Xt, 8)],
with
where Zr = (yt, X;)' and e = 8. The updating equation is

A A " 8 t+1 =8 t + at Vo!t[yt " -It] .-8,). ,
{II.3.1)
where
we have
written
it = f(Xt.
Vo ft = Vo f(Xt.
8,).
This
is known
as a "stochastic
gra-
went method." In tl1is section we consider the properties of tl1is algorithm and two useful
43
variants, the "quick" and the "modified" A disadvantage RM algorithms. is that it may converge very slowly (e.g.
of the simple RM algorithm
White, 1988). To improve the speed of convergence, a natural modification mate Gauss-Newton
is to take an approxialso known as
step at each stage. This yields the modified RM algorithm,
the "stochastic Newton method" The algorithm is given by (1.1) with
[0/1
1JI(Zt, 8) =
(Zt,
1fI2(Zt,
1fIl (Zt. 8) = vec [Vo f(Xt. 8) Vo f(Xt. 8)' -G].
'/f2(Zt.
0) = G 1 Vo f(Xt.
0) [rt -f(Xt.
0 )J
where e = vec G)', 8')'
The updating equations are then
(II.3.2a)
1 8t+1 =8t+atGt+1 --Vo!t[ft-ft].
(II.3.2b)
symmetric matrix.
A We take Go to be an arbitrary
positive-definite
" The difficulties of applying this algorithm are: (1) the inversion of Gt+l is computation ally
A demanding, and (2) the updating estimates Gt need not be positive-definite, pointing the algo-
rithm in the wrong direction. The first problem can be solved by use of the rank one updating formula for the matrix
inverse. Let Pt+l = Gt"ll and }.t = (1at)/ at. The modified RM algorithm is algebraically
equJvalem 10
(II.3.3a)
A A " " " Ot+l =Ot + at Pt+l Volt [ft-It],
of. Ljung and Sodorstrom (1983, A Ch. 2 & 3). Thc; c;hoil;c; Po = Ik j1) uflclll;ullvcnlem.
(II.3.3b)
44A To ensure that Gt is positive-definite,

" Gt " + at [V 8ft V 8ft ",
we may use the following

" -Gt],
modification
of (1I.3.2a):
Gt+l
(II.3.4a)
" Gt+l =
(II.3.4b)
A where E is some predetermined positive number, and Mt+l (E) is chosen so that Gt+l -El is positive-semidefinite. Some practical implementations of this can be found in Ljung and SoderA strom (1983, Ch. 6). A similar device can be applied to Pt. Implementation be understood to employ a projection device restricting .-pact convex set r such that the max-imum and minimum of this algorithm will
A A 8 t to a compact set D and Gt to a comeigenvalues of Gt lie in a bounded
strictly positive interval.

A simplification particular, of the modified RM algorithm is to choose G to be a diagonal matrix. scalar, so that matrix inversion In
we take G = e Ik, where e is a positive the algorithm
is avoided.
This yields the quick RM algorithm,
(II.l.l)
with 1[/ = [1[/~,1[/;]', where now
V'1 (Zt, f)) = Vof(Xt'
8)' Vo f(Xt,
8) -e,
1JI2(Zt,
()) = e-l
Vo f(Xt,
8)[Yt
-f(Xt,
8)],
so that the updating equations become

A et+l = A et A, + at[Vc5ft Vc5f, A -et] A
(II.3.5a)
(II.3.5b) The scalar et can be easily modified to be positive in a manner analogous to (3.4); we also restrict
et to be bounded. The quick RM algorithm is a compromise of the other two algorithms in that it takes a negative gradient direction with a scaling factor utilizing tion. Consequently, the quick algorithm than the modified some local curvature informa-
ought to converge more quickly than the simple algoalgorithm. When al = (t + 1)-1, the quick algorithm
rithm but more slowly
-45 -
then reduces to the "quick and dirty" algorithm of Albert and Gardner (1967, Ch. 7).
It is straightforward to impose conditions ensuri~g the validity of all assumptions required
for the convergence results of the preceding section. Only the mix.ingale assumptions A.5' and B.5' require particular attention. We make use of a convenient and fairly general class of mixingales, near epoch dependent (NED) functions of mixing processes(Billingsley, 1968, McLeish, 1975, Gallant and White, 1988a).
Let { Vt} be a stochastic process on (.0., IF , P) and define the mix.ing coefficients
<1> m = SUp't" SUP{F E 1F' -, G E IF;.m : P(F) > 0}
P(G F) -P(G)
am
SUp't"
SUP{FeF-.
GeJF;.m}
P(G ('\ F) -P(G) P(F)

(uni-
where /F~ =a(V'C, ..., Vt). Whentpm ~ O or am ~ O as m ~ ~ we say that {Vt} istp-mixing fonn mixing) ora-mixing (strong mixing). Whenlf>m = O(mA.) for some)., < -a
we say that {Vt}
is tf>-rnixing of size -a, and sirnilarly for amo We use the following definition of near epoch
dependence. where we ~d()pt the n()t~ti()n P~:!:,'::( .) ~ E( .
1F~.!,':.').
DEFINITION
11.3.1: Let {Zt} be a sequence of random variables belonging to L2(P), and let
process on (0, IF 1 P). Then {Zt} is near epoch dependent (NED) on { Vt} of
{ vt } be a stochastic
size -a ifv m = SUpt II Zt -E~~:::(Zt)
112is of size -a.
The following three results make it straightforward to impose conditions sufficing for Assumptions A.5' and B.5'. The first is obtained by following the argument of Theorem 3.1 of
McLeish (1975).
The second simplifies
a result of Andrews (1989).
The third allows simple
treatment of products of NED sequences.
PROPOSITION
mixing
11.3.2: Let {Zt E Lp(P)}, p ?2 be NED on {Vt} of size -a, where {Vt} is a
If>m of size -ap /(P -1) of size -a. or am of size -2ap / (p -2), p > 2.
sequence with
{Zr -E(Zr)
} is a mixingale
0
II.3.2. Let 9 : IRs
PROPOSmONII.3.3:
Let {Zr} satisfy the conditions ofProposition
-46g(Zl)-g(ZV I ~L
satisfy
Lipschitz
condition,
Zl-Z2
,L
< oo,Zl,Z2,
lRs.
Then
{g(Zt) E Lr(P) } is NED on { Vt} of size -a.
If { Vt} satisfies the conditions D
of Proposition
II.3.2,
then {g(Zt)-E(g(ZJ)}
is amixingale of size -a.
PROPOSITION 11.3.4: Let {Ut} and {Wt} be two sequencesNED on {Vt} ofsize-a.
(a) If SUpt
wt
$ il
< 00 and
SUpt II u t 114 $ il
< 00, then
SUpt II Ut wt 114 $ il2
and
{ Ut Wt}
is
NhU
on t Vt} of size -a /2.
(b) If SUpt II w t 118 ~ ~ < 00 and SUpt II u t 118 ~ ~ < 00, then SUpt II u t w t 114 ~ ~ 2 and { U t Wt } is
NED on { Vt} of size -a /2. (c) If SUptII u t 118 ~ L\ < 00 and { Vt } satisfies tile conditions of Proposition 3.2, tilen tilere exist KX)
E(Ut Ut+j)l!2
and a sequence of real numbers {bt}

::; Kbt and bt is of size -all.
such that supj~oIIE(UtUt+j
D'Q
0
ll.3.4(a), requiring SUpt II ft 114$: !!. and a
Our subsequent
results
will
make
use of Proposition
bound on tho olomonts of Xto Part (b ) i11ustra.tcs usc of thc Ca.uchy-Schwart.Z; incqu0.1ity to cclax
the boundedness condition; the price for this is a corresponding strengthening of moment conditions on ut (corresponding to yt). Here we sha11 adopt boundedness conditions on Xt to minimize moment conditions placed on yt and facilitate verification of the Lipschitz condition of Proposition II.3.3. Part (c) permits verification of Assumption B.5' (a.ii). We impose the following conditions.
ASSUMPTION Zr=(Yt,X;)' with
C.l:
Assumption
A.l
holds, and {Zt}
is NED and {Vt}
on {Vt}
of size -1, where sequence on
Xt bounded
and suPtIlYtIlp~L1<oo,
is a mixing
(.Q, IF , P) withq,m of size -p /2(P -1),
oram of size -p /(P -2),
p ?: 4.
ASSUMPTION
m.k.
C.2: f:
IRr x D ~
IR is jointly measurable, where D is a compact subset of

differentiable, and f (x, .) and Vo f(x, .) each
For each x E
m.r, f(x,
.) is continuously
satIsfy a LlpsCJUIZ COnOltiOn wltn LlpSCJUIZ constantS 1 (X) ana 2 (X), where 1 and L2 are each Lipschitz continuous in x. For each 8 E D, f( ., 8) and V s f( ., 8) each satisfies a Lipschitz
48 -
methods thus coincide, so that the RM estimators tend to the same limit(s) as the nonlinear least squares estimator (cf. Ljung and Soderstrom, 1983). Corollary 11.3.5is more general than the i.i.d. case treated by White (1989a) and the exampIes given in KC (Ch. 2), as we allow the data to be moderately dependent and heterogeneous. This result differs from those of Metivier and Priouret (1984) in that we require neither "conditional independence" nor stationarity .
Corollary II.3.5 also generalizes a result of Ruppert (1983). Ruppert assumesthat for some
O yt = f(Xt, 8*) + Et and that (Xt, EJ is strong mixing of size -p / (p -2), a condition that may
fail when Xt contains lagged ft, because ft need not be mixing when it is generated in this manner, even when Et and other elements of Xt are mix.ing. Indeed, this fact partially motivates our usage of near epoch dependence. Also, we do not require that Yeis generated in the manner assumedby Ruppert (Le., we may be estimating a "rnisspecified" model). Compared to the result of Ljung and Soderstrom (1983), we allow more dependence in the data, as the data need not be
ge:ne:r~te:d by ~ line'Jr filter.
The modified RM algorithm can be identified with the extended Kalman filter for the nonlinear signal model
yt = f(Xt. 8t) + Et
8t = 80 for
all
t.
" " The Kalman gain is at Pt+l V 6 it. Corollary 11.3.5thus provides conditions more general than previously available ensuring consistency of the filter. In particular, the model can be
misspecified and the data can be NED on some underlying mixing sequence. Because the quick RM algorithm includes Albert and Gardner's quick and dirty algorithm, Corollary 11.3.5directly generalizes their consistency result to the case of dependent observatiODS.
To obtain asymptotic normality results for the case of nonlinear regression, we impose the following conditions.
-52-
For this we impose appropriate conditions. In particular, we adopt Assumption C.l. The assumption of uniformly bounded XI causesno loss of generality in the present context. This is a consequence of the fact
~
that E(Yt I Xt) = E(Yt
Xt)
where
Xti
= ~(Xti)'
i = 1, ..., r
and
IR ~ [0, 1] is a strictly increasing continuous function. to g(iJ = E(Yt I it).
If Xt is not unifonnly
bounded then notation in
it is, and we seek an approximation
We revert to our original
what follows, with the implicit understanding that Xt has been transformed so that Assumption C.l holds. Note, however, that yt is not assumedbounded, providing the desired generality.
ASSUMPTION E.l: I:
compact subsets of function continuously IRr,
]R' x D ~
IR is given by (4.1) where D = A x B x r, with A, B and r

respectively, and with G: IR ~ IR a bounded
IRq+l and IRq(r+l) of order 3
differentiable
The conditions on G are readily verified for the logistic c.d.f. and hyperbolic
tangent "squashers"
commonly used in neural network applications.

A C.l, E.l, C.3 and A.4', let {()t} be given by (II.3.1),
algorithms, respectively) with A (} o chosen arbi-
COROLLARY
(II.3.2) or (II.3.5)
11.4.1: Given Assumptions

(the simple, modified
and quick
trarily. Then the conclusions of Theorem II.2.1 hold.
Thus the method ofback-propagation and its generalizations converge to a parameter vector giving a locally
E(Yt
mcan 3quarc optimal
approximation
to thc (;ond1Llonal cAllcl,;laLlull
[Ulll,;l!Ull gen-
Xt) under general conditions
on the stochastic process {2, }. This result considerably
eralizes Theorem 3.2 of White (1989a), For the asymptotic distribution results, we impose the following condition.
ASSUMPTION
F.l:
Assumption E.l holds with G continuously
differentiable
of order 4.
A et
COROLLARY
11.4.2:
Suppose
Assumptions
D.l,
D.2
and
F.l
hold
and
that
~e.a.s.-P
where {ot} is generated by (II.3.1), (II.3.2) or (II.3.5) with 00 chosen arbitrarily, at = (t + 1)-1, and 0. is an isolated element of e. .Then the conclusions of Theorem 11.2.4hold.
-54 -
considered here. For many choices of 1/', the analysis parallels that for the least squares case
rather closely. These results are within relatively easy reach for estimation procedures. For neural network models, it is desirable to relax the assumption that q is fixed. Letting
q -7 00 as the available sample becomes arbitrarily large permits use of neural network models
for purposes of non-parametric estimation. Off-line non-parametric estimation methods for the case of mixing processes are treated by White (1990a) using results for the method of sieves (Grenander, 1981, White and Wooldridge, 1991). On-line non-parametric estimation methods appear possible, but will require convergence to a global optimum of the underlying least squares problem, not just the local optimum that the present methods deliver. Results of Kushner (1987) for the method of simulated annealing provide hope that convergence to the global optimum is achievable for the case of dependent observations with appropriate modifications to the RM procedure. Finally, it is of interest to consider RM algorithms for neural network models that generalize the feedforward networks treated here by a11owing certain intema1 feedbacks. Such "recurrent" network models have been considered by Jordan (1986), Elman (1988) and Williams and Zipser (1989). For example, in the Elman (1988) set up, hidden layer activations feed back,
so that network
CAto.Atl. ",.Atq)-.Ato
output
= 1
is Ot = F(At' /3), Atj = G(Xt' Oj + At-l ' Oj), j = 1, ..., q, where
At =
This a11owsfor intema1 network memory and for rich dynamic
behavior of network output. Learning in such models is complicated by the fact that at any stage of learning. network output depends not only on the entire past history of inputs Xt. but also on .. the entire past history of estimated parameters e t. Results of KC are relevant for treating such
internal feedbacks. Convergence of RM estimates in recurrent networks is studied by Kuan
(1989) and Kuan, Hornik and White (1990).
-57
II.2.1(b) follows from Theorem II.2.1(c). Finally, we show that cycling between two asymptotically stable equilibria is impossible.
It is easy to see that points in e. must be isolated. Let O ~ and 0; be two isolated points in e. t
and let NEI and NEz be neighborhoods of 8~ and 8;, respectively, such that NEl ~ dCe*),
N el ~ d(e* ), and N el f'\ N el = 0. from, say, N 1 to N z infinitely
e ti E N El e = 7t'['(8)] caWlot
" " If the path of e t cycles between e ~ and e;, e t must move
often.
Let {ti} be an infinite subsequence of {t} such that

8 ( .) satisfying the
Then (}t; ( .) is a subsequence of (}t ( .) and has limit

.But for every to O i as .-:. T there is a t > T such that ~.
--8 (t) E N 1. Hence
8 (0) E N I but 8 ('r) Theorem
conycrgc;
'I1ll~ vlo1atc~ IllC; ~y11lpt.Ul1l; ~t.(1billt.y uf{J i and proves
2.1(d).
PROOF
OF COROLLARY condition
11.23:
The result follows from Theorem ll.2.1 because the A.4' implies at ~ O as t ~ ~ and Assumption A.5'
summability
of at in Assumption
implies Assumption A.5 by the mixingale convergence theorem (McLeish, 1975, Corollary
1.8).
PROOF
OF THEOREM
11.2.4: We verify the condition." for Theorem 2 nf KH
We first observe that the conditions [A1], [A4], [A7] and [AS] of KH are directly assumed, and that [A3] ofKH is ensured by Assumption B.5(c) and Lemma Al.
Second, we show that the consequence of [A2] of KH holds under Assumptions B.2(b) and
B.5(b, c). This amounts to showing that the second assertion in Lemma 1 of KH holds. By Assumption B.2(b) we have
L:learly, the integral on the RHS of (a. 10) converges to zero a.s. because e t -7 () .a.s.
a sequence of positive real numbers such that LkEk < 00, and let {Nk} be a sequence
Let {Ek} be
ofintegers
tending to infinity
as k -7 00. Define measurable sets Ak, Bk, Ck, Dk and F k as:
60 -
PROOF
OF COROLLARY
11.2.5: Only Assumption
B.5 needs to be verified.
We observe that
Assumption B.5'(b) is a mixingale condition ensuring Assumption B.5(b, c) by the mixingale convergence theorem. To establish Assumption B.5(a), we see that Assumption B.5'(a.i) ensures
that for K < 00
](t
=IIE(l/f;
lFQ
) 112 $: K
~ IL",t .
(a.15)
The fact that; K;,r is of size -2 implies that
where;
~,' is the mixinsale
memory coefficient.
This establishes AssumDtion B.5(aj). Similarly, AssumDtion B.5' Ca.ii)imDoses

; t = SUPj?:O IIE(1f/; 1f/;:j -E(1f/; 1f/;+j ) IIFo )112 ~ Khto
That bt is of size -2 ensures that Lt=o ~r < 00. This establishes Assumption B.5(a.ii).
0
D
PROOF
OF PROPOSITION
11.3.2: See Gallant and White (1988a, Lemma 3.14).
PROOF
OF PROPOSITION
11.3.3:
See Andrews
(1989, Lemma
1).
PROOF
OF PROPOSITION
113.4:
(a) We first observe
that
EIUtWt-E~.::::(UtWt)
12
where Ut.m= E~.:!:.:::(Ut) and Wt.m= E~.:!:.:::(Wt). Here we employ the fact that E~.:!:.:::(Ut Wt) is the best L2-predictor of Ut Wt among all IF~~:::-measurable functions. Hence,
II Ut
Wt-E~::::(Ut
WJI12
~IIUtWt-Ut,mWt,mI12
-62 -
Similarly,
II ut
Wt-Ut,m
wtI12~{iil3/2I1Ut-Ut,mll~1
Consequently,
v,
+VW,m
).
(c)
Using the same argument as in (b) we can show
II ut
Ut+j-E~~~+m(ut
ut+j)lk
~11 Ut
ut+j
-E~~:::(Ut)
E~tJ~:::(Ut+j)
112
.$: II ut
Ut+j-
ut E~~{::::(Ut+i)
112+ II utE~~1::::(ut+j)
-E~::::(Ut)
E~~~::::(Ut+j)
111
Hence, with Eo( .) = E( .
IFo ), we have
II p o (Ut lJt-t-J )-RTJtTJttjI!2
.sIIEo
E~~J+s
(Ut
Ut+j)
-E
Ut Ut+j
112+ IIEo[Ut
Ut+j
-E~~J+s
(Ut
Ut+j)]
Ik ,
(a.17)
where s = [tll]
t bounded by
is the integer part of tll.
By Jensen's inequality,
the second term in (a.17) is
where Kis a constant. It follows from Lemma 2.1 ofMcLeish (1975) and Lemma 3.14 of Gallant
u
PROOF OF COROLLARY II.3.5: We verify the conditions of Corollary II.2.3. Because the
other conditions obviously hold, for the simple RM estimates it suffices to show that Assumptions A.2(b) and A.5' hold. Given Assumption C.2, it is straightforward to verify that f and Vof are
such that If(x, 8)1 ~ Ql(X) and IVof(x, 8)1 ~ Q2(X) for aI18e D (compact), where Ql and Q2
-63 -
are Lipschitz continuous in x. Therefore,

11jf(z, ())I = I Vof(x, li)[Y-f(x, li)]1
~Q2(x)[lyl
+Ql(X)], h1(z)=1, and h2(z)=Q2(X)[lyl +Ql(X)].
so that Assumption A.2(b.i) holds for b(())=l,
11//(z,
()l)-1//(z,()vl
=1
Vof(x,
81)[y -f(x,
81)] -Vof(x,
82)[y -f(x,
8z)] I
~ I Vof(x, 81)y -Vof(x,

It follows from Assumption C.2 that
8vy I + I Vo!(x, 82)!(x, 8v -Vof(x,
b'l)!(X, 81) I
(a.18)
IVof(x,
81)y-Vof(x,
8vyl
~ lyIL2(X)181-821
I Vof(x,
82)f(x,
8v -Vof(x,
81 )f(x,
81) I
~ I Vof(x,
~)f(x,
82) -Vof(x,
8vf(x,
81) I + I Vof(x,
8vf(x,
81) -Vof(x,
81)f(x, 81) I
:s:l v c5f(x,82)I L1(X)181-821+ If(x, 81)IL2(x) 181-82
Hence (a.18) becomes
Thi;s
c;;sta.bli;shc;~ A~~UUlVUUU
A.2(lJ.li).
64
Because
Iy I, L1(x),
L2(x),
Ql(X)
and
Q2(X)
satisfy
Lipschitz
conditions,
Proposition
11.3.3ensuresthat IYtl , Ll(Xt)' L2(Xt) Ql(XJ and Q2(Xt) are NED on {Vt} of size -1. Because
Xt is bounded, Ql(Xt), Q2(X,), L1(X,) and L2(Xt) are bounded. Because IlYtl14 ~Ll, it follows
from Proposition 1I.3.4(a) and Corollary 4.3(a) of Gallant and White (1988a) (i.e., sums of random variables NED of size -a are also NED of size -a) that h3(ZJ is NED on {Vt} of size -1/2.
The mixing conditions size -112 by Proposition ing Assumption A.5'(ii). of Assumption 3.2. Similarly, C.l then ensure that { h 3 (Zt) -Eh 3(Zt) } is a mixingale {h2(ZJ -Eh2(ZJ} is a mixingale of
of size -112, establish-
We next verify that for each 8 E e, {1jI(Zt, 8) } is a mixingale of size -1/2. Fix 8 ( = 8). Observe that the Lipschitz condition on f( .,8) and the conditions on {2,} imply by Proposition ll.3.3 that {f(Xt. 8) } is NED on
{Yt-fCXt,
Vt} of size -1
The triangle inequality implies that

the continuity offC ., 8),
condition on
8)} is NED on {Vt} of size -1, and the boundedness ofXt,

that II yt -f
and the fact that II yt 114~ 11 < 00 implies
CXt. 8~14 ~ 11 < 00. The Lipschitz
{Vol(
..8)}
and the conditions
of {Zt} imply
by Proposition
11.3.3 that {Vol(Xt.
8)} is also
NED on {Vt} of size -1. Further, the elements ofV f(Xt, 8) are bounded, so that by Proposition
II.3.4(a) {1f/(Zt. ()) = Vof(Xt. 8)[ftf(Xt. 8)] } is NED on
Vt} of size -1/2. It follows from Pro-
position ll.3.2 that { Vo!(Xt. 8)[yt -!(xt,
8)] } is a mixingale of size -lll,
given the mixing con-
ditions imposed on {Vt} by Assumption C.l. Thus, Assumption A.5'(i) holds, and the result for the simple RM procedure follows. For the modified RM estimates we first note that every element of 0-1 is bounded above so
that I G-1 I < 11 for some 11.
Now.
IG-l
Vof(x,
8)[y
-f(x,
8)]
~~
Q2(x)[\yl
Ql(X)]
65 -
IVll(Z, 8)1 = \vec(Vof(x, 0) Vof(x, O)'-G)\
vec (Vof(x, 8) Vof(x, 8)') I +
Ivec
GI
= I Vof(x,
8) 12 + I vec G 1
where we use the fact that I vec A
= [tr(A
' A)]Y%.
Hence
Assumption
A.2(b.i)
holds,
as
= h:1(7)
We now establish a mean-value expansion result for G-1.
Recall that G is restricted to a result shows
convex compact set r, so the mean value theorem applies. A matrix differentiation that when c is symmetric and nonsingular, dC-l/dg/J
a-l Sij a-1 , whccc gij is tl"lC ij-U.
element of G and Sij is a selection matrix whose every element is zero except that the ij-th and ji-th elements are one; see Graybi11 (1983, p. 358). Hence we can write
rl \vec ("0... r:,-l'l a '-' J
-l
-vec
(0-
Sij
0 -1 ) ,
a~;j
The first term of (a.19) is less than
vec
VofCx,
81)VofCx,
8v'
-vec
[ v ofCx,
8z) vofCx,
8z)'
(I@VoJ(x,81))
.Vof(x,
81)-Vof(x,
8V]
(Vof(x,
82)
1)
v 6 f (x, 81) -V 6 f (x, 8V]
Vof(x,
81) -Vof(x,
8v
VofCx,
O2) @ I
where we used the fact that
vec (ARC)
= (C !8> A) vec B.
It
can be verified
and
that
Vof(x,
c5z)@ I
, where
k is the dimension
of
8.
Thus, (a.19) becomes
1Jf1 (z,
(J1)
-1Jf1
(z,
(Jv
~ 2K Q2 (X) L2 (X)
01-021
I vec(G2-G1)
h3' (z)
I f)1 -f)2
where hJ (z) = 2K Q2 (X) L2 (X) + 1 .Hence
Assumption
A.2(b.ii) holds, as
$; 11fIl(Z,
91)-V'1(Z,
92)1
IV'2(Z,
91)-V'2(Z,
9vl
h3(z)
181
-821
with {h2(Zt)
hj(z)
= h; (~) + h3' (;). }, {h3(Zt)
Using
thc
~d1nc
(11 !; Wll~ll~
as
before
we
nave
that
-Eh2(Zt)
-Eh3(Zt)
}, and {1/f(Zt, 0) -E1/f(Zt'
0) } are mixingales
of size -112.
-68-
Hence Assumption A.5' also holds. This yields the desired results for the modified RM estimates. The conclusions for the quick RM estimates follow because the quick algorithm is a special case of the modified algorithm.
D
PROOF OF COROLLARY
11.3.6: We verify the conditions of Corollary
ll.2.5.
For the simple
KM estimates we neea to ShOwthat Assumptions B.2(b) ana B.5' hold In this case v 9 1jI(Z,e) = V 0(\70 f (x, 8) [y -f (x, 8)])
= Voof(x,
hence for 9 in int G and 9 in Go
8) [y- f(x, 8)] -Vof(x,
8) Vof(x, 8)'
I v 6 1jf(Z, (}) -V 6 1jf(Z, (}a) I
Voof(y-f)-VofVof'-Voor(y-r)
+ vor
Vor'
IVqo!Y-Voof'y\
I(Voof')f'-(Voof)!1
lVof'Vof"-VofVof'1
where we have written f = f(x, 8), r
= f(x, ~ ), etc. By Assumption ~ ly\L3(X)18-50 )iI I,and
D.2, 0 = ~ = 8* .Apply-
ing Assumption D.3 we get Voo!Y-Voary\

I (Voof' )I' -(Vool)11
~ I (Voof' )I' -(Voof'
+ I (Voof' )1- (Voof>ll
~ lVoorl
Ll(X)
18-00
Ql
(X)L3
(x)
18-00
18-~1
since
Voor
I .$: Q3(X),
with
Q3
Lipschitz-continuous
in
by
straightforward
arguments.
Funher,
I Vorvar' -Vof Vof' I ~ I varvar' -VorVof' I + I vor Vof' -Vof Vof' I
~ 2Q2(X)
L2(x)
18 -l)O
I,
so that
-71-
I (G-1-( Go)-l) [Voor(y-r) -Vor Vor'] I

It follows from Ca.20) that the fir.1;ttenn in (a22) i.l; If';.1;.I; th~n
(a.22)
11
Iy I L3 (x) + Q3(X) Ll (X) + Ql (X) L3 (X) + 2 Q2(X) L2 (X)] 10-00 I
It can also be verified that the second tenn in (a.22) is less than
I G 1 -(G~) 1 I
Q3(X) ( Iy I + ~1 tX + tQ2(XZ
Iy I + Ql
(X)) + (Q2(X))2
vec(G-GO)
Thus (a.22) becomes

0-1 [Voof(y.n -VofVof'] -(00)-1 [Voor (y-r) -Vor Vor']
$: h;'
(z)
18-8
I,
where
h;' (z) E 11
Iy I L3(X)
+ Q3(X) L1(x)
+ Ql(X)
L3(X)
+ 2Q2(X)
L2(x)
+ Q3(X)( Iy I + Ql (X + (Q2 (X2

We also note the fact that IA I ~ I vec A I ~};j}; j I ajj I, where A is a square matrix and ajj
are its elements. Combining these results we immediately get

v 8 1jI (Z, e) -V 8 1jI (z, eO)
~ h4 (z) 18-81,
where h4 (z) = h~ (z) + h; (z) + h;' (z) .This
establishes Assumption
B.2(b).
A11 other con-
ditions can be verified as the proof for the simple RM algorithm.

tion result .. of 8 t follows from Corollary 1I.2.5 with
Thus the asymptotic distribu-
H; = E(V 91f1;),
where v 611' (z, e) IS given by (a.:Zl), ana
-73
where the first equality follows from the fact that exp[(-Ik/2)c] = exp(-c/2) 1 1 = [exp(-c/2)]Ik,
For the quick RM algorithm,
-I
H * 3 -
(Val;
.Vaol;)
-e
*-1
is also block triangular, and the lower ~ght kxk block ofI3 is
It follows that the lower right kxk block of F; is
so that
...d (t + 1)Y; (8t -8.) -. F3)
N (0,
We now show that F~ -G *-1 i~
G *-1 is a positive
semidefinite
matrix.
From Theorem
1I.2-4(c) we 2et
-};1 =HIFl +F1 HI =(HI +I/2)Fl +F1(Hl +1/2)
=HI Hence,
FI +FI
HI +FI
-(G*
)-1 I1(G*
)-1
=(G*)-I(H~F~
+F~H~
+F~)(G*)-1
-F~(G.)-l
-(G*)-IF~
(G*)-IF~(G*)-1
-75 -
is positive semidefinite, where (F; )y, is such that (F; )Vi (F; )Vi = F; .Since
holds.
i~
= }; 1, the result
PROOF OF COROLLARY
11.4.1: Owing to the compactness of the relevant domains, the spe-
cia} structure of f in (II.4.1) and the continuous differentiability of G, it is straightforward to verify the domination and Lipschitz conditions required for application of Corollary 11.3.5.
D
PROOF
OF COROLLARY
11.4.2:
Direct
application
of Corollary
II.3.6.
-76
TABLE 1
DEmRMINISTIC
CHAOS APPROXIMAmD
BY LINEAR MODELt
Logistic Map
N
Circle Map 250 .2957 .1202
Bier-Bountis Map 250
250 .1967 .3595

-1.60
a
R2
1.787 8472 1.21
SIC
-1.20
t N = number R2 = squared multiple
of observations;
a = regression
staIldard
error; Criterion:
regression
SIC
coefficient;
= log (j
SIC = Schwartz
Ny2N
Information
+ k(Iog
k = number
of estimated
coefficients
( = 2).
77 -
TABLE 2
DETERMINISTIC CHAOS APPROXIMAlED

SINGLE IllDDEN LAYER
BY
FEEDFORW ARD NETWORK
Logistic Map
q N
Circle Map 8 250

1.35 x 10-3
Bier-Bountis Map 8 250

2.34 x 10-2
8 250
2.68 x 10-4
a
R2
.9999
.9999
-6.32
.9999 -3.46
SIC
-7.93
t q. = SIC-optimal
number
of hidden
units;
remaining
symbols
as in Table
1.
-78-
REFERENCES
Albert, A.E., and L.A. Gardner (1967): Stochastic Approximation and Nonlinear Regression, Cambridge: M.I. T. Press.
Amemiya, T. (1981): "Qualitative Response Models: A Survey," Journal of Economic Literature
19, 1483-1536.
Amemiya, T. (1985): Advanced Econometrics. Cambridge: Harvard University Press.
Andrews, D. W.K. (1989): "An Empirical Process Central Limit Theorem for Dependent NonIdentically Distributed Random Variables," Cowles Foundation Discussion Paper, Yale University.
Andrews, D.W.K. (1991a): "Asymptotic Optimality of Generalized CL, Cross-validation and Generalized Cross-validation in Regression with Heteroskedastic Errors," Journal of
Econometrics 47,359-378
Andrews, D.W.K. (1991b): "Asymptotic Normality of Series Estimators for Nonparametric and Semi-parametric Regression Models," Econometrica 59, 307-345.
ArllUlll,
L. (1974):
SlochasItc
DtJ[eremial
EquariOns:
Theory
ana Appllcattons.
New
York:
John
Wiley & Sons.
Barron, A. (1990): "Complexity Regularization with Application to Artificial Neural Networks,"

University Report 57. of minois at Urbana -Champaign Department of Statistics Technical
Barron, A. (1991a): "Universal Approximation Bounds for Superpositions of a Sigmoidal Function," University oflllinois at Urbana -Champaign Department of Statistics Techni-
cal Report 58.
-79-
Barron, A. (1991b): "Approximation and Estimation Bounds for Artificial Neural Networks,"
University
Report 59.
of nlinois
at Urbana -Champaign
Department
of Statistics
Technical
Baxt, W.G. (1991): "The Optimization of the Training of an Artificial Neural Network Trained to Recognize the Presence of Myocardial Infarction by the Variance of Disease Likelihood," UC San Diego Medical Center Technical Report.
Bierens, H. (1990): " A Consistent Conditional Moment Test of Function Form," Econometrica
58,1443-1458.
Bi11ingsley,P. (1968): Convergence of Probability Measures. New York: John Wiley & Sons.
Billingsley, P. (1979): Probability and Measure. New York: Wiley.
Blum, J.R. (1954): "Approximation Methods Which Converge with Probability One," Annals of
Mathematical Statistics 25,382-386.
Blum, E.K. and L.K. Li (1991): Approximation Theory and Feedforward Networks," Neural Networks 4,511-516.
Carroll, S.M. and B.W. Dickinson (1989): "Construction of Neural Nets Using the Radon
Transfornl," in Proceedings of the International Joint Conference on Neural Net-
works, Washington D.C., New York: IEEE Press, pp. 1:607-611
Cybenko, G. (1989):
IIApproximation
by Superpositions of a Sigmoid Function,"
Mathematics
of
Control, Signals, and Systems 2,303-314.
Cowan, J. (1967): "A Mathematical Theory of Central Nervous Activity," dissertation, University of London.
unpublished Ph.D.
Davies, R.B. (1977):
"Hypothesis
Testing When a Nuisance Parameter is Present Only Under the
-80-
AItemative," Biometrika 64,247-254. Davies, R.B. (1987): "Hypothcsis Tcsting Whcn a Nui3ancc Paramctcci~ Pcc~cnt 011ly UllUCl lIIC
Alternative," Biometrika 74, 33-43.

Domowitz, I. and H. White (1982): "Misspecified Models with Dependent Observations," Journal of Econometrics 20, 35-58.
Duffle, D. and KJ.
Singleton
(1990):
"Simulated
Moments
Estimation
of Markov
Models
of
Asset Prices," NBER Technical Paper 87.
Elbadawi, I., A.R. Gallant and G. Souza (1983): "An Elasticity Can be Estimated Consistently Without A Priori Knowledge of Functional FonD," Econometrica 51,1731-1752.
Elman, J.L. (1988): "Finding Structure in Time," CRL Report 8801, Center for Research in Language, UC San Diego.
EngIund, J.-E., U. HoIst, and D. Ruppert (1988): "Recursive M-Estimators of Location and Scale for Dependent Sequences," Scandinavian Journal of Statistics 15,147-159.
Fabian, V. (1968):
"On Asymptotic
StatiJticJ
Normality
in Stochastic Approximation,"
Annals of
lrfathematical
39, 1327-1332.
Foutz, R.V. and R.C. Srivastava (1977): "The Performance of the Likelihood Ratio Test When
the Model is Incorrect, II Annals of Statistics 5, 1183-1194.
Friedman, J.H. and W. Stuetz1e (1981): "Projection Pursuit Regression," Journal of the American Statistical Association 76,817-823.
Fukushima, K and S. Miyake
(1984):
"Neocognition:
A New Algorithm
for Pattern Recognition
Tolerant of Deformations and Shifts in Position," Pattern Reco.Rnition 15.455-469.
Funahashi, K (1989): "On the Approximate Realization of Continuous Mappings by Neural
-81-
Networks," Neural Networks 2,183-192.
Gallant, A.R. (1973): "Inference for Nonlinear Models," North Carolina State University, Institute of Statistics, Mimeograph Series No, 875.
Gallant, A.R. (1975):
"Testing a Subset of the Parameters of a Nonlinear
Regression Model,"
Journal of the American Statistical Association 70,927-932.
Gallant, A.R. (1981):
"On the Bias in Flexible
Functional
Forms and an Essentially 1~, ?11-?45
Unbiased
Form: The Fourier Flexible Form." Journal of E~onomptrir.\'
Ga11ant, A.R. (1987): "Identification and Consistency in Seminonparametric Regression," in T.

Bewlely ed., Advances in Econometrics Fifth World Congress. New York: Cam-
bridge University Press,pp. 145-170.
Ga11ant, A. R. and H. White
(1988a):
A Unified
Theory
of Estimation
and Inference
for
Non-
linear Dynamic Models.
Oxford: Basil Blackwell.
Gallant,
A.R. and H. White

Avoidable
(1988b):
"There Exists a Neural Network

of the Second Annual
that Does Not Make

IEEE Conference on
Mistakes,"
Proceedings
Neural Networks,
San Diego.
New York: IEEE Press, pp. 1:657-664.
Ga11ant, A.R. and H. White (1991): "On Learning the Derivatives of an Unknown Mapping with
Multilayer Feedforward Networks," Neural Networks 4 (to appear).
Gamba, A., L. Gamberini, G. Palmieri and R. Sanna (1961): "Further Experiements with PAPA,"
Nuovo Cimento Suppl. 20,221-231
Gauss. K.F. (1809):
Theoria
Mouts
Corporom
Celestium.
English
translation
(1963):
Theory
of
the Motion of Heavenly Bodies. New York: Dover .
Geman, S. and C. Hwang (1982): "Nonparametric Maximum Likelihood Estimation by the
-82-
Method of Sieves," Annals of Statistics 10,401-414.
Gerencser, L. (1986): "Parameter Tracking of Time-Varying Continuous-Time Linear Stochastic Systems," in C.E. Byrnes and A.
Robust Control, New York: Elsevier,
Lindquist eds., Modelling, Identification and
pp. 581-594.
Go1dstein,L. (1988): "On the Choice of Step Size in the Robbins-Monro Procedure," Statistics and Probability Letters 6, 299-303.
Gourieroux, C., A. Monfort and A. Trognon (1984a)' "Pseudo-Maximum Likelihood Methods: Theory ," Econometrica 52, 681-700.
Gourieroux, C., A. Monfort and A. Trognon (1984b): "Pseudo-Maximum Likelihood Methods: Application to Poisson Models," Econometrica 52, 701- 720.
Graybi1l, F.A. (1983): Matrices with Applications in Statistics, second edition. Belmont:
worth.
Grenander, U. (1981): Abstract Inference. New York: Wiley.
Hansen, B. (1991):
"Inference
When a Nuisance Parameter is Not Identified
Under the Null
Hypothesis," University of Rochester Department of Economics Discussion Paper.
Hecht-Nielsen, R. (1989): "Theory of the Back-Propagation Neural Network," Proceedings of the [nternational Joint Conference on Neural Networks, Washington D.C. York: IEEE Press,pp. 1:593-606.
Hendry, D.F. and J.-F. Richard
(1990):
"Likelihood Evaluation for Dynamic Latent Variable
Models," Duke Institute of Statistics and Decision Sciences Discussion Paper 90A15
Hornik, K. (1991): "Approximation Capabilities of Multilayer Feedforward Nets," Neural Net-
-83
works
4,231-242.
Hornik, K. and C.-M. Kuan (1990): "Convergence of Learning Algorithms with Constant Learning Rates," University of lllinois Urbana -Champaign Department of Economics Discussion Paper.
Hornik, K, M. Stinchcombe, and H. White (1989): "Multi-Layer Feedforward Networks Are Universal Approximators," Neural Networks 2, 359-366.
Hornik, K, M. Stinchcombe and H. White (1990): "Universal Approximation of an Unknown Mapping and Its Derivatives Using Multilayer Feedforward Networks," Neural Networks 3,551-560.
Hu, S. and W. Joerding (1990): "Monotonicity and Concavity Restrictions for a Single Hidden Layer Feedforward Network," Washington State University Department of Economics Discussion Paper.
Huber, P.J. (1964):
"Robust Estimation
of a Location Parameter," Annals of Mathematical
Statis-
tics 35,73-101.
Huber, P.J. (1967): "The Behavior of Maximum Likelihood Estiamtes Under Nonstandard ConOitions,' E'rOCeedmgs of the r tflh Berkeley ~ympostum on Mathematical
and Probability. Berkeley: University of California Press, 1, pp. 221-233.
Statistics
Huber, P.J. (1985): "Projection Pursuit," Annals of Statistics, 13,435-475.
Joerding, W. and J. Meador (1990): "Encoding
A Priori:
Information
in Neural
Networks,"
Washington State University Department of Economics Discussion Paper.
Jones, L.K. (1991): "A Simple Lemma on Greedy Approximation in Hilbert Space and Convergence Rates for Projection Pursuit Regression and Neural Network Training,"
Annals of Statistics (forthcoming).
-84 -
Jordon, M.I. (1986): "Serial Order: A Parallel Distributed
Processing Approach,"
UC San Diego,
Institute for Cognitive Science Report 8604.
Kuan, C.-M. (1989): "Estimation of Neural Network Models," Ph.D. Dissertation, UC San Diego.
Kuan, C.-M., K. Hornik and H. White (1990): "Some Convergence Results for Learning in Recurrent Neural Networks," UCSD Department of Economics Discussion Paper.
Kuan, C.-M. and H. White (1991): "Strong Convergence of Recursive m-estimators for Models with Dynamic Latent Variables," UC San Diego Department of Economics Discussion Paper 91-05R.
Kushner, H.J. (1987): "Asymptotic Global Behavior for Stochastic Approximation and Diffusions with Slowly Decreasing Noise Effects: Global Minimization via Monte Carlo," SlAM
Journal of Applied Mathematics, 47, 169-185.
Kushner, H.J. and D.S. Clark (1978): Stochastic Approximation Methods for Constrained and
Unconstrained Systems New York: SpringerVerlag.
Kushner, H.J. and H. Huang (1979): "Rates of Convergence for Stochastic Approximation Type
Algorithms," SlAM Joumal ofControl and Optimization 17,607-617.
Kushner, H.J. and H. Huang (1981): "Asymptotic Properties on Stochastic Approximations with
Constant Coefficients," SlAM Journal of Control and Optimization, 19,87-105.
Lapedes, A. and R. Farber (1987): "Nonlinear Signal ProcessingUsing Neural Networks: Prediction and System Modeling," Los Alamos National Laboratory Technical Report.
Le CUD, Y. (1985):
"Une Procedure
d' Apprentissage
pour Reseau a Seuil Assymetrique,"
Proceedings of Cognitiva 85, 599-604.
Lee, T.H., H. White and C.W.J. Granger (1991): "Testing for Neglected Nonlinearity in Time
-85 -
Series Models," Journal of Econometrics (forthcoming).
Li, K.-C. (1987): "Asymptotic Optimality for Cp, CL, Cross-Validation and Generalized CrossValidation: Discrete Index Set," Annals of Statistics 15, 958-975.
Ljung, L. (1977): "Analysis of Recursive Stochastic Algorithms,"

Automatic Control AC-22,551-575,
IEEE Transactions on
Ljung, L. and T. Soderstrom (1983): Theory and Practice of Recursive Identification. Cambridge:
M.I.T. Press.
Lukacs, E. (1975): Stochastic Convergence. 2nd ed., New York: Academic Press.
Marcet, A. and T.J. Sargent (1989): "Convergence of Least Squares Learning Mechanisms in Self Referential, Linear Stochastic Models," Journal of Economic Theory 48, 337368.
Maxwell, T., G.L. Giles, Y.C. Lee and H.H. Chen (1986): "Nonlinear Dynamics of Artificial Neural Systems," in J. Denker ed., Neural Networks for Computing. New York:
American Institute of Physics.
McLeish, D.L. (1975): "A Maximal Inequality and Dependent Strong Laws," Annals of Probability 3, 829-839.
Metivier, M. and P. Priouret (1984): "Applications of a Kushner and Clark Lemma to General Classes of Stochastic Algorithm," IEEE Transactions on Infonnation Theory IT-30,
140-151
McCulloch, W.S. and W. Pitts (1943): "A Logical Calculus of the Ideas Immanent in Nervous
Activity ," Bulletin of Mathematical Biophysics 5, 115-133.
Minsky, M. and S. Papert (1969): Perceptrons.
Cambridge:
MIT Press.
-86-
Morris, R. and W.-S. Wong {1991): I"Systematic Choice of Initial Points in Local Search: Extensions and Application tb Neural Networks," Infonnation Processing Letters (forthcoming). Newey, W. (1985): "Maximum rLikelihood Specification Testing and Conditional Moment
Tests," Econometrica 53, 1047-1070.

Palmieri, G. and R. Sanna (1960): Methodos 12, No.48.
Parker, D.B. (1982): "Learnin~ Lo.!!Jc," Invention Report 581-64 (File 1). Stanfnrrl TTniv~r~ity Office of Technology Ljcensing.
Parker, D.B. (1985):
"Learning
Logic,"
l\.1IT Center for Computational
Research in Economics
and Management
Scienbe Technical Report TR-47.
Potscher, B. and I. Prucha (1991a): I"Basic Structure of the Asymptotic Theory in Dynamic Nonlinear Econometric Models, Part I: Consistency and Approximation Concepts,"
Econometric Reviews (forthcoming).
Potscher, B. and I. Prucha (1991b): I"Basic Structure of the Asymptotic Theory in Dynamic Nonlinear Econometric Mo(Jels, Part ll: Asymptotic Normality," Econometric Reviews
(forthcoming).
Robbins, H., and S. Monro (1951): '!A Stochastic Approximation Method," Annals of Mathematical Statistics 22, 400-407.
Rosenb1att, R. (1957):
"The Perceptron:
A Perceiving
and Recognizing
Automaton,"
Project
PARA, Cornell AeronaUtical Laboratory Report 85-460-1
Rosenblatt,
F. (1958):
"The Percdptron:
A Probabilistic
Model
for Information
Storage and
Organization in the Brain," Psychological Reviews 62, 386-408.
-87 -
Rosenblatt,
F. (1961):
Mechanisms.
Principles
Washington
of Neurodynamics:
D.C.: Spartan
Perceptrons
and the Theory of Brain
Books.
Rumelhart, D.E., G.E. Hinton and R.J. Williams (1986): "Learning Internal Representations by Error Propagation," in D. E. Rumelhart and I. L. McClelland eds., Parallel Distributed Processing: Explorations in the Microstructures of Cognition. Cambridge:
M.I.T. Press,1, pp. 318-362.

Ruppert, D. (1983): "Convergence of Stochastic Approximation Algorithms with Non- Additive Dependent Disturbances and Applications," in U. Herkenrath, D. Kalin and W. Vogel
eds., Mathematical Leaming Models-Theory and Algorithms. New York: Springer-
Verlag,pp. 182-190.
Sawa, T. (1978): "Infonnation Criteria for Discriminating Among Alternative Regression
Models," Econometrica 46, 1273-1292.
Sejnowski, T. and C. Rosenberg (1986): "NETalk:
A Parallel Network
That Learns to Read
Aloud," Johns Hopkins University Department of Electrical Engineering and Computer Science Technical Report 86/01.
Selfridge, 0., R. Sutton and A. Barto (1985): "Training and Tracking in Robotics," Proceedings of the Ninth International Joint Conference on Artificial Intelligence. Los Angeles:
Morgan Kaufman, 1, pp. 670-672.
Sontag, E. (1990): "Feedback Stabilization Using 1\vo-Hidden-Layer Nets," Rutgers Center for Systems and Control Technical Report SYCON-90-ll.
Stinchcombe, M. (1991): "Inner Functions and Universal Approximation Properties," UC San Diego Department of Economics Discussion Paper.
Stinchcombe, M. and H. White (1989): "Universal Approximation Using Feedforward Networks
-88-
With Non-Sigmoid Hidden Layer Activation Functions," Proceedings of the International Joint Conference on Neural Networks, San Diego. New York: IEEE Press,
pp. 1:612-617.
Stinchcombe, M. and H. White, (1991): "Consistent Specification Testing Using Duality," UC San Diego Department of Economics Discussion Paper.
Sussman, H. (1991):
"Uniqueness of the Weights for Minimal
Feedforward
Nets with a Given
Input-Output Map," Rutgers Center for Systems and Control Technical Report SYCON-19-06.
Sydaster,
(1981):
Topics
in Mathematical
Analysis
for
Economists.
New
York:
Academic
Press.
Tauchen, G. (1985): "Diagnostic Testing and Evaluation of Maximum Likelihood Models,"

Journal of Econometrics 30,415-444.
Tesauro, G. (1989): "Neurogammon Wins Computer Olympiad," Neural Computation 1, 321323.
rnompson, J.M:l~ and H.B. ~tewart(19H6): Nonlinear Dynamics and Chaos. New York: Wiley.
Walk, H. (1977): "An Invariance Principle for the Robbins-Monro Process in a Hilbert Space,"
Zeitschrift fiir Wahrscheinlichkeitstheorie und Verwandete Gebiete 30, 135-150.
Werbos, P. (1974):
"Beyond Regression: New Tools for Prediction and Analysis in the
Behavioral Sciences," unpublished Ph.D. Dissertation, Harvard University, Department of Applied Mathematics.
White, H. (1981): "Consequences and Detection ofMisspecified Nonlinear Regression Models,"

Journal of the American Statistical Association 76.419-433.
-89 -
White, H. (1982): "Maximum Likelihood Estimation of Misspecified Models," Econometrica
50, 1-25.
White, H. (1987a): "Some Asymptotic Results for Back-Propagation," Proceedings of the IEEE First International Conference on Neural Networks, San Diego. New York: IEEE Press,pp. III:261-266.
White, H. (1987b):
"Specification Testing in Dynamic Models," in Truman Bewley ed.,
Advances in Econometrics Fifth World Congress. New York: Cambridge University
Press, pp. 1-58.

White, H. (1988): "Economic Prediction Using Neural Networks: The Case of mM Stock
Prices," Proceedings of the Second Annual IEEE Conference in Neural Networks.
New York: IEEE Press,pp. ll:451-458.
White, H. (1989a): "Some Asymptotic Results for Learning in Single Hidden Layer Feedforward
Network Models," Jottmal of the American Statistical Association 84,1003-1013.
White, H. (1989b): "An Additional Hidden Unit Test for Neglected Nonlinearity," Proceedings of the International Joint Conference on Neural Networks, Washington D.C. New York: IEEE Press, pp. 11:451-455
White, H. (1990a): "Connectionist Nonparametric Regression: Multilayer Feedforward Networks Can Learn Arbitrary Mappings," Neural Networks 3,535-549.
White, H. (1990b): "Nonparametric Estimation of Conditional Quantiles Using Neural Networks," UC San Diego Department of Economics Discussion Paper.
White, H. (1992): Estimation, Inference and Specification Analysis. New York: Cambridge University Press (forthcoming).
White, H. and J. Wooldridge (1991): "Some Results for Sieve Estimation with Dependent
-90Observations," in W. Barnett, J. Powell and G. Tauchen eds., Nonparametric and

Semiparametric Methods in Economics. New York: Cambridge University Press, pp.
459-493.
Widrow, B. and M.E. Hoff(1960):
"Adaptive Switching CircuitS," Institute of Radio Engineers
WESCON Convention Record, Part 4, 96-104.
Williams, R. (1986): "The Logic of Activation Functions," in D.E. Rumelhart and J.L. McClelland eds., Parallel
C-'ognttton.
Distributed
MITPress,
Processing:
Explorations
in the Microstructures
of
Cambridge:
1, pp. 423-443,
Williams, R.J. and D. Zipser (1989): "A Learning Algorithm for Continua11y Running Fully Recurrent Neural Networks," Neural Computation 2, 270-280.
Woodford, M. (1990): "Learning to Believe in Sunspots," Econometrica 58, 277-308.
Xu, x. and W.T. Tsai {1990): "Constructing Associative Memories Using Neural Networks,"
Neural Networks 3,301-310.
Xu, x. and W.T. Tsai (1991): "Effective Neural Algorithms for the Traveling Salesman Problem," Neural Networks 4, 193-206.
Young, P.C. (1984): Recursive Estimation and Time-Series Analysis. New York: Springer Verlag.

Artificial Neural Networks An Econometric Perspective

Încărcat de

Informații document

Titlu original

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

Artificial Neural Networks An Econometric Perspective

Încărcat de

Drepturi de autor:

Formate disponibile

-2-

PART I: OVERVmW 1.1. ARTIFICIAL

AND HEURISTICS MODEL~

as the systematic part of the standard system of seemingly

(Multiple Adaptive Linear) network by Widrow and

When v = 1 (only a single output), we have the ADALlNE

where G(a) = 1 if a > 0 and G(a) = 0 if a $; 0. This choice for G implements

abruptly from off to on, "sigmoidal"

activity increases. The reason why this constituted a breakthrough

modification, the familiar

fh(X, ()) = F ({3hO + LJ=l Here F: 1R ~

1R is the output activation

and {3hj, j = 0, 1, ..., q, h = 1, ..., v are con.

j3'h = (j3hO. ...,j3hq))

Note that we have q hidden units.

adopt the latter choice, and for further simplicity

set v = 1. Thus we shall pay particular attention

to "single hidden layer" networks with output functions of the form

7q f(x, f) =/30 + L -a Gj(X'rj)fJj.

q +f3o + L G(x'rj)f3j), j=l

of G, a and {3 = ({30, {31,

special cases a1l of the networks discussed so far.

play an important role in the discussion of subsequent sections.

seemedto work surprisingly well.

understand their diverse successes, a number of researchers independently produced rigorous

For r E IN, let Ir(G)

be the class of hidden

JR, q E 1]J\l}, where

/Rr, and every E > 0, there exists f E Ir(G)

having q hidden units and continuously

sigmoid activation condition

function, and on its Fourier

9 belongs to a certain class of smooth functions satisfying a summability transform. An important

where ah is a qh X 1 vector with elements ahi, Ahi(.) is an affine function

of its argument (i.e.

preferable to networks with 1 = 2 layers.

for what classes of functions

Returning to the relatively

simple single hidden layer networks, such feedbacks can be

q fr(xt, (}) = /30 + L atj /3j j=l

As a consequence of tl1is feedback, network output depends on the ini-

rich as to present a potentially

set of tools for econometric

1.2. LEARNING IN ARTIFICIAL NEURAL NETWORKS

tors {Zt = (yt, X't)'}

(assumed stationary for simplicity), and we wish to forecast yt on the basis

tity, we obtain a regression model of the form

where () = (a,/30'/31 , ...,/3 j, r'l,

and for simplicity

we choose q and G a priori. we must acknowledge

Because this model is only intended

n min n-l /I," A L [yt ~=1 -f(Xt. 8)]2

{;; (0 n -() *: converges in distribution

mean zero and consistently

mate g(X,) = P[Yt = 1 I Xt] = E(Yt I Xt) can be specified as

+ .80 + L G(x'rj) j=l

max E[Yt Jog f(Xt. (1) + (1- YJ 1og(lBee .

with mean zero and consistently

estimable covariance matrix (White, 1982; 1992, Ch. 6).

use a linear (or transformed linear) specification,

modelers from the outset viewed learnin2 as a ~eql1f'.ntiI11 proce~s. Viewins

is embodied in the network

" " Given knowledge (} t at time t, knowledge (} t+ 1 at time t + 1 is then

@ Ik) -b (y)A -l]S;)