Sunteți pe pagina 1din 10

Neural Networks,Vol.2, pp. 183-192, 1989 0893-6080/89 $3.00 + .

DO
Printedin the USA.All rightsreserved. Copyright© 1989PergamonPressplc

ORIGINAL CONTRIBUTION

On the Approximate Realization of Continuous Mappings


by Neural Networks

KEN-ICHI FUNAHASHI
ATR Auditoryand Visual Perception Research Laboratories
(Received 6 May 1988; revised and accepted 14 September 1988)
Abstract--In this paper, we prove that any continuous mapping can be approximately realized by Rumelhart-
Hinton-Williams' multilayer neural networks with at least one hidden layer whose output functions are sigmoid
functions. The starting point of the proof for the one hidden layer case is an integral formula recently proposed
by Irie-Miyake and from this, the general case (for any number of hidden layers) can be proved by induction.
The two hidden layers case is proved also by using the Kolmogorov-Arnold-Sprecher theorem and this proof
also gives non-trivial realizations.
Keywords--Neural network, Back propagation, Output function, Sigmoid function, Hidden layer, Unit, Re-
alization, Continuous mapping.

1. INTRODUCTION units which partitions a space into concave sub-


spaces. Huang and Lippmann (1987) demonstrated
Since McCulloch-Pitts (1943), there have been many
by simulations that three-layer networks can form
studies of mathematical models of neural networks.
several complex decision regions in pattern recog-
Recently, Hopfield, Hinton, Rumelhart, Sejnowski
nition application. However, it has been known that
and others have tried many concrete applications
any piecewise-linear decision region (which is not
such as pattern recognition, and have shown that it
necessarily convex) can be realized by a multilayer
is possible to clarify the mechanism of human infor-
network (Duda & Fossum, 1966). It's learning al-
mation processing by the use of these models. In
gorithm was also proposed (Amari, 1967) based on
particular, the back propagation algorithm (gener-
the same principle as the generalized delta rule.
alized delta rule) proposed by Rumelhart, Hinton,
There are also other applications of multilayer net-
and Williams (1986) provides a learning rule for mul-
works for forming mappings, such as NETtalk by
tilayer networks. Many applications of this algorithm
Sejnowski and Rosenberg (1987).
have been shown recently. However, there has been
Hecht-Nielsen (1987) pointed out that Kolmo-
little theoretical research on the capability of the
gorov's theorem (Kolmogorov, 1957) and Sprecher's
Rumelhart-Hinton-Williams multilayer network.
refinement (Sprecher, 1965), which are both known
On the application to pattern recognition, Lipp-
as negative solutions of Hilbert's thirteenth problem,
mann (1987) asserts that arbitrary complex decision
show that any continuous mapping can be repre-
regions, including concave regions, can be formed
sented by a form of four-layer neural network. Ue-
using four-layer networks, but this is only an intuitive
saka (1971) and Poggio (1983) have also pointed this
assertion. Wieland and Leighton (1987) showed an
out. However, the assertion has a problem in that
example of a three-layer network with thresholding
the output function of each unit of this network is
not a given sigmoid function.
Irie and Miyake (1988) obtained an integral for-
The author wishes to thank Drs. Y. Tohkura, T. Inui and S. mula which suggests the realization of functions of
Miyake,and Mr. T. Okamotofor their valuablecommentson the several variables by three-layer networks by analogy
manuscript. The author also wouldlike to thank anonymousre- with the principle of the computerized tomography
viewerswhoseconstructivesuggestionshaveimprovedthe quality (CT). But in this integral formula, the output func-
of this paper. tion 0(x) must satisfy the condition of absolute in-
Requests for reprints should be sent to Ken-ichiFunahashi,
ATR Auditory and Visual Perception Research Laboratories, tegrability, so that it cannot be a sigmoid function.
Twin21 Building,MID Tower2-1-61 Shiromi,Higashi-ku,Osaka Moreover, the function to be realized is given by an
540, Japan. integral representation and the formula does not di-
183
184 K. F u n a h a s h i

rectly give the realization theorem of functions by tween the desired output and the output signal of
networks with finite units. the network is minimized. We generally use a
In neural networks of the feed-forward type by bounded and monotonic increasing differentiable
Rumelhart-Hinton-Williams, bounded and mono- function which is called sigmoid function for each
tone increasing differentiable functions such as the unit's output function.
sigmoid function ~b(x) = 1/(1 + e -x) are used as If a multilayer network has n input units and m
output functions of units. This is a different point output units, then the input-output relationship de-
from the McCulloch-Pitts model and perceptron fines a continuous mapping from n-dimensional Eu-
which use heaviside function as output functions of clidean space to m-dimensional Euclidean space. We
units and is the reason why it is possible to derive a call this mapping the input-output mapping of the
learning algorithm for multilayer networks. network. We study the problem of network capabil-
In a feed-forward type network, it's input-output ities from the point of view of input-output map-
relationship defines a mapping which is called an pings. It is observed that for the study of mappings
input-output mapping of the network. We studied defined by muitilayer networks it is sufficient to con-
the problem of network capabilities from the point sider networks whose output functions for hidden
of view of input-output mappings. layers are the above +(x) and whose output functions
In this paper, we started from an integral formula for input and output layers are linear.
recently proposed by Irie and Miyake (1988) and
proved the theorem which guarantees the approxi- 3. APPROXIMATE REALIZATION
mate realization of continuous mappings by three- OF CONTINUOUS MAPPINGS
layer (one hidden layer) networks whose output BY NEURAL NETWORKS
functions for hidden layer are sigmoid, and whose
output functions for input and output layers are lin- We shall consider the possibility of representing con-
ear in the sense of uniform topology. It is easy to tinuous mappings by neural networks whose output
prove the theorem for k (->3)-layer networks by us- functions in hidden layers are sigmoid, for example,
ing the theorem for a three-layer case. But the proof +(x) = 1/(1 + e-X). It is simply noted here that
of the theorem for the case k > 3 gives only trivial general continuous mappings cannot be exactly rep-
approximate realization of given mappings. There- resented by Rumelhart-Hinton-Williams' networks.
fore we show another proof for the four-layer case For example, if a real analytic output function such
by using the Kolmogorov-Arnold-Sprecher theorem as the sigmoid function +(x) = 1/(1 + e -*) is used,
(Kolmogorov, 1957; Sprecher, 1965). then an input-output mapping of this network is an-
McCulloch-Pitts showed that any logical circuit alytic and generally cannot represent all continuous
can be designed using their model. Correspondingly, mappings.
our assertion shows that any continuous mapping can Let points of n-dimensional Euclidean space R"
be approximately represented by the Rumelhart- be denoted by x = ( x 1 . . . . . Xn) and the norm of x
Hinton-Williams multilayer network. defined by Ix[ = teni:o xT~,l., ) '-.
We prove the following theorems and corollaries
in this paper.
2. MULTILAYER NEURAL NETWORKS
The Rumelhart-Hinton-Williams multilayer net- Theorem 1.
work that we consider here is a feed-forward type Let qb(x) be a nonconstant, bounded and monotone
network with connections between adjoining layers increasing continuous function. Let K be a compact
only. Networks generally have hidden layers be- subset (bounded closed subset) of R" and f ( x l . . . . .
tween the input and output layers. Each layer con- xn) be a real valued continuous function on K. Then
sists of computational units. The input-output for an arbitrary e > 0, there exists an integer N and
relationship of each unit is represented by inputs xi, real constants ci, Oi(i = 1 . . . . , N), wq(i = 1 .... ,
output y, connection weights wi, threshold 0, and N, j -- 1 . . . . . n) such that
differentiable function + as follows:
f(xl ..... x.) = c,+ woxj - oi
i=1
y 0t
satisfies m a x , ~ I f ( x , . . . . . x , ) - f (x~ . . . . . x,)[ <
The learning rule of this network is known as the ~. In other words, for an arbitrary ~ > 0, there exists
back propagation algorithm (Rumelhart, Hinton, & a three-layer network whose output functions for the
Williams, 1986). The back propagation algorithm is hidden layer are +(x), whose output functions for
an algorithm that uses a gradient descent method to input and output layers are linear and which has an
modify weights and thresholds so that the error be- input-output function f ( x ~ . . . . . x . ) such that
Approximate Realization of Continuous Mappings 185

max,~K If(x, .... , x.) - ..... x.)l < Remark 1.


The above theorem easily leads to the following
general theorem. Usual output functions such as the sigmoid function
1/(1 + e -x) used for back-propagation neural net-
works satisfy the condition of d#(x) that ~b(x) is a
Theorem 2.
nonconstant, bounded and monotone increasing con-
Let ¢(x) be a nonconstant, bounded and monotone tinuous function.
increasing continuous function. Let K be a compact
subset (bounded closed subset) of R" and fix an in- Remark 2.
teger k - 3. Then any continuous mapping f : K
Any mapping is approximately realized by a three-
R ~ defined by x = ( x , , . . . , Xn) ~ ( f , ( x ) . . . . ,
layer (one hidden layer) network. However, it should
f,,(x)) can be approximated in the sense of uniform
be theoretically studied in the future that the pos-
topology on K by input-output mappings of k-layer
sibility of k > 3-layer networks can realize a given
(k-2 hidden layers) networks whose output functions
mapping with less costs (number of units or connec-
for hidden layers are d#(x), and whose output func-
tions) than three-layer networks, within error e.
tions for input and output layers are linear. In other
For the application of neural networks to pattern
words, for any continuous mapping f : K ~ R " and
recognition, if m is the number of recognized cate-
an arbitrary e > 0, there exists a k-layer network
gories, usually m output units corresponding to these
whose input-output mapping is given by f : K ~ R "
categories are used, and the system is allowed to
such that max,er d(f(x), f(x)) < e, where d(,) is a
learn to take values near 1 only for units correspond-
metric which induces the usual topology of R".
ing to the input categories. Corollaries show that if
one uses multilayer networks with hidden layers, any
Corollary 1.
decision region can be formed by a neural network.
Let qb(x), K be as above and fix an integer k --- In particular, a strictly increasing continuous func-
3. Then any mapping f : x E K ~ (f,(x), . . . , tion, as the output function of each unit, can be
f,,(x)) E R " where fl(x) (i = 1, . . . , m) are sum- chosen.
mable on K, can be approximated in the sense of In this paper, we call bounded and monotone in-
L2-topology on K by input-output mappings of k- creasing continuous functions, sigmoid functions. In
layer (k-2 hidden layers) networks whose output particular, a sigmoid function ~b(x) having a weak
functions for hidden layers are ~b(x) and whose out- derivative which is summable has the property that
put functions for input and output layers are linear. if we set d#~(x) = ~b(x/e)(e > 0), then the derivatives
In other words, for an arbitrary ~ > 0, there exists d#~(x) = (1/()df(x/~) converge, in the sense of the
a k-layer network whose input-output mapping is generalized function (see, e.g., Gel'fand & Shilov,
given by f : x E K---~ ( f l ( x ) . . . . . fro(X)) E R m such 1964), to the ~ function as • ~ 0. That is to say, if
that ~b(oo) - ~b(- ~) = 1, then for any smooth function
g(x) with compact support,
dL2(X)(f' f ) = (~i=1fr [fi(x, . . . . ,X,)
lim (= ~b~(x) • g(x) dx = g(O).
~'--~ + 0 .J - - ~o

- f,(Xm, . . . , x,)l 2 dx)1,2 < The following examples are included in the class
of sigmoid functions considered here.

Corollary 2. Example 1. For ~b(x) = 1/(1 + e x p ( - x ) ) , ~b'(x) =


1/~ e x p ( - x / ~ ) / ( 1 + e x p ( - x / ~ ) ) 2 and ~b(x) is a sig-
Let K be as above and fix an integer k -> 3. Let ~b(x) moid function.
be a strictly increasing continuous function such that
d~((- ~, ~)) = (0, 1). Then any continuous mapping Example 2. For ~ ( x ) = 1/V~--~ f~_~ e x p ( - t z / 2 ) dt,
f : K ~ (0, 1) m c a n be approximated in the sense of • "(x) = ~ / X / ~ exp(-x2/2~) and ~ ( x ) is a sigmoid
uniform topology on K by input-output mappings of function.
k(->3)-layer neural networks whose output functions
for hidden and output layers are d#(x). Example 3. For ~b(x) where ~b(x) = 0(x < 0),
~b(x) = x(0 < x < 1) and ~b(x) = l(x -> 1),
Proof. Set f(x) = ( f l ( x ) , . . . ,
fm(X)). As d#-l:(0, ~b'(x) = 0(x < 0 or x ->- ~), ~b'(x) = l/e(0 -< x <
1) --9 ( - oo, ~) is continuous, the theorem 2 is applied ~), and d#(x) is a sigmoid function.
to the mapping x ~ ~b-lf(x) = ( ~ b - l f ( x ) , . . . , In the McCulloch-Pitts neural model and per-
~b-lf,,(x)) and the corollary is obtained easily. ceptron, a threshold function ~b(x) = l(x - 0), =
q.e.d. 0(x < 0) is used as the output function.
186 K. Funahashi

Sigmoid functions qb(x) where 6 ( - ~ ) = 0 and For f(x) E LI(R"), Fourier transform
6(~) = 1 are appropriate as output functions in the
neural model because if we set +,(x) = + ( x & ) ( e > r(e) : f. e dx, (1)
O) then these converge to the threshold function in
the McCulloch-Pitts neural model and perceptron as
where (x, {) = £7-1 xi{i, can be defined and set
~--> +0.
McCulloch-Pitts shows that one can design any / ( g ) = ,.;f(g).
If f(x) satisfies an additional condition that f(x)
logical circuit using their model. Correspondingly,
has continuous partial derivatives of order up to n,
theorem 2 above shows that any continuous mapping
then f(x) can be represented at each point by inverse
can be approximately represented by multilayer net-
Fourier transform of .f (~) as follows.
works with sigmoid output functions.

f(x) = (2~r)-" f., e'<X4>f({) d{.


4. PRELIMINARY 1 (MOLLIFIERS,
FOURIER TRANSFORMS) The Plancherel theorem especially asserts that ,-;can
be extended one to one onto mapping S : L2(R ") --~
Fundamental matters used in this paper are reviewed L2(R") and for f(x) E L ~ V1 L2(R"), / ; f ( { ) is equal
here. to the one defined by (1). Furthermore, for f(x) C
Let LP(R")(p -> 1) denote the space of all mea- L2(R"),
surable functions f(x) on R" which satisfy
e -i<*4>f(x) dx ~ 0(A ) + zc)
f If(x)?
n
dx < ~. i, 2

(see e.g., Yosida, 1968).


The norm of f E LP(R ~) is defined by

IIf(x)lk~ = n
If(x)l ~ dx , 5. PRELIMINARY 2 (IRIE-MIYAKE'S
INTEGRAL FORMULA)
and the convergence f,(x) ---> f(x) in LP(R") is de- The following theorem is a starting point for proof
fined by of Theorem 1.
lim IIf,(x) - f(x)llL, = O.
Theorem (Irie-Miyake)
Generally, for any measurable set K, L P ( K ) ( p >- Let O(x) E LI(R), that is, let O(x) be absolutely
1) is defined similarly. integrable and f ( x t . . . . . x,) ~ L2(R"). Let W(~j)
Let p(x) be a function on R" which satisfies the and F(wl . . . . . w,) be Fourier transforms of @(x)
following conditions: and f ( x l , . • • , xn) respectively.
If W(1) # 0, then
(i) p(x) >- O, 9(x) has continuous partial derivatives
of all orders and the support is contained in the
unit sphere IxI <-- 1.
(ii) fR" p(x) dx = 1
qJ x;w;- Wo (2,rr),xF(1)
i=l

Then, for ~ > O, set p,(x) = (1/e)"p(x/e). x exp(iw0) dwo dwl "'" dw,,.
If u(x) ~ L~o~, that is, u(x) is locally summable,
consider
Remark
This formula precisely asserts that if we set
p**u(x) = fR° p,(x - y)u(y) dy,
I~A(X1 . . . . . Xn) . . . .
then the following assertions hold: (a) p,*u(x) C -A -A

C ~, that is, p,*u(x) has continuous partial derivatives


of all orders, and the support of p,*u(x) is contained
in the ¢ neighborhood of support of u(x); (b) if u(x)
is a continuous function with compact support, then 1
x F(wl ..... wn)
O~*u(x) ---" u(x) uniformly on R" as e ~ + 0; and (c) (2"rr)"*(1)
if u(x) E LP(R~)(p >- 1) then p~*u(x) ---> u(x) in L p
as ¢ ~ + 0. The operator p,* is called a mollifier. x exp(iw0) dwo] dwl 1 1 0
dw,
d
Approximate Realization of Continuous Mappings 187

then By the change of the variable,


lim IIL.A(X, . . . . , x,) - f(x,, . . . , X,)IIL= = O.
f[ (+(x + ~) - O(x - a))e - ~ dx = 0
Connecting this formula with three-layer net- (for any ~ > 0). (1)
works, Irie and Miyake (1988) assert that arbitrary
functions can be represented by a three-layer net- Taking the complex conjugate of the above equation
work with an infinite number of computational units. (1),
In this formula, w0 corresponds to threshold, wi cor-
responds to connection weight and 0(x) corresponds
to the output function of the units. However, the f[ (+(x + a) - do(x - a))e/~ dx = 0
~o

sigmoid function 1/(1 + e -~) does not satisfy the (for any g > 0). (2)
condition of this formula that 0(x) be absolutely in-
Since the Fourier transform al(~) of gl(x) =
tegrable and so the formula does not directly give
do(X + o0 -- 6(X -- a) E LI(R) is continuous, so,
the realization theorem of functions by networks.
from (1) and (2), G I ( 0 is identically zero. Therefore,

6. P R E L I M I N A R Y 3 (SEVERAL LEMMAS) do(x + a ) - do(x - a ) - - - 0 .


We prepare several Lemmas for proof of our theo- This is a contradiction, because do(x) is not
rem 1. constant, q.e.d.
Lemma 1.
Let d0(x) be a nonconstant, bounded and monotone Remark
increasing continuous function. For a > 0, if we set
Lemma 1 holds for do(x) which is locally summable.
g(x) = do(x + ~) - do(x - ~),
then g ( x ) ~ L~(R), that is,
Lemma 2

f_ Ig(x)] d x < ~. Let Ai > 0 (i = 1 . . . . , m), K be a compact subset


(bounded closed subset) of R n and h(x~ . . . . . Xm,
Furthermore, for some 8 > O, if we set h . . . . , t,) be a continuous function on [ - m l , Zl]
X "" X [ - A m , Am] X K.
gdx) = do(x/~ + ~) - do(x/~ - ~), Then the function defined by the integral
then the value of Fourier transform G~({) of g~(x) at
= 1 is non-zero. H(t)= f _ ' - - , f~m
A1 Am
Proof. Let ]g(x)] ~ M. For L > M,
h(Xl . . . . . xm, tl, • • • , tn) dXl • • • dxm

I
•J L
Ig(x)l d x = f'_' g ( x ) d x =
L I-°
J -L+ct
can be approximated uniformly on K by the Riemann
sum

- f_-° do(x) d x 2Ai "'" 2Am


L-or Hs(t) =
Nm
N-1 (
= (~+~ re(x) ax x ~ h
kl • 2A1
-A1 + - -
dL-a
kl...km=0 N , • . . ,

_ (-L+~ do(x) dx <- 4oLM. k,. • 2A,,


J -L-a -Am+ , tl . . . . , tn)-
N
Therefore,
In other words, for an arbitrary e > 0, there exists
a natural number No such that for N -> No,
lira
L--*~ f: L
Ig(x)] dx < ~.
max ]H(t) - HN(t)] < e.
We show that for some g > 0, G~(1) ~ 0. If the I•K

assertion does not hold, then for any g > 0,


Proof. The function h(x, t) is continuous on the com-
f~ (do(x/~ + Or) - +(x/~ - oO)e -~ ax o. pact set [ - A 1 , ml] x ... x [ - A m , Am] x K, so h(x,
oo
188 K. Funahashi

t) is uniformly continuous. Therefore for any e > 0,


JA(XI ..... Xn)-- 1 fA fA
we can take the integer No such that if N - No and (2~r)" A -A

xi I F(wl . . . . . wn)exp (i ~ xiwi) dwl ... (6)

<--N-2A; (i = 1, . . . , m) then where t~(x) E L 1 is defined by

+(x) = + - -
h(Xl ..... Xm, tl ..... tn)
k l " 2A1 for some ot and ~ so that +(x) satisfies Lemma 1 in
- h -A1 + ~ ..... Section 6.
The essential part of the proof of Irie-Miyake's
integral formula is the equality
-Am + ----~-, fi . . . . . tn < 2A1 "" 2Am"
I~A(X 1. . . . . X,,) = JA(X1 ..... Xn) (7)
Assertion of the L e m m a is obvious from this
inequality, q.e.d. and this is derived from

7. P R O O F OF T H E O R E M S
We will prove our theorems in Section 3 under the
above preliminaries. = exp(i~xiwi)i=l "*(1). (8)

Proof of Theorem 1 In our discussion, using the estimate of F(w), we


can prove
Step I. Because f(x)(x = (x~, . . . , x,)) is a contin-
lira JA(Xl . . . . . Xn) = f ( x l , • • • , X,)
uous function on a compact subset K of R", f(x) can A~zc
be extended to be a continuous function on R" with
compact support. We also denote this by f(x). uniformly on R". Therefore
If we operate the mollifier p~* on f ( x ) , p~*f(x) is lira I~ A(Xt . . . . . X,) = f(Xl . . . . . Xn)
C~-function with compact support. Furthermore, A~

p~*f(x) ~ f ( x ) ( a ~ + 0) uniformly on R ~. Therefore uniformly on R". That is to say, we can state that
we may suppose f ( x ) is a C~-function with compact for any e > 0 there exists A > 0 such that
support for proving T h e o r e m 1. By the Paley-Wie-
ner theorem (see, e.g., Yosida, 1968), the Fourier max II~A(X, . . . . . Xn) -- f ( x l , • • • , X,)I < ~/2.
x~R n
transform F(w)(w = (wl . . . . . w~)) of f ( x ) is real
analytic and, for any integer N, there exists a con- (i)
stant CN such that
Step 2. We will approximate I~.A by finite integrals
IF(w)h ~ C~(1 + Iwl) N. (3) on K. For e > 0, fix A which satisfies (i).
In particular, F(w) E L ~ O LZ(R"). For A ' > 0, set
We define IA(Xl . . . . . X,), L.A(Xl . . . . . X,) and
JA(XI, . . . , Xn) as follows: A A
IA'.A(X1 .....
f
Xn) . . . .
-A -A
Ia(Xl' " " " 'Xn) = f~A "" f~zQ (~XiWi -- Wo)
IfA: (, xiwi Wo)
1
F(wl . . . . , w,) 1
(2"rr)"qs(1) × F(wl . . . . , w,)
(2"rr)"~(1)
× exp(iw0) dwo dWl "" dw,, (4)
× exp(iw0) dwo[ dwv • .dw,.
I~A(X,, j: f:[f:
. . . , Xn) . . . .
A A
We will show that, for ~ > 0, we can take A ' >

* i=,xiwi-
) 1
wo (2~r)"~(1)
0 so that
max [IA' A(Xl. . . . . X,) -- L A(Xl, • • • , X,)I < ~/2.
x•K

× exp(iw0) dwo] dWl "" dw,, (5) (ii)


3
Approximate Realization of Continuous Mappings 189
Using the following equation Proof of Theorem 2
If k = 3, set f : x = (xl, • • • , x,) ---> (fl(x) . . . . .
f_AA, t~(i=~lXiWi -- WO)exp(iwo)dWo fro(X)) and apply Theorem 1 to each fi(x).
For the general case, we first remark that a k(>3)-
layer network can be represented by the composition
= (~='x'w~+m'd~(t)e x p ( - i t ) d t . e x p ( i £ x i w i ) , of k-2 three-layer networks and using the realization
35"?= lxiw i - A ' i= 1
of identity mapping by three-layer network,
the fact F(x) E L ] and compactness of [ - A , A ]" ×
K, we can take A ' so that Proof of Corollary 1
In the expression f:x---> (fl(x), . , . , fro(x)), we ex-
If ~AA l~l (i=~lXiWi - wo) exp(iwo) dwo tend fi(x) to functions which take value zero on
R" - K. We also denote these by f~(x) (i = 1 . . . . .
m). We can approximate fi(x)(i = 1. . . . , m) by
- ~ ¢(~]x~w~- wo) exp(iwo)dwo C~-functions with compact support by operating mol-
lifier p~* on f~ and apply theorem 2 to p~*fi, q.e.d.
< ¢(2rr)"l't'(1)l The above proof of the theorem 2 for the case
k > 3 gives only trivial approximate realizations of
given mappings by k-layer networks. Therefore, we
shall give a different proof for the case k = 4, by
× "'" IF(x)l dx on g . using the Kolmogorov-Arnold-Sprecher theorem,
A A which gives nontrivial realizations of continuous
mappings.
Therefore,
max I1a,,a(Xl. . . . , x,) - L.a(xl . . . . , x,)l 8. KOLMOGOROV-ARNOLD-
xEK
SPRECHER'S THEOREM
E
Let I = [0, 1] denote the closed unit interval, I" =
_= [F(x)l , x + , ) [0, 1]"(n - 2) the Cartesian product of I.
In his famous thirteenth problem, Hilbert conjec-
tured that there are analytic functions of three vari-
x --. IF(x)l dx < ~/2. ables which cannot be represented as a finite
-A A
superposition of continuous functions of only two
arguments. Kolmogorov (1957) and Arnold refuted
Step 3. From (i) and (ii), we can say that for any this conjecture and proved the following theorem.
> 0, there exist A, A' > 0 such that
Theorem (Kolmogorov)
max If(x1 .... , x.) - ZA' A(x,, • • • , X.)I <
xEK
Any continuous functions f(xl . . . . , x,) of several
(iii) variables defined on l"(n >- 2) can be represented in
the form
That is to say, f(x) can be approximated by the
finite integral IA',A(X) uniformly on K. The integrand
of I~'.A(X) can be replaced by the real part and is
f(x) = ~ ×j d~o(xi) ,
1=1 i=1
continuous on [-A', A'] × ... × [-A, A] × K,
so by Lemma 2, Im',a(X) can be approximated by the where ×j, ¢ij are continuous functions of one variable
Riemann sum uniformly on K. and ¢0 are monotone functions which are not de-
Since pendent on f.
Sprecher (1965) refined the above theorem and

wo)= o) obtained the following:

Theorem (Sprecher)
O( aWX,, wo o) For each integer n -> 2, there exists a real, monotone
increasing function ¢(x), ¢([0, 1]) = [0, 1], depen-
the Riemann sum can be represented by a three-layer dent on n and having the following property:
network. Therefore f(x) can be represented approx- For each preassigned number ~ > 0 there is a rational
imately by the three-layer networks, q.e.d. number ~, 0 < ¢ < 8, such that every real continuous
190 K. Funahashi

function of n variables, f(x), defined on 1% can be proximate these functions using a sigmoid func-
represented as tion ¢b.
Let K j ( j = 1, . . . , 2n + 1) be the images of [0,
f(x) ~-~ × hi~J(Xi + ¢ ( j -- 1)) + j -- 1 , 1]" by mappings
/=1 i=1
n
where the function × is real and continuous and h is •j:x
an independent constant of f. ,=l

Hecht-Nielsen (1987) pointed out that this theo- ( j = 1. . . . . 2n + 1)


rem means that any continuous mapping f : x
I" ---> (fl(x) . . . . . f r o ( X ) ) E R m is represented by a
and set K = U Kj. Take ~ > 0 and the closure K~
form of four-layer neural network with hidden units of 6 neighborhood of K. Continuous functions Xp
whose output functions are 0, ×~(i = 1 . . . . . m), (p = 1 . . . . . m) are approximated by
where ~ is used for the first hidden layer, ×~ is given N

by Sprecher's theorem for f~(x) and ×~(i = 1, . . . . Xp.N(X) = E Ci.N+(ai.NX + bi,N) (9)
i=1
m) are used for the second hidden layer.
so that
9. A L T E R N A T I V E P R O O F OF T H E O R E M 2
F O R T H E C A S E k -- 4 I×p(x) - Xp.N(X)l < ¢ / ( 4 n + 2)(p = 1 . . . . , m)
(10)
In section 8, we reviewed Kolmogorov's theorem and
its refinement from the point of view of neural net- on K~. As ×p.U(X) are uniformly continuous on K~,
works. The K o l m o g o r o v - A r n o l d - S p r e c h e r theorem sufficiently small -q can be taken so that if Ix - Yl
and the following proposition are used to prove our < ~q(x, y ~ K~) t h e n ]Xp,N(X) -- Xp.N(Y)I < ¢ / ( 4 n +
theorem 2 for the case k = 4. This proposition is a 2)(p = 1 , . . . , m ) .
special case (one variable case) of theorem 1 in Sec- We apply our lemma to "rj and approximate "r/on
tion 3. [0, 1]" by "rj,U' SO that
]-rj(x) - Tj.N'(X)] < min(-q, a), (11)
Proposition
where % N , ( x ) ( j = 1 . . . . . m ) are defined as follows:
Let g ( x ) be a continuous function on R and ~b(x) a We approximate +(x) by
bounded and monotone increasing continuous func-
N'
tion. For an arbitrary compact subset (bounded
~JN'(x) ~--- E eil~)( ~lix + [)i) (12)
closed subset) K of R and an arbitrary ~ > 0, there i-1
are an integer N and real constants a~, bi, ci(i =
1. . . . . N) such that on 2n~ neighborhood of [0, 1] and set

g ( x ) - ~N c,4)(aix + be) < "rj.N'(X) = L hiON'(X, + ~(J -- 1)) + j - 1 (13)


i=l

holds on K.
In the appendix, we shall state the direct proof of so that the above inequality (11) is satisfied. Using
the above proposition by a different method without a transformation
using Fourier transforms under the additional con- 2n+l 2n+l
dition that ¢b(x) has a weak derivative which is sum- X X
mable. j=l j-1
Next we prove theorem 2 for the case k = 4 by 2n+l 2n+l

using the K o l m o g o r o v - A r n o l d - S p r e c h e r theorem = E E


/=1 j=l
and the above proposition.
2n+l 2n+l

Proof. We may suppose that K = [0, 1]", because + jX


=l
E
/=1
fp(x)(p = 1, . . . , m) can be extended continuous
functions with compact supports. We apply Sprech- it is seen that fp(x)(p = 1 , . . . , m) are approxi-
er's theorem to f p ( x ) ( p = 1, . . . , m ) and represent mated by
fp(x) by the form 2n+l

Xp,N['rj.N,(X)] (p = 1 ..... m)
/=1
fp(x) = ~ ×e hi*( x' + ~ ( J - 1)) + j - 1
j=l i=1
on [0, 1]" so that the errors are less than ~. Looking
(p = 1, . . . , m), where h and ~ are constants. We at the form of this approximation, the theorem is
apply our proposition to functions ×p, tb, and ap- obtained, q.e.d.
Approximate Realization o f Continuous Mappings 191

10. N E U R A L NETWORK mation processing Systems, Denver, Colorado, 1987 (pp. 387-
396). New York: American Institute of Physics.
A N D INFORMATION PROCESSING Irie, B., & Miyake, S. (1988). Capabilities of three-layered Per-
IN THE BRAIN ceptrons. IEEE International Conference on Neural Networks,
1,641-648.
In the Rumelhart-Hinton-Williams multilayer
Kolmogorov, A. N. (1957). On the representation of continuous
neural network, input and output values of each unit functions of many variables by superposition of continuous
correspond to pulse-frequencies in a neuron and thus functions of one variable and addition. Doklady Akademii
each unit, disregarding time characteristics, is a very Nauk SSSR, 144, 679-681; American Mathematical Society
simple model of the neuron. When a neural network Translation, 28, 55-59 [1963].
Lippmann, R. P. (1987, April). An introduction to computing
is implemented for pattern recognition in engineer-
with neural nets. IEEE ASSP Magazine, 4, pp. 4-22.
ing fields, output units correspond to gnostic cells in McCulloch, W. S., & Pitts, W. (1943). A logical calculus of the
the brain. idea immanent in nervous activity. Bulletin of Mathematical
The approximate realization of continuous map- Biophysics, 5, 115-133.
pings using neural networks, which are simple Poggio, T. (1983). Visual algorithms. In O. J. Braddick & A. C.
Sleigh (Eds.), Physical and biological processing of images (pp.
models of the neural system, suggest that there are 128-135). New York: Springer-Verlag.
several gnostic cells in the brain. It also shows the Rumelhart, D. E., Hinton, G. E., & Williams, R. J. (1986).
possibility of revealing information processing in the Learning representations by error propagation. In D. E. Ru-
brain through neural network approaches. melhart, J. L. McClelland and the PDP Research Group (Eds.),
Parallel distributed processing (Vol. 1, pp. 318-362). Cam-
bridge, MA: MIT Press.
11. SUMMARY Sejnowski, T. J., & Rosenberg, C. R. (1987). Parallel networks
that learn to pronounce English text. Complex Systems, 1, 145-
We proved the approximate realization theorem of 168.
continuous functions by three-layer networks. This Sprecher, D. A. (1965). On the structure of continuous functions
theorem leads to the approximate realization theo- of several variables. Transactions of the American Mathemat-
rem of continuous mappings by k(->3)-layer net- ical Society, 115, 340-355.
Tamura, S. and Waibel, A. (1988). Noise reduction using con-
works and we showed that any mapping whose nectionist models. 1988 International Conference on Acoustic,
components are summable on compact subset, can Speech, and Signal Processing, pp. 553-556.
be approximately represented by k(->3)-layer net- Uesaka, Y. (197l). Analog perceptrons: On additive represen-
works in the sense of L2-norm. We also showed an tation of functions. Information and Control, 19, 41-65.
alternative proof of the theorem for the case k = 4 Waibel, A., Hanazawa, T., Hinton, G., Shikano, K., and Lang,
K. (1988). Phoneme recognition: neural networks vs. hidden
by using the Kolmogorov-Arnold-Sprecher theo- Markov models. 1988 International Conference on Acoustic,
rem and a proposition which is a special case of the Speech, and Signal Processing, pp. 107-110.
three-layer case. We consider that one of the prob- Wieland, A., & Leighton, R. (1987). Geometric analysis of neural
lems of analyzing neural network capabilities is network capabilities, IEEE First International Conference on
solved in the form of the existence theorem of net- Neural Networks, 3, 385-392.
Yosida, K. (1968). Functional analysis. New York: Springer-Ver-
works which are approximately capable of repre- lag.
senting any mapping given.
Presently, for application of neural networks to APPENDIX (DIRECT PROOF OF THE PROPOSITION
pattern recognition or related engineering fields, up IN SECTION 9 BY A DIFFERENT METHOD)
to four-layer networks are used (Waibel, Hanazawa, Proof. There is a continuous function g(x) on R which has a
Hinton, Shikano, & Lang, 1988; Tamura & Waibel, compact support such that g(x) = g(x) on K. We may prove the
proposition for 8(x) and so we may initially suppose that g(x) has
1988). The theorems proved here provide that the a compact support. We may also suppose that ~b(~) - d~(-~) =
mathematical base and their use would be funda- 1. For the arbitrary ¢ > 0, we will approximate g(x) on K by a
mental in further discussions of neural network sys- summation of sigmoid functions whose variables are shifted and
scaled. Initially, we can approximate g(x) by a simple function
tem theory. (step function) c(x) with compact support so that
REFERENCES [g(x) - c(x)l < ~/2 (A.1)
Amari, S. (1967). A theory of adaptive pattern classifiers, 1EEE on R and whose step variances are less than e/4. Here c(x) is
Transactions on Electronic Computers, EC-16, 299-307. represented using the Heaviside function H(x) as follows:
Duda, R. O., & Fossum, H. (1966). Pattern classification by
iteratively determined linear and piecewise linear discriminant c(x) : ~ ciH(x - x,).
functions. 1EEE Transactions on Electronic Computers, EC-
15,220-232.
For a sigmoid function ~b(x), set (b~(x) = ¢(x/ot)(ot > 0). Then
Gel'fand, I. M., & Shilov, G. E. (1964). Generalized functions, d
(Vol. 1, Chap. 1). New York: Academic Press. ~b'(x) = d~x ~b~(x) converge to the delta function as ct ~ 0. We
Hecht-Nielsen, R. (1987). Kolmogorov mapping neural network consider the convolution c*~b'(x) of c(x) and ~b'(x). We set
existence theorem. IEEE First International Conference on 2~' = "minimum width of steps" and obtain
Neural Networks, 3, 11-13.
Huang, W. Y., & Lippmann, R. P. (1987). Neural net and tra- c(x) - c*~'(x) = f " *'(y)[c(x) - c(x - Y)] dy.
ditional classifiers. In D. Z. Anderson (Ed.), Neural infor-
192 K. Funahashi

Divide the integrand of the right term into ( - ~ , - ( ) , [ - , ' , and so, c*+'(x) is represented as follows:
E'], ( ( , ~) and estimate these using the properties of sigmoid
functions. For example, c*+'(x) = ~ e,+°(x - x3.
i:I

fi'~, ~b'~(y)[c(x) - c(x - y)] dy < , / 4 f j ~b'(y) dy = ~/4 That is to say,

and other terms will be arbitrarily small for a sufficiently small c~. c(x) - ~ c,+°(x x,) < ~/2. (A.2)
i=l
Therefore we obtain
Using (A.1) and (A.2) we obtain
Ic~x) - c*+'(x)l < ~/4.
As c*+~(x) = c'*qb~(x) and c'(x) is given by g(x) - ~ c,+o(x -- x,) < E.
i=1

c'(x) = ~ c,~(x - x,) Here +~(x - x~) = +(x/a - x,/cO , s o w e set a, = l / a , b, =


i..l - x , / a and the proposition is proved, q.e.d.