Sunteți pe pagina 1din 5

Deep Feedforward Networks:

Application to Pattern Recognition

Haroon A. Babri and Yin Tong


School of EEE
Nanyang Technological University
Nanyang Ave., Singapore 639798
Fax No.: (65) 7912687
email: eharoon@ntuvax.ntu. ac .sg
ABSTRACT
Issues pertaining to learning in deep feedforward neural networks are addressed. The sensitiv-
ity of the network output is derived as a function of weight perturbations in different hidden layers.
For a given hidden layer 1 , the output sensitivity varies inversely with current activation levels of
neurons of the previous layer 1 - 1, and the magnitude of the connection weights between layers
1 and 1 - 1. Learning involves modifying weights. Relatively small connection weights (usually
during initial learning phase of BP algorithm) or small neuron activation levels(usual1y after
initial learning period) can increase the sensitivity of the network and make learning unstable.
This problem is further aggravated as the depth of the network increases. A weight initialization
strategy and a modified activation (sigmoid) function are proposed to alleviate these problems.
Using this scheme, deep networks trained with the error backpropagation learning rule show
substantial improvement in error curve (trajectory) control and convergence speed when applied
to pattern recognition problems.

1. Introduction
Since the popular discovery of the Backpropagation Algorithm for training multilayer feedfor-
ward neural networks] they have become a powerful method in approximation and classification
applications( [SI). They are favored against traditional methods in their ability to 'learn' general
rules based purely on examples(training data), greatly simplifying the requirements for solving a
particular problem. Also, it has been shown that feedforward networks of depth two(one hidden
layer) can approximate, to any accuracy] arbitrary continuous functions (e.g., [4]).Therefore]
most applications of artificial neural networks to various problems have focused on the use of
shallow networks.
However, some problems are so 'hard' to learn([6], [9])that deeper networks are required and
modifications have to be made to standard feedforward architecture or learning procedure(short-
cut links, cascade-correlation algorithm] etc.) to make the learning converge. Furthermore,when
implementing neural networks in hardware] one is restricted to use neural activation values
and weights with limited precision. Under such constraints, depth two networks can require
unreasonably large (exponential) number of neurons in the hidden layer to compute certain
functions [7]. Depth three networks can compute these functions with a polynomial number of
neurons, and in this sense are more powerful than those of depth two. Whether the same is
true for even deeper networks, or the hierarchy collapses at depth three, is an interesting open
problem.
The above reasons provide strong motivation for investigating the learning behavior of deep
networks more thoroughly. This paper presents some preliminary results of our research in this
direction. The organization of this paper is as following. In Section 2 , we describe one of the main

0-7803-3210-5/96 $4.0001996 IEEE 1422


difficulties encountered in training deep feedforward networks, and propose a method to come
around this difficulty. In Section 3, we give some simulation results to compare the performance
of using this scheme with that of standard EBP algorithm. Section 4 ends this paper with some
discussion.

2. Sensitivity of Deep Feedforward Networks


The study of learning in deep networks has been hindered by difficulty in analysing their behavior
mathematically, as well as in experimentally evaluating them. Recently there have been some
studies on sensitivity of neural networks (e.g. [3]), with a view t o enable selection of weights with
lower sensitivity, thereby producing more robust networks. In [lo], the sensitivity is derived as a
function of the depth of weight perturbation. In [2], a weight initialization strategy is proposed
to alleviate the large variability in sensitivity. In addition, the attenuation of the error signals in
deep networks as one propagates the error towards the input layer results in increasing isolation
of the input from the output [6].All these factors contribute to the difficulty of training networks
of more than two hidden layers.
In this work, we identify another cause of difficulty of learning in multilayer networks and
propose a solution to the problem. Consider an L layer network of sigmoidal neurons. The layers
are indexed by 1 ( I = O , l , - . . , L ) . Let X ( ' ) = {xj"'} denote the activation vector of layer 1,
W ( ' )= {tu:;)} the weight matrix between layers 1 - 1 and 1. We define the Euclidean norm of
the activation vectors in a network, and the Frobenius norm of a weight matrix in one layer as
follows:
Jm
I I W ( ~=) ) I I = Trace(WTW), llx(l)ll=
The output vectors of adjacent layers are related by
Jm-
= (X-TX)

X(U = qx('-l),
W(")

where 6 denotes the nonlinear transformation associated with layer 1. Then, using the linear
first-order Taylor series approximation t o the function 8, we get the following formula:

A(llX(')II) = Il(X + AX)(')II- IlX(')ll

After some mathematical manipulation, we get the following expression:

zy(U(o),p
perturbation coefficient, D(')= C . . I
is the weight perturbation coefficient, of the
s,j wlf'
corresponding layer I.
Consider the case when only one weight layer, say K ( K 5 L ) , is perturbed. The output
perturbation ratio can be easily obtained by using(2), with 61)X(o)II= 0:

where M L K = C ( L ) C ( L - l. ). . C ( K f l ) D ( Kis) the amplification factor due to weig,ht perturbation


at layer K only.

1423
It is easy to see that the coefficients C(’)and D(’) can become large for small activation levels
and small weights respectively. The farther the weight perturbation is from the output layer,
the greater is the number of communication paths between that layer and the output containing
small activations and small weights, hence resulting in large values of amplification factor M L K .
This can make learning unstable since small changes in weights can produce large variations at
the output of the network.
To constrain MLK’Sto a reasonable range, both the coefficients of C(’)and D(’) forming the
product should be adequately constrained. To bound C(‘),we propose the following modification
to the standard sigmoidal function:
1 - 2E
f(u) = +E

= z y ) ( l - zj‘))C(’)
where E is a small positive number. Since f’(ui”) , can be shown t o be loosely
upper bounded by E,,,
g(1- &)?.
zyfr(uy)z~L-l)
The weight perturbation coefficient D(‘) = ,(f)
can become large due to very
J l
small weights. To bound it , we propose a weight initialization scheme in which the connection
weights are assigned initial values distributed in a narrow range around two centers symmetrical
about zero, i.e., at -WO and W O , where W O is a positive number. As learning proceeds, some
weights will tend to move towards zero. This trend can be arrested by including a constraint of
the form lIW(’) - W~llin the backpropagation cost function, or by imposing a strong condition
to prohibit weights from advancing too close to zero [l].In this paper, we will only consider the
scheme based on the abovementioned weight initialization and activation function modification.

3. Simulation Results
To test the performance of the scheme proposed above, the Prostrate4 benchmark was selected.
It has been used as a hard benchmark to test many neural classifiers (e.g., [5]), because the
decision regions are both meshed and disjoint.
Following [5], we use a fixed feedforward network architecture with 2 neurons in the input
layer,l6 in the first hidden layer, 8 in the second hidden layer, and 2 in the output layer. The
learning rate and momentum parameter are fixed at 0.3 and 0.7 respectively, and minimum
averaged squared error of 0.01 is used as a stopping criterion. The training and test sets consist
of 540 and 2160 points, respectively, uniformly distributed in the sample space. The desired
ouputs for the two classes are (0.1, 0.9) and (0.9, 0.1). During testing, outputs (> 0.6,< 0.4),
and(< 0.6, > 0.4) are interpreted as representing the two classes respectively. In the modified
learning procedure, weights are initally uniformly distributed in intervals of width 0.2, centered
at-0.5 and 0.5.A set of 50 experiments with different initial weights and values of €(from 0.01
to 0.05) was conducted. The proposed scheme consistently performed better than the standard
error backpropagation (EBP) algorithm. The error rate on the test set varied 2% to 4% for the
proposed algorithm, and 2% to 6% for the standard backpropagation algorithm. The plots of a
typical set of experiments are shown in Figure 1 and Figure 2. In these figures, ( U ) shows the
sum-squared error plot when the network is trained with the corresponding sheme, ( b ) shows
the averaged absolute values of the amplification factors M L K . It is worth noting that the
perturbation amplification factors are reduced by several orders of magnitude using the proposed
scheme as compared to standard EBP algorithm. With the proposed algorithm, the network
converged in 700 -1300 epochs as compared to 7000 - 10000 epochs for the standard case. We
also experimented with networks of depth four. The summary of the results is presented in Table
1.

1424
converge at 7971 epoch
0.1

0.09

0.08 1
1°=1
O3O

0.07

f7 0.06
0.05

VJ
0.04

0.03

0.02

0.01 I -3
5000 1L-30 0 5000 10000

Figure 1: Learning with standard EBP: (a) Error Curve; (b) Amplificaticin Factoirs

I-
converge at 638 epoch

10-
lo4
0.08

0.07

M31

M32

O.O+

0.01
,h ..,
i
5

500
:

-
1
M33

1000
epoch epoch
(8) (b)

Figure 2: Learning with Proposed Scheme, WO = 0.5, E = 0.03: (a)Error Curve;


(b)Amplification Factors

1425
I i
1
Depth
3 1
Convergence(Ave. No. of Epochs)
Standard Algorithm
8649
I Modified Algorithm
I 972 I
4 13327 4182
I
Table: 1: Comparison of training times of deep networks

Preliminary results with depth five network indicate that whereas with standard algorithm the
network fails to converge, it can be trained using the modified scheme.

4. Conclusion
There are several factors contributing to the difficulty of training multilayered feedforward net-
works. In this paper, we have identified the sensitivity of the output to perturbations on small
weights as one of the major factors making learning difficult. A simple weight initialization and
activation function modification scheme proposed to alleviate this problem substantially reduces
the time required to train the network, as compared to standa.rd backpropagation algorithm.
Since deeper networks can now be trained, we are evaluating the hypothesis that deeper networks
are more efficient in solving problems of greater complexity.

References
[l]Babri.H.A.,et.a1.(1995) Causes and Remedy of Stability Problems in Backpropagation Net-
works. Neural, Scientific, and Parallel Computation Vol 3, p p 357-368, 1995
[2] Babri,H.,A., Kot,A.(1994) On learning in deep feedforward neural networks. Proc. 10th Int’l
Conference on Systems Engineering, Vol 1, p p . 60-67, Coventry, 1994.
[3] Choi,J.,Y., Choi,C.(1992) Sensitivity Analysis of Multilayer Perceptron with Differentiable
Activation Function. IEEE Tran. Neural Networks, Vol 3, p p . 101-107, 1992.
[4] Cybenko,G.(1989) Approximation by Superpositions of a Sigmoidal Function. Mathematics
of Control, Signals, and Systems, Vol 2, p p . 303-314, 1989.
[5] Huang,W.Y., Lippmann,R.,P.( 1987) Comparisons Between Neural Net and Conventional
Classifiers. Proc. IEEE Int’l Conf. on Neural Networks, Vol 4,pp. 485-493, 1987
[6] Lang,K.,J., Witbrock,M.,J.(1988) Learning to tell two spirals apart. Proc. 1988 Connection-
ist Models Summer School, pp. 52-58, 1988.
[7] Obradovic,Z., Yan,P.(1990) Small Depth Polynomial Size Neural Networks. Neural Compu-
tation, Vol 2, p p . 402-404, 1990
[8] Rumelhart,D.E., Hinton,G.E., Williams,R.J(1986). Parallel Distributed Processing, Vol 1.
Cambridge, MA: MIT Press, 1986.
[9] Sontag,E.,D. (1992) Feedback Stabilization Using Two-Hidden-Layer Nets. IEEE Tran. Neu-
ral Networks, Vol 3, pp. 981-990, Nouermber 1992.
[lo] Stevenson,M., Winter,R., Widrow,B.(1990) Sensitivity of Feedforward Neural Networks to
Weight Errors. IEEE Tran. Neural Networks, Vol 1, p p . 71-80, March 1990

1426

S-ar putea să vă placă și