Documente Academic
Documente Profesional
Documente Cultură
MULTILAYER PERCEPTRONS
For ease of reference we have reproduced the diagram of the three-laver perceptron in
Figure 1. Note that we have simply cascaded three perceptron networks. The output of
the second network is the input to the third network. Each layer may have different
number of neurons, and even a different transfer function. We used superscripts to
identify the layer number. Thus, the weight matrix for the first layer is written as W1 and
the weight matrix for the second layer is written W2.
To identify the structure of multilayer network, we will sometimes use the following
shorthand notation, where the number of inputs is followed by the number of neurons in
each layer:
R S 1 S 2 S 3 (1)
First Boundary:
a 1 = hardlim ( 1 0 p + 0.5 )
Second Boundary:
1
a2 = hardlim ( 0 1 p + 0.75 )
First Subnetwork
Individual Decisions
Inputs
p1
-1
0
0
-1
a11
1
0.5
1
p2
n11
AND Operation
n12
a12
-1.5
1
0.75
1
n21
a21
Third Boundary:
a3 = hardlim ( 1 0 p 1.5 )
Fourth Boundary:
a 4 = hardlim ( 0 1 p 0.25 )
Second Subnetwork
Individual Decisions
Inputs
p1
n13
0
1
a13
1
- 1.5
1
p2
AND Operation
n14
n22
a14
a22
-1.5
1
- 0.25
1
1
1
W = 0
1
0
0
1
0
1
0.5
b = 0.75
1.5
0.25
1
2
W = 1 100
b 2 = 1.5
0 0 11
1.5
b 3 = 0.5
W = 11
Input
p
2x1
Initial Decisions
W1
4x 2
AND Operations
a1
4x1
n1
a2
W2
W3
1x 2
1
4
2x1
a3
1x1
n3
1x1
2x1
4x1
2x1
n2
2x4
4x1
OR Operation
1x1
Pattern Classification
To illustrate the capabilities of the multilayer perceptron for pattern classification,
consider the classic exclusive-or (XOR) problem. The input/target pairs for the XOR
gate are
0
1
1
0
P1 = , t1 = 0 P2 = , t 2 = 1 P3 = , t 3=1 P4 = , t 4 = 0
0
1
0
1
This problem, which illustrated graphically in the figure below, was used by Minsky
and Papert in 1969 to demonstrate the limitations of the single-layer perceptron.
Because the two categories are not linearly separable, a single-layer percptron cannot
perform the classification.
A two-layer network can solve the XOR problem. In fact, there are many different
multilayer solutions. One solution is to use two neurons in the first layer to create two
decision boundaries. The first boundary separates p1 from the other patterns, and the
second boundary separates p 4 . Then the second layer is used to combine the two
boundaries together using an AND operation. The decision boundaries for each firstlayer neuron are shown in Figure 2.
The resulting two-layer, 2-2-1 network is shown in Figure 3. The overall decision
regions for this network are shown in the Figure 2a. The shaded region indicates those
inputs that will produce a network output of 1:
Function Approximation
Up to this point in the text we have viewed neural networks mainly in the context of
pattern classification. It is also instructive to view networks as function approximators.
In control systems, for example, the objective is to find an appropriate feedback function
that maps form measured outputs to control inputs. In adaptive filtering the objective is
to find a function that maps form delayed values of an input signal to an appropriate
output signal. The following example will illustrate the flexibility of the multilayer
perceptron for implementing functions.
Consider the two-layer, 1-2-1 network shown in Figure 4. For this example the transfer
function for the first layer is log-sigmoid and the transfer function for the second layer is
linear. In other words,
f (1) (n ) =
1
1+ e
and f ( 2) (n ) = n (2)
Suppose that the nominal values of the weights and biases for this network are
1
1
w1,1 = 10, w 1,1 = 10, b1 = 10, b1 = 10,
2
2
2
2
w1,1 = 1, w1, 2 = 1, b 2 = 0
The network response for these parameters is shown in Figure 5, which plots the
network output a 2 as the input p is varied over the range [-2,2].
Notice that the response consists of two steps, one for each of the log-sign-moid neurons
in the first layer. By adjusting the network parameters we can change the shape and
location of each step, as we will see in the following discussion.
The centers of the steps occur where the net input to neuron in the first layer is zero:
1
n1
1
1
w1,1 p + b1
=0 p=
1
b1
1
w1,1
10
= 1,
10
(3) (4)
n1 = w1 ,1 p + b1 = 0 p =
2
2
2
b1
2
w 1 ,1
2
10
= 1
10
The steepness of each step can be adjusted by changing the network weights.
Figure 6 illustrates the effects of parameter changes on the network response. The blue
curve is the nominal response. The other curves correspond to the network response
when one parameter at a time is varied over the following ranges:
2
2
1 w1,1 1,1 w1,2 1,0 b1 20,1 b 2 1 (5)
2
Figure 6(a) show how the network biases in the first (hidden) layer can be used to locate
the position of the steps. Figure 6(b) illustrates how the weights determine the slope of
the steps. The bias in the second (out-put) layer shifts the entire network response up or
down, as can be seen in Figure 6(d).
From this example we can see how flexible the multilayer network is. It would appear
that we could use such networks to approximate almost any function, if we had a
sufficient number of neurons in the hidden layer. In fact, it has been shown that twolayer networks, with sigmoid transfer functions in the hidden layer and linear transfer
functions in the output layer, can approximate virtually any function of interest to any
degree of accurary, provided sufficiently many hidden units are available
-1
-2
-1
-1
-2
-1
-1
-2
-1
-1
0
1
2
-2
-1
0
Figure 6. Effect of Parameter Changes on Network Response
a 0 = p, (7)
Which provides the starting point for Eq. (6). The outputs of the neurons in the last
layer are considered the network outputs:
a = a M (8)
Performance Index
The backpropagation algorithm for multilayer networks is a generalization of the LMS
algorithm, and both algorithms use the same performance index: mean square error.
The algorithms is provided with a set of examples of proper network behavior
(9)
[ ] [
F ( x ) = E e 2 = E (t a )2
(10)
Where x is the vector of network weights and biases. If the network has multiple
outputs this generalizes to
[ ] [
F ( x ) = E e t e = E (t a )T (t a ) (11)
As with the LMS algorithm, we will approximate the mean square error by
F ( x ) = (t (k ) a (k ))T (t (k ) a (k )) = e T (k ) e (k ), (12)
Where the expectation of the squared error has been replaced by the squared error at
iteration k
The steepest descent algorithm for the approximate mean square error is
w imj (k + 1) = w imj (k )
,
,
bim (k + 1) = bim (k )
F
wimj
,
F
bim
,
(13) (14)
Chain Rule
For a single-layer linear network (the ADALINE) these partial derivatives are
conveniently computed using
W (k + 1) = W (k ) + 2 e (k ) pT (k ),
Eq. 33 and Eq. 34 of Widrow Hoff Learning
b(k + 1) = b(k ) + 2 e (k )
For the multilayer network the error is not an explicit function of the weights in the
hidden layers, therefore these derivatives are not computed so easily.
Because the error is an indirect function of the weights in the hidden layers, we will use
the chain rule of calculus to calculate the derivatives. To review the chain rule, suppose
that we have a function f that is an explicit function only of the variable n. We want to
take the derivative of f with respect to a third variable w. The chain rule is then:
(15)
dw
dn
dw
For example, if
10
= e n (2) (17)
dw
dn
dw
( )
We will use this concept to find the derivatives in Eq.(13) and Eq. (14):
w im j
,
F
bim
F
nim
nim
nim
w im j
,
,
(18) (19)
nim
bim
The second term in each of these equations can be easily computed, since the net input
to layer m is an explicit function of the weights and bias in that layer:
nim =
s m 1
wimj a m 1 + bim
, j
(20)
j =1
Therefore
nim
w imj
,
a m 1 ,
j
nim
bim
= 1 (21)
If we now define
sim =
F
nim
(22)
(the sensitivity of F to changes in the ith element of the net input at layer m) the Eq. (18)
and Eq. (19) can be simplified to
F
w imj
,
F
bim
= sim a m 1,
j
(23) (24)
= sim
11
(25)(26)
T
W m (k + 1) = W m (k ) s m a m 1 ,
(k + 1) = b (k ) s
(27)(28)
Where
F
m
n1
F
F
m
m
s =
= n2
m
n
M
F
n m
sm
(29)
m
n1
m +1
n2
m
n m +1 n1
=
M
n m
n m +1
S m +1
n m
1
m
n1 +1
m
m
n2
nS m
m
m
n2 +1
n2 +1
L
m
m
n2
nS m
M
M
m 1
n mm+11
nS m+1
+
S +
L
m
m
n2
nS m
m
n1 +1
(30)
Next we want to find an expression for this matrix. Consider the i, j element of the
matrix.
12
m
w im +1a1 + bim +1
,1
m
nim +1
t =1
= w m +1 a j
=
i, j
n m
n m
nm
j
j
j
m
= wimj+1
,
( ) = w m +1 f m (nm )
f m n m
j
i, j
n m
j
(31)
Where
( )
f
nm =
j
(nm )
j
(32)
n m
j
n m +1
n
( )
= W m +1F m n m (33)
where
( )
&
f m nm
1
&
F m nm =
M
m
&
f m n2
( )
( )
(34)
M
&
f m nmm
S
0
( )
We can now write out the recurrence relation for the sensitivity by using the chain rule
in matrix form:
n m +1
T F
F
&
= m =
= F m nm W m +1
(35)
m
m +1
n
nm +1
n n
( )(
( )(
T
&
= F m nm W m +1 s m +1
Now we can see where the backpropagation algorithm derives its name. The
sensitivities are propagated backward through the network form the last layer to the
first layer:
s M s M 1 L s 2 s1 (36)
13
At this point it is worth emphasizing that the backpropagation algorithm uses the same
approximate steepest descent technique that we used in the LMS algorithm. The only
complication is that in order to compute the gradient we need to first backpropagate the
sensitivities. The beauty of back propagation is that we have a very efficient
implementation of the chain rule.
Theres one last step to make in order to complete the backpropagation algorithm.
Consider the starting point as, s M , for the recurrence relation of Eqn. (35). It is obtain in
the final layer,
sM
siM
F
(t a)T (t a )
= M =
=
ni
niM
(t j a j )
j =1
M
i
= 2(ti ai )
ai
. (37)
niM
Then, since
a i
a M f M (niM )
= iM =
= f M (niM ) (38)
niM ni
niM
&
s M = 2 F M (n M )(t a ) (40)
PROCEDURE FOR BACKPROPAGATION ALGORITHM
Step 1: Propagate the input forward through the network, i.e.,
a 0 = p , (41)
&
s m = F m (n m )(W m+1 )T s m+1 , for m = (M 1),..., 2 ,1 .(45)
Step 3: Update the weights and biases are updated using the appropriate steepest
descent algorithm :
W M (k + 1) = W m (k ) s m (a m1 )T ,(46)
b m (k + 1) = b m (k ) s m (47)
EXAMPLES
Consider the 1-2-1 multilayer network shown in Figure 8. The network will be use to
solve or to approximate the trigonometric function,
14
Figure 8.
To obtain the training set, it will be evaluated for a several values of p. Initial values are
chosen as follows,
0.27 (1)
0.48
(2)
(2)
W ( 1) ( 0 ) =
, b (0) = 0.13 , W (0) = [0.09 0.17 ], b (0) = [0.48].
0.41
Network Response
Sine Wave
-1
-2
-1
Figure 9.
The next step is to start the algorithm. The initial input, p = 1, i.e. a 0 = p = 1 .
The output of the first layer,
0.27
0.75
0.48
a (1) = f (1) (W (1) a (0 ) + b (1) ) = log sig
[1] + 0.13 = f LS 0.54
0.41
1 .75 0.321
0
.
= 1+ e1 =
0.54 0.368
1+ e
15
0.321
a ( 2 ) = f ( 2) (W ( 2) a (1) + b ( 2) ) = f LS [0.09 0.17]
+ [0.48] = [0.446]
0.368
e = t a = 1 + sin
The next step of the algorithm is to backpropagate the sensitivities. To do so, the
&
&
derivatives of the transfer function, f (1) ( n) and f ( 2 ) ( 2) must be obtained. Hence for
the first layer,
d 1
en
&
f ( 1) ( n ) =
=
dn 1 + e n 1 + e n
1 1
( 1)
(1 )
= 1
= (1 a )(a ) .
n
n
1 + e 1 + e
d
&
f ( n) =
( n) = 1 .
dn
Then the process of backpropagation can be done by considering the starting point at the
second layer using Eqn(44).
&
&
s 2 = 2F ( 2) ( n( 2) )(t a ) = 2 f ( 2) ( n( 2) ) (1.261) = 2[1] (1.261) = 2.522 .
The sensitivity of the first layer can then be calculated by the process of backpropagating
the sensitivity from the second layer following the Eqn(45).
(
(
(1 a11) )(a11) )
0.09
0
&
s (1) = F (1) ( n(1) )(W ( 2) )T s( 2 ) =
[ 2.522]
( 1)
(1) 0.17
0
(1 a 2 )(a2
0
(1 0.321)(0.321)
0.09
[ 2.522]
=
0
(1 0.368)(0.368) 0.17
0 0.227 0.0495
0.218
.
=
=
0.233 0.429 0.0997
0
The final stage of the algorithm is to update the weights. For simplicity, let the learning
rate, =0.1. From Eqn(46) and Eqn(47), then the weight and bias are calculated as
follows,
0.27
0.0495
W (1) (1) = W (1) (0) s (1) (a ( 0) )T =
0.1 0.0997 [1],
0.41
0.265
=
;
0.420
16
0.48
0.0495 0.475
b(1) (1) = b(1) (0) s (1) =
0.1 0.0997 = 0.140 .
0.13
This completes the first iteration of the of the backpropagation algorithm. The next step
is to choose another input p and perform another iteration of the algorithm. The
iteration process needs to be repeated until the difference between network response
and target reaches some acceptable level.
USING BACKPROPAGATION
There are some issues when considering the practical implementation of the
backpropagation algorithm which relates to the choice of network architecture,
convergence and some generalization.
Choice of Network Architecture
Multilayer network can be used to approximate almost any function, if there exist
enough neurons in the hidden layers. However, for adequate performance,
experimentations are needed to identify the number of layers as well as the number of
neurons.
Consider the following expression,
i
g( p ) = 1 + sin
p for 2 p 2 (49)
4
Where i = 1, 2, 4 and 8. As I is increased, the function becomes more complex, because
we will have more periods of sine wave over interval -2 p 2 . It will be more difficult
for a neural network with a fixed number of neurons in the hidden layers to approximate g(p) as i
is increased.
For this first example, a 1-3-1 network is used, where the transfer function for the first
layer is log-sigmoid and the transfer function for the second layer is linear. This type of
two-layer network can produce a response that is sum of three log-sigmoid functions or
as many log-sigmoids as there are neurons in the hidden layer. Clearly there is a limit to
how complex a function this network can implement. The figure below shows the
response of the network after it has been trained to approximate g(p) for i = 1, 2, 4, 8.
The final network responses are shown.
17
-1
-2
-1
-1
-2
-1
-1
-1
-2
-1
-1
-2
For i = 4 the 1-3-1 network reaches its maximum capability. When i>4 the network is not
capable of producing an accurate approximation of g(p). In the bottom right graph of
Figure 10 we can see how the 1-3-1 network attempts to approximate g(p) for i=8. The
mean square error between the network response and g(p) is minimized, but the network
response is only able to match a small part of the function. In the next example we will
approach the problem from a slightly different perspective. This time we will pick one
function g(p) and then use larger and larger networks until we are able to accurately
represent the function. Fof g(p) we will use
6
g( p ) = 1 + sin
p for 2 p 2 .(50)
4
To approximate this function we will use two-layer networks, where the transfer function
for the first layer is log-sigmoid and the transfer function for the second layer is linear
(1-S1-1 networks).
Figure 11 illustrates the network response as the number of neurons in the first layer
(hidden layer) is increased. Unless there are at least five neurons in the hidden layer the
nework cannot accurately represent g(p).
18
3
2
1-2-1
1-3-1
-1
-2
-1
-1
-2
3
2
-1
1-4-1
1-5-1
-1
-2
-1
-1
-2
-1
In the previous section we presented some examples in which the network response did
not give an accurate approximation to the desired function, even though the
backpropagation algorithm produced network parameters that minimized mean square
error. This occurred because the capabilities of the network were inherently limited by
the number of hidden neurons it contained. In this section we will provide an example in
which the network is capable of approximating the function, but the learning algorithm
does not produce network parameters that produce an accurate approximation.
The function that we want the network to approximate is
g( p ) = 1 + sin( p ) for 2 p 2 .(51)
To approximate this function we will use 1-3-1 network, where the transfer function for
the first layer is log-sigmoid and the transfer function for the second layer is linear.
Figure 12 illustrates a case where the learning algorithm converges to a solution that
minimizes mean square error. The thin blue lines represent intermediate iterations, and
the thick blue line represents the final solution, when the algorithm converged.
19
1
3
2
0
-1
-2
-1
20
3
4
2
0
1
-1
-2
-1
Generalization
In most cases the multilayer network is trained with a finite number of examples of
proper network behaviour.:
{ p1 , t 1 }, p 2 , t 2 , ... , pQ , t Q (52)
This training set is normally representative of a much larger class of possible input/output
pairs. It is important that the network successfully generalize what it has learned to the
total population.
} {
For example, suppose that the traiing set is obtained by sampli g the following function:
g( p ) = 1 + sin p , (53)
4
At the points p = -2, -1.6, -1.2, , 1.6, 2 . There are a total of 11 input/target pairs. In
Figure 14 we see the response of a 1-2-1
21
22
23