Sunteți pe pagina 1din 5

CSE351 Artificial Intelligence

By Dr. Anwar M. Mirza Date: November 25, 2001

Lecture No. 20 Backpropagation Neural Network


VARIATIONS
Several modifications can be made to the backpropagation algorithm, which may improve its performance in some situations.

1. Alternate Weight Update Procedures


Momentum

In backpropagation with momentum, the weight change is in a direction that is a combination of the current gradient and the previous gradient. Convergence is sometimes faster if a momentum term is added to the weight update formulas. In order to use this strategy, weights from one or more previous training patterns must be saved. In the simplest form, the new weights for training step t +1 are based on the weights at training steps t and t 1 . The weight update formulas for backpropagation with momentum are,
w jk (t +1) = w jk (t ) + k z j + [ w jk (t ) w jk (t 1) ]

or and or

jk (t +1) = w jk (t ) + k z j + w jk (t ) w v ij (t +1) = v ij (t ) + j x i + [ v ij (t ) v ij (t 1) ] v ij (t +1) = v ij (t ) + j x i + v ij (t )

where the momentum parameter is constrained to be in the range from 0 to 1, exclusive of the end points.

Batch Updating of Weights

In some cases it is advantageous to accumulate the weight corrections terms for several patterns (or even an entire epoch if there are not too many patterns) and then make a single weight adjustment (equal to the average)

for each weight, rather than updating the weights after each pattern is presented. This procedure has a smoothing effect on the correction term. In some cases, the smoothing may increase the changes of convergence.

2. Alternate Activation Functions


Customized Sigmoid Function for Training Patterns
The binary sigmoid function
f ( x) = 1 1 +exp( x )

with

f ( x ) = f ( x )[1 f ( x )] can be modified to cover any desired range, to be centered at any desired value of x , and to have any desired slope at its center. The binary sigmoid can have its range expanded and shifted so that it maps the real numbers into the interval [a, b] for any a and . To do this, we define the parameters:
= b a = a

Then the sigmoid function

g ( x ) = f ( x )

has the desired property, i.e. its range is [a, b]. Furthermore, its derivative also can be expressed in terms of the function value as
g ( x ) = 1

[ + g ( x )][ g ( x )]

For example, for a problem with bipolar target output, the appropriate activation function would be
g ( x) = 2 f ( x ) 1 1 g ( x) = [1 + g ( x )][ 1 g ( x )] 2
1 The steepness of the logistic sigmoid can be modified by a slope parameter s ig m a = 1 we have a more general function: 0.9 s ig m a = 2 0.8

. Thus

0.7

g ( x ) = f ( x ) =

1 + exp( x )

and

0.6

0.5

0.4

g ( x ) =

[ + g ( x )][ g ( x)]

0.3 0.2

0.1

0 -4

-3

-2

-1

Binary Sigmoid with

= 1 and = 2

Adaptive Slope for Sigmoid


The slope can be adjusted during training, in a manner similar to that used for adjusting the weights. We consider a general activation function f ( x) , where the net input to the output unit Yk is considered as:

x = k y _ ink
And for the hidden unit Z j is considered as:

x = j x _ in j
Thus the activation function for an output unit depends on both weights on connections coming into the unit and the slope k for that unit. Similar is the case for the hidden units. Let us use the abbreviations

y k = f ( k y _ ink )

and

z j = f ( j z _ in j )

to simply the notation. As in standard backpropagation, we define k = [t k y k ] f ( y k ) and j = k k w jk f ( z j )


k

The update for the weights to the output units are


w jk = k k z j v ij = j j x i

And for the weights to the hidden units are Similarly the updates for the slopes on the output units are

k = k y _ ink
And for the slopes on the hidden units are

j = j _ in j y

Another Sigmoid Function


The arctangent function is also used as an activation function for backpropagation nets. It saturates (approaches its asymptotic values) more slowly than the hyperbolic tangent function or bipolar sigmoid. Scaled so that the function values range between 1 and +1, the function is
with derivative 2 1 f ( x ) = 1+ x 2
f ( x) = 2 arctan( x)

Non-saturating Activation Functions


For some applications, where saturation is not especially beneficial, a nonsaturating activation function may be used. One suitable example is:
log( 1 + x ) for x > 0 f ( x) = log( 1 + x ) for x < 0

Note that the derivative is continuous at x = 0, i.e. 1 for x > 0 f ( x) = 1 + x 1 for x < 0 1 x This function can be used in place of sigmoid function in some applications.

Example
Fewer epochs of training are required for the XOR problem (with either standard bipolar or modified bipolar representation) wen we use the logarithmic activation function in place of bipolar sigmoid. The following table compares the two functions with respect to the number of epochs of they require: Problem Standard bipolar XOR Modified bipolar XOR (targets of +0.8 and 0.8) Logarithmic 144 epochs 77 epochs Bipolar Sigmoid 387 epochs 267 epochs

Non-sigmoid Activation Functions

Radial basis functions, activation functions with a local field of response, are also used in backpropagation neural nets. The response of such a function is non negative for all values of x; the response function decreases to 0 as x c . A common example is the Gaussian function:
f ( x) = exp( x 2 ) f ( x) = x exp( x 2 ) = xf ( x) 2 2

1 0.9 0.8 0.7 0.6

f(x )

0.5 0.4 0.3 0.2 0.1 0 -3

-2

-1

0 x

Gaussian Activation Function

S-ar putea să vă placă și