Back Propagation

Applicable Artificial Intelligence
Back Propagation
Academic session 2009/2010

The Delta Rule
The delta rule also called the Least Mean Square (LMS) states that:
 The output vector is compared to the desired answer. If the

difference is zero, no adjustments are made; otherwise, the weights
are adjusted to reduce this difference.
The adjustment or change in weight from layer i to j is given by:
Δwij = η* oi * δj where
η is the learning rate,

oi represents the activation of layer i and
δj is the error, or difference between expected and actual output
Limitations of Delta Rule
 It works with simple input-output
 However, if there are hidden layers, the

Delta Rule does not guarantee
convergence, as local minima might exist
The Generalised Delta Rule
 Patterns are repeatedly presented to the
network, and the errors are used to modify the
pattern of connectivity in such a way that the
network's responses become more accurate
 The generalised delta rule requires that the
activation function be monotonic. As a result,
most ANNs are based upon a sigmoid-shaped
activation function.
Back Propagation:
Prime example of GDR
 Back propagation’s most commonly used activation function is the sigmoid:
oj = 1/(1 + e-netj )
netj = ∑wij oi + θj
θj is the weight from

a bias unit (=1, always on)
 Some important properties:

If netj=0, oj=0.5 (“undecided”)
netj needs to be ±∞ to raise/lower oj to +1 or 0
XOR:
Trained Network Example
n1 n2 n4 n4
(desired) (actual)
1 0 1 0.91
0 0 0 0.08
0 1 1 0.91
1 1 0 0.10
XOR Example:
Training Starts with Weights at 0
XOR (showing Sum and AF)
Computing the Weights
 Back propagation formulas:
(1) δk = (tk – ok) f ’(netk)
(2) f ’(netk) = ok(1-ok)
(3) Δwjk = ηδkoj
(4) δk is the error signal

(5) tk is the target value
(6) ok is the actual activation value
(7) f ’(netk) is the derivative of the activation function
(8) Δwjk is the change from unit j (lower) to unit k
(9) η is a small constant that controls the learning rate
XOR example:
Weights in the Output Layer
 Choosing η=0.1, we have:
δk = (tk – ok) f ’(netk)
δn4 = (tn4 – on4 ) on4 (1-on4 )
δn4 = (1-0.5) x 0.5 x (1-0.5) = 0.125,
 And the weight changes:
Δwjk = ηδkoj
Δwn1n4 = 0.1 x 0.125 x 1 = 0.0125,
Δwb4n4 = 0.1 x 0.125 x 1 = 0.0125,
Δwn3n4 = 0.1 x 0.125 x 0.5 = 0.00625,
Δwn2n4 = 0.1 x 0.125 x 0 = 0
Weights in Hidden Layers
 Now δk is the error at the layer above (output
layer in the XOR example), then for layer j:
(1) δj = f ’(netj) ∑ δk wjk

(2) f ’(netj) = oj(1-oj) [thus δj = oj(1-oj) ∑ δk wjk ]
(3) Δwij = ηδjoi
Note: current layer is j, above is k, below is i

XOR example:
Weights in the Hidden Layer
 Recall that error for n4 is 0.125, weight
between n3 and n4 is still 0, then δn3 for the
hidden unit n3 will be:
δj = oj(1-oj) ∑ δk wjk
δn3 = on3 (1-on3 ) δn4 wn3n4
δn3 = 0.5 x (1-0.5) x 0.125 x 0 = 0,
 And the weight changes will be:
Δwn1n3 = 0.1 x 0 x 1 = 0, (Δwij = ηδjoi)
Δwn2n3 = 0.1 x 0 x 0 = 0,
Δwb3n3 = 0.1 x 0 x 1 =0
Changing the Weights
 All the networks weights can then be modified
by these amounts at each iteration according to:
wjk ← wjk + Δwjk
 The above method is called “online” or

continuous update method
 On large problems, a “batch” or periodic update
method might be more appropriate
Speeding Up Back Propagation
 The XOR problem requires: n1 n2 n4 n4 (actual)
(desired)
 25,496 iterations for η=0.1 1 0 1 0.91
 3,172 iterations for η=0.5 0 0 0 0.08
0 1 1 0.91
 1,381 iterations for η=1.0 1 1 0 0.10
 A trick: use Δw + αΔw(i-1)
Δwoptimal = Δw + αΔw(i-1)
where typically:
α is about 0.9
η is 1/N or 2/N for a problem with N patterns
Recap on Training Issues
Methodology for training on a particular problem
 Initialise weights in a random manner
 Feed the network with a set of numbers (e.g. between 0 and 1), and
estimate the error at the output layer
 Change the weights in order to obtain a better result (learning)
 Feed the input layer with other examples and continue adjusting
weights, until eventually the desired output is achieved for each
example
 The entire set of training examples must be shown to the network
many times in order to get a satisfactory result
 After all of this training, the network is hopefully able to solve our
problem - its `knowledge' about the problem domain is stored by all
the different connection weights
Recap on
Preparation of Training Data
 A neural network needs to see several examples of a matching
input/output pair; it cannot infer the characteristics of the input data
from only one example
 The training procedure must compile a wide range of examples
which exhibit all the different characteristics we are interested in.
The selected examples must contain features that are common to
all input data
 Prior to training, it may be advisable to add some noise or other
randomness. This tends to produce a more reliable network, as it
accounts for natural variability in the data
 With the standard sigmoid activation function, note that the desired
output must never be set to exactly 0 or 1! Setting the output to say
0.9 allows faster convergence
 Poor training data inevitably leads to unreliable and unpredictable
network
Other Aspects of
Data Analysis/Preparation
Stages in pattern recognition
 Formulation of the problem: understanding and planning
 Data collection: measures and procedure details (ground truth)
 Initial examination of data: get a feel for the structure
 Feature selection/extraction: linear and non-linear transformation of
original data set
 Clustering or pattern classification: exploratory data analysis
 Apply discrimination or regression procedures as appropriate
 Assessment of results: may need an independent test set
 Interpretation
 The above is necessarily an iterative process --> analysis may pose
further hypothesis that require further data collection.
Initial Examination of Data
One of the most important parts of the data analysis cycle. It comprises 3 parts:
 Checking the quality of the data

 Data degradation: errors, outliers, missing observations
 Errors: recording equipment, transcription, or even deliberate
 Outlier: observation that appears to be inconsistent
 Missing values: it is important to know how and why they occur
 Calculating summary statistics
 Most widely used measures are sample mean an standard deviation
 Producing plots of the data in order to get a feel for its structure
 Provide insight into the nature of multivariate data
 Useful to detect clusters in the data, suggest transformations, etc
 E.g. Weathervane plots, box plots, scatter plots, Andrews curves, Chernoff
faces, individual variables and scatter plots of pair-wise combinations, etc.
Data Analysis
Some Further Issues:
 In engineering, we seek a design that is optimal for the expected

operating conditions
 But we are given a finite design set: problem of over-fitting and
incorrect modelling
 In practice, we do not know what is structure and what is noise in the
data
 Optimal performance is determined by error rate, although this has
limitations
 Finally, it is assumed that training data is representative of the
operating conditions. Is it really?

Back Propagation

Încărcat de

Informații document

Descriere originală:

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

Back Propagation

Încărcat de

Drepturi de autor:

Formate disponibile

Applicable Artificial Intelligence

Academic session 2009/2010

 The output vector is compared to the desired answer. If the

The adjustment or change in weight from layer i to j is given by:

η is the learning rate,

 However, if there are hidden layers, the

θj is the weight from

 Some important properties:

(4) δk is the error signal

(1) δj = f ’(netj) ∑ δk wjk

Note: current layer is j, above is k, below is i

wjk ← wjk + Δwjk

 The above method is called “online” or

 Checking the quality of the data

 In engineering, we seek a design that is optimal for the expected

S-ar putea să vă placă și