L5-6 - Feedforward Neural Networks - MLP and RBF - v2

PR
Lecture 5-6
Dr. Robi Polikar Feedforward Neural
Networs
The Multilayer Perceptron

Radial Basis Function NNs
Chapter 11 in Alpaydin
PR This Week in CI&PR
 Feedforward Neural Networks: The Multilayer Perceptron (MLP)
 Brief history of artificial neural networks (adapted from Gutierrez)
• Physiological origins
• The birth of multilayered neural networks
 Architecture and notation
 The backpropagation learning algorithm A review of second order methods
Newton’s method, Levenberg –
 Improving the backpropagation
Marquardt, quick-prop, conjugate-
 Choosing the activation function
gradient etc.
 Input normalization
 Radial Basis Function Networks
 Choosing the target values
 The $1,000,000 question: How many hidden units? A E. Alpaydin, Int. to Machine Learning,
D Duda, Hart & Stork, Pattern Classification
 Initializing weights G R. Gutieerez-Osuna, Lecture Notes
 Choosing the learning rates: adaptive learning rate M Matlab Documentation, Mathworks
 Help! I got stuck at the local minimum: The momentum term
 Stopping criteria
 Regularization RP Original graphic created / generated by Robi Polikar – All Rights Reserved © 2001 – 2013.
May be used with permission and citation.
Computational Intelligence & Pattern Recognition © 2001- 2013, Robi Polikar, Rowan University, Glassboro, NJ
PR Neural Networks
Neural Network, in computer science, highly interconnected network of information-
processing elements that mimics the connectivity and functioning of the human brain. Neural
networks address problems that are often difficult for traditional computers to solve, such as
speech and pattern recognition. They also provide some insight into the way the human brain
works. One of the most significant strengths of neural networks is their ability to learn from a
limited set of examples.
© Encarta, 1993-2007 Microsoft Corporation. All rights reserved.
In computer science and related fields, artificial neural networks are models inspired by
animal central nervous systems (in particular the brain) that are capable of machine
learning and pattern recognition. They are usually presented as systems of interconnected
"neurons" that can compute values from inputs by feeding information through the network.
For example, in a neural network for handwriting recognition, a set of input neurons may be
activated by the pixels of an input image representing a letter or digit. The activations of these
neurons are then passed on, weighted and transformed by some function determined by the
network's designer, to other neurons, etc., until finally an output neuron is activated that
determines which character was read.
Like other machine learning methods, neural networks have been used to solve a wide variety
of tasks that are hard to solve using ordinary rule-based programming, including computer
vision and speech recognition.
http://en.wikipedia.org/wiki/Artificial_neural_network
PR Neural Networks
Physiological Origins
d input
nodes
H hidden
x1 layer nodes
c output
nodes
x2 z1
…
Wjk
..
Wij
....
zk
..
yj
zc
…
x(d-1)
i=1,2,…d
RP xd j=1,2,…,H
k=1,2,…c
The Mathematical origins
PR 1940s
 McCulloch & Pitts, 1943
 Warren McCulloch (a psychiatrist and neuroanatomist) and
Walter Pitts (mathematician) devised the first computational
model of neurons
• A McCulloch-Pitt neuron fires if the sum of excitatory inputs exceeds a threshold,
provided that the neuron does not receive an inhibitory input. They showed that
such as network of neurons can construct any logical function.
 D. O. Hebb, 1949 (1904 – 1985) – Father of Cognitive

Psychology
 Devised “Hebbian Learning”, where the weight of one
connection across a synapse is increased proportionate to the
amount of activation that synapse experiences., i.e., a neural
pathway is strengthened every time it is used.
• “When an axon of cell A is near enough to excite a cell B and repeatedly or
persistently takes part in firing it, some growth or metabolic chance takes place in
one or both cells such that A’s efficiency , as one of the cells firing B, is increased.”
The Organization of Behavior, Wiley, 1949.
𝑛𝑒𝑤 𝑜𝑙𝑑
𝑤𝑖𝑗 = 𝑤𝑖𝑗 + 𝜂𝑝𝑖 𝑎𝑗
PR The Origins
1950s
 Frank Rosenblatt, 1958 (1928 – 1969) x1
 Founder of the perceptron model – a wJi
single neuron with adjustable weights and J
a threshold activation. xd
• Proved that if two classes are linearly separable,
the learning algorithm for perceptron (the  d 
perceptron rule) will converge. f  netJ   wJi  xi 
 
 Widrow (1929 - ) & Hoff (1937 - )  i 1 
 Bernard Widrow and Ted Hoff are the

fathers of the famous LMS algorithm,
which is used not only to adapt the
weights of the perceptron model, but also
for weights of an adaptive filter. Hoff is
also credited with the invention of the
microprocessor / CPU (Intel 4004)
𝑛𝑒𝑤 𝑜𝑙𝑑
𝑤𝑖𝑗 = 𝑤𝑖𝑗 + 𝜂 𝑑𝑗 − 𝑥𝑖 𝑤𝑖𝑗 𝑥𝑖
PR What do they
look like today?
TED HOFF
BERNARD
WIDROW
(with
Nik Kasabov)
RP PHOTOS – ALL RIGHTS RESERVED, ROBI POLIKAR, IJCNN 2009, ATLANTA, GA © 2009
PR The Origins
1960s and 70s
 Marvin Minsky and Seymour Papert, 1969

 Both pioneers of AI, in their monograph, “ Perceptrons”,
Minksy and Seymour proved the mathematical limitations
of the perceptrons: they don’t work if the data are not
linearly separable. In fact, they showed that the perceptron,
a single layer network, cannot even solve the simple XOR
problem. They further (incorrectly, but effectively) argued
that similar limitations would also be applicable to
multilayered networks.
• This was the kiss of death for the ANNs. The ANN research and funding
came to screeching halt with this article.
The perceptron has shown itself worthy of study despite (and even because of !) its severe limitations. It has
many features to attract attention: its linearity, its intriguing learning theorem; its clear paradigmatic
simplicity as a kind of parallel computation. There is no reason to suppose that any of these virtues carry over
to the many-layered version. Nevertheless, we consider it to be an important research problem to elucidate (or
reject) our intuitive judgment that the extension to multilayer systems is sterile.
Minksy and Papert, 1969
PR The origins
1980s
 S. Grossberg and G. Carpenter, 1980
 Two of very few left studying ANNs were Grossberg
and Carpenter (couple), described a new unsupervised
algorithm called the Adaptive Resonance Theory
(ART), which has later been expanded into a
supervised algorithm ARTMAP. Several variations of
ARTMAP have been developed since then.
 John Hopfield, 1982
 He developed the Hopfield network, the first example
of recurrent networks, used as an associative memory.
 Teuvo Kohonen, 1982
 Credited with the development of the unsupervised
training algorithms, self organizing maps (SOMs)
 A. Barto, R.S. Sutton and C. Anderson, 1983
 Popularized reinforcement learning
PR Feedforward
Networks
 D. Rumelhart, Hinton and Williams, 1986
 Developed a simple, yet elegant algorithm for training multilayer
networks, learning nonlinearly separable decision boundaries, through
back propagation of errors, a generalization of the LMS algorithm.
 Paul Werbos, 1974
 Widely recognized as the original inventor of the backpropagation
algorithm in his Ph.D. thesis (Harvard) in 1974. Currently at NSF
 Dave Broomhead and David Lowe, 1988
John Moody and Christian Darken, 1989
 Radial Basis Function (RBF)
Neural networks
PR Kernel Methods
Support Vector Machines
 Vladimir Vapnik and Alexey Chernovenkis, 1962
 VC dimension of classifiers
 Statistical learning theory
 Linear SVMs (coming soon)
 Isabel Guyon, Bernard Boser and V. Vapnik, 1992

 Nonlinear SVMs (coming soon)
 Kernel trick (coming soon)
 B. Schölkoph, A. Smola, C. Burges, N. Chistianni, 1995 – today

 Recent developments
 Kernel regression, Kernel PCA
 Semi supervised learning
PR Bagging, Boosting,
Ensemble Systems
 Leo Breiman (1928-2005), 1994
 Bagging – one of the original ensemble of
classifiers algorithms based on random sampling
with replacement.
 Later, random forests, a clever name for an
ensemble of decision trees algorithms
 Robert Schapire and Yoav Freund, 1995
 Hedge, boosting, AdaBoost (coming soon!)
 Arguably one of the most influential algorithms in
recent history of machine learning.
 Tin Kam Ho, 1998
 Random subspace methods
 Ludmilla Kuncheva, 2004
 Classifier fusion
 Gavin Brown, 2004
 Feature selection, diversity in ensemble systems
PR Other notable People
 Michael I. Jordan, Zoubin Ghahramani, Daphne Koller
 Bayesian networks, mixture of experts
 Expectation – maximization (EM) algorithm
 Graphical methods
 David Wolpert
 No free lunch theorem, 1997
 Stacked generalization, 1992
 Nitesh Chawla
 Learning from unbalanced data
 SMOTE (Synthetic Minority Oversampling
TEchnique)
 Geoffrey Hinton
 Deep neural networks / deep learning 2000
 Dropout method, 2012
PR The Need for
Multilayered Networks
The XOR Problem
x2
Input x Output
1 (1,1) 0
(0,1) 1
(1,0) 1
0
(0,0) 0
RP x1
0 1
No linear function can separate the two classes of the XOR problem!
(no set of w0 w1 w2 will satisfy all of the following contraints)
𝑥1 = 0, 𝑥2 = 0, 𝑦 = 0 → 𝑤0 ≤0
𝑥1 = 0, 𝑥2 = 1, 𝑦 = 1 → 𝑤2 + 𝑤0 >0
𝑥1 = 1, 𝑥2 = 0, 𝑦 = 1 → 𝑤1 + 𝑤0 >0
𝑥1 = 1, 𝑥2 = 1, 𝑦 = 0 → 𝑤1 + 𝑤2 + 𝑤0 ≤0
PR The Need for
 But a multilayered network can: 𝑦
1, 𝑥1 + 𝑥2 − 1.5 > 0
 Start with the AND function 𝑦=
0, 𝑥1 + 𝑥2 − 1.5 < 0
A
PR The Need for
𝑥1 XOR 𝑥2 = (𝑥1 AND ~𝑥2) OR (~𝑥1 AND 𝑥2)
OR x1  x2   x1  x2    x1  x2 
z1 z2
AND AND
PR The ANN Training Cycle
Stage 1: Network Training ANN w/ weights to be
determined
Present Examples
Indicate Desired Outputs
Determine
Synaptic
Weights “knowledge”
Stage 2: Network Testing
New Data Predicted Outputs
RP
PR Universal Approximator
 Classification can be thought of as a special case of function approximation:
 For a three class problem:
𝑥1
Class 1: [1 0 0] 1: Class 1
x …. Classifier Class 2: [0 1 0] 2: Class 2
3: Class 3
Class 3: [0 0 1]
𝑥𝑑
𝑑-dimensional input 𝑦 = 𝑓(𝒙) 1 or 3, c-dimensional input

x y
 Hence, the problem is – given a set of input/output example pairs of an unknown

function – to determine the output of this function to any general input.
 An algorithm that is capable of approximating any function – however
difficult, large dimensional or complicated it may be – is known as a
universal approximator. The MLP is a universal approximator.
PR The Multilayer Perceptron
Architecture
• What truly separates an MLP from a regular simple perceptron
d input is the non-linear threshold function f, also known as the activation
nodes function. If a linear thresholding function is used, the MLP can be
H hidden
replaced with a series of simple perceptron, which can then only
layer nodes solve linearly separable problems.
x1
c output
nodes
z1 d 
……...
x2
 
y j  f net j  f  w ji xi 

 
 i 1 
..
Wkj
……....
RP Wji netk zk netj

netj y H 
zk  f net k   f  wkj y j 

..
j
 
zc  j 1
…

x(d-1)
netk
i=1,2,…d
xd j=1,2,…,H
Computational Intelligence & Pattern Recognition k=1,2,…c © 2001- 2013, Robi Polikar, Rowan University, Glassboro, NJ
PR Computing Node Values
x1 wJi
netJ
 d 
∑ y J  f net J   f   wJi xi 
yJ  i 1 
xd
y1 wKj
netk
H 
∑ z K  f net K   f   wKj y j 
zK  j 1 
RP yH
PR Backpropagation
Learning Rule
 The weights are determined through the gradient descent error
minimization of the criterion function J(w) Target (desired) outputs
𝑐 Actual network outputs
1 2
1
Training Error: 𝐽(𝐰) = 𝑡𝑘 − 𝑧𝑘 = ‖𝐭 − 𝐳‖2 𝐻
2 2
𝑘=1 𝑧𝐾 = 𝑓 𝑛𝑒𝑡𝐾 = 𝑓 𝑤𝐾𝑗 𝑦𝑗
𝑗=1
𝜕𝐽 𝐰
𝐰(𝑡 + 1) = 𝐰(𝑡) + 𝛥𝐰(𝑡) ⇒ 𝛥𝐰 = −𝜂
𝜕𝐰
We need to express 𝐽(𝐰) in terms of w for both output and hidden layer nodes. Output nodes
are easy, since we know the functional representation of 𝐽 with respect to w through the chain
rule:
𝜕𝐽(𝐰) 𝜕𝐽(𝐰) 𝜕𝑛𝑒𝑡𝑘 𝜕𝐽(𝐰) 𝜕𝑧𝑘 𝜕𝑛𝑒𝑡𝑘
= ⋅ = ⋅ ⋅ = − 𝑡𝑘 − 𝑧𝑘 ⋅ 𝑓 ′ 𝑛𝑒𝑡𝑘 ⋅ 𝑦𝑗
𝜕𝑤𝑘𝑗 𝜕𝑛𝑒𝑡𝑘 𝜕𝑤𝑘𝑗 𝜕𝑧𝑘 𝜕𝑛𝑒𝑡𝑘 𝜕𝑤𝑘𝑗
yj k Output node sensitivity
= -k
𝛥𝑤𝑘𝑗 = 𝜂 ⋅ 𝛿𝑘 ⋅ 𝑦𝑗 = 𝜂 𝑡𝑘 − 𝑧𝑘 𝑓 ′ 𝑛𝑒𝑡𝑘 𝑦𝑗
For logistic sigmoid
= 𝜂 𝑡𝑘 − 𝑧𝑘 𝑓 𝑛𝑒𝑡𝑘 𝑓 1 − 𝑛𝑒𝑡𝑘 𝑦𝑗 𝑓 ′ 𝑥 = 𝑓 𝑥 𝑓(1 − 𝑥)
PR Backpropagation
Learning Rule
 For the hidden layer, things are a little bit more complicated, since we do not know the
desired values of the hidden layer node outputs. However, by the appropriate use of the
chain rule, we obtain : 𝑓 ′ 𝑛𝑒𝑡𝑗 𝑥 𝑖
𝜕𝐽(𝐰) 𝜕𝐽(𝐰) 𝜕𝑦𝑗 𝜕𝑛𝑒𝑡𝑗
= ⋅ ⋅
𝜕𝑤𝑗𝑖 𝜕𝑦𝑗 𝜕𝑛𝑒𝑡𝑗 𝜕𝑤𝑗𝑖
𝜕𝑧𝑘 𝜕𝑛𝑒𝑡𝑘
Hidden layer ⋅ = 𝑓 ′ 𝑛𝑒𝑡𝑘 ⋅ 𝑤𝑘𝑗
= j 𝜕𝑛𝑒𝑡𝑘 𝜕𝑦𝑗
node sensitivity
𝑐 𝑐
𝜕𝐽(𝐰) 𝜕 1 2
𝜕𝑧𝑘
= 𝑡𝑘 − 𝑧𝑘 =− 𝑡𝑘 − 𝑧𝑘 ⋅
𝜕𝑦𝑗 𝜕𝑦𝑗 2 𝜕𝑦𝑗
𝑘=1 𝑘=1
𝑐 𝑐
=− 𝑡𝑘 − 𝑧𝑘 𝑓 ′ 𝑛𝑒𝑡𝑘 ⋅ 𝑤𝑘𝑗 = − 𝛿𝑘 𝑤𝑘𝑗

𝑘=1 𝑘=1
k
𝑐
𝜕𝐽 𝑑
⇒ 𝛥𝑤𝑗𝑖 = −𝜂 =𝜂 𝛿𝑘 𝑤𝑘𝑗 𝑓 ′ 𝑛𝑒𝑡𝑗 𝑥𝑖 = 𝜂 ⋅ 𝛿𝑗 ⋅ 𝑥𝑖 𝑦𝑗 = 𝑓 𝑛𝑒𝑡𝑗 = 𝑓 𝑤𝑗𝑖 𝑥𝑖
𝜕𝑤𝑗𝑖
𝑘=1 𝑖=1
𝐻
𝑧𝐾 = 𝑓 𝑛𝑒𝑡𝐾 = 𝑓 𝑤𝐾𝑗 𝑦𝑗
= j
Computational Intelligence & Pattern Recognition 𝑗=1Glassboro, NJ
© 2001- 2013, Robi Polikar, Rowan University,
PR MLP/BP
 The weight update rule is then:
𝛥𝑤𝑗𝑖 = 𝜂 ⋅ 𝛿𝑗 ⋅ 𝑥𝑖 for hidden layer weights
𝛥𝑤𝑘𝑗 = 𝜂 ⋅ 𝛿𝑘 ⋅ 𝑦𝑗 for output layer weights
 In each case, the parameter 𝛿 represents the sensitivity of the criterion function
(error) with respect to activation of hidden / output layer node. The sensitivity of a
hidden layer node is a weighted sum of the output sensitivities, scaled by 𝑓’(𝑛𝑒𝑡𝑗),
where the weights are the hidden-to-output layer weights, whereas the output
sensitivities themselves are the errors at the output level, scaled by 𝑓’(𝑛𝑒𝑡𝑘)
𝑐
𝛿𝑗 = 𝛿𝑘 𝑤𝑘𝑗 𝑓 ′ 𝑛𝑒𝑡𝑗 𝛿𝑘 = 𝑡𝑘 − 𝑧𝑘 ⋅ 𝑓 ′ 𝑛𝑒𝑡𝑘

𝑘=1
Credit assignment problem
(have you seen this somewhere before)?
 The algorithm takes its name – backpropagation – from the fact that during
training, the errors are propagated back, from the output to the hidden layer!
PR Credit Assignment Problem
 Hidden nodes themselves do not make error: they just contribute to the
errors of the output nodes. The amount contributed is indicated by the
sensitivities.
𝛿𝑘 = 𝑡𝑘 − 𝑧𝑘 ⋅ 𝑓 ′ 𝑛𝑒𝑡𝑘
𝛿𝑗 = 𝛿𝑘 𝑤𝑘𝑗 𝑓 ′ 𝑛𝑒𝑡𝑗
𝑘=1
PR Generalizations of MLP/BP
 We have seen the BP for a specific two layer MLP. However, with some
notational and bookkeeping effort, the BP learning rule can be easily
generalized to the following cases…
 Input units including bias units – just add one more input node with 𝑥0 = 1
 Input units connected directly to the output units (as well as hidden nodes)
 There are more than two layers  Deep neural networks
 There are different nonlinearities for each layer
 Each unit has its own nonlinearity
 Each unit has a different learning rate
 The output is a continuous value (i.e., regression problem)
PR Training Protocols
 Three major training protocols:
 Stochastic Learning: Instances are drawn randomly from the training data, and
the weights are updated for each chosen instance.
 Batch Learning: Entire training data is shown to the network before weights are
updated. Each such presentation of the entire training dataset to the network is
called an epoch. In this case, the error for each pattern 𝑱𝒑 are computed, and
summed, before weights are updated. This is the recommended mode of training
an MLP.
 Online Learning: Instances are drawn consecutively from the training data, and
the weights are updated for each instance. This can be sensitive to the order in
which the data are presented.
PR MLP- Batch Learning
Initialize w, learning rate 𝜂, # of hidden layer nodes 𝐻, iteration 𝑡 = 0, criterion 𝜃;

do t  t+1 increment epoch
𝑚 ← 0; Δ𝑤𝑗𝑖 ← 0, Δ𝑤𝑘𝑗 ← 0;
do 𝑚 ← 𝑚 + 1
Select pattern 𝒙𝑚
Δ𝑤𝑗𝑖 ← Δ𝑤𝑗𝑖 + 𝜂𝛿𝑗 𝑥𝑖 ; Δ𝑤𝑘𝑗 ← Δ𝑤𝑘𝑗 + 𝜂𝛿𝑘 𝑦𝑗 ;
until 𝑚 = 𝑛
𝑤𝑗𝑖 ← 𝑤𝑗𝑖 + Δ𝑤𝑗𝑖 ; 𝑤𝑘𝑗 ← 𝑤𝑘𝑗 + Δ𝑤𝑘𝑗
until || 𝛻𝐽(𝐰)|| < 𝜃
return 𝐰
end
PR Datasets used in MLP/BP
 Typically, three sets of data are used in MLP training and testing, all of
which are accompanied by their corresponding correct class information
 Training data: This is the data on which the gradient descent is
performed. That is, the training is done on this data.
 Validation data: A second set of dataset, which is not used for training,
however, it is used to determine when the training should stop.
 Test data: This is the dataset using which we assess the generalization
performance of the network.
D
PR MLP Bayes
 With sufficient number of hidden layer nodes, the MLP can approximate
any function, and hence can solve any non-linearly separable classification
problem.
 In fact, with sufficient number of hidden nodes, along with plenty of data, it
can be shown that the network outputs represents the posterior
probabilities of classes (Richard & Lippman, 1991).
 Outputs can, however, may be forced to represent probabilities by
 Use exponential activation function at the output layer 𝑓 𝑛𝑒𝑡 = 𝑒𝑛𝑒𝑡
 Use 0 – 1 target vectors
 Normalize outputs according to
𝑒 𝑛𝑒𝑡𝑘
𝑧𝑘 = 𝑐 𝑛𝑒𝑡
𝑖 𝑒
𝑖
PR Improving the Backpropagation
Practical Considerations
 Activation Function
 Input Normalization
 Target Values
 Number of Hidden Units
 Initializing Weights
 Learning Rates
 Momentum
 Stopping Criteria
 Regularization
Practical Considerations
PR Activation Function
 Desirable properties of an activation function
 Nonlinearity – gives the multilayer networks the power of generating
nonlinear decision boundaries;
 Saturation for classification problems – so that the outputs can be limited
between some minimum and maximum limits (-1 and 1, or 0 and 1). Not
necessary for regression (function approximation problems)
 Continuity and smoothness – so that we can take its derivative
 Monotonocity – so that the activation function itself does not introduce
additional local minima
 Linearity for small values of 𝑛𝑒𝑡, to preserve the properties of linear
discriminant functions
 A function that satisfies all of the above is
….(drum roll)
PR The Sigmoid
Activation Function
Logarithmic Sigmoid 1
𝑓(𝑛𝑒𝑡) =
1 + 𝑒 −𝛽⋅𝑛𝑒𝑡
1
 = 0.75
0.9
0.8
 = 0.5
0.7
 = 0.25
0.6
0.5  = 0.1
0.4
0.3
0.2 Matlab’s logsig

0.1  = 0.9 function uses
 =2 =1
0
-15 -10 -5 0 5 10 15
PR The Sigmoid
Activation Function
2
Tangential Sigmoid 𝑓(𝑛𝑒𝑡) = tanh(𝛽𝑛𝑒𝑡) = −1
1 + 𝑒 −𝛽⋅𝑛𝑒𝑡
1
 = 0.75
0.8  = 0.5
0.6
 = 0.25
0.4
0.2
 = 0.1
0
-0.2
-0.4
-0.6  =1 Matlab’s tansig

-0.8 function uses
 =2
=2
-1
-15 -10 -5 0 5 10 15
PR Normalization
 For stability reasons, individual features need to be in the same ‘ball park figure’. If
two features vary orders of magnitude in their values, network cannot learn
effectively.
 If the mean and variance of each feature is made zero and one, respectively, this is
called standardization. This makes sense, if the features are uncorrelated.
 If the relative amplitudes of features need to be conserved, necessary when the
features are related, then a –relative to maximum – or – relative to norm2 –
normalizations should be used !
 For the fish example, x1=length (mm), x2=weight(kg)

typical values: x1=985mm, x2=2kg Large variation in input values…No good!
 Use smart normalization: Use x1=0.985m and x2=2kg, so that both variables are in
the same order of magnitude (i.e., normalize only one variable, when it makes sense
to do so)
PR Preprocessing &
Normalization in Matlab
 Matlab function mapminmax() scales inputs and targets into [-1 1] range
 [Y,PS] = mapminmax(X, YMIN,YMAX) processes X by normalizing the minimum and maximum
values of each row to [YMIN, YMAX]. The default values for YMIN and YMAX are
-1 and +1, respectively. PS is a struct of processing settings that then allows using the exact same
normalization to some other input Z through Y=mapminmax ('apply',Z, PS).
 X = mapminmax('reverse', Y, PS) returns X, given Y and settings PS.
 [Y,PS] = mapstd(X, ymean,ystd) processes matrix X (similar to above functions) but by

transforming the mean and standard dev. of each row to ymean and ystd (defaults 0 and 1).
 X=mapstd(‘reverse’, Y, PS) returns X in the original units.
 The feedforwardnet function includes both normalization and standardization of both

inputs and targets by default. This means that tansig must be used for the activation
function at the hidden and output layers.
 [y,ps] = fixunknowns(x) processes matrixes by replacing each row

 containing unknown values (represented by NaN) with two rows of information: first row
contains the original row, with NaN values replaced by the row's mean, second row contains
1 and 0 values, indicating which values in the first row were known or unknown. The
function fixunkowns is only recommended for input processing.
PR Other Processing
Functions
 [Y,PS] = processpca(X, maxfrac) applies PCA to X such that
 each row is uncorrelated, the rows are in the order of the amount they contribute to total
variation, and rows whose contribution to total variation are less than maxfrac. The
parameters are saved as PS, such that y = processpca('apply',Z,PS) applies the same
transformation to some other matrix Z. The process can be reversed by
Z = processpca('reverse',y,ps)
 Other processing functions include:
 Removeconstantrows: removes the rows of a matrix that has constant values
 Removerows: removes certain rows indicated by an input argument of indices to
be removed.
PR Target Values
 Typically, two output encoding protocols are used for classification problems
 0 1 targets with log-sigmoid (‘logsig’ in Matlab) activation function

• [0 0 0 1 0 0]  Class 4
• [0 1 0 0 0 0]  Class 2
 -1 + 1 targets with tan-sigmoid (‘tansig’ in Matlab) activation function (Matlab’s default)
• [-1 -1 -1 +1 -1 -1]  Class 4
• [-1 +1 -1 -1 -1 -1]  Class 2
 From a practical point of view, it is recommended that the asymptotic values not
be used. That is, use 0.05 instead of 0, and 0.95 instead of 1.
 This is because, the slope of the activation function, that is, the gradient, which is
proportional to Δw, approaches zero at extreme values of the input. This significantly slows
down the training.
 For regression (function approximation) problems, actual values are used along
with the linear activation function (‘purelin’ in Matlab) .
PR Number of Hidden Units
 Although the number of input and output layer nodes are fixed (number of features and number
of classes, respectively), the number of hidden layer nodes, H, is a user selectable parameter.
 H defines the expressive power of the network. Typically, larger H results in a network that can
solve more complicated problems. However,
 Excessive number of hidden nodes causes over fitting. This is a phenomenon where the training
error can be made arbitrarily small, but the network performance on the test data is poor  Poor
generalization performance… No Good !
 Too few hidden nodes may not be able to solve more complicated problems … No Good !
 There is no formal procedure to determine H. Typically, it is determined by a combination of
previous expertise, amount of data available, dimensionality, complexity of the problem, and trial
and error.
 A common rule of thumb that is often used is to choose H such that to total number of weights
remains less then N/10, where N is the total number of training data available.
Note that H, along with d and c determines the

total number of weights, which represents the
degrees of freedom of the algorithm.
PR Initializing the Weights
 To promote uniform learning where all classes are learned approximately at
the same time, the weights must be initialized carefully.
 A typical rule of thumb is to randomly choose the weights from a uniform
distribution according to the following limits:
−1 𝑑 < 𝑤𝑗𝑖 < 1 𝑑

−1 𝐻 < 𝑤𝑘𝑗 < 1 𝐻
 Another approach is the Nguyen-Widrow initialization, which generates

initial weights and biases such that the active regions of the neurons are
distributed approximately evenly over the input space.
D. Nguyen and B. Widrow, ``Improving the Learning Speed of 2-Layer Neural Networks by Choosing Initial Values of the
Adaptive Weights,'' Proceedings of the International Joint Conference on Neural Networks (IJCNN), 3:21-26, June 1990.
 Initialization to zero is never a good idea…Why?

PR Learning Rate
 Learning rate, in theory, only affects the convergence time, however, since the
global minimum is often not found, it could result in system divergence:
 A proper way to select the learning rate involves computing the second derivative
of the criterion function wrt each weight, and taking the inverse of this derivative
as the learning rate. This dynamic learning rate however is computationally
expensive. A good starting point is =0.1 𝜕2𝐽 𝐰
−1
𝜂𝑜𝑝𝑡 =
𝜕𝐰 2
 MATLAB uses an alternate dynamic learning rate update scheme
 If 𝑒𝑟𝑟𝑜𝑟𝑛𝑒𝑤 > 𝑘. 𝑒𝑟𝑟𝑜𝑟𝑜𝑙𝑑 discard current weight update, set 𝜂 = 𝑎1 . 𝜂
 If 𝑒𝑟𝑟𝑜𝑟𝑛𝑒𝑤 < 𝑒𝑟𝑟𝑜𝑟𝑜𝑙𝑑  keep current weight update, set 𝜂 = 𝑎2 . 𝜂
 Typical values: 𝑘 = 1.04 𝑎1 = 0.7 𝑎2 = 1.05
PR The Problem of Local minima
J(w)
PR Momentum
 If there are small plateaus in the error surface, then backpropagation can take a long time,
or even get stuck in small local minima. In order to prevent this, a momentum term is
added, which incorporates the speed at which the weights are learned. This is loosely
related to the momentum in physics – a moving object keeps moving unless prevented by
outside forces.
 Momentum term simply makes the following change to the weight update rule, where  is
the momentum term:
𝑤𝑡+1 = 𝑤𝑡 + 1 − 𝛼 𝛥𝑤𝑡 + 𝛼𝛥𝑤t−1
 If =0, this is the same as the regular backpropagation, where the weight update is
determined purely by the gradient descent
 If =1, the gradient descent is completely ignored, and the update is based on the
‘momentum’, previous weight update rule. The weight update continues on along the
direction in which it was moving previously.
 Typical value for  is generally between 0.5 and 0.95
PR Batch Learning with Momentum
Initialize w, learning rate 𝜂, # of hidden layer nodes 𝐻,

iteration 𝑡 = 0, criterion 𝜃, momentum 𝛼;
do 𝑡 ← 𝑡 + 1 increment epoch
𝑚 ← 0; Δ𝑤𝑗𝑖 ← 0, Δ𝑤𝑘𝑗 ← 0;
do 𝑚 ← 𝑚 + 1
Select pattern xm
Δ𝑤𝑗𝑖 (𝑡 + 1) ← Δ𝑤𝑗𝑖 (𝑡) + 𝜂 1 −  𝛿𝑗 𝑥𝑖 + Δ𝑤𝑗𝑖 (𝑡 − 1);
Δ𝑤𝑘𝑗 (𝑡 + 1) ← Δ𝑤𝑘𝑗 (𝑡) + 𝜂 1 −  𝛿𝑘 𝑦𝑗 + Δ𝑤𝑘𝑗 (𝑡 − 1);
until 𝑚 = 𝑛
𝑤𝑗𝑖 ← 𝑤𝑗𝑖 + Δ𝑤𝑗𝑖 ; 𝑤𝑘𝑗 ← 𝑤𝑘𝑗 + Δ𝑤𝑘𝑗
until || 𝛻𝐽(𝐰)|| < 𝜃
return w
end Backpropagation with Momentum
PR Stopping Criterion
 In order to prevent overfitting, the algorithm should be stopped before it

reaches its minimum error goal.
 This is because, too small error goal causes the noise in the data to be learned at
the expense of general properties of the data
 It is difficult, however, to determine the stopping threshold on the error goal.
 Typical approach is to monitor the performance of the network on a validation
data, and stop training when the error on the validation data reaches a desired level.
PR Regularization
 Regularization is the smoothing of the error curve so that the optimum
solution can be found more effectively.
 One such technique is the weight decay, that prevents the weights from growing
too large:
𝑤t+1 = 1 − 𝜀 𝑤𝑡 0 < 𝜀 < 1
 The weights that do not contribute to reducing the criterion function will
eventually shrink to zero. They can then be eliminated all together. Weights that do
contribute to 𝐽 will not decay however, as they will be updated.
 This is effectively equivalent to using the following criterion function with no
separate weight decay.
2𝜀 𝑇 Regularization term
𝐽𝑒𝑓 =𝐽 𝐰 + 𝐰 𝐰
𝜂
PR Feedforward
networks in Matlab
 The process for training / testing / using neural networks in Matlab includes
the following steps:
 Collect data (duh!)
 Create the network (duh!)2
 Configure the network
 Initialize the weights and biases
 Train the network
 Validate and use the network
 The main functions to use an MLP in Matlab are:
 feedforwardnet, which creates the network architecture, 𝑛𝑒𝑡
 configure, which sets up the parameters of the 𝑛𝑒𝑡
 train, which configures and trains the network
 sim, which simulates the network, by computing the outputs for a gives test data
 perform, which computes the performance of the network on testdat whose
labels are known.
PR Matlab’s Model
of a network
 Matlab uses the network object to store all of the information that defines a NN.
 The network object includes the structure of the network (how many layers, how
many nodes in each layer), as well as many configurable parameters, such as the
weights and biases.
 The fundamental building block is the neuron, represented as follows, where p is
input, w is the weight (vector) and b is the bias associated with an extra input of
fixed value 1. Matlab uses
 a weight function to determine how the weights
are applied (for MLPs this is dot product, 𝐰 𝑇 𝐱, for RBF
it can be a distance function, e.g. 𝐰 − 𝐱 ),
 a net input function , which for MLPs simply adds
the bias to the weighted sum of the inputs; and
 a transfer (activation) function to act as
nonlinear thresholding, which for MLPs is typically
logistic or tangential sigmoid. M
PR Matlab’s Model
of a network
 An abbreviated version of this model is shown as follows:
S: # of neurons
R: # of inputs
 The transfer functions are typically indicated with diagrams indicating the
nature of the function being used, for example:
PR Matlab’s Model
of a Feed forward network
A single-layer network of S logsig neurons having R inputs is shown below
in full detail on the left and with a layer diagram on the right
M
Image Source: Matlab Neural Network Toolbox, User’s Guide
http://www.mathworks.com/products/neuralnet/
PR Matlab’s Model
A two-layer (single hidden layer) network layer diagram
weights connecting input weights connecting hidden

layer 1 and hidden layer 1 layer 2 and hidden layer 1
2
# of bias terms
# of inputs
# of hidden Output of Output of # of output (hidden
layer nodes hidden layer 1 hidden layer 2 layer 2) nodes
M
𝐈𝐖: Input weights 𝐋𝐖: Layer Weights
PR Matlab’s Model
 Similarly, if you have three layers…
𝐚𝟏 = 𝐟 𝟏 𝐈𝐖 𝟏,𝟏 𝐩 + 𝐛𝟏 𝐚𝟐 = 𝐟 𝟐 𝐋𝐖 𝟐,𝟏 𝐚𝟏 + 𝐛𝟐 𝐚𝟑 = 𝐟 𝟑 𝐋𝐖 𝟑,𝟐 𝐚𝟐 + 𝐛𝟑 = 𝐲
𝐚𝟑 = 𝐟 𝟑 𝐋𝐖 𝟑,𝟐 𝐟 𝟐 𝐋𝐖 𝟐,𝟏 𝐟 𝟏 𝐈𝐖 𝟏,𝟏 𝐩 + 𝐛𝟏 + 𝐛𝟐 + 𝐛𝟑 = 𝐲
PR feedforwardnet
feedforwardnet Feedforward neural network (Neural Networks Toolbox)
net = feedforwardnet(hiddenSizes, trainFcn) creates a network net, where hiddenSizes is a row vector of one or more
hidden layer sizes (default = 10) and trainFcn indicating the training function (default trainlm) to be used with this network.
Specialized versions of the feedforward network include fitting (fitnet) and pattern recognition (patternnet) networks. A
variation on the feedforward network is the cascade forward network (cascadeforwardnet) which has additional
connections from the input to every layer, and from each layer to all following layers.
Examples
Here a feedforward neural network is used to solve a simple problem.
[x,t] = simplefit_dataset;
net = feedforwardnet(10)
net = train(net,x,t);
view(net)
y = net(x);
perf = perform(net,y,t)
Note that feedforwardnet() automatically applies removeconstantrows() and mapminmax() to both the inputs and
outputs (target values).
PR net = feedforwardnet(10)
feedforwardnet
# of datasets to be used for training,

not the size of input data.
# of layers to be used as outputs,

not the number of classes.
Total # of weights
Used with Simulink only
defines which layers have biases,

generate network inputs / outputs,
or just connection layers
Structure of properties for inputs,

layers, outputs…
PR net.inputs
 This property holds structures of properties for each of the network's inputs. It is
always an 𝑁𝑖 x 1 cell array of input structures, where 𝑁𝑖 is the number of network
inputs (net.numinputs, which in our case will always be 1).
 To access these properties, type net.inputs{1} which will return the following
properties (your numbers may be different depending on your network)
Always empty matrix for feedforward networks

Default processing functions for the input layer
This property defines the number of elements in the input data

used to configure the input, or zero if the input is unconfigured.
The processing functions (net.inputs{i}.processFcns) and their associated processing parameters

(net.inputs{i}.processParams) are used to define proper processing settings (net.inputs{i}.processSettings) during network
configuration which either happens the first time train is called, or by calling configure directly, so as to best match the
example data. Then whenever the network is trained or simulated, processing occurs consistent with those settings.
To remove default input processing functions: net.inputs{1}.processFcns={}

PR net.layers
 This property holds structures of properties for each of the network's layers. It is always an
𝑁𝑙 x 1 cell array of layer structures, where 𝑁𝑙 is the number of network layers
(net.numLayers). The structure defining the properties of the 𝑖 𝑡ℎ layer is located at
net.layers{i}. To see the properties, type: net.layers{1} and net.layers{2} (or more)
Used by self organizing map

type neural networks
Nguyen-Widrow algorithm for weight/bias initialization

Add bias values to the (weighted) inputs
Used by self organizing maps (SOMs)

 A matrix of min and max values for each node 
 # of neurons in this layer. It is 0 if the layer is not yet configured 
Used by self organizing maps (SOMs)
 Activations functions used in this layer 
To change the transfer function to logsig, for example, you could execute the command:
net.layers{1}.transferFcn = 'logsig'
PR net.outputs
 This property holds structures of properties for each of the network's outputs. It is always a
1 x 𝑁𝑙 cell array, where 𝑁𝑙 is the number of network outputs (net.numOutputs).
Used by recursive (feedback) neural networks
Default processing functions for the output layer
Also try:
To remove default input processing
functions: net.outputs{2}.processFcns={}
net.biases;
PR net.inputWeights;
net.layerWeights
 These properties hold the structures of properties for each of the network’s biases, input or
layer weights.
 For biases, there is one structure for each layer, which include the following properties:
initFcn, learn (whether the biases should be learned), learnFcn (which learning function to
use to learn bias values; default traingdm) , learnParam (.lr : initial learning rate and .mc:
momentum constant), and size.
 For input and layer weights:
Used by time delay neural networks (TDNNs)
# of weights : 0 indicates that the network is not configured yet

Weighting function: dot product = weighted sum
Number of input and output weights are read from the size of the input and target data, in a subsequent configure() or train() function
PR Weights & Biases
 The actual values of the weights and biases can be obtained by calling
net.IW, net.LW and net.b.
 Note that due to their cell structures, you actually need to call them with
the following cell array indices:
net.IW{1}; net.LW{2,1} and net.b{1} or net.b{2}
PR Network Functions
 A variety of functions control how the network is trained and how it behaves
Adaptation function, generally used for incremental (on-line, one
instance at a time learning)
Type of derivative / gradient to be used
Type of data division to be used for validation: dividerand is random division (default)
Ratio of data to be used for training, validation and test partitions
Defines how the initialization of weights (net.IW, net.LW) and biases (net.b) are to be done
Objective /cost function; mse: mean square error (default)
Lists which performance metrics should be

plotted during training
Training function, generally used for batch learning. trainlm is default

Sets the parameters of the current training
function, such as error goal, max # of
epocs, min gradient, learning rate, etc. (See
next slide)
To change the data partitioning: net.divideFcn=‘dividetrain’; assigns all data to training

net.divideParam.trainRatio=1; net.divideParam.valRatio=0; net.divideParam.testRatio=0 should
achieve the same outcome.
PR Training Parameters
for traingdx
 traingdx is a gradient descent back propagation algorithm with momentum
term and adaptive learning rate, with the following parameters and their
default values
PR Training Parameters
for trainlm
 trainlm is the Levenberg-Marquardt backpropagation. It is one of the fastest BP
algorithms, but also the one that requires the most memory. It is the default
training algorithm for feedforward networks, using the following parameters and
default values.
PR Network Methods
 For any given network object net, the following methods are available
view(net) returns a graphical diagram of the architecture
 Finally, for any given set of test_data, you can obtain the outputs of the network
net by invoking outputs=net(test_data)
PR Training & Testing The
Network
train Train neural network (Neural Network Toolbox)
[net,tr] = train(net,P,T,Pi,Ai) trains a network net according to net.trainFcn and net.trainParam of the network object net,
where P is the network input (matrix or cell array in column format), T is the network targets (in one-hot format), and Pi
and Ai are the input and layer delays (not used for MLPs). It returns the trained network net and training record tr, which
includes the performances.
sim Simulate neural network (Neural Network Toolbox)
[Y,Pf,Af,E,perf] = sim(net,P,Pi,Ai,T) simulates the network net, using input data P (typically test data, again in column
format), target values T, as well as the delays (for TDNNs) Pi and Ai(not used for MLPs). The function returns the network
outputs Y, network errors E and the network performance perf along with final input and layer delays Pf and Af (not used
for MLPs).
If target values for the test data are not known, and we simply want to obtain the outputs of the network for the input P,
use output=net(P);
If target values are known for the test data, the performance of the network can also be obtained by the perform function
by calling perf = perform(net, P, T).
PR Training MLP In Matlab
net = feedforwardnet(10, 'traingdx');
net.inputs{1}.processFcns={} %Remove the default input processing functions of
%min/max normalization, fixing unknowns and removing repeat instances
net.outputs{2}.processfcns={} %Remove the default output processing functions
%net.divideFcn=''; %This removes the default data partitioning (normally 60%, 20%, 20%)
net.divideparam.trainRatio=TR_ratio;
net.divideparam.valRatio=V_ratio;
net.divideparam.testRatio=T_ratio;
net.trainParam.epochs = 1000; % Maximum number of epochs to train

net.trainParam.goal = 0.01; % Performance goal
net.trainParam.max_fail = 10; % Maximum validation failures
net.trainParam.mc = 0.9; % Momentum constant
net.trainParam.min_grad = 1e-10; % Minimum performance gradient
net.trainParam.show = 50; % Epochs between displays
net.trainParam.showCommandLine = 0; % Generate command-line output
net.trainParam.showWindow = 1; % Show training GUI
net.trainParam.time = inf; % Maximum time to train in seconds
%train and simulate the network

[net,train_record,net_outputs,net_errors] = train(net,tr_data,tr_labels);
[Network_output ignore1 ignore2 Network_error Network_perf]= sim(net,ts_data, [], [], ts_labels);
plotconfusion(ts_labels, Network_output) %Graphical output of the confusion matrix
PR Test on OCR Data
Confusion Matrix
174 0 0 2 0 0 2 0 0 0 97.8%
1
9.7% 0.0% 0.0% 0.1% 0.0% 0.0% 0.1% 0.0% 0.0% 0.0% 2.2%
0 166 1 0 1 0 2 0 6 1 93.8%
2
0.0% 9.2% 0.1% 0.0% 0.1% 0.0% 0.1% 0.0% 0.3% 0.1% 6.2%
0 2 174 0 0 1 0 0 0 0 98.3%
3
0.0% 0.1% 9.7% 0.0% 0.0% 0.1% 0.0% 0.0% 0.0% 0.0% 1.7%
0 0 0 163 0 0 0 0 0 0 100%
4
0.0% 0.0% 0.0% 9.1% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0%
Output Class
0 0 0 0 179 1 0 2 0 2 97.3%
5
0.0% 0.0% 0.0% 0.0% 10.0% 0.1% 0.0% 0.1% 0.0% 0.1% 2.7%
4 0 0 3 0 174 0 6 2 1 91.6%
6
0.2% 0.0% 0.0% 0.2% 0.0% 9.7% 0.0% 0.3% 0.1% 0.1% 8.4%
0 0 0 0 0 1 176 0 0 0 99.4%
7
0.0% 0.0% 0.0% 0.0% 0.0% 0.1% 9.8% 0.0% 0.0% 0.0% 0.6%
0 0 1 4 0 0 0 164 1 0 96.5%
8
0.0% 0.0% 0.1% 0.2% 0.0% 0.0% 0.0% 9.1% 0.1% 0.0% 3.5%
0 6 1 2 1 0 1 2 159 2 91.4%
9
0.0% 0.3% 0.1% 0.1% 0.1% 0.0% 0.1% 0.1% 8.8% 0.1% 8.6%
0 8 0 9 0 5 0 5 6 174 84.1%
10
0.0% 0.4% 0.0% 0.5% 0.0% 0.3% 0.0% 0.3% 0.3% 9.7% 15.9%
97.8%91.2%98.3%89.1%98.9%95.6%97.2%91.6%91.4%96.7%94.8%
2.2% 8.8% 1.7% 10.9% 1.1% 4.4% 2.8% 8.4% 8.6% 3.3% 5.2%
1 2 3 4 5 6 7 8 9 10
Target Class
PR Some issues to consider
 If the network inputs and outputs are preprocessed with the minmax
function, all input and output values will be normalized to [-1 1] range.
 Then logsig function is not suitable for output layer. Why?
 In this case, use one of the following
• Logsig at the hidden layer, tansig or purelin at the output layer
• Tansig at both hidden and output layers
• Tansig at the hidden and purelin at the output layer.
 Always check your input arguments. Matlab expects the data to be in the
columns – don’t screw this up!
 You can create your own partitioning using the dividerand() function
 Recall: the network created by Matlab is a “neural network object” – similar
to a struct. It has a mindboggling set of parameters. Type “net” at the
command prompt and investigate its components.
PR nntool
PR nntool
PR Train / Validation / test
trainmlp_validation.m
PR If you set TR_error=0
Overfitting
PR trainmlp_validation.m Examples
Spiral data
6
4
One hidden layer, with N=50 nodes,
2
+tansig activation,
Sigmoid activation, traingdxtraingdx
0 MLP based classification
classification of
of the
the test
test data
data
66
-2
44
-4
-6
22
-8
-6 -4 MLP-2based classification
0 of2the test data
4 6 8
6
00
-2
-2
-4
-4
-6
-6-6 -4 -2 0 2 4 6
-6 -4 -2 0 2 4 6
-2

-4
Sigmoid + purelin activation, traingdx
-6
RP
-6 -4 -2 0 2 4 6
PR trainmlp_validation.m Examples
MLP based classification of the test data
Randomly generated Gaussian data 6
8
6 4
4
2
0
0
-2 -2
-4
-4
-6
-4 -3 -2 -1 0 1 2 3 4 5 6
-6
-3 -2 -1 0 1 2 3 4 5

Sigmoid activation, traingdx
RP
PR Second Order Methods
 The gradient descent is based on the minimization of the criterion function
using the first order derivative
 Methods that make use of second order derivatives typically find the
solution much faster the first order methods.
 Newton’s Methods
• Levenberg Marquardt Backpropagation
 Quick prop
 Conjugate Gradient Methods
• Fletcher-Reeves
• Polak-Ribiere
PR Newton’s Method
 The weight update rule uses the Hessian matrix, which includes the second
derivatives of the criterion function with respect to the weights:
𝜕2𝐽 𝜕2𝐽 𝜕2𝐽

⋯
𝜕𝑤1 2 𝜕𝑤1 𝜕𝑤2 𝜕𝑤1 𝜕𝑤𝑘
2 𝜕2𝐽 𝜕2𝐽 𝜕2𝐽
𝜕 𝐽(𝐰) ⋯
𝐇= 2
= 𝜕𝑤2 𝜕𝑤1 𝜕𝑤2 2 𝜕𝑤2 𝜕𝑤𝑘
𝜕𝐰
⋮ ⋮ ⋱ ⋮
𝜕2𝐽 𝜕2𝐽 𝜕2𝐽
⋯
𝜕𝑤𝑘 𝜕𝑤1 𝜕𝑤𝑘 𝜕𝑤2 𝜕𝑤𝑘 2
PR Newton’s Method
 The weight update rule in Newton’s method is then:
𝜕𝐽(𝐰) 𝜕𝐽(𝐰)
𝛥𝐰 = −𝐇 −1 𝐰𝑘+1 = 𝐰𝑘 − 𝐇 −1
𝜕𝐰 𝜕𝐰
 The problem with this method is that

 Inverse of the Hessian is a computationally very expensive operation
 The method assumes the error surface to be quadratic. In practice, this is generally not
true, and may cause divergence.
 Another group of algorithms use an approximation of the Hessian, and have

been shown to work rather well in practice. These algorithms are called quasi-
Newton algorithms. One of the fastest of such algorithms is the Levenberg-
Marquardt algorithm. The LM algorithm in MATLAB
( trainlm) is the default learning algorithm for feedforwardnet().
PR The Levenberg – Marquardt
Backpropagation
 The LM method uses the 𝐇 = 𝐉 𝑇 𝐉 approximation for the Hessian, with the
gradient computed as 𝐠 = 𝐉 𝑇 𝐞 where 𝐉 is the Jacobian matrix containing
the first derivatives of the network errors (not to be confused with the
criterion function 𝐽), and e is the vector of network errors. The weight
update rule is then
𝐰𝑘+1 = 𝐰𝑘 − 𝐉𝑇 𝐉 + 𝜇𝐈 −1 𝐉 𝑇 𝐞
where 𝜇 is a constant which switches the algorithm back and forth between
a regular gradient descent (when 𝜇 is large) and quasi-Newton’s method
(when 𝜇 is zero). Typically, 𝜇 is decreased as the network gets closer to the
solution, since in that region Newton’s algorithm is most efficient.
PR Quick-Prop
 Simple and fast second order method that does not even require a
second order derivative calculation !!!
 Assumptions:
 Weights are independentDescent optimized separately for each weight !
 Error surface is quadratic
 Weight update rule given by
𝑑𝐽
𝑑𝑤 𝑘
𝛥𝑤𝑘+1 = 𝛥𝑤𝑘
𝑑𝐽 𝑑𝐽
−
𝑑𝑤 𝑘−1 𝑑𝑤 𝑘
*The difference of first order derivatives

approximate the second order derivative ! This
technique is not available in MATLAB. 𝒅𝑱𝒌/𝒅𝒘 = 𝟎 𝒅𝑱𝒌−𝟏 /𝒅𝒘 = 𝟎
PR Conjugate
Gradient Descent
 Another group of second order algorithms that do not require Hessian
computation are the family of Conjugate Gradient Descent methods.
 Instead of iterating along the direction of the steepest gradient descent

only, these methods use alternate directions, along with the gradient
descent directions for faster convergence.
𝑇
 Pairs of directions (vectors) that satisfy 𝛥𝐰𝑘−1 𝐇𝛥𝐰𝑘 = 0 are called H-
conjugate, meaning that vectors 𝐰𝑘−1 and 𝐰𝑘 are non-interfering with
respect to H. If H is proportional to the identity matrix, then conjugate
directions are orthogonal to each other.
PR Conjugate
Gradient Descent
 Conjugate Gradient Descent:
1. Start with the steepest descent direction: 𝛥𝐰0 = −𝛻𝐽(𝐰0 )
2. At the 𝑘𝑡ℎ update, perform a line search to determine the optimal distance to
move along this direction, 𝐰𝑘 (equivalent to determining the optimum learning
rate). Let’s call this amount 𝛼𝑘.
3. Move along this direction 𝐰𝑘 by the amount 𝛼𝑘: 𝐰𝑘+1 = 𝐰𝑘 + 𝛼𝑘 𝛥𝐰𝑘
4. The next search will then be conjugate to previous search direction. Compute the
conjugate direction by 𝛥𝐰𝑘 = −𝛻𝐽(𝐰𝑘 ) + 𝛽𝑘 𝛥𝐰𝑘−1
5. The various versions of the conjugate gradient descent algorithm are distinguished
by the way the constant k is computed.
𝛻𝐽 𝐰𝑘 𝑇 𝛻𝐽 𝐰𝑘 𝛻𝐽 𝐰𝑘 𝑇 𝛻𝐽 𝐰𝑘 − 𝛻𝐽 𝐰𝑘−1
𝛽𝑘 = 𝑇
𝛽𝑘 =
𝛻𝐽 𝐰𝑘−1 𝛻𝐽 𝐰𝑘−1 𝛻𝐽 𝐰𝑘−1 𝑇 𝛻𝐽 𝐰𝑘−1
Fletcher-Reeves update – ‘traincgf’ Polak-Ribiere update – ‘traincgp’
PR Conjugate
Gradient Descent
PR Gradient Descent vs.
Conjugate Gradient
Conjugate Gradient
Descent
Gradient
Descent
PR Other Neural Network
structures
 Radial Basis Function (RBF) Networks: Similar in architecture to MLP, however, different
learning rule. Typically used for function approximation, though capable of solving
classification problems as well.
Matlab: newrb (train_data, targets, error_goal, spread)
 Probabilistic Neural Networks (revisited): Almost identical to RBF architecture,

however, simpler learning rule. Network outputs are posterior probabilities of
respective classes. No iterative learning is involved, therefore, PNN training is extremely
fast.
Matlab: newpnn (train_data, targets, spread)
 Learning Vector Quantization (LVQ) Networks: Most commonly used in speech

processing applications, its training mechanism uses a winner take it all type
competitive hidden layer.
Matlab: newlvq (range, subclasses, class_priors, learning_rate, learning_rule)
PR
Dr. Robi Polikar
Radial Basis Function

Neural Networks
PR Function Approximation
 Constructing complicated functions from simple building blocks
 Lego systems
 Fourier / wavelet transforms
 Logic circuits
 RBF networks
PR Function Approximation
* * ?
* *
* *
* *
*
*
RP
PR Recall: Universal
Approximator
 Classification can be thought of as a special case of function approximation:
 For a three class problem:
𝑥1
Class 1: 1 or [1 0 0]
x …. Classifier Class 2: 2 or [0 1 0]
Class 3: 3 or [0 0 1]
𝑥𝑑
𝑑-dimensional input 1 or 3, c-dimensional input

x 𝑦 = 𝑓(𝒙) 𝑦
 Hence, the problem is – given a set of input/output example pairs of an unknown

function – to determine the output of this function to any general input.
 An algorithm that is capable of approximating any function – however
difficult, large dimensional or complicated it may be – is known as a
universal approximator. The RBF is also a universal approximator.
PR Radial Basis Function
Neural Networks
 The RBF networks, just like MLP networks, can therefore be used
classification and/or function approximation problems.
 The RBFs, which have a similar architecture to that of MLPs, however,
achieve this goal using a different strategy:
………..
Linear output
Input layer layer
Nonlinear
transformation layer
(generates local receptive fields) RP
PR Nonlinear Receptive Fields
 The hallmark of RBF networks is their use of nonlinear receptive fields
 The receptive fields nonlinearly transforms (maps) the input feature space,
where the input patterns are not linearly separable, to the hidden unit
space, where the mapped inputs may be linearly separable.
 The hidden unit space often needs to be of a higher dimensionality
 Cover’s Theorem (1965) on the separability of patterns: A complex pattern
classification problem that is nonlinearly separable in a low dimensional space, is
more likely to be linearly separable in a high dimensional space.
 We will see this concept again with SVM soon.
PR The (you guessed it right) XOR Problem
x2
Consider the nonlinear functions to map the input vector x to the 1- 2 space
1
𝐱−𝐭 1 2
𝜙1 (𝐱) = 𝑒 − 𝐭1 = 1 1 𝑇
𝐱 = [𝑥1 𝑥2]
𝐱−𝐭 2 2
𝜙2 (𝐱) = 𝑒 − 𝐭2 = 0 0 𝑇
0
x1
0 1
_ (1,1)
Input x 𝝓𝟏(𝐱) 𝝓𝟐(𝐱) 1.0
_
(1,1) 1 0.1353 0.8
_
(0,1) 0.3678 0.3678  0.6
(1,0) 0.3678 0.3678 _
0.4 (0,1)
(0,0) 0.1353 1 _ (1,0)
0.2 (0,0)
_
0 | | | | | |
0 0.2 0.4 0.6 0.8 1.0 1.2 RP
The nonlinear  function transformed a nonlinearly separable problem into a linearly separable one !!!
PR Initial Assessment
 Using nonlinear functions, we can convert a nonlinearly separable problem
into a linearly separable one.
 From a function approximation perspective, this is equivalent to
implementing a complex function (corresponding to the nonlinearly
separable decision boundary) using simple functions (corresponding to the
linearly separable decision boundary)
 Implementing this procedure using a network architecture, yields the RBF
networks, if the nonlinear mapping functions are radial basis functions.
 Radial Basis Functions:
 Radial: Symmetric around its center
 Basis Functions: Also called kernels, a set of functions whose linear combination
can generate an arbitrary function in a given function space.
 Hence, the radial basis functions are in fact distance functions: the distance
between a data point 𝒙 and the RBF with a center of 𝝁 is zero if 𝒙 = 𝝁, and
becomes symmetrically smaller as 𝒙 moves away from 𝝁.
PR RBF Networks
Radial Basis Function with =2, =1
d input 1
nodes H hidden layer RBFs 0.9
0.8
(receptive fields) 0.7
x1 0.6
0.5
𝝓𝟏 c output 0.4
0.3
nodes 0.2
𝑧1
0.1
……...
x2 -4 -2 0 2 4 6 8
x
Wkj
..
……....
𝐻 𝐻
Uji netk 𝑧𝑘 = 𝑓 𝑛𝑒𝑡𝑘 = 𝑓 𝑤𝑘𝑗 𝑦𝑗 = 𝑤𝑘𝑗 𝑦𝑗

𝑗=1 𝑗=1
𝝓𝒋 y
..
j
𝑧𝑐 Linear act. function
…
x(d-1)
𝑥1
𝝓𝑯 uJi 𝜙 𝑛𝑒𝑡𝐽 = ‖𝐱 − 𝐮𝐉 ‖
2
𝝓 𝐱−𝐮𝐉
xd = 𝑒− 𝜎
𝐔= 𝐗𝑇 𝑥𝑑
RP
: spread constant
PR Principle of Operation
Radial basis Euclidean
function Norm
x1
UJi 𝜙 𝑛𝑒𝑡𝐽 = ‖𝐱 − 𝐮𝐉 ‖
2
𝝓 yJ 𝐱−𝐮𝐉 𝜎: spread constant
= 𝑒− 𝜎
xd
𝝓 y1
𝐻 𝐻
wKj
𝑧𝐾 = 𝑓 𝑛𝑒𝑡𝐾 = 𝑓 𝑤𝐾𝑗 𝑦𝑗 = 𝑤𝐾𝑗 𝑦𝑗
𝝓 yH
𝑗=1 𝑗=1
Unknowns: 𝑢𝑗𝑖, 𝑤𝑘𝑗, 𝜎
RP
PR Principle of Operation
 What do these parameters represent?
 Physical meanings:
• 𝜙: The radial basis function for the hidden layer. This is a simple nonlinear mapping
function (typically Gaussian) that transforms the d- dimensional input patterns to a
(typically higher) H-dimensional space. The complex decision boundary will be
constructed from linear combinations (weighted sums) of these simple building
blocks.
• 𝑢𝑗𝑖 : The weights joining the first to hidden layer. These weights constitute the
center points of the radial basis functions. Also called prototypes of data.
• 𝜎: The spread constant(s). These values determine the spread (extend) of each
radial basis function.
• 𝑤𝑗𝑘: The weights joining hidden and output layers. These are the weights which are
used in obtaining the linear combination of the radial basis functions. They
determine the relative amplitudes of the RBFs when they are combined to form the
complex function.
• ‖𝐱 − 𝐮𝐽 ‖: the Euclidean distance between the input 𝐱 and the prototype vector 𝐮𝐉.
Activation of the hidden unit is determined according to this distance through 𝜙
PR RBF Principle of Operation
0.4
Weighted sum of radial basis transfer functions
Function implemented
0.35
by RBF network wJ:Relative weight
of Jth RBF
0.3
𝜙𝐽 : 𝐽𝑡ℎ RBF function
0.25
RBFs centered at
0.2
training data instances
0.15
0.1
RP
0.05 J
uJ
0
-2 0 2 4 6 8 10
Training data instances uJ: center for Jth RBF

PR RBFs – In Depth
 RBF neural networks are feed-forward neural networks that consist
of
 a hidden layer nonlinear kernel units that compute essentially the distance
between the given inputs and preset centers called prototypes – this layer performs
a nonlinear transformation of the inputs into a higher dimensional space (what is
that dimensionality???), where the classes are better separated
 an output layer of linear neurons that computes a weighted sum of contributions
from the kernels to predict the target labels.
PR How to Train?
 There are various approaches for training RBF networks.
 Approach 1: Exact RBF – Guarantees correct classification of all training
data instances. Requires N hidden layer nodes, one for each training
instance. No iterative training is involved. RBF centers (u) are fixed as
training data points, spread as variance of the data, and w are obtained by
solving a set of linear equations (In Matlab: newrbe() )
 Approach 2: Fixed centers selected at random. Uses 𝐻 < 𝑁 hidden layer
nodes. No iterative training is involved. Spread is based on Euclidean
metrics, w are obtained by solving a set of linear equations.
 Approach 3: Centers are obtained from unsupervised learning
(clustering). Spreads are obtained as variances of clusters, w are obtained
through LMS algorithm. Clustering (k-means) and LMS are iterative. This
is the most commonly used procedure. Typically provides good results.
 Approach 4: All unknowns are obtained from supervised learning.
PR Approach 1
 Exact RBF
 The first layer weights u are set to the training data; 𝑼 = 𝑿𝑻. That is, the
Gaussians are centered at the training data instances.
𝑑
 The spread is chosen as 𝜎 = max , where 𝑑𝑚𝑎𝑥 is the maximum Euclidean distance
2𝑁
between any two centers, and 𝑁 is the number of training data points. Note that
𝐻 = 𝑁, for this case.
 The output of the 𝑘𝑡ℎ RBF output neuron is then
𝑁 𝑁
Multiple
𝑧𝑘 = 𝑤𝑘𝑗 ⋅ 𝜙 ‖𝐱 − 𝐮𝑗 ‖ outputs 𝑧= 𝑤𝑗 ⋅ 𝜙 ‖𝐱 − 𝐮𝑗 ‖ Single output
𝑗=1 𝑗=1
 During training, we want the outputs to be equal to our desired targets. Without
loss of any generality, assume that we are approximating a single dimensional
function, and let the unknown true function be 𝑓(𝒙). The desired (target) output
for each input is then 𝑡𝑖 = 𝑓(𝒙𝑖), 𝑖 = 1, 2, … , 𝑁.
PR Approach 1
(Cont.)
 We then have a set of linear equations, which can be represented in the

matrix form:
𝜙11 𝜙12 ⋯ 𝜙1𝑁 𝑤1 𝑡1

𝑁
𝜙21 𝜙22 ⋯ 𝜙2𝑁 𝑤2 𝑡
𝑧= 𝑤𝑗 ⋅ 𝜙 ‖𝐱 − 𝐮𝑗 ‖ ⋅ ⋮ = ⋮2
⋮ ⋮ ⋮ ⋮
𝑗=1
y 𝜙𝑁1 𝜙𝑁2 ⋯ 𝜙𝑁𝑁 𝑤𝑁 𝑡𝑁
𝜙𝑖𝑗 = 𝜙‖𝐱 𝑖 − 𝐱𝑗 ‖ , (𝑖, 𝑗) = 1,2, . . . , 𝑁
𝐝 = 𝑡1 , 𝑡2 , ⋯ 𝑡𝑁 𝑇 𝚽⋅𝐰=𝐝
Define: 𝐰 = 𝑤1 , 𝑤2 , ⋯ 𝑤𝑁 𝑇 𝐰 = 𝚽 −1 𝐝
𝚽 = 𝜙𝑖𝑗 (𝑖, 𝑗) = 1,2, . . . , 𝑁
Is this matrix always invertible?
PR Approach 1
(Cont.)
 Michelli’s Theorem (1986)

 If 𝒙𝑖 𝑁𝑖=1 are a distinct set of points in the 𝑑-dimensional space, then the
𝑁 𝑏𝑦 𝑁 interpolation matrix 𝚽 with elements obtained from radial basis
functions 𝜙𝑖𝑗 = 𝜙‖𝐱 𝑖 − 𝐱𝑗 ‖ is nonsingular, and hence can be inverted!
 Note that the theorem is valid regardless the value of 𝑁, the choice of the RBF
(as long as it is an RBF), or what the data points may be, as long as they are
distinct!
 A large number of RBFs can be used:
• Multiquadrics: 𝜙(𝑟) = 𝑟 2 + 𝑐 2 1 2
for some 𝑐 > 0, 𝑟 ∈ ℜ

• Inverse multiquadrics: 1
𝜙(𝑟) = 2
𝑟 + 𝑐2 1 2
• Gaussian functions: 𝜙(𝑟) = 𝑒 −𝑟 2 2𝜎2 for some 𝜎 > 0, 𝑟 ∈ ℜ
𝑟 = ‖𝐱 − 𝐱𝑗 ‖
PR Approach1
(Cont.)
 The Gaussian is the most commonly used RBF (why…?).

 Note that as 𝑟 → ∞, 𝜙(𝑟) → 0
 Gaussian RBFs are localized functions ! unlike the sigmoids used by MLPs
Using Gaussian radial basis functions Using sigmoidal radial basis functions
PR Exact RBF Properties
 Using localized functions typically makes RBF networks more suitable for
function approximation problems. Why?
 Since first layer weights are set to input patterns, second layer weights are
obtained from solving linear equations, and spread is computed from the
data, no iterative training is involved !!!
 Guaranteed to correctly classify all training data points!
 However, since we are using as many receptive fields as the number of
data, the solution is over determined, if the underlying physical process
does not have as many degrees of freedom  Overfitting!
 The importance of : Too small will
also cause overfitting. Too large will
fail to characterize rapid changes in
the signal.
PR Too many
Receptive Fields?
 In order to reduce the artificial complexity of the RBF, we need to
use fewer number of receptive fields.
 How about using a subset of training data, say 𝑀 < 𝑁 of them.
 These 𝑀 data points will then constitute 𝑀 receptive field centers.
 How to choose these 𝑀 points…?
 At random  Approach 2.
𝑀
− 2 ‖𝐱𝑖 −𝐱𝑗 ‖2 𝑑max
𝑦𝑗 = 𝜙𝑖𝑗 = 𝜙 ‖𝐱 𝑖 − 𝐱𝑗 ‖2 = 𝑒 𝑑max , 𝑖 = 1,2, . . . , 𝑁 𝑗 = 1,2, . . . , 𝑀 𝜎 =
2𝑀
Output layer weights are determined as they were in Approach 1, through solving a
set of M linear equations!
 Unsupervised training: K-means  Approach 3

The centers are selected through self organization of clusters, where the
data is more densely populated. Determining 𝑀 is usually heuristic.
PR Approach 3
K-Means - Unsupervised
Clustering - Algorithm
 Choose number of clusters, M
 Initialize M cluster centers to the first M training data points: 𝐭𝑘 = 𝐱𝑘 , 𝑘 = 1,2, … , 𝑀.
 Repeat
 At iteration 𝑛, group all patterns to the cluster whose center is closest
𝐭𝑘(𝑛): center of 𝑘𝑡ℎ RBF at
𝐶(𝐱) = argmin‖𝐱(𝑛) − 𝐭 𝑘 (𝑛)‖, 𝑘 = 1,2, . . . , 𝑀 𝑛𝑡ℎ iteration
𝑘
 Compute the centers of all clusters after the regrouping
𝑀𝑘
1
𝐭𝑘 = 𝐱𝑗
New cluster center 𝑀𝑘 Instances that are grouped
for kth RBF. 𝑗=1
in the kth cluster
Number of instances
in the kth cluster
 Until there is no change in cluster centers from one iteration to the next.
An alternate k-means algorithm is given in Haykin (p. 301).

Determining the Output Weights:
PR Approach 3 LMS Algorithm
1
 The LMS algorithm is used to minimize the cost function 𝐸(𝐰) = 𝑒 2 (𝑛) where
2
𝑇
𝑒(𝑛) is the error at iteration 𝑛, i.e., 𝑒(𝑛) = 𝑑(𝑛) − 𝐲 𝑛 𝐰 𝑛
∂𝐸(𝐰) ∂𝑒(𝑛) ∂𝑒(𝑛) ∂𝐸(𝐰)
= 𝑒(𝑛) = −𝐲(𝑛) = −𝐲(𝑛)𝑒(𝑛)
∂𝐰(𝑛) ∂𝐰 ∂𝐰 ∂𝐰(𝑛 )
 Using the steepest (gradient) descent method: 𝐰(𝑛 + 1) = 𝐰(𝑛) + 𝜂𝐲(𝑛)𝑒(𝑛)

Instance based LMS algorithm pseudocode (for single output):
Initialize weights, 𝑤𝒋 to some small random value, 𝑗 = 1,2, … , 𝑀
M Cluster centers
Repeat
obtained through k-means
Choose next training pair (𝒙, 𝑑);
𝑀
Compute network output at iteration n: 𝑧(𝑛) = 𝑤𝑗 ⋅ 𝜙 ‖𝐱 − 𝐱𝑗 ‖ = 𝐰𝑇 ⋅ 𝐲
𝑗=1
Compute error: 𝑒(𝑛) = 𝑑(𝑛) − 𝑧(𝑛)

Update weights:
Until weights converge to a steady set of values 𝐰(𝑛 + 1) = 𝐰(𝑛) + 𝜂𝑒(𝑛)𝐲(𝑛)
PR Approach 4:
Supervised
RBF Training
 This is the most general form.
 All parameters, receptive field centers (first layer weights), output layer weights
and spread constants, are learned through iterative supervised training using LMS
/ gradient descent algorithm.
𝑁
ℰ= 𝑒2𝑗
𝑗=1
𝑀
𝑤𝑘 𝜙 ‖𝐱𝑗 − 𝐭𝑖 ‖
𝑖=1
𝐺 ‖𝐱𝑗 − 𝐭𝑖 ‖ = 𝜙 ‖𝐱𝑗 − 𝐭𝑖 ‖
𝐶
G’ represents the first derivative

of the function wrt its argument
PR MLP vs. RBF
 Similarities
 Both are universal approximators: they can approximate an arbitrary function of arbitrary
dimensionality and arbitrary complexity, provided that the number of hidden layer units are
sufficiently large, and there is sufficient training data.
 Differences
 MLP generates more global decision regions, as opposed to RBF generating more local
decision regions
 MLP partition the feature space into hyperplanes, whereas RBF partitions the space into
hyperellipsoids
 MLP is more likely to battle with local minima and flat valleys then RBF, and hence in
general has longer training times
 Since MLPs generate global decision regions, they do better in extrapolating, that is
classifying instances that are outside of the feature space represented by the training data. It
should be noted however, extrapolating may mean dealing with outliers.
 MLPs typically require fewer parameters then RBFs to approximate a given function with
the same accuracy G
(From R. Gutierrez)
PR MLP vs. RBF (Cont.)
 Differences (cont.)
 All parameters of the MLP are trained simultaneously, whereas RBF parameters can be
trained separately in an efficient hybrid manner
 RBFs have one and only one hidden layer, whereas MLPs can have multiple hidden layers.
 The hidden neurons of an MLP compute the inner product between an input vector and the
weight vector. RBFs instead compute the Euclidean distance between the input vector and
the radial basis function centers.
 The hidden layer of an RBF is nonlinear and its output layer is linear, whereas an MLP
typically has both layers as nonlinear. This really is more of a historic preference based on
empirical success. MLPs typically do better on classification type problems, and RBFs
typically do better on regression / function approximation type problems.
G
(From R. Gutierrez)
PR MLP vs. RBF
G
(From R. Gutierrez)
PR RBF Networks
in Matlab
2
-x
radbas(x)=e The RBF accepts a distance between
1 the input p and the weight vector w.
0.9
As the distance between w and p
decreases, the RBF output 0. Hence
0.8
0.7
0.6 an RBF is a detector that produces 1

0.5 when the input p is identical to its
0.4
weight vector w. The bias b allows the
0.3
sensitivity of the radbas neuron to be
0.2
𝑎 = 𝑟𝑎𝑑𝑏𝑎𝑠 𝐰 − 𝐩 𝑏 0.1
adjusted.
2
𝑟𝑎𝑑𝑏𝑎𝑠 𝑥 = 𝑒 −𝑥 0
-3 -2 -1 0 1 2 3
𝑆 1 x𝑅 𝐈𝐖 𝟏,𝟏
𝐚𝟏 𝐚𝟐 = 𝐲
𝑆 1 x1
𝐋𝐖 𝟐,𝟏
𝐧𝟏 𝐧𝟐
𝐛𝟏 𝐛𝟐
𝑆 1 x1 𝑆 2 x1 𝑆2
𝐚𝟏 = 𝑟𝑎𝑑𝑏𝑎𝑠 𝐈𝐖1,1 − 𝐩 .∗ 𝐛1 𝐚𝟐 = 𝑝𝑢𝑟𝑒𝑙𝑖𝑛 𝐋𝐖 2,1 𝐚𝟏 + 𝐛2

a{1} = radbas(netprod(dist(net.IW{1,1},p),net.b{1}))
PR newrb
newrb Design radial basis network
net = newrb(P,T,goal,spread,MN,DF)
Radial basis networks can be used to approximate functions. newrb adds neurons to the hidden layer of a radial basis network until it meets the
specified mean squared error goal.
net = newrb(P,T,goal,spread,MN,DF) takes the following arguments – P: R-by-Q matrix of Q input vectors; T: S-by-Q matrix of Q target class vectors;
goal: Mean squared error goal (default = 0.0); spread: Spread of radial basis functions (default = 1.0); MN: Maximum number of neurons (default is
Q); DF: Number of neurons to add between displays (default = 25)
and returns a new radial basis network. The larger the spread is, the smoother the function approximation. Too large a spread means a lot of
neurons are required to fit a fast-changing function. Too small a spread means many neurons are required to fit a smooth function, and the network
might not generalize well. Call newrb with different spreads to find the best value for a given problem.
Examples
Here you design a radial basis network, given inputs P and targets T.
P = [1 2 3]; T = [2.0 4.1 5.9]; net = newrb(P,T);
The network is simulated for a new input.
P = 1.5; Y = sim(net,P)
About the Algorithm

newrb creates a two-layer network. The first layer has radbas neurons, and calculates its weighted inputs with dist and its net input with netprod.
The second layer has purelin neurons, and calculates its weighted input with dotprod and its net inputs with netsum. Both layers have biases.
Initially the radbas layer has no neurons. The following steps are repeated until the network's mean squared error falls below goal.
1. The network is simulated.

2. The input vector with the greatest error is found.
3. A radbas neuron is added with weights equal to that vector.
4. The purelin layer weights are redesigned to minimize error.
PR Other Variations
 newgrnn – Create a generalized regression neural network. Consists of two
layers, the first is identical to that of RBF, whereas the second layer has a
slight variation of purelin layer. More suitable for function approximation
problems.
 newpnn – Create a probabilistic neural network. Also has two layers, with
the first being a RBF, whereas the second normalizes the first layer outputs
and passes them through a competitive function that picks the largest
output. Most suited for classification problems. This function essentially
creates a version of kNN, with the distances computed wrt to the radial
basis function.
PR trainrbf.m RBF Matlab Demo
%trainrbf: Trains and simulated a RBF and GRNN network on a synthetic
dataset - Originally written - 2003, Updated 10/2013 Original function
%Robi Polikar 1
load arb_function3.csv; RP 0
X=arb_function3(:,1);
Y=arb_function3(:,2);
size(X); N=length(X); -1
X=X(1:4:N); Y=Y(1:4:N); 0 2000 4000 6000 8000 10000 12000 14000 16000
Training Data
%Sub sample the data at a rate 1
of 1:5 to create training data
P1=X(1:5:length(X))'; 0
T1=Y(1:5:length(Y))';
net_rb = newrb(P1,T1,0.001, 1); -1

0 2000 4000 6000 8000 10000 12000 14000 16000
net_grnn=newgrnn(P1, T1, 1);
out_rb=sim(net_rb,X'); RBF approximation of the original data
2
out_grnn=sim(net_grnn, X');
subplot(411) 0
plot(X,Y); grid
title('Original function')
-2
subplot(412) 0 2000 4000 6000 8000 10000 12000 14000 16000
plot(P1,T1); grid GRNN approximation of the original data
title('Training Data'); 1
subplot(413)
plot(X,out_rb); grid
0
title('RBF approximation of the original data')
subplot(414) -1
plot(X,out_grnn); grid 0 2000 4000 6000 8000 10000 12000 14000 16000
title('GRNN approximation of the original data')
Original function
PR 1
RP
-1
0 2000 4000 6000 8000 10000 12000 14000 16000
Using an error Training Data
goal of 1, and 1
spread of 5
0
Required 1200
Neurons, using 2000
points for training -1 0 2000 4000 6000 8000 10000 12000 14000 16000
Simulation on original data

2
-1
0 2000 4000 6000 8000 10000 12000 14000 16000
PR RBF vs. MLP
The Peaks!
x 5
z  3  1  x   e  x 2  y 12  x2  y 2 1  x12  y 2
 10    x  y   e  e
2 3
5  3
Mesh of the training data
10
-5
-10
4
2 3
2
0 1
0
-2 -1
-2
-4 -3
PR RBF vs. MLP
The Peaks!
 RBF parameters
 Error goal: 0.1
 Spread: 1
 Training function: fully supervised
 MLP parameters:
 Number of hidden layers: 2
 Number of nodes in each layer: 25
 Error goal: 0.01; (does not reach in 1000 iterations)
 Training function: traingdx or trainlm
PR RBF vs. MLP
rbf_mlp_peaks
The Peaks!
PR nnstart
 For a very good introduction, play with nnstart
PR Selecting Data
PR Setting Up The Network
PR Training
 Check the dimensionality
 Check out the error histogram,
confusion matrix and ROC curve once
the training is completed
PR Performance Plots
Best Validation Performance is 0.01575 at epoch 84
0
10
Train
Validation
Test
Best
Mean Squared Error (mse)
-1
10
x 10
4 Error Histogram with 20 Bins
-2 3.5 Training
10
Validation
3 Test
Zero Error
2.5
Instances
-3 2
10
0 10 20 30 40 50 60 70 80 90
1.5
90 Epochs
1
0.5
-0.445
-0.146
0.153
-0.9432
-0.8436
-0.7439
-0.6443
-0.5446
-0.3453
-0.2457
0.0533
0.2526
0.3523
0.4519
0.5516
0.6512
0.7509
0.8505
0.9502
-0.04635
Errors = Targets - Outputs
PR ROC Curves
(Class Specific)
Training ROC Validation ROC
1 1
Class 1
Class 2
0.8 0.8
Class 3
True Positive Rate
True Positive Rate

Class 4
0.6 Class 5 0.6
Class 6
Class 7
0.4 0.4
Class 8
Class 9
0.2 Class 10 0.2
0 0
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1
False Positive Rate False Positive Rate
Test ROC All ROC

1 1
0.8 0.8
True Positive Rate
True Positive Rate

0.6 0.6
0.4 0.4
0.2 0.2
0 0
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1
False Positive Rate False Positive Rate
PR Confusion Matrices - I
Training Confusion Matrix Validation Confusion Matrix

114 0 0 0 0 2 0 0 0 0 98.3% 123 0 0 0 0 4 1 0 3 1 93.2%
1 1
9.9% 0.0% 0.0% 0.0% 0.0% 0.2% 0.0% 0.0% 0.0% 0.0% 1.7% 9.2% 0.0% 0.0% 0.0% 0.0% 0.3% 0.1% 0.0% 0.2% 0.1% 6.8%
0 102 0 0 1 0 1 0 7 0 91.9% 0 121 1 0 7 0 2 0 4 0 89.6%
2 2
0.0% 8.9% 0.0% 0.0% 0.1% 0.0% 0.1% 0.0% 0.6% 0.0% 8.1% 0.0% 9.0% 0.1% 0.0% 0.5% 0.0% 0.1% 0.0% 0.3% 0.0% 10.4%
0 0 110 3 0 1 0 0 0 0 96.5% 0 1 125 5 0 0 0 0 4 1 91.9%
3 3
0.0% 0.0% 9.6% 0.3% 0.0% 0.1% 0.0% 0.0% 0.0% 0.0% 3.5% 0.0% 0.1% 9.3% 0.4% 0.0% 0.0% 0.0% 0.0% 0.3% 0.1% 8.1%
0 0 2 116 0 1 0 2 0 3 93.5% 0 0 5 130 0 7 0 7 0 3 85.5%
4 4
0.0% 0.0% 0.2% 10.1% 0.0% 0.1% 0.0% 0.2% 0.0% 0.3% 6.5% 0.0% 0.0% 0.4% 9.7% 0.0% 0.5% 0.0% 0.5% 0.0% 0.2% 14.5%
0 1 0 0 115 0 0 0 1 0 98.3% 0 9 0 0 108 0 3 5 0 6 82.4%
Output Class
Output Class
5 5
0.0% 0.1% 0.0% 0.0% 10.0% 0.0% 0.0% 0.0% 0.1% 0.0% 1.7% 0.0% 0.7% 0.0% 0.0% 8.1% 0.0% 0.2% 0.4% 0.0% 0.4% 17.6%
0 0 0 0 0 97 0 0 0 0 100% 0 0 0 1 0 104 2 0 0 0 97.2%
6 6
0.0% 0.0% 0.0% 0.0% 0.0% 8.5% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.1% 0.0% 7.8% 0.1% 0.0% 0.0% 0.0% 2.8%
1 1 0 0 3 0 110 0 0 0 95.7% 0 2 0 0 1 0 130 0 3 0 95.6%
7 7
0.1% 0.1% 0.0% 0.0% 0.3% 0.0% 9.6% 0.0% 0.0% 0.0% 4.3% 0.0% 0.1% 0.0% 0.0% 0.1% 0.0% 9.7% 0.0% 0.2% 0.0% 4.4%
0 0 0 0 0 0 0 117 0 1 99.2% 0 1 1 0 0 1 0 121 0 1 96.8%
8 8
0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 10.2% 0.0% 0.1% 0.8% 0.0% 0.1% 0.1% 0.0% 0.0% 0.1% 0.0% 9.0% 0.0% 0.1% 3.2%
0 1 3 0 0 0 1 0 115 1 95.0% 0 6 6 1 0 1 6 0 112 0 84.8%
9 9
0.0% 0.1% 0.3% 0.0% 0.0% 0.0% 0.1% 0.0% 10.0% 0.1% 5.0% 0.0% 0.4% 0.4% 0.1% 0.0% 0.1% 0.4% 0.0% 8.4% 0.0% 15.2%
0 1 0 1 2 2 0 2 0 106 93.0% 4 5 1 4 0 4 0 3 6 125 82.2%
10 10
0.0% 0.1% 0.0% 0.1% 0.2% 0.2% 0.0% 0.2% 0.0% 9.2% 7.0% 0.3% 0.4% 0.1% 0.3% 0.0% 0.3% 0.0% 0.2% 0.4% 9.3% 17.8%
99.1%96.2%95.7%96.7%95.0%94.2%98.2%96.7%93.5%95.5% 96.1% 96.9%83.4%89.9%92.2%93.1%86.0%90.3%89.0%84.8%91.2%89.6%
0.9% 3.8% 4.3% 3.3% 5.0% 5.8% 1.8% 3.3% 6.5% 4.5% 3.9% 3.1% 16.6%10.1% 7.8% 6.9% 14.0% 9.7% 11.0%15.2% 8.8% 10.4%
1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10
Target Class Target Class
Test Confusion Matrix All Confusion Matrix

123 0 0 0 0 2 0 0 3 3 93.9% 360 0 0 0 0 8 1 0 6 4 95.0%
1 1
9.2% 0.0% 0.0% 0.0% 0.0% 0.1% 0.0% 0.0% 0.2% 0.2% 6.1% 9.4% 0.0% 0.0% 0.0% 0.0% 0.2% 0.0% 0.0% 0.2% 0.1% 5.0%
0 109 2
Computational
2 Intelligence &0 Pattern
7 0 2 0
Recognition 8 0 85.2%
2
0 332
© 2001- 3
2013,0 Robi
15 0 5
Polikar, 0
Rowan 19 0 88.8%
University, Glassboro, NJ
0 1 3 0 0 0 1 0 115 1 95.0% 0 6 6 1 0 1 6 0 112 0 84.8%
9 9
0.0% 0.1% 0.3% 0.0% 0.0% 0.0% 0.1% 0.0% 10.0% 0.1% 5.0% 0.0% 0.4% 0.4% 0.1% 0.0% 0.1% 0.4% 0.0% 8.4% 0.0% 15.2%
PR
0 1 0 1 2 2 0 2 0 106 93.0% 4 5 1 4 0 4 0 3 6 125 82.2%
10 10
Confusion Matrices - II
0.0% 0.1% 0.0% 0.1% 0.2% 0.2% 0.0% 0.2% 0.0% 9.2% 7.0% 0.3% 0.4% 0.1% 0.3% 0.0% 0.3% 0.0% 0.2% 0.4% 9.3% 17.8%
99.1%96.2%95.7%96.7%95.0%94.2%98.2%96.7%93.5%95.5% 96.1% 96.9%83.4%89.9%92.2%93.1%86.0%90.3%89.0%84.8%91.2%89.6%
0.9% 3.8% 4.3% 3.3% 5.0% 5.8% 1.8% 3.3% 6.5% 4.5% 3.9% 3.1% 16.6%10.1% 7.8% 6.9% 14.0% 9.7% 11.0%15.2% 8.8% 10.4%
1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10
Test Confusion Matrix All Confusion Matrix

123 0 0 0 0 2 0 0 3 3 93.9% 360 0 0 0 0 8 1 0 6 4 95.0%
1 1
9.2% 0.0% 0.0% 0.0% 0.0% 0.1% 0.0% 0.0% 0.2% 0.2% 6.1% 9.4% 0.0% 0.0% 0.0% 0.0% 0.2% 0.0% 0.0% 0.2% 0.1% 5.0%
0 109 2 0 7 0 2 0 8 0 85.2% 0 332 3 0 15 0 5 0 19 0 88.8%
2 2
0.0% 8.1% 0.1% 0.0% 0.5% 0.0% 0.1% 0.0% 0.6% 0.0% 14.8% 0.0% 8.7% 0.1% 0.0% 0.4% 0.0% 0.1% 0.0% 0.5% 0.0% 11.2%
0 3 117 4 0 1 0 0 5 0 90.0% 0 4 352 12 0 2 0 0 9 1 92.6%
3 3
0.0% 0.2% 8.7% 0.3% 0.0% 0.1% 0.0% 0.0% 0.4% 0.0% 10.0% 0.0% 0.1% 9.2% 0.3% 0.0% 0.1% 0.0% 0.0% 0.2% 0.0% 7.4%
0 0 1 121 0 4 0 7 0 3 89.0% 0 0 8 367 0 12 0 16 0 9 89.1%
4 4
0.0% 0.0% 0.1% 9.0% 0.0% 0.3% 0.0% 0.5% 0.0% 0.2% 11.0% 0.0% 0.0% 0.2% 9.6% 0.0% 0.3% 0.0% 0.4% 0.0% 0.2% 10.9%
2 5 0 0 135 0 1 3 1 4 89.4% 2 15 0 0 358 0 4 8 2 10 89.7%
Output Class
Output Class
5 5
0.1% 0.4% 0.0% 0.0% 10.1% 0.0% 0.1% 0.2% 0.1% 0.3% 10.6% 0.1% 0.4% 0.0% 0.0% 9.4% 0.0% 0.1% 0.2% 0.1% 0.3% 10.3%
2 0 1 1 0 137 0 2 1 0 95.1% 2 0 1 2 0 338 2 2 1 0 97.1%
6 6
0.1% 0.0% 0.1% 0.1% 0.0% 10.2% 0.0% 0.1% 0.1% 0.0% 4.9% 0.1% 0.0% 0.0% 0.1% 0.0% 8.8% 0.1% 0.1% 0.0% 0.0% 2.9%
0 3 0 0 6 0 114 0 1 0 91.9% 1 6 0 0 10 0 354 0 4 0 94.4%
7 7
0.0% 0.2% 0.0% 0.0% 0.4% 0.0% 8.5% 0.0% 0.1% 0.0% 8.1% 0.0% 0.2% 0.0% 0.0% 0.3% 0.0% 9.3% 0.0% 0.1% 0.0% 5.6%
0 4 0 2 0 3 0 117 1 1 91.4% 0 5 1 2 0 4 0 355 1 3 95.7%
8 8
0.0% 0.3% 0.0% 0.1% 0.0% 0.2% 0.0% 8.7% 0.1% 0.1% 8.6% 0.0% 0.1% 0.0% 0.1% 0.0% 0.1% 0.0% 9.3% 0.0% 0.1% 4.3%
4 7 4 0 1 0 4 0 102 2 82.3% 4 14 13 1 1 1 11 0 329 3 87.3%
9 9
0.3% 0.5% 0.3% 0.0% 0.1% 0.0% 0.3% 0.0% 7.6% 0.1% 17.7% 0.1% 0.4% 0.3% 0.0% 0.0% 0.0% 0.3% 0.0% 8.6% 0.1% 12.7%
3 7 1 0 1 5 0 1 3 121 85.2% 7 13 2 5 3 11 0 6 9 352 86.3%
10 10
0.2% 0.5% 0.1% 0.0% 0.1% 0.4% 0.0% 0.1% 0.2% 9.0% 14.8% 0.2% 0.3% 0.1% 0.1% 0.1% 0.3% 0.0% 0.2% 0.2% 9.2% 13.7%
91.8%79.0%92.9%94.5%90.0%90.1%94.2%90.0%81.6%90.3% 89.4% 95.7%85.3%92.6%94.3%92.5%89.9%93.9%91.7%86.6%92.1%91.5%
8.2% 21.0% 7.1% 5.5% 10.0% 9.9% 5.8% 10.0%18.4% 9.7% 10.6% 4.3% 14.7% 7.4% 5.7% 7.5% 10.1% 6.1% 8.3% 13.4% 7.9% 8.5%
1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10
PR Evaluate The Network
PR Evaluation Results
On Test Data
Confusion Matrix ROC
1
174 0 1 1 0 0 0 0 0 2 97.8% Class 1
1
9.7% 0.0% 0.1% 0.1% 0.0% 0.0% 0.0% 0.0% 0.0% 0.1% 2.2% Class 2
0.9 Class 3
0 169 7 2 1 3 4 0 16 4 82.0%
2 Class 4
0.0% 9.4% 0.4% 0.1% 0.1% 0.2% 0.2% 0.0% 0.9% 0.2% 18.0% Class 5
0 0 157 3 0 0 0 0 0 2 96.9% 0.8 Class 6
3 Class 7
0.0% 0.0% 8.7% 0.2% 0.0% 0.0% 0.0% 0.0% 0.0% 0.1% 3.1%
Class 8
0 0 8 164 0 0 0 1 2 1 93.2% 0.7 Class 9
4
0.0% 0.0% 0.4% 9.1% 0.0% 0.0% 0.0% 0.1% 0.1% 0.1% 6.8% Class 10
Output Class
1 5 0 0 177 0 5 1 0 8 89.8% 0.6

5
True Positive Rate

0.1% 0.3% 0.0% 0.0% 9.8% 0.0% 0.3% 0.1% 0.0% 0.4% 10.2%
2 0 0 4 0 178 0 5 4 0 92.2%
6 0.5
0.1% 0.0% 0.0% 0.2% 0.0% 9.9% 0.0% 0.3% 0.2% 0.0% 7.8%
1 0 0 0 0 0 170 0 0 0 99.4%
7 0.4
0.1% 0.0% 0.0% 0.0% 0.0% 0.0% 9.5% 0.0% 0.0% 0.0% 0.6%
0 0 2 1 1 0 0 161 2 0 96.4%
8
0.0% 0.0% 0.1% 0.1% 0.1% 0.0% 0.0% 9.0% 0.1% 0.0% 3.6% 0.3
0 8 2 4 0 0 2 1 143 4 87.2%
9
0.0% 0.4% 0.1% 0.2% 0.0% 0.0% 0.1% 0.1% 8.0% 0.2% 12.8% 0.2
0 0 0 4 2 1 0 10 7 159 86.9%
10
0.0% 0.0% 0.0% 0.2% 0.1% 0.1% 0.0% 0.6% 0.4% 8.8% 13.1%
0.1
97.8%92.9%88.7%89.6%97.8%97.8%93.9%89.9%82.2%88.3%91.9%
2.2% 7.1% 11.3%10.4% 2.2% 2.2% 6.1% 10.1%17.8%11.7% 8.1%
0
1 2 3 4 5 6 7 8 9 10 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
False Positive Rate
Target Class
PR What Else Can it do,
you Ask…?
Prepared to be impressed,
and then click here
and then click here
PR …and here is the script…
% Solve a Pattern Recognition Problem with a Neural Network - Script generated by NPRTOOL - Created Thu Oct 10 02:48:04 EDT 2013
% This script assumes these variables are defined: % opt_train - input data. % opt_class - target data.
inputs = opt_train; targets = opt_class;
% Create a Pattern Recognition Network
hiddenLayerSize = 20; net = patternnet(hiddenLayerSize);
% Choose Input and Output Pre/Post-Processing Functions - For a list of all processing functions type: help nnprocess
net.inputs{1}.processFcns = {'removeconstantrows','mapminmax'};
net.outputs{2}.processFcns = {'removeconstantrows','mapminmax'};
% Setup Division of Data for Training, Validation, Testing - For a list of all data division functions type: help nndivide
net.divideFcn = 'dividerand'; % Divide data randomly
net.divideMode = 'sample'; % Divide up every sample
net.divideParam.trainRatio = 30/100; net.divideParam.valRatio = 35/100; net.divideParam.testRatio = 35/100;
% For help on training function 'trainlm' type: help trainlm; For a list of all training functions type: help nntrain
net.trainFcn = 'trainlm'; % Levenberg-Marquardt
% Choose a Performance Function - For a list of all performance functions type: help nnperformance
net.performFcn = 'mse'; % Mean squared error
% Choose Plot Functions - For a list of all plot functions type: help nnplot
net.plotFcns = {'plotperform','plottrainstate','ploterrhist', 'plotregression', 'plotfit'};
% Train and test the Network
[net,tr] = train(net,inputs,targets); outputs = net(inputs); errors = gsubtract(targets,outputs); performance = perform(net,targets,outputs)
% Recalculate Training, Validation and Test Performance
trainTargets = targets .* tr.trainMask{1};
valTargets = targets .* tr.valMask{1};
testTargets = targets .* tr.testMask{1};
trainPerformance = perform(net,trainTargets,outputs);
valPerformance = perform(net,valTargets,outputs);
testPerformance = perform(net,testTargets,outputs)
% View the Network

view(net)
% Uncomment these lines to enable various plots.
%figure, plotperform(tr) %figure, plottrainstate(tr) %figure, plotconfusion(targets,outputs) %figure, ploterrhist(errors)
PR …And the simulink model
Constant Input NNET Output
x1 y1
Pattern Recognition Neural Network
PR Midterm Project II
Oct 24
 Option 1 (easier, but close ended – intended for undergraduate students) - Pick 3 of the
more challenging datasets (from the UCI repository) that you used in Midterm Project I, and
design two function approximation (with noise) problems
 Train and test using the following network structures: MLP, RBF, PNN, LVQ, GRNN (which network
is suitable for which classification vs. regression problem)
 Investigate different parameters, learning algorithms, etc. and tabulate your results. Can you make
any generalizations with respect to accuracy, speed, network size, etc.?
 Use proper cross validation and statistical test to compare the algorithms on the datasets. Provide
the principles of operation for each of the network, including PNN, LVQ and GRNN.
 UG students can instead do option 2 for 20% additional credit.
 Option 2 – intended for grad students (open ended problem – publication opportunity)
 You will be given a dataset of EEG data with two classes: Alzheimer’s disease and normal. The dataset
includes raw data as well as 152 different feature sets obtained from 71 subjects. Your goal is to devise
and implement a rigorous approach to determine which feature sets provide the best diagnostic
accuracy, using appropriate cross-validation techniques. Provide a list of these approaches in descending
order of accuracy, also providing sensitivity, specificity and positive predictive value, along with their
confidence intervals.

L5-6 - Feedforward Neural Networks - MLP and RBF - v2

Încărcat de

Informații document

Titlu original

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

L5-6 - Feedforward Neural Networks - MLP and RBF - v2

Încărcat de

Drepturi de autor:

Formate disponibile

PR

The Multilayer Perceptron

 D. O. Hebb, 1949 (1904 – 1985) – Father of Cognitive

 Bernard Widrow and Ted Hoff are the

 Marvin Minsky and Seymour Papert, 1969

 Isabel Guyon, Bernard Boser and V. Vapnik, 1992

 B. Schölkoph, A. Smola, C. Burges, N. Chistianni, 1995 – today

New Data Predicted Outputs

𝑑-dimensional input 𝑦 = 𝑓(𝒙) 1 or 3, c-dimensional input

 Hence, the problem is – given a set of input/output example pairs of an unknown

RP Wji netk zk netj

=− 𝑡𝑘 − 𝑧𝑘 𝑓 ′ 𝑛𝑒𝑡𝑘 ⋅ 𝑤𝑘𝑗 = − 𝛿𝑘 𝑤𝑘𝑗

𝛥𝑤𝑘𝑗 = 𝜂 ⋅ 𝛿𝑘 ⋅ 𝑦𝑗 for output layer weights

𝛿𝑗 = 𝛿𝑘 𝑤𝑘𝑗 𝑓 ′ 𝑛𝑒𝑡𝑗 𝛿𝑘 = 𝑡𝑘 − 𝑧𝑘 ⋅ 𝑓 ′ 𝑛𝑒𝑡𝑘

Initialize w, learning rate 𝜂, # of hidden layer nodes 𝐻, iteration 𝑡 = 0, criterion 𝜃;

0.2 Matlab’s logsig

-0.6  =1 Matlab’s tansig

 For the fish example, x1=length (mm), x2=weight(kg)

 [Y,PS] = mapstd(X, ymean,ystd) processes matrix X (similar to above functions) but by

 The feedforwardnet function includes both normalization and standardization of both

 [y,ps] = fixunknowns(x) processes matrixes by replacing each row

 0 1 targets with log-sigmoid (‘logsig’ in Matlab) activation function

Note that H, along with d and c determines the

−1 𝑑 < 𝑤𝑗𝑖 < 1 𝑑

 Another approach is the Nguyen-Widrow initialization, which generates

 Initialization to zero is never a good idea…Why?

Initialize w, learning rate 𝜂, # of hidden layer nodes 𝐻,

 In order to prevent overfitting, the algorithm should be stopped before it

weights connecting input weights connecting hidden

𝐚𝟏 = 𝐟 𝟏 𝐈𝐖 𝟏,𝟏 𝐩 + 𝐛𝟏 𝐚𝟐 = 𝐟 𝟐 𝐋𝐖 𝟐,𝟏 𝐚𝟏 + 𝐛𝟐 𝐚𝟑 = 𝐟 𝟑 𝐋𝐖 𝟑,𝟐 𝐚𝟐 + 𝐛𝟑 = 𝐲

𝐚𝟑 = 𝐟 𝟑 𝐋𝐖 𝟑,𝟐 𝐟 𝟐 𝐋𝐖 𝟐,𝟏 𝐟 𝟏 𝐈𝐖 𝟏,𝟏 𝐩 + 𝐛𝟏 + 𝐛𝟐 + 𝐛𝟑 = 𝐲

Here a feedforward neural network is used to solve a simple problem.

# of datasets to be used for training,

# of layers to be used as outputs,

Used with Simulink only

defines which layers have biases,

Structure of properties for inputs,

Always empty matrix for feedforward networks

This property defines the number of elements in the input data

The processing functions (net.inputs{i}.processFcns) and their associated processing parameters

To remove default input processing functions: net.inputs{1}.processFcns={}

Used by self organizing map

Nguyen-Widrow algorithm for weight/bias initialization

Used by self organizing maps (SOMs)

Used by recursive (feedback) neural networks

Default processing functions for the output layer

Used by time delay neural networks (TDNNs)

# of weights : 0 indicates that the network is not configured yet

Lists which performance metrics should be

Training function, generally used for batch learning. trainlm is default

To change the data partitioning: net.divideFcn=‘dividetrain’; assigns all data to training

view(net) returns a graphical diagram of the architecture

sim Simulate neural network (Neural Network Toolbox)

net.trainParam.epochs = 1000; % Maximum number of epochs to train

%train and simulate the network

[Network_output ignore1 ignore2 Network_error Network_perf]= sim(net,ts_data, [], [], ts_labels);

plotconfusion(ts_labels, Network_output) %Graphical output of the confusion matrix

One hidden layer, with N=50 nodes,

One hidden layer, with N=20 nodes,

𝜕2𝐽 𝜕2𝐽 𝜕2𝐽

 The problem with this method is that

 Another group of algorithms use an approximation of the Hessian, and have

*The difference of first order derivatives