Documente Academic
Documente Profesional
Documente Cultură
Lecture 5-6
Dr. Robi Polikar Feedforward Neural
Networs
Computational Intelligence & Pattern Recognition © 2001- 2013, Robi Polikar, Rowan University, Glassboro, NJ
PR Neural Networks
Neural Network, in computer science, highly interconnected network of information-
processing elements that mimics the connectivity and functioning of the human brain. Neural
networks address problems that are often difficult for traditional computers to solve, such as
speech and pattern recognition. They also provide some insight into the way the human brain
works. One of the most significant strengths of neural networks is their ability to learn from a
limited set of examples.
© Encarta, 1993-2007 Microsoft Corporation. All rights reserved.
In computer science and related fields, artificial neural networks are models inspired by
animal central nervous systems (in particular the brain) that are capable of machine
learning and pattern recognition. They are usually presented as systems of interconnected
"neurons" that can compute values from inputs by feeding information through the network.
For example, in a neural network for handwriting recognition, a set of input neurons may be
activated by the pixels of an input image representing a letter or digit. The activations of these
neurons are then passed on, weighted and transformed by some function determined by the
network's designer, to other neurons, etc., until finally an output neuron is activated that
determines which character was read.
Like other machine learning methods, neural networks have been used to solve a wide variety
of tasks that are hard to solve using ordinary rule-based programming, including computer
vision and speech recognition.
http://en.wikipedia.org/wiki/Artificial_neural_network
Computational Intelligence & Pattern Recognition © 2001- 2013, Robi Polikar, Rowan University, Glassboro, NJ
PR Neural Networks
Physiological Origins
d input
nodes
H hidden
x1 layer nodes
c output
nodes
x2 z1
…
Wjk
..
Wij
....
zk
..
yj
zc
…
x(d-1)
i=1,2,…d
RP xd j=1,2,…,H
k=1,2,…c
Computational Intelligence & Pattern Recognition © 2001- 2013, Robi Polikar, Rowan University, Glassboro, NJ
The Mathematical origins
PR 1940s
McCulloch & Pitts, 1943
Warren McCulloch (a psychiatrist and neuroanatomist) and
Walter Pitts (mathematician) devised the first computational
model of neurons
• A McCulloch-Pitt neuron fires if the sum of excitatory inputs exceeds a threshold,
provided that the neuron does not receive an inhibitory input. They showed that
such as network of neurons can construct any logical function.
𝑛𝑒𝑤 𝑜𝑙𝑑
𝑤𝑖𝑗 = 𝑤𝑖𝑗 + 𝜂𝑝𝑖 𝑎𝑗
Computational Intelligence & Pattern Recognition © 2001- 2013, Robi Polikar, Rowan University, Glassboro, NJ
PR The Origins
1950s
Frank Rosenblatt, 1958 (1928 – 1969) x1
Founder of the perceptron model – a wJi
single neuron with adjustable weights and J
a threshold activation. xd
• Proved that if two classes are linearly separable,
the learning algorithm for perceptron (the d
perceptron rule) will converge. f netJ wJi xi
Widrow (1929 - ) & Hoff (1937 - ) i 1
TED HOFF
BERNARD
WIDROW
(with
Nik Kasabov)
RP PHOTOS – ALL RIGHTS RESERVED, ROBI POLIKAR, IJCNN 2009, ATLANTA, GA © 2009
Computational Intelligence & Pattern Recognition © 2001- 2013, Robi Polikar, Rowan University, Glassboro, NJ
PR The Origins
1960s and 70s
The perceptron has shown itself worthy of study despite (and even because of !) its severe limitations. It has
many features to attract attention: its linearity, its intriguing learning theorem; its clear paradigmatic
simplicity as a kind of parallel computation. There is no reason to suppose that any of these virtues carry over
to the many-layered version. Nevertheless, we consider it to be an important research problem to elucidate (or
reject) our intuitive judgment that the extension to multilayer systems is sterile.
Minksy and Papert, 1969
Computational Intelligence & Pattern Recognition © 2001- 2013, Robi Polikar, Rowan University, Glassboro, NJ
PR The origins
1980s
S. Grossberg and G. Carpenter, 1980
Two of very few left studying ANNs were Grossberg
and Carpenter (couple), described a new unsupervised
algorithm called the Adaptive Resonance Theory
(ART), which has later been expanded into a
supervised algorithm ARTMAP. Several variations of
ARTMAP have been developed since then.
John Hopfield, 1982
He developed the Hopfield network, the first example
of recurrent networks, used as an associative memory.
Teuvo Kohonen, 1982
Credited with the development of the unsupervised
training algorithms, self organizing maps (SOMs)
A. Barto, R.S. Sutton and C. Anderson, 1983
Popularized reinforcement learning
Computational Intelligence & Pattern Recognition © 2001- 2013, Robi Polikar, Rowan University, Glassboro, NJ
PR Feedforward
Networks
D. Rumelhart, Hinton and Williams, 1986
Developed a simple, yet elegant algorithm for training multilayer
networks, learning nonlinearly separable decision boundaries, through
back propagation of errors, a generalization of the LMS algorithm.
Paul Werbos, 1974
Widely recognized as the original inventor of the backpropagation
algorithm in his Ph.D. thesis (Harvard) in 1974. Currently at NSF
Dave Broomhead and David Lowe, 1988
John Moody and Christian Darken, 1989
Radial Basis Function (RBF)
Neural networks
Computational Intelligence & Pattern Recognition © 2001- 2013, Robi Polikar, Rowan University, Glassboro, NJ
PR Kernel Methods
Support Vector Machines
Vladimir Vapnik and Alexey Chernovenkis, 1962
VC dimension of classifiers
Statistical learning theory
Linear SVMs (coming soon)
Computational Intelligence & Pattern Recognition © 2001- 2013, Robi Polikar, Rowan University, Glassboro, NJ
PR Bagging, Boosting,
Ensemble Systems
Leo Breiman (1928-2005), 1994
Bagging – one of the original ensemble of
classifiers algorithms based on random sampling
with replacement.
Later, random forests, a clever name for an
ensemble of decision trees algorithms
Robert Schapire and Yoav Freund, 1995
Hedge, boosting, AdaBoost (coming soon!)
Arguably one of the most influential algorithms in
recent history of machine learning.
Tin Kam Ho, 1998
Random subspace methods
Ludmilla Kuncheva, 2004
Classifier fusion
Gavin Brown, 2004
Feature selection, diversity in ensemble systems
Computational Intelligence & Pattern Recognition © 2001- 2013, Robi Polikar, Rowan University, Glassboro, NJ
PR Other notable People
Michael I. Jordan, Zoubin Ghahramani, Daphne Koller
Bayesian networks, mixture of experts
Expectation – maximization (EM) algorithm
Graphical methods
David Wolpert
No free lunch theorem, 1997
Stacked generalization, 1992
Nitesh Chawla
Learning from unbalanced data
SMOTE (Synthetic Minority Oversampling
TEchnique)
Geoffrey Hinton
Deep neural networks / deep learning 2000
Dropout method, 2012
Computational Intelligence & Pattern Recognition © 2001- 2013, Robi Polikar, Rowan University, Glassboro, NJ
PR The Need for
Multilayered Networks
The XOR Problem
x2
Input x Output
1 (1,1) 0
(0,1) 1
(1,0) 1
0
(0,0) 0
RP x1
0 1
No linear function can separate the two classes of the XOR problem!
(no set of w0 w1 w2 will satisfy all of the following contraints)
𝑥1 = 0, 𝑥2 = 0, 𝑦 = 0 → 𝑤0 ≤0
𝑥1 = 0, 𝑥2 = 1, 𝑦 = 1 → 𝑤2 + 𝑤0 >0
𝑥1 = 1, 𝑥2 = 0, 𝑦 = 1 → 𝑤1 + 𝑤0 >0
𝑥1 = 1, 𝑥2 = 1, 𝑦 = 0 → 𝑤1 + 𝑤2 + 𝑤0 ≤0
Computational Intelligence & Pattern Recognition © 2001- 2013, Robi Polikar, Rowan University, Glassboro, NJ
PR The Need for
Multilayered Networks
But a multilayered network can: 𝑦
1, 𝑥1 + 𝑥2 − 1.5 > 0
Start with the AND function 𝑦=
0, 𝑥1 + 𝑥2 − 1.5 < 0
A
Computational Intelligence & Pattern Recognition © 2001- 2013, Robi Polikar, Rowan University, Glassboro, NJ
PR The Need for
Multilayered Networks
𝑥1 XOR 𝑥2 = (𝑥1 AND ~𝑥2) OR (~𝑥1 AND 𝑥2)
OR x1 x2 x1 x2 x1 x2
z1 z2
AND AND
Computational Intelligence & Pattern Recognition © 2001- 2013, Robi Polikar, Rowan University, Glassboro, NJ
PR The ANN Training Cycle
Stage 1: Network Training ANN w/ weights to be
determined
Present Examples
Indicate Desired Outputs
Determine
Synaptic
Weights “knowledge”
Stage 2: Network Testing
RP
Computational Intelligence & Pattern Recognition © 2001- 2013, Robi Polikar, Rowan University, Glassboro, NJ
PR Universal Approximator
Classification can be thought of as a special case of function approximation:
For a three class problem:
𝑥1
Class 1: [1 0 0] 1: Class 1
x …. Classifier Class 2: [0 1 0] 2: Class 2
3: Class 3
Class 3: [0 0 1]
𝑥𝑑
Computational Intelligence & Pattern Recognition © 2001- 2013, Robi Polikar, Rowan University, Glassboro, NJ
PR The Multilayer Perceptron
Architecture
• What truly separates an MLP from a regular simple perceptron
d input is the non-linear threshold function f, also known as the activation
nodes function. If a linear thresholding function is used, the MLP can be
H hidden
replaced with a series of simple perceptron, which can then only
layer nodes solve linearly separable problems.
x1
c output
nodes
z1 d
……...
x2
y j f net j f w ji xi
i 1
..
Wkj
……....
x(d-1)
netk
i=1,2,…d
xd j=1,2,…,H
Computational Intelligence & Pattern Recognition k=1,2,…c © 2001- 2013, Robi Polikar, Rowan University, Glassboro, NJ
PR Computing Node Values
x1 wJi
netJ
d
∑ y J f net J f wJi xi
yJ i 1
xd
y1 wKj
netk
H
∑ z K f net K f wKj y j
zK j 1
RP yH
Computational Intelligence & Pattern Recognition © 2001- 2013, Robi Polikar, Rowan University, Glassboro, NJ
PR Backpropagation
Learning Rule
The weights are determined through the gradient descent error
minimization of the criterion function J(w) Target (desired) outputs
𝑐 Actual network outputs
1 2
1
Training Error: 𝐽(𝐰) = 𝑡𝑘 − 𝑧𝑘 = ‖𝐭 − 𝐳‖2 𝐻
2 2
𝑘=1 𝑧𝐾 = 𝑓 𝑛𝑒𝑡𝐾 = 𝑓 𝑤𝐾𝑗 𝑦𝑗
𝑗=1
𝜕𝐽 𝐰
𝐰(𝑡 + 1) = 𝐰(𝑡) + 𝛥𝐰(𝑡) ⇒ 𝛥𝐰 = −𝜂
𝜕𝐰
We need to express 𝐽(𝐰) in terms of w for both output and hidden layer nodes. Output nodes
are easy, since we know the functional representation of 𝐽 with respect to w through the chain
rule:
𝜕𝐽(𝐰) 𝜕𝐽(𝐰) 𝜕𝑛𝑒𝑡𝑘 𝜕𝐽(𝐰) 𝜕𝑧𝑘 𝜕𝑛𝑒𝑡𝑘
= ⋅ = ⋅ ⋅ = − 𝑡𝑘 − 𝑧𝑘 ⋅ 𝑓 ′ 𝑛𝑒𝑡𝑘 ⋅ 𝑦𝑗
𝜕𝑤𝑘𝑗 𝜕𝑛𝑒𝑡𝑘 𝜕𝑤𝑘𝑗 𝜕𝑧𝑘 𝜕𝑛𝑒𝑡𝑘 𝜕𝑤𝑘𝑗
yj k Output node sensitivity
= -k
𝛥𝑤𝑘𝑗 = 𝜂 ⋅ 𝛿𝑘 ⋅ 𝑦𝑗 = 𝜂 𝑡𝑘 − 𝑧𝑘 𝑓 ′ 𝑛𝑒𝑡𝑘 𝑦𝑗
For logistic sigmoid
= 𝜂 𝑡𝑘 − 𝑧𝑘 𝑓 𝑛𝑒𝑡𝑘 𝑓 1 − 𝑛𝑒𝑡𝑘 𝑦𝑗 𝑓 ′ 𝑥 = 𝑓 𝑥 𝑓(1 − 𝑥)
Computational Intelligence & Pattern Recognition © 2001- 2013, Robi Polikar, Rowan University, Glassboro, NJ
PR Backpropagation
Learning Rule
For the hidden layer, things are a little bit more complicated, since we do not know the
desired values of the hidden layer node outputs. However, by the appropriate use of the
chain rule, we obtain : 𝑓 ′ 𝑛𝑒𝑡𝑗 𝑥 𝑖
𝜕𝐽(𝐰) 𝜕𝐽(𝐰) 𝜕𝑦𝑗 𝜕𝑛𝑒𝑡𝑗
= ⋅ ⋅
𝜕𝑤𝑗𝑖 𝜕𝑦𝑗 𝜕𝑛𝑒𝑡𝑗 𝜕𝑤𝑗𝑖
𝜕𝑧𝑘 𝜕𝑛𝑒𝑡𝑘
Hidden layer ⋅ = 𝑓 ′ 𝑛𝑒𝑡𝑘 ⋅ 𝑤𝑘𝑗
= j 𝜕𝑛𝑒𝑡𝑘 𝜕𝑦𝑗
node sensitivity
𝑐 𝑐
𝜕𝐽(𝐰) 𝜕 1 2
𝜕𝑧𝑘
= 𝑡𝑘 − 𝑧𝑘 =− 𝑡𝑘 − 𝑧𝑘 ⋅
𝜕𝑦𝑗 𝜕𝑦𝑗 2 𝜕𝑦𝑗
𝑘=1 𝑘=1
𝑐 𝑐
𝑧𝐾 = 𝑓 𝑛𝑒𝑡𝐾 = 𝑓 𝑤𝐾𝑗 𝑦𝑗
= j
Computational Intelligence & Pattern Recognition 𝑗=1Glassboro, NJ
© 2001- 2013, Robi Polikar, Rowan University,
PR MLP/BP
The weight update rule is then:
𝛥𝑤𝑗𝑖 = 𝜂 ⋅ 𝛿𝑗 ⋅ 𝑥𝑖 for hidden layer weights
In each case, the parameter 𝛿 represents the sensitivity of the criterion function
(error) with respect to activation of hidden / output layer node. The sensitivity of a
hidden layer node is a weighted sum of the output sensitivities, scaled by 𝑓’(𝑛𝑒𝑡𝑗),
where the weights are the hidden-to-output layer weights, whereas the output
sensitivities themselves are the errors at the output level, scaled by 𝑓’(𝑛𝑒𝑡𝑘)
𝑐
The algorithm takes its name – backpropagation – from the fact that during
training, the errors are propagated back, from the output to the hidden layer!
Computational Intelligence & Pattern Recognition © 2001- 2013, Robi Polikar, Rowan University, Glassboro, NJ
PR Credit Assignment Problem
Hidden nodes themselves do not make error: they just contribute to the
errors of the output nodes. The amount contributed is indicated by the
sensitivities.
𝛿𝑘 = 𝑡𝑘 − 𝑧𝑘 ⋅ 𝑓 ′ 𝑛𝑒𝑡𝑘
𝛿𝑗 = 𝛿𝑘 𝑤𝑘𝑗 𝑓 ′ 𝑛𝑒𝑡𝑗
𝑘=1
Computational Intelligence & Pattern Recognition © 2001- 2013, Robi Polikar, Rowan University, Glassboro, NJ
PR Generalizations of MLP/BP
We have seen the BP for a specific two layer MLP. However, with some
notational and bookkeeping effort, the BP learning rule can be easily
generalized to the following cases…
Input units including bias units – just add one more input node with 𝑥0 = 1
Input units connected directly to the output units (as well as hidden nodes)
There are more than two layers Deep neural networks
There are different nonlinearities for each layer
Each unit has its own nonlinearity
Each unit has a different learning rate
The output is a continuous value (i.e., regression problem)
Computational Intelligence & Pattern Recognition © 2001- 2013, Robi Polikar, Rowan University, Glassboro, NJ
PR Training Protocols
Three major training protocols:
Stochastic Learning: Instances are drawn randomly from the training data, and
the weights are updated for each chosen instance.
Batch Learning: Entire training data is shown to the network before weights are
updated. Each such presentation of the entire training dataset to the network is
called an epoch. In this case, the error for each pattern 𝑱𝒑 are computed, and
summed, before weights are updated. This is the recommended mode of training
an MLP.
Online Learning: Instances are drawn consecutively from the training data, and
the weights are updated for each instance. This can be sensitive to the order in
which the data are presented.
Computational Intelligence & Pattern Recognition © 2001- 2013, Robi Polikar, Rowan University, Glassboro, NJ
PR MLP- Batch Learning
Computational Intelligence & Pattern Recognition © 2001- 2013, Robi Polikar, Rowan University, Glassboro, NJ
PR Datasets used in MLP/BP
Typically, three sets of data are used in MLP training and testing, all of
which are accompanied by their corresponding correct class information
Training data: This is the data on which the gradient descent is
performed. That is, the training is done on this data.
Validation data: A second set of dataset, which is not used for training,
however, it is used to determine when the training should stop.
Test data: This is the dataset using which we assess the generalization
performance of the network.
D
Computational Intelligence & Pattern Recognition © 2001- 2013, Robi Polikar, Rowan University, Glassboro, NJ
PR MLP Bayes
With sufficient number of hidden layer nodes, the MLP can approximate
any function, and hence can solve any non-linearly separable classification
problem.
In fact, with sufficient number of hidden nodes, along with plenty of data, it
can be shown that the network outputs represents the posterior
probabilities of classes (Richard & Lippman, 1991).
Outputs can, however, may be forced to represent probabilities by
Use exponential activation function at the output layer 𝑓 𝑛𝑒𝑡 = 𝑒𝑛𝑒𝑡
Use 0 – 1 target vectors
Normalize outputs according to
𝑒 𝑛𝑒𝑡𝑘
𝑧𝑘 = 𝑐 𝑛𝑒𝑡
𝑖 𝑒
𝑖
Computational Intelligence & Pattern Recognition © 2001- 2013, Robi Polikar, Rowan University, Glassboro, NJ
PR Improving the Backpropagation
Practical Considerations
Activation Function
Input Normalization
Target Values
Number of Hidden Units
Initializing Weights
Learning Rates
Momentum
Stopping Criteria
Regularization
Computational Intelligence & Pattern Recognition © 2001- 2013, Robi Polikar, Rowan University, Glassboro, NJ
Practical Considerations
PR Activation Function
Desirable properties of an activation function
Nonlinearity – gives the multilayer networks the power of generating
nonlinear decision boundaries;
Saturation for classification problems – so that the outputs can be limited
between some minimum and maximum limits (-1 and 1, or 0 and 1). Not
necessary for regression (function approximation problems)
Continuity and smoothness – so that we can take its derivative
Monotonocity – so that the activation function itself does not introduce
additional local minima
Linearity for small values of 𝑛𝑒𝑡, to preserve the properties of linear
discriminant functions
A function that satisfies all of the above is
….(drum roll)
Computational Intelligence & Pattern Recognition © 2001- 2013, Robi Polikar, Rowan University, Glassboro, NJ
PR The Sigmoid
Activation Function
Logarithmic Sigmoid 1
𝑓(𝑛𝑒𝑡) =
1 + 𝑒 −𝛽⋅𝑛𝑒𝑡
1
= 0.75
0.9
0.8
= 0.5
0.7
= 0.25
0.6
0.5 = 0.1
0.4
0.3
Computational Intelligence & Pattern Recognition © 2001- 2013, Robi Polikar, Rowan University, Glassboro, NJ
PR The Sigmoid
Activation Function
2
Tangential Sigmoid 𝑓(𝑛𝑒𝑡) = tanh(𝛽𝑛𝑒𝑡) = −1
1 + 𝑒 −𝛽⋅𝑛𝑒𝑡
1
= 0.75
0.8 = 0.5
0.6
= 0.25
0.4
0.2
= 0.1
0
-0.2
-0.4
Computational Intelligence & Pattern Recognition © 2001- 2013, Robi Polikar, Rowan University, Glassboro, NJ
PR Normalization
For stability reasons, individual features need to be in the same ‘ball park figure’. If
two features vary orders of magnitude in their values, network cannot learn
effectively.
If the mean and variance of each feature is made zero and one, respectively, this is
called standardization. This makes sense, if the features are uncorrelated.
If the relative amplitudes of features need to be conserved, necessary when the
features are related, then a –relative to maximum – or – relative to norm2 –
normalizations should be used !
Computational Intelligence & Pattern Recognition © 2001- 2013, Robi Polikar, Rowan University, Glassboro, NJ
PR Preprocessing &
Normalization in Matlab
Matlab function mapminmax() scales inputs and targets into [-1 1] range
[Y,PS] = mapminmax(X, YMIN,YMAX) processes X by normalizing the minimum and maximum
values of each row to [YMIN, YMAX]. The default values for YMIN and YMAX are
-1 and +1, respectively. PS is a struct of processing settings that then allows using the exact same
normalization to some other input Z through Y=mapminmax ('apply',Z, PS).
X = mapminmax('reverse', Y, PS) returns X, given Y and settings PS.
Computational Intelligence & Pattern Recognition © 2001- 2013, Robi Polikar, Rowan University, Glassboro, NJ
PR Target Values
Typically, two output encoding protocols are used for classification problems
From a practical point of view, it is recommended that the asymptotic values not
be used. That is, use 0.05 instead of 0, and 0.95 instead of 1.
This is because, the slope of the activation function, that is, the gradient, which is
proportional to Δw, approaches zero at extreme values of the input. This significantly slows
down the training.
For regression (function approximation) problems, actual values are used along
with the linear activation function (‘purelin’ in Matlab) .
Computational Intelligence & Pattern Recognition © 2001- 2013, Robi Polikar, Rowan University, Glassboro, NJ
PR Number of Hidden Units
Although the number of input and output layer nodes are fixed (number of features and number
of classes, respectively), the number of hidden layer nodes, H, is a user selectable parameter.
H defines the expressive power of the network. Typically, larger H results in a network that can
solve more complicated problems. However,
Excessive number of hidden nodes causes over fitting. This is a phenomenon where the training
error can be made arbitrarily small, but the network performance on the test data is poor Poor
generalization performance… No Good !
Too few hidden nodes may not be able to solve more complicated problems … No Good !
There is no formal procedure to determine H. Typically, it is determined by a combination of
previous expertise, amount of data available, dimensionality, complexity of the problem, and trial
and error.
A common rule of thumb that is often used is to choose H such that to total number of weights
remains less then N/10, where N is the total number of training data available.
Computational Intelligence & Pattern Recognition © 2001- 2013, Robi Polikar, Rowan University, Glassboro, NJ
PR Initializing the Weights
To promote uniform learning where all classes are learned approximately at
the same time, the weights must be initialized carefully.
A typical rule of thumb is to randomly choose the weights from a uniform
distribution according to the following limits:
A proper way to select the learning rate involves computing the second derivative
of the criterion function wrt each weight, and taking the inverse of this derivative
as the learning rate. This dynamic learning rate however is computationally
expensive. A good starting point is =0.1 𝜕2𝐽 𝐰
−1
𝜂𝑜𝑝𝑡 =
𝜕𝐰 2
MATLAB uses an alternate dynamic learning rate update scheme
If 𝑒𝑟𝑟𝑜𝑟𝑛𝑒𝑤 > 𝑘. 𝑒𝑟𝑟𝑜𝑟𝑜𝑙𝑑 discard current weight update, set 𝜂 = 𝑎1 . 𝜂
If 𝑒𝑟𝑟𝑜𝑟𝑛𝑒𝑤 < 𝑒𝑟𝑟𝑜𝑟𝑜𝑙𝑑 keep current weight update, set 𝜂 = 𝑎2 . 𝜂
Typical values: 𝑘 = 1.04 𝑎1 = 0.7 𝑎2 = 1.05
Computational Intelligence & Pattern Recognition © 2001- 2013, Robi Polikar, Rowan University, Glassboro, NJ
PR The Problem of Local minima
J(w)
Computational Intelligence & Pattern Recognition © 2001- 2013, Robi Polikar, Rowan University, Glassboro, NJ
PR Momentum
If there are small plateaus in the error surface, then backpropagation can take a long time,
or even get stuck in small local minima. In order to prevent this, a momentum term is
added, which incorporates the speed at which the weights are learned. This is loosely
related to the momentum in physics – a moving object keeps moving unless prevented by
outside forces.
Momentum term simply makes the following change to the weight update rule, where is
the momentum term:
𝑤𝑡+1 = 𝑤𝑡 + 1 − 𝛼 𝛥𝑤𝑡 + 𝛼𝛥𝑤t−1
If =0, this is the same as the regular backpropagation, where the weight update is
determined purely by the gradient descent
If =1, the gradient descent is completely ignored, and the update is based on the
‘momentum’, previous weight update rule. The weight update continues on along the
direction in which it was moving previously.
Typical value for is generally between 0.5 and 0.95
Computational Intelligence & Pattern Recognition © 2001- 2013, Robi Polikar, Rowan University, Glassboro, NJ
PR Batch Learning with Momentum
Computational Intelligence & Pattern Recognition © 2001- 2013, Robi Polikar, Rowan University, Glassboro, NJ
PR Stopping Criterion
Computational Intelligence & Pattern Recognition © 2001- 2013, Robi Polikar, Rowan University, Glassboro, NJ
PR Regularization
Regularization is the smoothing of the error curve so that the optimum
solution can be found more effectively.
One such technique is the weight decay, that prevents the weights from growing
too large:
𝑤t+1 = 1 − 𝜀 𝑤𝑡 0 < 𝜀 < 1
The weights that do not contribute to reducing the criterion function will
eventually shrink to zero. They can then be eliminated all together. Weights that do
contribute to 𝐽 will not decay however, as they will be updated.
This is effectively equivalent to using the following criterion function with no
separate weight decay.
2𝜀 𝑇 Regularization term
𝐽𝑒𝑓 =𝐽 𝐰 + 𝐰 𝐰
𝜂
Computational Intelligence & Pattern Recognition © 2001- 2013, Robi Polikar, Rowan University, Glassboro, NJ
PR Feedforward
networks in Matlab
The process for training / testing / using neural networks in Matlab includes
the following steps:
Collect data (duh!)
Create the network (duh!)2
Configure the network
Initialize the weights and biases
Train the network
Validate and use the network
The main functions to use an MLP in Matlab are:
feedforwardnet, which creates the network architecture, 𝑛𝑒𝑡
configure, which sets up the parameters of the 𝑛𝑒𝑡
train, which configures and trains the network
sim, which simulates the network, by computing the outputs for a gives test data
perform, which computes the performance of the network on testdat whose
labels are known.
Computational Intelligence & Pattern Recognition © 2001- 2013, Robi Polikar, Rowan University, Glassboro, NJ
PR Matlab’s Model
of a network
Matlab uses the network object to store all of the information that defines a NN.
The network object includes the structure of the network (how many layers, how
many nodes in each layer), as well as many configurable parameters, such as the
weights and biases.
The fundamental building block is the neuron, represented as follows, where p is
input, w is the weight (vector) and b is the bias associated with an extra input of
fixed value 1. Matlab uses
a weight function to determine how the weights
are applied (for MLPs this is dot product, 𝐰 𝑇 𝐱, for RBF
it can be a distance function, e.g. 𝐰 − 𝐱 ),
a net input function , which for MLPs simply adds
the bias to the weighted sum of the inputs; and
a transfer (activation) function to act as
nonlinear thresholding, which for MLPs is typically
logistic or tangential sigmoid. M
Computational Intelligence & Pattern Recognition © 2001- 2013, Robi Polikar, Rowan University, Glassboro, NJ
PR Matlab’s Model
of a network
An abbreviated version of this model is shown as follows:
S: # of neurons
R: # of inputs
The transfer functions are typically indicated with diagrams indicating the
nature of the function being used, for example:
Computational Intelligence & Pattern Recognition © 2001- 2013, Robi Polikar, Rowan University, Glassboro, NJ
PR Matlab’s Model
of a Feed forward network
A single-layer network of S logsig neurons having R inputs is shown below
in full detail on the left and with a layer diagram on the right
M
Image Source: Matlab Neural Network Toolbox, User’s Guide
http://www.mathworks.com/products/neuralnet/
Computational Intelligence & Pattern Recognition © 2001- 2013, Robi Polikar, Rowan University, Glassboro, NJ
PR Matlab’s Model
of a Feed forward network
A two-layer (single hidden layer) network layer diagram
# of bias terms
# of inputs
# of hidden Output of Output of # of output (hidden
layer nodes hidden layer 1 hidden layer 2 layer 2) nodes
M
𝐈𝐖: Input weights 𝐋𝐖: Layer Weights
Computational Intelligence & Pattern Recognition © 2001- 2013, Robi Polikar, Rowan University, Glassboro, NJ
PR Matlab’s Model
of a Feed forward network
Similarly, if you have three layers…
Computational Intelligence & Pattern Recognition © 2001- 2013, Robi Polikar, Rowan University, Glassboro, NJ
PR feedforwardnet
feedforwardnet Feedforward neural network (Neural Networks Toolbox)
net = feedforwardnet(hiddenSizes, trainFcn) creates a network net, where hiddenSizes is a row vector of one or more
hidden layer sizes (default = 10) and trainFcn indicating the training function (default trainlm) to be used with this network.
Specialized versions of the feedforward network include fitting (fitnet) and pattern recognition (patternnet) networks. A
variation on the feedforward network is the cascade forward network (cascadeforwardnet) which has additional
connections from the input to every layer, and from each layer to all following layers.
Examples
[x,t] = simplefit_dataset;
net = feedforwardnet(10)
net = train(net,x,t);
view(net)
y = net(x);
perf = perform(net,y,t)
Note that feedforwardnet() automatically applies removeconstantrows() and mapminmax() to both the inputs and
outputs (target values).
Computational Intelligence & Pattern Recognition © 2001- 2013, Robi Polikar, Rowan University, Glassboro, NJ
PR net = feedforwardnet(10)
feedforwardnet
Total # of weights
Computational Intelligence & Pattern Recognition © 2001- 2013, Robi Polikar, Rowan University, Glassboro, NJ
PR net.inputs
This property holds structures of properties for each of the network's inputs. It is
always an 𝑁𝑖 x 1 cell array of input structures, where 𝑁𝑖 is the number of network
inputs (net.numinputs, which in our case will always be 1).
To access these properties, type net.inputs{1} which will return the following
properties (your numbers may be different depending on your network)
To change the transfer function to logsig, for example, you could execute the command:
net.layers{1}.transferFcn = 'logsig'
Computational Intelligence & Pattern Recognition © 2001- 2013, Robi Polikar, Rowan University, Glassboro, NJ
PR net.outputs
This property holds structures of properties for each of the network's outputs. It is always a
1 x 𝑁𝑙 cell array, where 𝑁𝑙 is the number of network outputs (net.numOutputs).
Also try:
To remove default input processing
functions: net.outputs{2}.processFcns={}
Computational Intelligence & Pattern Recognition © 2001- 2013, Robi Polikar, Rowan University, Glassboro, NJ
net.biases;
PR net.inputWeights;
net.layerWeights
These properties hold the structures of properties for each of the network’s biases, input or
layer weights.
For biases, there is one structure for each layer, which include the following properties:
initFcn, learn (whether the biases should be learned), learnFcn (which learning function to
use to learn bias values; default traingdm) , learnParam (.lr : initial learning rate and .mc:
momentum constant), and size.
For input and layer weights:
Number of input and output weights are read from the size of the input and target data, in a subsequent configure() or train() function
Computational Intelligence & Pattern Recognition © 2001- 2013, Robi Polikar, Rowan University, Glassboro, NJ
PR Weights & Biases
The actual values of the weights and biases can be obtained by calling
net.IW, net.LW and net.b.
Note that due to their cell structures, you actually need to call them with
the following cell array indices:
net.IW{1}; net.LW{2,1} and net.b{1} or net.b{2}
Computational Intelligence & Pattern Recognition © 2001- 2013, Robi Polikar, Rowan University, Glassboro, NJ
PR Network Functions
A variety of functions control how the network is trained and how it behaves
Adaptation function, generally used for incremental (on-line, one
instance at a time learning)
Type of derivative / gradient to be used
Type of data division to be used for validation: dividerand is random division (default)
Ratio of data to be used for training, validation and test partitions
Defines how the initialization of weights (net.IW, net.LW) and biases (net.b) are to be done
Objective /cost function; mse: mean square error (default)
Computational Intelligence & Pattern Recognition © 2001- 2013, Robi Polikar, Rowan University, Glassboro, NJ
PR Training Parameters
for traingdx
traingdx is a gradient descent back propagation algorithm with momentum
term and adaptive learning rate, with the following parameters and their
default values
Computational Intelligence & Pattern Recognition © 2001- 2013, Robi Polikar, Rowan University, Glassboro, NJ
PR Training Parameters
for trainlm
trainlm is the Levenberg-Marquardt backpropagation. It is one of the fastest BP
algorithms, but also the one that requires the most memory. It is the default
training algorithm for feedforward networks, using the following parameters and
default values.
Computational Intelligence & Pattern Recognition © 2001- 2013, Robi Polikar, Rowan University, Glassboro, NJ
PR Network Methods
For any given network object net, the following methods are available
Finally, for any given set of test_data, you can obtain the outputs of the network
net by invoking outputs=net(test_data)
Computational Intelligence & Pattern Recognition © 2001- 2013, Robi Polikar, Rowan University, Glassboro, NJ
PR Training & Testing The
Network
train Train neural network (Neural Network Toolbox)
[net,tr] = train(net,P,T,Pi,Ai) trains a network net according to net.trainFcn and net.trainParam of the network object net,
where P is the network input (matrix or cell array in column format), T is the network targets (in one-hot format), and Pi
and Ai are the input and layer delays (not used for MLPs). It returns the trained network net and training record tr, which
includes the performances.
[Y,Pf,Af,E,perf] = sim(net,P,Pi,Ai,T) simulates the network net, using input data P (typically test data, again in column
format), target values T, as well as the delays (for TDNNs) Pi and Ai(not used for MLPs). The function returns the network
outputs Y, network errors E and the network performance perf along with final input and layer delays Pf and Af (not used
for MLPs).
If target values for the test data are not known, and we simply want to obtain the outputs of the network for the input P,
use output=net(P);
If target values are known for the test data, the performance of the network can also be obtained by the perform function
by calling perf = perform(net, P, T).
Computational Intelligence & Pattern Recognition © 2001- 2013, Robi Polikar, Rowan University, Glassboro, NJ
PR Training MLP In Matlab
net = feedforwardnet(10, 'traingdx');
net.inputs{1}.processFcns={} %Remove the default input processing functions of
%min/max normalization, fixing unknowns and removing repeat instances
net.outputs{2}.processfcns={} %Remove the default output processing functions
%net.divideFcn=''; %This removes the default data partitioning (normally 60%, 20%, 20%)
net.divideparam.trainRatio=TR_ratio;
net.divideparam.valRatio=V_ratio;
net.divideparam.testRatio=T_ratio;
Computational Intelligence & Pattern Recognition © 2001- 2013, Robi Polikar, Rowan University, Glassboro, NJ
PR Test on OCR Data
Confusion Matrix
174 0 0 2 0 0 2 0 0 0 97.8%
1
9.7% 0.0% 0.0% 0.1% 0.0% 0.0% 0.1% 0.0% 0.0% 0.0% 2.2%
0 166 1 0 1 0 2 0 6 1 93.8%
2
0.0% 9.2% 0.1% 0.0% 0.1% 0.0% 0.1% 0.0% 0.3% 0.1% 6.2%
0 2 174 0 0 1 0 0 0 0 98.3%
3
0.0% 0.1% 9.7% 0.0% 0.0% 0.1% 0.0% 0.0% 0.0% 0.0% 1.7%
0 0 0 163 0 0 0 0 0 0 100%
4
0.0% 0.0% 0.0% 9.1% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0%
Output Class
0 0 0 0 179 1 0 2 0 2 97.3%
5
0.0% 0.0% 0.0% 0.0% 10.0% 0.1% 0.0% 0.1% 0.0% 0.1% 2.7%
4 0 0 3 0 174 0 6 2 1 91.6%
6
0.2% 0.0% 0.0% 0.2% 0.0% 9.7% 0.0% 0.3% 0.1% 0.1% 8.4%
0 0 0 0 0 1 176 0 0 0 99.4%
7
0.0% 0.0% 0.0% 0.0% 0.0% 0.1% 9.8% 0.0% 0.0% 0.0% 0.6%
0 0 1 4 0 0 0 164 1 0 96.5%
8
0.0% 0.0% 0.1% 0.2% 0.0% 0.0% 0.0% 9.1% 0.1% 0.0% 3.5%
0 6 1 2 1 0 1 2 159 2 91.4%
9
0.0% 0.3% 0.1% 0.1% 0.1% 0.0% 0.1% 0.1% 8.8% 0.1% 8.6%
0 8 0 9 0 5 0 5 6 174 84.1%
10
0.0% 0.4% 0.0% 0.5% 0.0% 0.3% 0.0% 0.3% 0.3% 9.7% 15.9%
97.8%91.2%98.3%89.1%98.9%95.6%97.2%91.6%91.4%96.7%94.8%
2.2% 8.8% 1.7% 10.9% 1.1% 4.4% 2.8% 8.4% 8.6% 3.3% 5.2%
1 2 3 4 5 6 7 8 9 10
Target Class
Computational Intelligence & Pattern Recognition © 2001- 2013, Robi Polikar, Rowan University, Glassboro, NJ
PR Some issues to consider
If the network inputs and outputs are preprocessed with the minmax
function, all input and output values will be normalized to [-1 1] range.
Then logsig function is not suitable for output layer. Why?
In this case, use one of the following
• Logsig at the hidden layer, tansig or purelin at the output layer
• Tansig at both hidden and output layers
• Tansig at the hidden and purelin at the output layer.
Always check your input arguments. Matlab expects the data to be in the
columns – don’t screw this up!
You can create your own partitioning using the dividerand() function
Recall: the network created by Matlab is a “neural network object” – similar
to a struct. It has a mindboggling set of parameters. Type “net” at the
command prompt and investigate its components.
Computational Intelligence & Pattern Recognition © 2001- 2013, Robi Polikar, Rowan University, Glassboro, NJ
PR nntool
Computational Intelligence & Pattern Recognition © 2001- 2013, Robi Polikar, Rowan University, Glassboro, NJ
PR nntool
Computational Intelligence & Pattern Recognition © 2001- 2013, Robi Polikar, Rowan University, Glassboro, NJ
PR Train / Validation / test
trainmlp_validation.m
Computational Intelligence & Pattern Recognition © 2001- 2013, Robi Polikar, Rowan University, Glassboro, NJ
PR If you set TR_error=0
Overfitting
Computational Intelligence & Pattern Recognition © 2001- 2013, Robi Polikar, Rowan University, Glassboro, NJ
PR trainmlp_validation.m Examples
Spiral data
6
4
One hidden layer, with N=50 nodes,
2
+tansig activation,
Sigmoid activation, traingdxtraingdx
0 MLP based classification
classification of
of the
the test
test data
data
66
-2
44
-4
-6
22
-8
-6 -4 MLP-2based classification
0 of2the test data
4 6 8
6
00
-2
-2
-4
-4
-6
-6-6 -4 -2 0 2 4 6
-6 -4 -2 0 2 4 6
-2
6 4
4
2
0
0
-2 -2
-4
-4
-6
-4 -3 -2 -1 0 1 2 3 4 5 6
-6
-3 -2 -1 0 1 2 3 4 5
RP
Computational Intelligence & Pattern Recognition © 2001- 2013, Robi Polikar, Rowan University, Glassboro, NJ
PR Second Order Methods
The gradient descent is based on the minimization of the criterion function
using the first order derivative
Methods that make use of second order derivatives typically find the
solution much faster the first order methods.
Newton’s Methods
• Levenberg Marquardt Backpropagation
Quick prop
Conjugate Gradient Methods
• Fletcher-Reeves
• Polak-Ribiere
Computational Intelligence & Pattern Recognition © 2001- 2013, Robi Polikar, Rowan University, Glassboro, NJ
PR Newton’s Method
The weight update rule uses the Hessian matrix, which includes the second
derivatives of the criterion function with respect to the weights:
Computational Intelligence & Pattern Recognition © 2001- 2013, Robi Polikar, Rowan University, Glassboro, NJ
PR Newton’s Method
The weight update rule in Newton’s method is then:
𝜕𝐽(𝐰) 𝜕𝐽(𝐰)
𝛥𝐰 = −𝐇 −1 𝐰𝑘+1 = 𝐰𝑘 − 𝐇 −1
𝜕𝐰 𝜕𝐰
Computational Intelligence & Pattern Recognition © 2001- 2013, Robi Polikar, Rowan University, Glassboro, NJ
PR The Levenberg – Marquardt
Backpropagation
The LM method uses the 𝐇 = 𝐉 𝑇 𝐉 approximation for the Hessian, with the
gradient computed as 𝐠 = 𝐉 𝑇 𝐞 where 𝐉 is the Jacobian matrix containing
the first derivatives of the network errors (not to be confused with the
criterion function 𝐽), and e is the vector of network errors. The weight
update rule is then
𝐰𝑘+1 = 𝐰𝑘 − 𝐉𝑇 𝐉 + 𝜇𝐈 −1 𝐉 𝑇 𝐞
where 𝜇 is a constant which switches the algorithm back and forth between
a regular gradient descent (when 𝜇 is large) and quasi-Newton’s method
(when 𝜇 is zero). Typically, 𝜇 is decreased as the network gets closer to the
solution, since in that region Newton’s algorithm is most efficient.
Computational Intelligence & Pattern Recognition © 2001- 2013, Robi Polikar, Rowan University, Glassboro, NJ
PR Quick-Prop
Simple and fast second order method that does not even require a
second order derivative calculation !!!
Assumptions:
Weights are independentDescent optimized separately for each weight !
Error surface is quadratic
Weight update rule given by
𝑑𝐽
𝑑𝑤 𝑘
𝛥𝑤𝑘+1 = 𝛥𝑤𝑘
𝑑𝐽 𝑑𝐽
−
𝑑𝑤 𝑘−1 𝑑𝑤 𝑘
Computational Intelligence & Pattern Recognition © 2001- 2013, Robi Polikar, Rowan University, Glassboro, NJ
PR Conjugate
Gradient Descent
Another group of second order algorithms that do not require Hessian
computation are the family of Conjugate Gradient Descent methods.
𝑇
Pairs of directions (vectors) that satisfy 𝛥𝐰𝑘−1 𝐇𝛥𝐰𝑘 = 0 are called H-
conjugate, meaning that vectors 𝐰𝑘−1 and 𝐰𝑘 are non-interfering with
respect to H. If H is proportional to the identity matrix, then conjugate
directions are orthogonal to each other.
Computational Intelligence & Pattern Recognition © 2001- 2013, Robi Polikar, Rowan University, Glassboro, NJ
PR Conjugate
Gradient Descent
Conjugate Gradient Descent:
1. Start with the steepest descent direction: 𝛥𝐰0 = −𝛻𝐽(𝐰0 )
2. At the 𝑘𝑡ℎ update, perform a line search to determine the optimal distance to
move along this direction, 𝐰𝑘 (equivalent to determining the optimum learning
rate). Let’s call this amount 𝛼𝑘.
3. Move along this direction 𝐰𝑘 by the amount 𝛼𝑘: 𝐰𝑘+1 = 𝐰𝑘 + 𝛼𝑘 𝛥𝐰𝑘
4. The next search will then be conjugate to previous search direction. Compute the
conjugate direction by 𝛥𝐰𝑘 = −𝛻𝐽(𝐰𝑘 ) + 𝛽𝑘 𝛥𝐰𝑘−1
5. The various versions of the conjugate gradient descent algorithm are distinguished
by the way the constant k is computed.
𝛻𝐽 𝐰𝑘 𝑇 𝛻𝐽 𝐰𝑘 𝛻𝐽 𝐰𝑘 𝑇 𝛻𝐽 𝐰𝑘 − 𝛻𝐽 𝐰𝑘−1
𝛽𝑘 = 𝑇
𝛽𝑘 =
𝛻𝐽 𝐰𝑘−1 𝛻𝐽 𝐰𝑘−1 𝛻𝐽 𝐰𝑘−1 𝑇 𝛻𝐽 𝐰𝑘−1
Fletcher-Reeves update – ‘traincgf’ Polak-Ribiere update – ‘traincgp’
Computational Intelligence & Pattern Recognition © 2001- 2013, Robi Polikar, Rowan University, Glassboro, NJ
PR Conjugate
Gradient Descent
Computational Intelligence & Pattern Recognition © 2001- 2013, Robi Polikar, Rowan University, Glassboro, NJ
PR Gradient Descent vs.
Conjugate Gradient
Conjugate Gradient
Descent
Gradient
Descent
Computational Intelligence & Pattern Recognition © 2001- 2013, Robi Polikar, Rowan University, Glassboro, NJ
PR Other Neural Network
structures
Radial Basis Function (RBF) Networks: Similar in architecture to MLP, however, different
learning rule. Typically used for function approximation, though capable of solving
classification problems as well.
Matlab: newrb (train_data, targets, error_goal, spread)
Computational Intelligence & Pattern Recognition © 2001- 2013, Robi Polikar, Rowan University, Glassboro, NJ
PR
Dr. Robi Polikar
Computational Intelligence & Pattern Recognition © 2001- 2013, Robi Polikar, Rowan University, Glassboro, NJ
PR Function Approximation
* * ?
* *
* *
* *
*
*
RP
Computational Intelligence & Pattern Recognition © 2001- 2013, Robi Polikar, Rowan University, Glassboro, NJ
PR Recall: Universal
Approximator
Classification can be thought of as a special case of function approximation:
For a three class problem:
𝑥1
Class 1: 1 or [1 0 0]
x …. Classifier Class 2: 2 or [0 1 0]
Class 3: 3 or [0 0 1]
𝑥𝑑
Computational Intelligence & Pattern Recognition © 2001- 2013, Robi Polikar, Rowan University, Glassboro, NJ
PR Radial Basis Function
Neural Networks
The RBF networks, just like MLP networks, can therefore be used
classification and/or function approximation problems.
The RBFs, which have a similar architecture to that of MLPs, however,
achieve this goal using a different strategy:
………..
Linear output
Input layer layer
Nonlinear
transformation layer
(generates local receptive fields) RP
Computational Intelligence & Pattern Recognition © 2001- 2013, Robi Polikar, Rowan University, Glassboro, NJ
PR Nonlinear Receptive Fields
The hallmark of RBF networks is their use of nonlinear receptive fields
The receptive fields nonlinearly transforms (maps) the input feature space,
where the input patterns are not linearly separable, to the hidden unit
space, where the mapped inputs may be linearly separable.
The hidden unit space often needs to be of a higher dimensionality
Cover’s Theorem (1965) on the separability of patterns: A complex pattern
classification problem that is nonlinearly separable in a low dimensional space, is
more likely to be linearly separable in a high dimensional space.
We will see this concept again with SVM soon.
Computational Intelligence & Pattern Recognition © 2001- 2013, Robi Polikar, Rowan University, Glassboro, NJ
PR The (you guessed it right) XOR Problem
x2
Consider the nonlinear functions to map the input vector x to the 1- 2 space
1
𝐱−𝐭 1 2
𝜙1 (𝐱) = 𝑒 − 𝐭1 = 1 1 𝑇
𝐱 = [𝑥1 𝑥2]
𝐱−𝐭 2 2
𝜙2 (𝐱) = 𝑒 − 𝐭2 = 0 0 𝑇
0
x1
0 1
_ (1,1)
Input x 𝝓𝟏(𝐱) 𝝓𝟐(𝐱) 1.0
_
(1,1) 1 0.1353 0.8
_
(0,1) 0.3678 0.3678 0.6
(1,0) 0.3678 0.3678 _
0.4 (0,1)
(0,0) 0.1353 1 _ (1,0)
0.2 (0,0)
_
0 | | | | | |
0 0.2 0.4 0.6 0.8 1.0 1.2 RP
The nonlinear function transformed a nonlinearly separable problem into a linearly separable one !!!
Computational Intelligence & Pattern Recognition © 2001- 2013, Robi Polikar, Rowan University, Glassboro, NJ
PR Initial Assessment
Using nonlinear functions, we can convert a nonlinearly separable problem
into a linearly separable one.
From a function approximation perspective, this is equivalent to
implementing a complex function (corresponding to the nonlinearly
separable decision boundary) using simple functions (corresponding to the
linearly separable decision boundary)
Implementing this procedure using a network architecture, yields the RBF
networks, if the nonlinear mapping functions are radial basis functions.
Radial Basis Functions:
Radial: Symmetric around its center
Basis Functions: Also called kernels, a set of functions whose linear combination
can generate an arbitrary function in a given function space.
Hence, the radial basis functions are in fact distance functions: the distance
between a data point 𝒙 and the RBF with a center of 𝝁 is zero if 𝒙 = 𝝁, and
becomes symmetrically smaller as 𝒙 moves away from 𝝁.
Computational Intelligence & Pattern Recognition © 2001- 2013, Robi Polikar, Rowan University, Glassboro, NJ
PR RBF Networks
Radial Basis Function with =2, =1
d input 1
0.8
(receptive fields) 0.7
x1 0.6
0.5
𝝓𝟏 c output 0.4
0.3
nodes 0.2
𝑧1
0.1
……...
x2 -4 -2 0 2 4 6 8
x
Wkj
..
……....
𝐻 𝐻
x(d-1)
𝑥1
𝝓𝑯 uJi 𝜙 𝑛𝑒𝑡𝐽 = ‖𝐱 − 𝐮𝐉 ‖
2
𝝓 𝐱−𝐮𝐉
xd = 𝑒− 𝜎
𝐔= 𝐗𝑇 𝑥𝑑
RP
: spread constant
Computational Intelligence & Pattern Recognition © 2001- 2013, Robi Polikar, Rowan University, Glassboro, NJ
PR Principle of Operation
Radial basis Euclidean
function Norm
x1
UJi 𝜙 𝑛𝑒𝑡𝐽 = ‖𝐱 − 𝐮𝐉 ‖
2
𝝓 yJ 𝐱−𝐮𝐉 𝜎: spread constant
= 𝑒− 𝜎
xd
𝝓 y1
𝐻 𝐻
wKj
𝑧𝐾 = 𝑓 𝑛𝑒𝑡𝐾 = 𝑓 𝑤𝐾𝑗 𝑦𝑗 = 𝑤𝐾𝑗 𝑦𝑗
𝝓 yH
𝑗=1 𝑗=1
RP
Computational Intelligence & Pattern Recognition © 2001- 2013, Robi Polikar, Rowan University, Glassboro, NJ
PR Principle of Operation
What do these parameters represent?
Physical meanings:
• 𝜙: The radial basis function for the hidden layer. This is a simple nonlinear mapping
function (typically Gaussian) that transforms the d- dimensional input patterns to a
(typically higher) H-dimensional space. The complex decision boundary will be
constructed from linear combinations (weighted sums) of these simple building
blocks.
• 𝑢𝑗𝑖 : The weights joining the first to hidden layer. These weights constitute the
center points of the radial basis functions. Also called prototypes of data.
• 𝜎: The spread constant(s). These values determine the spread (extend) of each
radial basis function.
• 𝑤𝑗𝑘: The weights joining hidden and output layers. These are the weights which are
used in obtaining the linear combination of the radial basis functions. They
determine the relative amplitudes of the RBFs when they are combined to form the
complex function.
• ‖𝐱 − 𝐮𝐽 ‖: the Euclidean distance between the input 𝐱 and the prototype vector 𝐮𝐉.
Activation of the hidden unit is determined according to this distance through 𝜙
Computational Intelligence & Pattern Recognition © 2001- 2013, Robi Polikar, Rowan University, Glassboro, NJ
PR RBF Principle of Operation
0.4
Weighted sum of radial basis transfer functions
Function implemented
0.35
by RBF network wJ:Relative weight
of Jth RBF
0.3
𝜙𝐽 : 𝐽𝑡ℎ RBF function
0.25
RBFs centered at
0.2
training data instances
0.15
0.1
RP
0.05 J
uJ
0
-2 0 2 4 6 8 10
Computational Intelligence & Pattern Recognition © 2001- 2013, Robi Polikar, Rowan University, Glassboro, NJ
PR How to Train?
There are various approaches for training RBF networks.
Approach 1: Exact RBF – Guarantees correct classification of all training
data instances. Requires N hidden layer nodes, one for each training
instance. No iterative training is involved. RBF centers (u) are fixed as
training data points, spread as variance of the data, and w are obtained by
solving a set of linear equations (In Matlab: newrbe() )
Approach 2: Fixed centers selected at random. Uses 𝐻 < 𝑁 hidden layer
nodes. No iterative training is involved. Spread is based on Euclidean
metrics, w are obtained by solving a set of linear equations.
Approach 3: Centers are obtained from unsupervised learning
(clustering). Spreads are obtained as variances of clusters, w are obtained
through LMS algorithm. Clustering (k-means) and LMS are iterative. This
is the most commonly used procedure. Typically provides good results.
Approach 4: All unknowns are obtained from supervised learning.
Computational Intelligence & Pattern Recognition © 2001- 2013, Robi Polikar, Rowan University, Glassboro, NJ
PR Approach 1
Exact RBF
The first layer weights u are set to the training data; 𝑼 = 𝑿𝑻. That is, the
Gaussians are centered at the training data instances.
𝑑
The spread is chosen as 𝜎 = max , where 𝑑𝑚𝑎𝑥 is the maximum Euclidean distance
2𝑁
between any two centers, and 𝑁 is the number of training data points. Note that
𝐻 = 𝑁, for this case.
The output of the 𝑘𝑡ℎ RBF output neuron is then
𝑁 𝑁
Multiple
𝑧𝑘 = 𝑤𝑘𝑗 ⋅ 𝜙 ‖𝐱 − 𝐮𝑗 ‖ outputs 𝑧= 𝑤𝑗 ⋅ 𝜙 ‖𝐱 − 𝐮𝑗 ‖ Single output
𝑗=1 𝑗=1
During training, we want the outputs to be equal to our desired targets. Without
loss of any generality, assume that we are approximating a single dimensional
function, and let the unknown true function be 𝑓(𝒙). The desired (target) output
for each input is then 𝑡𝑖 = 𝑓(𝒙𝑖), 𝑖 = 1, 2, … , 𝑁.
Computational Intelligence & Pattern Recognition © 2001- 2013, Robi Polikar, Rowan University, Glassboro, NJ
PR Approach 1
(Cont.)
𝐝 = 𝑡1 , 𝑡2 , ⋯ 𝑡𝑁 𝑇 𝚽⋅𝐰=𝐝
Define: 𝐰 = 𝑤1 , 𝑤2 , ⋯ 𝑤𝑁 𝑇 𝐰 = 𝚽 −1 𝐝
𝚽 = 𝜙𝑖𝑗 (𝑖, 𝑗) = 1,2, . . . , 𝑁
Is this matrix always invertible?
Computational Intelligence & Pattern Recognition © 2001- 2013, Robi Polikar, Rowan University, Glassboro, NJ
PR Approach 1
(Cont.)
• Multiquadrics: 𝜙(𝑟) = 𝑟 2 + 𝑐 2 1 2
𝑟 = ‖𝐱 − 𝐱𝑗 ‖
Computational Intelligence & Pattern Recognition © 2001- 2013, Robi Polikar, Rowan University, Glassboro, NJ
PR Approach1
(Cont.)
Gaussian RBFs are localized functions ! unlike the sigmoids used by MLPs
Using Gaussian radial basis functions Using sigmoidal radial basis functions
Computational Intelligence & Pattern Recognition © 2001- 2013, Robi Polikar, Rowan University, Glassboro, NJ
PR Exact RBF Properties
Using localized functions typically makes RBF networks more suitable for
function approximation problems. Why?
Since first layer weights are set to input patterns, second layer weights are
obtained from solving linear equations, and spread is computed from the
data, no iterative training is involved !!!
Guaranteed to correctly classify all training data points!
However, since we are using as many receptive fields as the number of
data, the solution is over determined, if the underlying physical process
does not have as many degrees of freedom Overfitting!
The importance of : Too small will
also cause overfitting. Too large will
fail to characterize rapid changes in
the signal.
Computational Intelligence & Pattern Recognition © 2001- 2013, Robi Polikar, Rowan University, Glassboro, NJ
PR Too many
Receptive Fields?
In order to reduce the artificial complexity of the RBF, we need to
use fewer number of receptive fields.
How about using a subset of training data, say 𝑀 < 𝑁 of them.
These 𝑀 data points will then constitute 𝑀 receptive field centers.
How to choose these 𝑀 points…?
At random Approach 2.
𝑀
− 2 ‖𝐱𝑖 −𝐱𝑗 ‖2 𝑑max
𝑦𝑗 = 𝜙𝑖𝑗 = 𝜙 ‖𝐱 𝑖 − 𝐱𝑗 ‖2 = 𝑒 𝑑max , 𝑖 = 1,2, . . . , 𝑁 𝑗 = 1,2, . . . , 𝑀 𝜎 =
2𝑀
Output layer weights are determined as they were in Approach 1, through solving a
set of M linear equations!
Computational Intelligence & Pattern Recognition © 2001- 2013, Robi Polikar, Rowan University, Glassboro, NJ
PR Approach 3
K-Means - Unsupervised
Clustering - Algorithm
Choose number of clusters, M
Initialize M cluster centers to the first M training data points: 𝐭𝑘 = 𝐱𝑘 , 𝑘 = 1,2, … , 𝑀.
Repeat
At iteration 𝑛, group all patterns to the cluster whose center is closest
𝐭𝑘(𝑛): center of 𝑘𝑡ℎ RBF at
𝐶(𝐱) = argmin‖𝐱(𝑛) − 𝐭 𝑘 (𝑛)‖, 𝑘 = 1,2, . . . , 𝑀 𝑛𝑡ℎ iteration
𝑘
Compute the centers of all clusters after the regrouping
𝑀𝑘
1
𝐭𝑘 = 𝐱𝑗
New cluster center 𝑀𝑘 Instances that are grouped
for kth RBF. 𝑗=1
in the kth cluster
Number of instances
in the kth cluster
Until there is no change in cluster centers from one iteration to the next.
Computational Intelligence & Pattern Recognition © 2001- 2013, Robi Polikar, Rowan University, Glassboro, NJ
PR Approach 4:
Supervised
RBF Training
This is the most general form.
All parameters, receptive field centers (first layer weights), output layer weights
and spread constants, are learned through iterative supervised training using LMS
/ gradient descent algorithm.
𝑁
ℰ= 𝑒2𝑗
𝑗=1
𝑀
𝑤𝑘 𝜙 ‖𝐱𝑗 − 𝐭𝑖 ‖
𝑖=1
𝐺 ‖𝐱𝑗 − 𝐭𝑖 ‖ = 𝜙 ‖𝐱𝑗 − 𝐭𝑖 ‖
𝐶
Computational Intelligence & Pattern Recognition © 2001- 2013, Robi Polikar, Rowan University, Glassboro, NJ
PR MLP vs. RBF
Similarities
Both are universal approximators: they can approximate an arbitrary function of arbitrary
dimensionality and arbitrary complexity, provided that the number of hidden layer units are
sufficiently large, and there is sufficient training data.
Differences
MLP generates more global decision regions, as opposed to RBF generating more local
decision regions
MLP partition the feature space into hyperplanes, whereas RBF partitions the space into
hyperellipsoids
MLP is more likely to battle with local minima and flat valleys then RBF, and hence in
general has longer training times
Since MLPs generate global decision regions, they do better in extrapolating, that is
classifying instances that are outside of the feature space represented by the training data. It
should be noted however, extrapolating may mean dealing with outliers.
MLPs typically require fewer parameters then RBFs to approximate a given function with
the same accuracy G
(From R. Gutierrez)
Computational Intelligence & Pattern Recognition © 2001- 2013, Robi Polikar, Rowan University, Glassboro, NJ
PR MLP vs. RBF (Cont.)
Differences (cont.)
All parameters of the MLP are trained simultaneously, whereas RBF parameters can be
trained separately in an efficient hybrid manner
RBFs have one and only one hidden layer, whereas MLPs can have multiple hidden layers.
The hidden neurons of an MLP compute the inner product between an input vector and the
weight vector. RBFs instead compute the Euclidean distance between the input vector and
the radial basis function centers.
The hidden layer of an RBF is nonlinear and its output layer is linear, whereas an MLP
typically has both layers as nonlinear. This really is more of a historic preference based on
empirical success. MLPs typically do better on classification type problems, and RBFs
typically do better on regression / function approximation type problems.
G
(From R. Gutierrez)
Computational Intelligence & Pattern Recognition © 2001- 2013, Robi Polikar, Rowan University, Glassboro, NJ
PR MLP vs. RBF
G
(From R. Gutierrez)
Computational Intelligence & Pattern Recognition © 2001- 2013, Robi Polikar, Rowan University, Glassboro, NJ
PR RBF Networks
in Matlab
2
-x
radbas(x)=e The RBF accepts a distance between
1 the input p and the weight vector w.
0.9
As the distance between w and p
decreases, the RBF output 0. Hence
0.8
0.7
𝑎 = 𝑟𝑎𝑑𝑏𝑎𝑠 𝐰 − 𝐩 𝑏 0.1
adjusted.
2
𝑟𝑎𝑑𝑏𝑎𝑠 𝑥 = 𝑒 −𝑥 0
-3 -2 -1 0 1 2 3
𝑆 1 x𝑅 𝐈𝐖 𝟏,𝟏
𝐚𝟏 𝐚𝟐 = 𝐲
𝑆 1 x1
𝐋𝐖 𝟐,𝟏
𝐧𝟏 𝐧𝟐
𝐛𝟏 𝐛𝟐
𝑆 1 x1 𝑆 2 x1 𝑆2
Examples
Here you design a radial basis network, given inputs P and targets T.
P = [1 2 3]; T = [2.0 4.1 5.9]; net = newrb(P,T);
The network is simulated for a new input.
P = 1.5; Y = sim(net,P)
Computational Intelligence & Pattern Recognition © 2001- 2013, Robi Polikar, Rowan University, Glassboro, NJ
PR Other Variations
newgrnn – Create a generalized regression neural network. Consists of two
layers, the first is identical to that of RBF, whereas the second layer has a
slight variation of purelin layer. More suitable for function approximation
problems.
newpnn – Create a probabilistic neural network. Also has two layers, with
the first being a RBF, whereas the second normalizes the first layer outputs
and passes them through a competitive function that picks the largest
output. Most suited for classification problems. This function essentially
creates a version of kNN, with the distances computed wrt to the radial
basis function.
Computational Intelligence & Pattern Recognition © 2001- 2013, Robi Polikar, Rowan University, Glassboro, NJ
PR trainrbf.m RBF Matlab Demo
%trainrbf: Trains and simulated a RBF and GRNN network on a synthetic
dataset - Originally written - 2003, Updated 10/2013 Original function
%Robi Polikar 1
load arb_function3.csv; RP 0
X=arb_function3(:,1);
Y=arb_function3(:,2);
size(X); N=length(X); -1
X=X(1:4:N); Y=Y(1:4:N); 0 2000 4000 6000 8000 10000 12000 14000 16000
Training Data
%Sub sample the data at a rate 1
of 1:5 to create training data
P1=X(1:5:length(X))'; 0
T1=Y(1:5:length(Y))';
subplot(411) 0
plot(X,Y); grid
title('Original function')
-2
subplot(412) 0 2000 4000 6000 8000 10000 12000 14000 16000
plot(P1,T1); grid GRNN approximation of the original data
title('Training Data'); 1
subplot(413)
plot(X,out_rb); grid
0
title('RBF approximation of the original data')
subplot(414) -1
plot(X,out_grnn); grid 0 2000 4000 6000 8000 10000 12000 14000 16000
title('GRNN approximation of the original data')
Computational Intelligence & Pattern Recognition © 2001- 2013, Robi Polikar, Rowan University, Glassboro, NJ
Original function
PR 1
RP
-1
0 2000 4000 6000 8000 10000 12000 14000 16000
Using an error Training Data
goal of 1, and 1
spread of 5
0
Required 1200
Neurons, using 2000
points for training -1 0 2000 4000 6000 8000 10000 12000 14000 16000
-1
0 2000 4000 6000 8000 10000 12000 14000 16000
Computational Intelligence & Pattern Recognition © 2001- 2013, Robi Polikar, Rowan University, Glassboro, NJ
PR RBF vs. MLP
The Peaks!
x 5
z 3 1 x e x 2 y 12 x2 y 2 1 x12 y 2
10 x y e e
2 3
5 3
Mesh of the training data
10
-5
-10
4
2 3
2
0 1
0
-2 -1
-2
-4 -3
Computational Intelligence & Pattern Recognition © 2001- 2013, Robi Polikar, Rowan University, Glassboro, NJ
PR RBF vs. MLP
The Peaks!
RBF parameters
Error goal: 0.1
Spread: 1
Training function: fully supervised
MLP parameters:
Number of hidden layers: 2
Number of nodes in each layer: 25
Error goal: 0.01; (does not reach in 1000 iterations)
Training function: traingdx or trainlm
Computational Intelligence & Pattern Recognition © 2001- 2013, Robi Polikar, Rowan University, Glassboro, NJ
PR RBF vs. MLP
rbf_mlp_peaks
The Peaks!
Computational Intelligence & Pattern Recognition © 2001- 2013, Robi Polikar, Rowan University, Glassboro, NJ
PR nnstart
For a very good introduction, play with nnstart
Computational Intelligence & Pattern Recognition © 2001- 2013, Robi Polikar, Rowan University, Glassboro, NJ
PR Selecting Data
Computational Intelligence & Pattern Recognition © 2001- 2013, Robi Polikar, Rowan University, Glassboro, NJ
PR Setting Up The Network
Computational Intelligence & Pattern Recognition © 2001- 2013, Robi Polikar, Rowan University, Glassboro, NJ
PR Training
Check the dimensionality
Check out the error histogram,
confusion matrix and ROC curve once
the training is completed
Computational Intelligence & Pattern Recognition © 2001- 2013, Robi Polikar, Rowan University, Glassboro, NJ
PR Performance Plots
Best Validation Performance is 0.01575 at epoch 84
0
10
Train
Validation
Test
Best
Mean Squared Error (mse)
-1
10
x 10
4 Error Histogram with 20 Bins
-2 3.5 Training
10
Validation
3 Test
Zero Error
2.5
Instances
-3 2
10
0 10 20 30 40 50 60 70 80 90
1.5
90 Epochs
1
0.5
-0.445
-0.146
0.153
-0.9432
-0.8436
-0.7439
-0.6443
-0.5446
-0.3453
-0.2457
0.0533
0.2526
0.3523
0.4519
0.5516
0.6512
0.7509
0.8505
0.9502
-0.04635
Errors = Targets - Outputs
Computational Intelligence & Pattern Recognition © 2001- 2013, Robi Polikar, Rowan University, Glassboro, NJ
PR ROC Curves
(Class Specific)
Training ROC Validation ROC
1 1
Class 1
Class 2
0.8 0.8
Class 3
0 0
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1
False Positive Rate False Positive Rate
0.8 0.8
True Positive Rate
0.4 0.4
0.2 0.2
0 0
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1
False Positive Rate False Positive Rate
Computational Intelligence & Pattern Recognition © 2001- 2013, Robi Polikar, Rowan University, Glassboro, NJ
PR Confusion Matrices - I
Output Class
5 5
0.0% 0.1% 0.0% 0.0% 10.0% 0.0% 0.0% 0.0% 0.1% 0.0% 1.7% 0.0% 0.7% 0.0% 0.0% 8.1% 0.0% 0.2% 0.4% 0.0% 0.4% 17.6%
0 0 0 0 0 97 0 0 0 0 100% 0 0 0 1 0 104 2 0 0 0 97.2%
6 6
0.0% 0.0% 0.0% 0.0% 0.0% 8.5% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.1% 0.0% 7.8% 0.1% 0.0% 0.0% 0.0% 2.8%
1 1 0 0 3 0 110 0 0 0 95.7% 0 2 0 0 1 0 130 0 3 0 95.6%
7 7
0.1% 0.1% 0.0% 0.0% 0.3% 0.0% 9.6% 0.0% 0.0% 0.0% 4.3% 0.0% 0.1% 0.0% 0.0% 0.1% 0.0% 9.7% 0.0% 0.2% 0.0% 4.4%
0 0 0 0 0 0 0 117 0 1 99.2% 0 1 1 0 0 1 0 121 0 1 96.8%
8 8
0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 10.2% 0.0% 0.1% 0.8% 0.0% 0.1% 0.1% 0.0% 0.0% 0.1% 0.0% 9.0% 0.0% 0.1% 3.2%
0 1 3 0 0 0 1 0 115 1 95.0% 0 6 6 1 0 1 6 0 112 0 84.8%
9 9
0.0% 0.1% 0.3% 0.0% 0.0% 0.0% 0.1% 0.0% 10.0% 0.1% 5.0% 0.0% 0.4% 0.4% 0.1% 0.0% 0.1% 0.4% 0.0% 8.4% 0.0% 15.2%
0 1 0 1 2 2 0 2 0 106 93.0% 4 5 1 4 0 4 0 3 6 125 82.2%
10 10
0.0% 0.1% 0.0% 0.1% 0.2% 0.2% 0.0% 0.2% 0.0% 9.2% 7.0% 0.3% 0.4% 0.1% 0.3% 0.0% 0.3% 0.0% 0.2% 0.4% 9.3% 17.8%
99.1%96.2%95.7%96.7%95.0%94.2%98.2%96.7%93.5%95.5% 96.1% 96.9%83.4%89.9%92.2%93.1%86.0%90.3%89.0%84.8%91.2%89.6%
0.9% 3.8% 4.3% 3.3% 5.0% 5.8% 1.8% 3.3% 6.5% 4.5% 3.9% 3.1% 16.6%10.1% 7.8% 6.9% 14.0% 9.7% 11.0%15.2% 8.8% 10.4%
1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10
Target Class Target Class
PR
0 1 0 1 2 2 0 2 0 106 93.0% 4 5 1 4 0 4 0 3 6 125 82.2%
10 10
Confusion Matrices - II
0.0% 0.1% 0.0% 0.1% 0.2% 0.2% 0.0% 0.2% 0.0% 9.2% 7.0% 0.3% 0.4% 0.1% 0.3% 0.0% 0.3% 0.0% 0.2% 0.4% 9.3% 17.8%
99.1%96.2%95.7%96.7%95.0%94.2%98.2%96.7%93.5%95.5% 96.1% 96.9%83.4%89.9%92.2%93.1%86.0%90.3%89.0%84.8%91.2%89.6%
0.9% 3.8% 4.3% 3.3% 5.0% 5.8% 1.8% 3.3% 6.5% 4.5% 3.9% 3.1% 16.6%10.1% 7.8% 6.9% 14.0% 9.7% 11.0%15.2% 8.8% 10.4%
1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10
Target Class Target Class
Output Class
5 5
0.1% 0.4% 0.0% 0.0% 10.1% 0.0% 0.1% 0.2% 0.1% 0.3% 10.6% 0.1% 0.4% 0.0% 0.0% 9.4% 0.0% 0.1% 0.2% 0.1% 0.3% 10.3%
2 0 1 1 0 137 0 2 1 0 95.1% 2 0 1 2 0 338 2 2 1 0 97.1%
6 6
0.1% 0.0% 0.1% 0.1% 0.0% 10.2% 0.0% 0.1% 0.1% 0.0% 4.9% 0.1% 0.0% 0.0% 0.1% 0.0% 8.8% 0.1% 0.1% 0.0% 0.0% 2.9%
0 3 0 0 6 0 114 0 1 0 91.9% 1 6 0 0 10 0 354 0 4 0 94.4%
7 7
0.0% 0.2% 0.0% 0.0% 0.4% 0.0% 8.5% 0.0% 0.1% 0.0% 8.1% 0.0% 0.2% 0.0% 0.0% 0.3% 0.0% 9.3% 0.0% 0.1% 0.0% 5.6%
0 4 0 2 0 3 0 117 1 1 91.4% 0 5 1 2 0 4 0 355 1 3 95.7%
8 8
0.0% 0.3% 0.0% 0.1% 0.0% 0.2% 0.0% 8.7% 0.1% 0.1% 8.6% 0.0% 0.1% 0.0% 0.1% 0.0% 0.1% 0.0% 9.3% 0.0% 0.1% 4.3%
4 7 4 0 1 0 4 0 102 2 82.3% 4 14 13 1 1 1 11 0 329 3 87.3%
9 9
0.3% 0.5% 0.3% 0.0% 0.1% 0.0% 0.3% 0.0% 7.6% 0.1% 17.7% 0.1% 0.4% 0.3% 0.0% 0.0% 0.0% 0.3% 0.0% 8.6% 0.1% 12.7%
3 7 1 0 1 5 0 1 3 121 85.2% 7 13 2 5 3 11 0 6 9 352 86.3%
10 10
0.2% 0.5% 0.1% 0.0% 0.1% 0.4% 0.0% 0.1% 0.2% 9.0% 14.8% 0.2% 0.3% 0.1% 0.1% 0.1% 0.3% 0.0% 0.2% 0.2% 9.2% 13.7%
91.8%79.0%92.9%94.5%90.0%90.1%94.2%90.0%81.6%90.3% 89.4% 95.7%85.3%92.6%94.3%92.5%89.9%93.9%91.7%86.6%92.1%91.5%
8.2% 21.0% 7.1% 5.5% 10.0% 9.9% 5.8% 10.0%18.4% 9.7% 10.6% 4.3% 14.7% 7.4% 5.7% 7.5% 10.1% 6.1% 8.3% 13.4% 7.9% 8.5%
1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10
Target Class Target Class
Computational Intelligence & Pattern Recognition © 2001- 2013, Robi Polikar, Rowan University, Glassboro, NJ
PR Evaluate The Network
Computational Intelligence & Pattern Recognition © 2001- 2013, Robi Polikar, Rowan University, Glassboro, NJ
PR Evaluation Results
On Test Data
Confusion Matrix ROC
1
174 0 1 1 0 0 0 0 0 2 97.8% Class 1
1
9.7% 0.0% 0.1% 0.1% 0.0% 0.0% 0.0% 0.0% 0.0% 0.1% 2.2% Class 2
0.9 Class 3
0 169 7 2 1 3 4 0 16 4 82.0%
2 Class 4
0.0% 9.4% 0.4% 0.1% 0.1% 0.2% 0.2% 0.0% 0.9% 0.2% 18.0% Class 5
0 0 157 3 0 0 0 0 0 2 96.9% 0.8 Class 6
3 Class 7
0.0% 0.0% 8.7% 0.2% 0.0% 0.0% 0.0% 0.0% 0.0% 0.1% 3.1%
Class 8
0 0 8 164 0 0 0 1 2 1 93.2% 0.7 Class 9
4
0.0% 0.0% 0.4% 9.1% 0.0% 0.0% 0.0% 0.1% 0.1% 0.1% 6.8% Class 10
Output Class
0 8 2 4 0 0 2 1 143 4 87.2%
9
0.0% 0.4% 0.1% 0.2% 0.0% 0.0% 0.1% 0.1% 8.0% 0.2% 12.8% 0.2
0 0 0 4 2 1 0 10 7 159 86.9%
10
0.0% 0.0% 0.0% 0.2% 0.1% 0.1% 0.0% 0.6% 0.4% 8.8% 13.1%
0.1
97.8%92.9%88.7%89.6%97.8%97.8%93.9%89.9%82.2%88.3%91.9%
2.2% 7.1% 11.3%10.4% 2.2% 2.2% 6.1% 10.1%17.8%11.7% 8.1%
0
1 2 3 4 5 6 7 8 9 10 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
False Positive Rate
Target Class
Computational Intelligence & Pattern Recognition © 2001- 2013, Robi Polikar, Rowan University, Glassboro, NJ
PR What Else Can it do,
you Ask…?
Prepared to be impressed,
and then click here
Computational Intelligence & Pattern Recognition © 2001- 2013, Robi Polikar, Rowan University, Glassboro, NJ
PR …and here is the script…
% Solve a Pattern Recognition Problem with a Neural Network - Script generated by NPRTOOL - Created Thu Oct 10 02:48:04 EDT 2013
% This script assumes these variables are defined: % opt_train - input data. % opt_class - target data.
inputs = opt_train; targets = opt_class;
% Create a Pattern Recognition Network
hiddenLayerSize = 20; net = patternnet(hiddenLayerSize);
% Choose Input and Output Pre/Post-Processing Functions - For a list of all processing functions type: help nnprocess
net.inputs{1}.processFcns = {'removeconstantrows','mapminmax'};
net.outputs{2}.processFcns = {'removeconstantrows','mapminmax'};
% Setup Division of Data for Training, Validation, Testing - For a list of all data division functions type: help nndivide
net.divideFcn = 'dividerand'; % Divide data randomly
net.divideMode = 'sample'; % Divide up every sample
net.divideParam.trainRatio = 30/100; net.divideParam.valRatio = 35/100; net.divideParam.testRatio = 35/100;
% For help on training function 'trainlm' type: help trainlm; For a list of all training functions type: help nntrain
net.trainFcn = 'trainlm'; % Levenberg-Marquardt
% Choose a Performance Function - For a list of all performance functions type: help nnperformance
net.performFcn = 'mse'; % Mean squared error
% Choose Plot Functions - For a list of all plot functions type: help nnplot
net.plotFcns = {'plotperform','plottrainstate','ploterrhist', 'plotregression', 'plotfit'};
% Train and test the Network
[net,tr] = train(net,inputs,targets); outputs = net(inputs); errors = gsubtract(targets,outputs); performance = perform(net,targets,outputs)
% Recalculate Training, Validation and Test Performance
trainTargets = targets .* tr.trainMask{1};
valTargets = targets .* tr.valMask{1};
testTargets = targets .* tr.testMask{1};
trainPerformance = perform(net,trainTargets,outputs);
valPerformance = perform(net,valTargets,outputs);
testPerformance = perform(net,testTargets,outputs)
Computational Intelligence & Pattern Recognition © 2001- 2013, Robi Polikar, Rowan University, Glassboro, NJ
PR …And the simulink model
x1 y1
Computational Intelligence & Pattern Recognition © 2001- 2013, Robi Polikar, Rowan University, Glassboro, NJ
PR Midterm Project II
Oct 24
Option 1 (easier, but close ended – intended for undergraduate students) - Pick 3 of the
more challenging datasets (from the UCI repository) that you used in Midterm Project I, and
design two function approximation (with noise) problems
Train and test using the following network structures: MLP, RBF, PNN, LVQ, GRNN (which network
is suitable for which classification vs. regression problem)
Investigate different parameters, learning algorithms, etc. and tabulate your results. Can you make
any generalizations with respect to accuracy, speed, network size, etc.?
Use proper cross validation and statistical test to compare the algorithms on the datasets. Provide
the principles of operation for each of the network, including PNN, LVQ and GRNN.
UG students can instead do option 2 for 20% additional credit.
Option 2 – intended for grad students (open ended problem – publication opportunity)
You will be given a dataset of EEG data with two classes: Alzheimer’s disease and normal. The dataset
includes raw data as well as 152 different feature sets obtained from 71 subjects. Your goal is to devise
and implement a rigorous approach to determine which feature sets provide the best diagnostic
accuracy, using appropriate cross-validation techniques. Provide a list of these approaches in descending
order of accuracy, also providing sensitivity, specificity and positive predictive value, along with their
confidence intervals.
Computational Intelligence & Pattern Recognition © 2001- 2013, Robi Polikar, Rowan University, Glassboro, NJ