Documente Academic
Documente Profesional
Documente Cultură
Supervised Learning
I:
Perceptrons and LMS
Information is downloaded
into LTMs for more
permanent storage: days,
months, and years. Recall process of these memories
Capacity of LTMs is very is distinct from the memories
large. themselves.
Important Aspects of Human
Memory
Recall of recent memories is more easily disrupted than that of
older memories.
Memories are dynamic and undergo continual change and with time
all memories fade.
STM results in
physical changes in sensory receptors.
simultaneous and cohesive reverberation of neuron circuits.
Long-term memory involves
plastic changes in the brain which take the form of
strengthening or weakening of existing synapses
the formation of new synapses.
Learning mechanism distributes the memory over different areas
Makes robust to damage
Permits the brain to work easily from partially corrupted
information.
Reflexive and declarative memories may actually involve different
neuronal circuits.
From Synapses to Behaviour: The
Case of Aplysia
Aplysia has a respiratory
gill that is housedin the
mantle cavity (which is a
respiratory chamber
covered by the mantle
shelf) on the dorsal side of
the mollusc.
The mantle forms a spout
at the dorsal end which
usually protrudes between
the parapodia which are
wing-like extensions of the
body wall.
Gill–Siphon Withdrawal (GSW)
Reflex
When a tactile stimulus is applied to the
siphon or the mantle shelf, two reflexes
are initiated:
the siphon contracts behind the parapodia
the gill is withdrawn into the mantle cavity.
The gill–siphon withdrawal (GSW) reflex
is an example of an innate defensive
reflex.
Cellular Mechanism of GSW
Habituation
MN: Mentor Neuron
Siphon SN
SN MN Gill IN: Interneuron
Skin SN: Sensory Neuron
IN
Repeated
presentation Potentials produced by Strength of gill
of stimulus Siphon interneuron and motor
SN withdrawal reflex
Skin neuron cells gradually is reduced
becomes weaker
an input vector Xk ∈ Rn to
5 4 3 2
3 x -1.2 x -12.27 x +3.288 x +7.182 x
15
T = {( X k , Dk )}k =1
Q
5
an unknown function
-10
f : Rn → Rp which is to be
-15
characterized.
-20
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
x
The Supervised Learning Procedure
Error information fed back for network adaptation
Sk Error
Xk
Dx
Neural Network
labelling.
The goal of pattern
-0.5
classification is to assign an -1.5 -1 -0.5 0 0.5 1 1.5 2 2.5
input pattern to one of a finite
number of classes.
Quantization vectors are called
codebook vectors.
Characteristics of Supervised and
Unsupervised Learning
General Philosophy of Learning:
Principle of Minimal Disturbance
Adapt to reduce the output error for the
current training pattern, with minimal
disturbance to responses already learned.
Error Correction and Gradient
Descent Rules
Error correction rules alter the weights of a
network using a linear error measure to reduce the
error in the output generated in response to the
present input pattern.
Wk = (w , w ,K, w ) W
k
0
k
1
k T
n k ∈ R n+1
W2
Objective: To design the
weights of a TLN to correctly
X2 Σ S
classify a given set of patterns.
…
Wn
Assumption: A training set of
following form is given
Xn
T = {( X k , Dk )} X K ∈ R n +1 , d k ∈ {0,1}
Q
k =1
indicated:
χ1 = {X1,X2} and X4
χ0 = {X3,X4}
Solution region
Linear Containment
Consider the set X0’ in which each element X0 is
negated.
Given a weight vectorWk, for any Xk ∈ X1 ∪ X0’,
XkT Wk > 0 implies correct classification and
XkTWk < 0 implies incorrect classification.
X‘ = X1 ∪ X0’ is called the adjusted training set.
Assumption of linear separability guarantees the
existence of a solution weight vector WS, such
that XkTWS > 0 ∀ Xk ∈ X
We say X’ is a linearly contained set.
Recast of Perceptron Learning with
Linearly Contained Data
X1
WO
W1
S
W2
X2
Binary threshold neuron
Classroom Exercise
Classroom Exercise
MATLAB Simulation
(a) Hyperplane movement depicted during Perceptron Learning
3
k=15
2.5
2
k=35
1.5
k=25
x2
1
(0,1) (1,1)
k=5
0.5
0
(0,0) (1,0)
-0.5
-0.2 0 0.2 0.4 0.6 0.8 1 1.2
x1
3
0
2
w2
1 -1
0 -2
3
2.5 -3
2
1.5
1
0.5 -4
0
w0
w1
Perceptron Learning Algorithm:
MATLAB Code
p = [1 0 0 elseif y<=0 & d(i) == 1
101 w =w + eta*p(i,:);
up(i) =1;
110
else
1 1 1]; up(i)=0;
d =[0 0 0 1]; end
w =[0 0 0]; end
eta =1; number_of_updates =up * up’;
update =1; if number_of_updates > 0
while update==1 update =1;
for i=1:4 else
y =p(i,:)*w’; update =0;
if y >=0 & d(i) ==0 end
w =w - eta*p(i,:); end
up(i) =1;
Perceptron Learning and Non-
separable Sets
Theorem:
Given a finite set of training patterns X,
there exists a number M such that if we run
the Perceptron learning algorithm beginning
with any initial set of weights,W1, then any
weight vector Wk produced in the course of
the algorithm will satisfyWk ≤ W1 +M
Two Corollaries
If, in a finite set of training patterns X, each
pattern Xk has integer (or rational) components xik,
then the Perceptron learning algorithm will visit a
finite set of distinct weight vectors Wk.
For a finite set of training patterns X, with
individual patterns Xk having integer (or rational)
components xik the Perceptron learning algorithm
will, in finite time, produce a weight vector that
correctly classifies all training patterns iff X is
linearly separable, or leave and re-visit a specific
weight vector iff X is linearly non-separable.
Handling Linearly Non-separable
Sets: The Pocket Algorithm
Philosophy: Incorporate positive reinforcement in a way
to reward weights that yield a low error solution.
Pocket algorithm works by remembering the weight
vector that yields the largest number of correct
classifications on a consecutive run.
This weight vector is kept in the “pocket”, and we
denote it as Wpocket .
While updating the weights in accordance with
Perceptron learning, if a weight vector is discovered
that has a longer run of consecutively correct
classifications than the one in the pocket, it replaces
the weight vector in the pocket.
Pocket Algorithm:
Operational Summary
Pocket Convergence Theorem
Given a finite set of training examples, X, and a
probabilityp < 1, there exists an integer k0 such
that after any k > k0 iterations of the pocket
algorithm, the probability that the pocket weight
vectorWpocket is optimal exceeds p.
Linear Neurons and Linear Error
Consider a training set of the form T = {Xk, dk},
Xk ∈ Rn+1, dk ∈ R.
To allow the desired output to vary smoothly or
continuously over some interval consider a
linear signal function: sk = yk = XkTWk
The linear error ek due to a presented training
pair (Xk, dk), is the difference between the
desired output dk andthe neuronal signal sk:
ek = dk − sk = dk − XkTWk
Operational Details of α–LMS
+1
α–LMS error correction X1k
is proportional to the WO
error itself W1 k
η.
…
η controls the stability Xnk
Wnk
and speed of
convergence.
Stability ensured
if 0 < η < 2.
α–LMS Works with Normalized
Training Patterns
x3
xk
Wk+1
Wk
Wk
x2
x1
α–LMS: Operational Summary
MATLAB Simulation Example
Synthetic data set shown
in the figure is generated
by artificially scattering 10
f = 3x(x-1)(x-1.9)(x+0.7)(x+1.8)± ε
generate a scatter of 200
points in a ±0.1 interval in
0
the y direction. -5
[0,1].
Then stretching it to the
-20
-2 -1 0 1 2
scaling it to ±0.1
Computer Simulation of α-LMS
Algorithm
2
1.8
Iteration 1
1.6
1.4
1.2
y
0.8
0.6
0.2
= (d k2 − 2d k X kT Wk + WkT X k X kT Wk )
1
(2)
2
Assumption: The weights are held fixed at Wk while
computing the expectation.
The mean-squared error can now be computed by taking
the expectation on both sides of (2):
ε = E [ε k ] =
1
2
[ ] [ ] 1
[ ]
E d k2 − E d k X kT Wk + WkT E X k X kT Wk
2
(3)
Our problem…
To find optimal weight vector that
minimizes the mean-square error.
Cross Correlations
For convenience of expression we define the
pattern vector P as the cross-correlation
between the desired scalar output, dk, and the
input vector, Xk
[ ] [ ]
P T ∆ E d k X kT = E (d k , d k x1k ,L, d k xnk ) (4)
ε = E [ε k ] =
1
2
[ ] 1
E d k2 − P T Wk + WkT RWk
2
(6)
1
[ ] 1
= E d k2 − P TWˆ
2 2
(12)
1 ⎛⎜ T ⎞
[ ]
= E d + V RV + V14
2
k
2 ⎜⎝
T
RW 4
ˆ
2+W 4
ˆ T
4 RV
3 + W RW ⎟ − P TV − P TWˆ
ˆ T ˆ
⎟
2Wˆ T RV = 2 P T R −1RV = 2 P TV ⎠
(14)
1
= ε min + V T RV (15)
2
1
( T
) ( T
) (
= ε min + W − Wˆ R W − Wˆ R W − Wˆ
2
) (16)
scatter of 100 10
describes a fifth
f = 3x(x-1)(x-1.9)(x+0.7)(x+1.8)± ε
order function
0
scattered around
-10
the function in a ±1
-15
interval. -20
-2 -1 0
x
1 2
MATLAB Simulation:
Computing the MSE Surface
First calculate the correlation matrix R, the cross-
[ ]
correlation P, E d 2 , and the Weiner solution. These are
straightforward to compute since the data set is
deterministic.
[ ]
For the pattern set under consideration, E d 2 = 859.59,and the
Weiner solution can be computed to be
⎛1 0 ⎞
R = ⎜⎜ ⎟⎟ (24)
⎝0 1.36 ⎠
⎛ 0.3386 ⎞
P = ⎜⎜ ⎟⎟ (25)
⎝ − 1 . 8818 ⎠
⎛ 0.3386 ⎞
Wˆ = ⎜⎜ ⎟⎟ (26)
⎝ - 1.3834 ⎠
MSE Surface in the Vicinity of
Weiner Solution
Steepest Descent Search with Exact
Gradient Information
Steepest descent search
uses exact gradient
information available
from the mean-square
error surface to direct
the search in weight
space.
The figure shows a
projection of the
square-error function on
the ε − wik plane.
Steepest Descent Procedure
Summary
Provide an appropriate weight increment to wik to push the
error towards the minimum which occurs at ŵi .
Perturb the weight in a direction that depends on which
side of the optimal weight ŵi the current weight value wik
lies.
If the weight component wik lies to the left of ŵi, say at wik1 ,
where the error gradient is negative (as indicated by the
tangent) we need to increase wik .
If wik is on the right of ŵi , say at wik2 where the error
gradient is positive, we need to decrease ŵi .
This rule is summarized in the following statement:
∂ε
If
∂w ki
> 0, (w k
i > ˆ
w i ) , decrease w k
i
∂ε
If
∂w ki
< 0, (w k
i < ˆ
w i ) , increase w k
i
Weight Update Procedure
It follows logically therefore, that the weight component should be
updated in proportion with the negative of the gradient:
⎛ ∂ε ⎞
w k +1
i = w + η ⎜⎜ −
k
i ⎟⎟ i = 0,1,K , n (27 )
⎝ ∂ w k
i ⎠
Vectorially we may write
Wk +1 = Wk + η (− ∇ε ) (28)
where we now re-introduced the iteration time dependence into the
weight components:
T
⎛ ∂ε ∂ε ⎞
∇ε = ⎜⎜ k ,K , k ⎟⎟ (29)
⎝ ∂w0 ∂ wn ⎠
Equation (28) is the steepest descent update procedure. Note that
steepest descent uses exact gradient information at each step to
decide weight changes.
Convergence of Steepest Descent - 1
Question: What can one say about the stability of the
algorithm? Does it converge for all values of η ?
To answer this question consider the following series of
subtitutions and transformations. From Eqns. (28) and (21)
Wk +1 = Wk + η (− ∇ε ) (30)
= Wk − ηRVk (31)
(
= Wk + ηR Wˆ − Wk ) (32)
= (1 − ηR )Wk + ηRWˆ (33)
Transforming Eqn. (33) into V-space (the principal coordinate
system) by substitution of Vk + Wˆ for Wk yields:
Vk +1 = (I − ηR )Vk (34)
Steepest Descent Convergence - 2
Rotation to the principal axes of the elliptic contours can be
effected by using V = QV ' :
QVk'+1 = (I − ηR ) QVk' (35)
or
Vk' +1 = Q −1 (I − ηR )QVk' (36)
= (I − ηD )Vk' (37)
where D is the diagonal eigenvalue matrix. Recursive application of
Eqn. (37) yields:
Vk' = (I − ηD ) V0'
k
(38)
It follows from this that for stability and convergence of the
algorithm:
lim(I − ηD ) = 0
k
(39)
k →∞
Steepest Descent Convergence - 3
lim(1 − ηλmax ) = 0
k
This requires that (40)
k →∞
2
or 0 <η < (41)
λmax
λmax being the largest eigenvalue of R.
If this condition is satisfied then we have
k →∞ k →∞
(
lim Vk' = lim Q −1 Wk − Wˆ = 0 ) (42)
or lim Wk = Wˆ (43)
k →∞
(
Wk +1 =W k+2ηR Wˆ − Wk ) (46)
MATLAB Simulation Example
eta = .01; % Set learning rate
R=zeros(2,2); % Initialize correlation matrix
X = [ones(1,max_points);x]; % Augment input vectors
for k =1:max_points
R = R + X(:,k)*X(:,k)'; % Compute R
end
R = R/max_points;
weiner=inv(R)*P; % Compute the Weiner solution
errormin = D - P'*inv(R)*P; % Find the minimum error
MATLAB Simulation Example (contd.)
shift1 = linspace(-12,12, 21); % Generate a weight space matrix
shift2 = linspace(-9,9, 21);
for i = 1:21 % Compute a weight matrix about
shiftwts(1,i) = weiner(1)+shift1(i); % Weiner solution
shiftwts(2,i) = weiner(2)+shift2(i);
end
descent uses
W 0 = (-3.9, 6.27)T 194
160
6
160
exact gradient 4
information to
55.3
2
search the 0
Weiner solution 90.1
1
w
(1.53, -1.38)
Weiner
-2
194
solution in
-4
229 160
-6
weight space.
264
-8
E [ε k ] → constant as k → ∞ (58)
( [
E [Wk +1 ] = I − ηE X k X kT ] ) E[W ] + ηE[d X ]
k k k (64)
= (I − ηR ) E [Wk ] + ηP (65)
Where P and R are as already defined
µ-LMS Algorithm:
Convergence in the Mean (3)
Appropriate substitution yields:
( )
E [Wk +1 ] = I − ηQDQ T E [Wk ] + ηQDQ T Wˆ
= (QQ )
− ηQDQ T E [Wk ] + ηQDQ T Wˆ
T
(66)
= Q(I − ηD )Q T E [Wk ] + ηQDQ T Wˆ
T
Pre-multiplication throughout by Q results in:
~ ~ ~
Where Vk = E [Wk ] − Wˆ and Vk' = Q TVk . Eqn. (71) represents a set of n+1
Decoupled difference equations :
2
0 <η < (76)
tr[R ]
Random Walk towards the Weiner
Solution
Assume the
familiar fifth 95
19
order function
6 1 4 1. 0.3
4 .56 4 9
12 141.883 117.6554 8839 1 6 41
21 41 66.1 6.1
3 1 12
0. 682 11 4
4 19 93.42
µ-LMS uses a
93 7.6
8 .42 55
9 82 68 4
69.1 69
4
2
55
2 .1
local estimate of
5
44.9697 9
7.6
82
9
.883
8
11
682
0
1 41
gradient to
Weiner solution
93.42682
93.42
w1 (0.339, -1.881)
82 8
-2
search the
69.
69.19
µ LMS solution
1 98
44
(0.5373, -2.3311) 5
97
.9
-4
6
96
Weiner solution
28
97
11 4.
5
4
7.
-6 14 65
in weight space.
1. 54 4
88 93 .
42 6 69.19828 55
39 2 .6
19 82 268 7
1 1 839 2 4
-8 0 93.4
21 .341 166 1.8 11
4.5 .11 117.6554 14 66.
69 24 1 1
-10 5 141 8839 34
-10 -5 0 5 90 10
w0
Computer Simulation Example
µ-LMS learning assumes that the data stream is stochastic
(with stationary properties).
scatter = (2*rand-1)*eps;
d = y + scatter;
w = w + 2*eta*(d -X'*w)*X;
wts1(loop)=w(1);
wts2(loop)=w(2);
end
MATLAB Code Notes
The rest of the program is very similar to the code for steepest
descent search.