WK5 - Dynamic Networks

WK5 Dynamic Networks
Contents
Sequences
Time Delayed I
Time Delayed II
Recurrent I
Recurrent II
Conclusions
CS 476: Networks of Neural

Computation
WK5 Dynamic Networks:

Time Delayed & Recurrent
Networks
Dr. Stathis Kasderidis
Dept. of Computer Science
University of Crete
Spring Semester, 2009

CS 476: Networks of Neural Computation, CSD, UOC, 2009
Contents
Sequence Learning
Contents
Time Delayed Networks I: Implicit

Representation
Sequences
Time Delayed I
Time Delayed Networks II: Explicit

Time Delayed II Representation
Recurrent I
Recurrent II
Conclusions
Recurrent Networks I: Elman + Jordan

Networks
Recurrent Networks II: Back Propagation
Through Time
Conclusions
Sequence Learning
Contents
Sequences
Time Delayed I
Time Delayed II
Recurrent I
Recurrent II
Conclusions
MLP & RBF networks are static networks, i.e.

they learn a mapping from a single input
signal to a single output response for an
arbitrary large number of pairs;
Dynamic networks learn a mapping from a
single input signal to a sequence of response
signals, for an arbitrary number of pairs
(signal,sequence).
Typically the input signal to a dynamic
network is an element of the sequence and
then the network produces as a response the
rest of the sequence.
To learn sequences we need to include some

form of memory (short term memory) to the
network.
Sequence Learning II
We can introduce memory effects with two

principal ways:
Contents
Sequences
Implicit:
Time Delayed II
e.g. Time lagged signal as input

to a static network or as recurrent
connections
Recurrent I
Explicit:
Time Delayed I
Recurrent II
Conclusions
e.g. Temporal Backpropagation
Method
In the implicit form, we assume that the
environment from which we collect examples
of (input signal, output sequence) is
stationary. For the explicit form the
environment could be non-stationary, i.e. the
network can track the changes in the
structure of the signal.
Time Delayed Networks I
The time delayed approach includes two basic

Contents
types of networks:
Implicit Representation of Time: We
Sequences
combine a memory structure in the input
Time Delayed I
layer of the network with a static network
Time Delayed II
model
Recurrent I
Explicit Representation of Time: We
explicitly allow the network to code time, by
Recurrent II
generalising the network weights from
Conclusions
scalars to vectors, as in TBP (Temporal
Backpropagation).
Typical forms of memories that are used are
the Tapped Delay Line and the Gamma Memory
family.
The Tapped Delay Line form of memory is

shown below for an input signal x(n):
Contents
Sequences
Time Delayed I
Time Delayed II
Recurrent I
Recurrent II
Conclusions
The Gamma form of memory is defined by:

n 1
p (1 ) n p
g p ( n)
p 1
n p
The Gamma Memory is shown below:
Contents
Sequences
Time Delayed I
Time Delayed II
Recurrent I
Recurrent II
Conclusions
In the implicit representation approach we

Contents
combine a static network (e.g. MLP / RBF) with
a memory structure (e.g. tapped delay line).
Sequences
Time Delayed I An example is shown below (the NETtalk
network):
Time Delayed II
Recurrent I
Recurrent II
Conclusions
We present the data in a sliding window. For

Contents
example in NETtalk the middle group of input
neurons present the letter in focus. The rest of
Sequences
the input groups, three before & three after,
Time Delayed I
present context.
Time Delayed II
The purpose is to predict for example the next
symbol in the sequence.
Recurrent I
The NETtalk model (Sejnowski & Rosenberg,
Recurrent II
1987) has:
Conclusions
203 input nodes
80 hidden neurons
26 output neurons
18629 weights
Used BP method for training
Time Delayed Networks II
In explicit time representation, neurons have a

Contents
spatio-temporal structure, i.e. its synapse
arriving to a neuron is not a scalar number but a
Sequences
vector of weights, which are used for
Time Delayed I
convolution of the time-delayed input signal of a
Time Delayed II previous neuron with the synapse.
Recurrent I
Recurrent II
A schematic representation of a neuron is

given below:
Conclusions
The output of the neuron in this case is given

by:
p
Contents
y j ( n)
Sequences
Time Delayed I
Time Delayed II
Recurrent I
Recurrent II
Conclusions
w
l 0
(l ) x ( n l ) b j
In case of a whole network, for example

assuming a single output node and a linear
output layer,
the response
m
m
p is given by:
1
y ( n) w j y j ( n) w j
j 1
j 1
w
l 0
(l ) x ( n l ) b j b0
Where p is the depth of the memory and b0 is the

bias of the output neuron
In the more general case, where we have

Contents
multiple neurons at the output layer, we have
for neuron j of any layer:
Sequences
m0 p
Time Delayed I
y j (n) w ji (l ) xi ( n l ) b j
i 1 l 0
Time Delayed II
Time Delayed II
Recurrent I
Recurrent II
Conclusions
The output of any synapse is given by the

convolution sum:
p
T
s ji (n) w ji xi (n) w ji (l ) xi (n l )
l 0
Where the state vector xi(n) and weight vector

wji for synapse I are defined as follows:
Contents
Sequences
Time Delayed I
Time Delayed II
xi(n)=[xi(n),
wji=[wji(0),
xi(n-1),, xi(n-p)]T
wji(1),, wji(p)]T
Recurrent I
Recurrent II
Conclusions
Time Delayed Networks II: Learning Law
To train such a network we use the Temporal

BackPropagation algorithm. We present the
algorithm below.
Contents
Sequences
Assume that neuron j lies in the output layer

Time Delayed II and its response is denoted by yj(n) at time n,
while its desired response is given by dj(n).
Time Delayed I
Recurrent I
Recurrent II
Conclusions
We can define an instantaneous value for the

sum of squared errors produced by the network
as follows:
1
2
E ( n)
e
j ( n)
2 j
Time Delayed Networks II: Learning Law-1
The error signal at the output layer is defined

by:
e j ( n) d j ( n) y j ( n)
Contents
Sequences
Time Delayed I
The idea is a minimise an overall cost function,

Time Delayed II calculated over all time:
Recurrent I
Etotal E (n)
n
Recurrent II
Conclusions
We could proceed as usual by calculating the

gradient of the cost function over the weights.
This impliesE
that
we needto
the
E (calculate
n)
total

gradient:
instantaneous
w
w
ji
ji
However this approach to work we need to

Contents
unfold the network in time (i.e. to convert it to
an equivalent static network and then calculate
Sequences
the gradient). This option presents a number of
Time Delayed I
disadvantages:
Time Delayed II
Recurrent II
loss of symmetry between forward and

backward pass for the calculation of
instantaneous gradient;
Conclusions
No
Recurrent I
nice recursive formula for propagation of

error terms;
Need
for global bookkeeping to keep track

of which static weights are actually the same
in the equivalent network
For these reasons we prefer to calculate the

gradient of the cost function as follows:
Contents
Sequences
Time Delayed I
Time Delayed II
Recurrent I
Recurrent II
Conclusions
Etotal
w ji
Etotal v j ( n)
v j ( n) w ji
Note that in general holds:

Etotal v j ( n)
E ( n )
v j ( n) w ji
w ji
The equality is correct only when we take the

sum over all time.
To calculate the weight update we use the

steepest descent method:
Contents
Sequences
Time Delayed I
Time Delayed II
Recurrent I
Recurrent II
Conclusions
Etotal v j ( n)
w ji ( n 1) w ji ( n)
v j ( n) w ji ( n)
Where is the learning rate.

We calculate the terms in the above relation as
follows:
v j ( n )
i ( n)
w ji ( n)
This is by definition the induced field vj(n)

We define the local gradient as:
Contents
j ( n)
Sequences
Etotal
v j ( n )
Time Delayed I
Time Delayed II Thus
Recurrent I
Recurrent II
we can write the weight update equations

in the familiar form:
w ji (n 1) w ji (n) j (n) xi (n)
Conclusions
We need to calculate the for the cases of

output and hidden layers.
For the output layer the local gradient is given

by:
Etotal
E ( n )
Contents
j ( n)
Sequences
v j ( n )
v j ( n )
e j (n) ' (v j (n))
Time Delayed I
Time Delayed II
Recurrent I
Recurrent II
Conclusions
For a hidden layer we assume that neuron j is

connected to a set A of neurons in the next
layer (hidden or output).
Then we have:
E
j ( n)
total
v j ( n )

r A k
Etotal vr (k )
vr ( k ) v j ( n )
By re-writing we get the following:
Contents
j ( n) r ( k )
Sequences
Time Delayed I
Time Delayed II
r A
vr ( k )
v j ( n )
r (k )
r A
v r ( k ) y j ( n )
y j ( n ) v j ( n )
Recurrent I
Recurrent II
Conclusions
Finally we putting all together we get:

n p
j (n) ' (v j (n)) r (k ) wrj (k l )

r A k n
' (v j (n)) r (n l ) wrj (n)

r A l 0
Where l is the layer level
Contents
Sequences
Time Delayed I
Time Delayed II
Recurrent I
Recurrent II
Conclusions
Recurrent I
Contents
Sequences
Time Delayed I
A network is called recurrent when there are

connections which feedback to previous layers
or neurons, including self-connections. An
example is shown next:
Time Delayed II
Recurrent I
Recurrent II
Conclusions
Successful early models of recurrent networks

are:
Jordan
Network
Elman
Network
Recurrent I
The Jordan Network has the structure of an

Contents
MLP and additional context units. The Output
neurons feedback to the context neurons in 1-1
Sequences
fashion. The context units also feedback to
Time Delayed I
themselves.
Time Delayed II
Recurrent I
Recurrent II
The network is trained by using the

Backpropagation algorithm
A schematic is shown in the next figure:
Conclusions
Recurrent I
The Elman Network has also the structure of

Contents
an MLP and additional context units. The Hidden
neurons feedback to the context neurons in 1-1
Sequences
fashion. The hidden neurons connections to the
Time Delayed I
context units are constant and equal to 1. It is
Time Delayed II also called Simple Recurrent Network
(SRN).
Recurrent I
Recurrent II
Conclusions
The network is trained by using the

Backpropagation algorithm
A schematic
is shown in the
next figure:
Recurrent II
More complex forms of recurrent networks are

possible. We can start by extending a MLP as a
basic building block.
Contents
Sequences
Time Delayed I
Time Delayed II
Recurrent I
Recurrent II
Conclusions
Typical paradigms of complex recurrent models

are:
Nonlinear
Autoregressive with Exogenous

Inputs Network (NARX)
The
State Space Model
The
Recurrent Multilayer Perceptron (RMLP)
Schematic representations of the networks are

given in the next slides:
Recurrent II-1
The structure of the NARX model includes:

Contents
Sequences
Time Delayed I
Time Delayed II
Recurrent I
Recurrent II
Conclusions
MLP static network;
current input u(n) and its delayed versions

up to a time q;
A
time delayed version of the current output

y(n) which feeds back to the input layer. The
memory of the delayed output vector is in
general p.
The
output is calculated as:
y(n+1)=F(y(n),,y(n-p+1),u(n),,u(n-q+1))
Recurrent II-2
A schematic of the NARX model is as follows:

Contents
Sequences
Time Delayed I
Time Delayed II
Recurrent I
Recurrent II
Conclusions
Recurrent II-3
The structure of the State Space model

includes:
Contents
Sequences
Time Delayed I
The
MLP network with a single hidden layer;
Time Delayed II
hidden neurons define the state of the

network;
Recurrent I
Recurrent II
Conclusions
linear output layer;
feedback of the hidden layer to the input

layer assuming a memory of q lags;
The
output is determined by the coupled

equations:
x(n+1)=f(x(n),u(n))
y(n+1)=C x(n+1)
Recurrent II-4
Where f is a suitable nonlinear function

Contents
characterising the hidden layer. x is the state
vector, as it is produced by the hidden layer. It
Sequences
has q components. y is the output vector and it
Time Delayed I has p components. The input vector is given by
Time Delayed II u and it has m components.
Recurrent I
Recurrent II
A schematic representation of the network is

given below:
Conclusions
Recurrent II-5
The structure of the RMLP includes:

Contents
One
Sequences
Time Delayed I
Time Delayed II
Recurrent I
Recurrent II
Conclusions
or more hidden layers;
Feedback
around each layer;
The
general structure of a static MLP

network;
The
output is calculated as follows

(assuming that xI, xII, and xo are the first,
second and output layer outputs):
xI(n+1)= I(xI(n), u(n))
xII(n+1)= II(xII(n), xI(n+1))
xO(n+1)= O(xO(n), xII(n+1))
Recurrent II-6
Where the functions I(), II() and O() denote

the
Contents
Sequences
Activation functions of the corresponding layer.
Time Delayed I
A schematic representation is given below:
Time Delayed II
Recurrent I
Recurrent II
Conclusions
Recurrent II-7
Some theorems on the computational

power of recurrent networks:
Contents
Sequences
Thm
Time Delayed I
Time Delayed II
Recurrent I
Recurrent II
Conclusions
1: All Turing machines may be

simulated by fully connected recurrent
networks built on neurons with sigmoid
activation functions.
Thm
2: NARX networks with one layer of

hidden neurons with bounded, one-sided
saturated activation functions and a linear
output neuron can simulate fully connected
recurrent networks with bounded, one-sided
saturated activation functions, except for a
linear slowdown.
Recurrent II-8
Corollary:
Contents
Sequences
Time Delayed I
NARX networks with one hidden

layer of neurons with BOSS activations
functions and a linear output neuron are
Turing equivalent.
Time Delayed II
Recurrent I
Recurrent II
Conclusions
Recurrent II-9
The training of the recurrent networks can be

done with two methods:
Contents
Sequences
BackPropagation
Time Delayed I
Real-Time
Time Delayed II
Recurrent I
Recurrent II
Conclusions
Through Time
Recurrent Learning
We can train a recurrent network with either

epoch-based or continuous training operation.
However an epoch in recurrent networks does
not mean the presentation of all learning
patterns but rather denotes the length of a
single sequence that we use for training. So an
epoch in recurrent network corresponds in
presenting only one pattern to the network.
At the end of an epoch the network stabilises.
Recurrent II-10
Some useful heuristics for the training is

given below:
Contents
Sequences
Lexigraphic
Time Delayed I
Time Delayed II
Recurrent I
Recurrent II
Conclusions
order of training samples should

be followed, with the shortest strings of
symbols being presented in the network first;
The
training should begin with a small

training sample and then its size should be
incrementally increased as the training
proceeds;
The
synaptic weights of the network should

be updated only if the absolute error on the
training sample currently being processed by
the network is greater than some prescribed
criterion;
Recurrent II-11
The
use of weight decay during training is

recommended; weight decay was discussed
in WK3
Contents
Sequences
Time Delayed I
Time Delayed II The

Recurrent I
Recurrent II
Conclusions
BackPropagation Through Time algorithm

proceeds by unfolding a network in time. To be
more specific:
Assume
that we have a recurrent network N

which is required to learn a temporal task
starting from time n0 and going all the way to
time n.
Let
N* denote the feedforward network that

results from unfolding the temporal
of the
recurrent
CS 476:operation
Networks of Neural
Computation,
CSD, network
UOC, 2009 N.
Recurrent II-12
The
network N* is related to the original

network N as follows:
Contents
For
Time Delayed II
each time step in the interval (n0,n], the

network N* has a layer containing K neurons,
where K is the number of neurons contained
in network N;
Recurrent I
In
Sequences
Time Delayed I
Recurrent II
every layer of network N* there is a copy

of each neuron in network N;
Conclusions
For
every time step l [n0,n], the synaptic

connection from neuron i in layer l to neuron j
in layer l+1 of the network N* is a copy of the
synaptic connection from neuron i to neuron j
in the network N.
The following example explains the idea of

CSunfolding:
476: Networks of Neural Computation, CSD, UOC, 2009
Recurrent II-13
We assume that we have a network with two

neurons which is unfolded for a number of
steps, n:
Contents
Sequences
Time Delayed I
Time Delayed II
Recurrent I
Recurrent II
Conclusions
Recurrent II-14
We present now the method of Epochwise

BackPropagation Through Time.
Let the dataset used for training the network

be partitioned into independent epochs, with
each epoch representing a temporal pattern
of interest. Let n0 denote the start time of an
epoch and n1 denotes its end time.
We can define the following cost function:
Contents
Sequences
Time Delayed I
Time Delayed II
Recurrent I
Recurrent II
Conclusions
1 n1
2
Etotal (n0 , n1 ) e j (n)
2 n n0 j A
Recurrent II-15
Contents
Sequences
Time Delayed I
Time Delayed II
Where A is the set of indices j pertaining to

those neurons in the network for which
desired responses are specified, and ej(n) is
the error signal at the output of such a
neuron measured with respect to some
desired response.
Recurrent I
Recurrent II
Conclusions
Recurrent II-16
Contents
1.
Sequences
Time Delayed I
Time Delayed II
Recurrent I
Recurrent II
Conclusions
The algorithm proceeds as follows:

For a given epoch, the recurrent network
starts running from some initial state until
it reaches a new state, at which point the
training is stopped and the network is
reset to an initial state for the next epoch.
The initial state doesnt have to be the
same for each epoch of training. Rather,
what is important is for the initial state for
the new epoch to be different from the
state reached by the network at the end
of the previous epoch;
Recurrent II-17
2.
First a single forward pass of the data

through the network for the interval (n0,
n1) is performed. The complete record of
input data, network state (i.e. synaptic
weights), and desired responses over this
interval is saved;
3.
A single backward pass over this past

record is performed to compute the
Etotalgradients:
( n0 , n1 )
values jof
( n)the
local
Contents
Sequences
Time Delayed I
Time Delayed II
Recurrent I
Recurrent II
Conclusions
v j ( n )
For all j A and n0 < n n1 . This

computation is performed by the formula:
Recurrent II-18
Contents
j (n)
Time Delayed I
Time Delayed II
Recurrent II
Conclusions
' (v j (n))
Sequences
Recurrent I
' (v j (n))e j (n)
for n n1
e j (n) w jk k (n 1)
for n0 n n1
k A
Where () is the derivative of an

activation function with respect to its
argument, and vj(n) is the induced local
field of neuron j.
The use of above formula is repeated,
starting from time n1 and working back,
step by step, to time n0 ; the number of
steps involved here is equal to the
number of time steps contained in the
epoch.
Recurrent II-19
4.
Contents
Sequences
Time Delayed I
Time Delayed II
Once the computation of backpropagation has been performed back to

time n0+1, the following adjustment is
applied to the synaptic weight wji of
neuron j:
Etotal
w ji
w ji
Recurrent I
Recurrent II
n1
n n0 1
(n) xi ( n 1)
Conclusions
Where is the learning rate parameter

and xi(n-1) is the input applied to the ith
synapse of neuron j at time n-1.
Recurrent II-20
There is a potential problem with the

method, which is called the Vanishing
Gradients Problem, i.e. the corrections
calculated for the weights are not large
enough when using methods based on
steepest descent.
However this is a research problem currently

and ones has to see the literature for details.
Contents
Sequences
Time Delayed I
Time Delayed II
Recurrent I
Recurrent II
Conclusions
Conclusions
Dynamic networks learn sequences in contrast

to the static mappings of MLP and RBF
networks.
Contents
Sequences
Time Delayed I
Time Delayed II
Recurrent I
Recurrent II
Conclusions
Time representation takes place explicitly or

implicitly.
The implicit form includes time-delayed
versions of the input vector and use of a static
network model afterwards or the use of
recurrent networks.
The explicit form uses a generalisation of the

MLP model where a synapse is modelled now as
a weight vector and not as a single number. The
synapse activation is not any more the product
of the synapses weight with the output of a
CSprevious
476: Networks
of Neural but
Computation,
2009product of
neuron
ratherCSD,
theUOC,
inner
Conclusions I
The extended MLP networks with explicit

temporal structure are trained with the
Temporal BackPropagation algorithm.
Contents
Sequences
The recurrent networks include a number of

simple and complex architectures. In the simpler
Time Delayed II
case we train the networks using the standard
Recurrent I
BackProgation algorithm.
Time Delayed I
Recurrent II
Conclusions
In the more complex cases we first unfold the

network in time and then train it using the
BackProgation Through Time algorithm.

WK5 - Dynamic Networks

Încărcat de

Informații document

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

WK5 - Dynamic Networks

Încărcat de

Drepturi de autor:

Formate disponibile

WK5 Dynamic Networks

CS 476: Networks of Neural

WK5 Dynamic Networks:

Spring Semester, 2009

Time Delayed Networks I: Implicit

Time Delayed Networks II: Explicit

Recurrent Networks I: Elman + Jordan

CS 476: Networks of Neural Computation, CSD, UOC, 2009

MLP & RBF networks are static networks, i.e.

To learn sequences we need to include some

We can introduce memory effects with two

e.g. Time lagged signal as input

e.g. Temporal Backpropagation

CS 476: Networks of Neural Computation, CSD, UOC, 2009

Time Delayed Networks I

The time delayed approach includes two basic

Time Delayed Networks I

The Tapped Delay Line form of memory is

The Gamma form of memory is defined by:

CS 476: Networks of Neural Computation, CSD, UOC, 2009

Time Delayed Networks I

The Gamma Memory is shown below:

CS 476: Networks of Neural Computation, CSD, UOC, 2009

Time Delayed Networks I

In the implicit representation approach we

CS 476: Networks of Neural Computation, CSD, UOC, 2009

Time Delayed Networks I

We present the data in a sliding window. For

Time Delayed Networks II

In explicit time representation, neurons have a

A schematic representation of a neuron is

CS 476: Networks of Neural Computation, CSD, UOC, 2009

Time Delayed Networks II

The output of the neuron in this case is given

In case of a whole network, for example

Where p is the depth of the memory and b0 is the

Time Delayed Networks II

In the more general case, where we have

The output of any synapse is given by the

CS 476: Networks of Neural Computation, CSD, UOC, 2009

Time Delayed Networks II

Where the state vector xi(n) and weight vector

CS 476: Networks of Neural Computation, CSD, UOC, 2009

Time Delayed Networks II: Learning Law

To train such a network we use the Temporal

Assume that neuron j lies in the output layer

We can define an instantaneous value for the

CS 476: Networks of Neural Computation, CSD, UOC, 2009

Time Delayed Networks II: Learning Law-1

The error signal at the output layer is defined

The idea is a minimise an overall cost function,

We could proceed as usual by calculating the

CS 476: Networks of Neural Computation, CSD, UOC, 2009

Time Delayed Networks II: Learning Law-2

However this approach to work we need to

loss of symmetry between forward and

nice recursive formula for propagation of

for global bookkeeping to keep track

Time Delayed Networks II: Learning Law-3

For these reasons we prefer to calculate the

Note that in general holds:

The equality is correct only when we take the

CS 476: Networks of Neural Computation, CSD, UOC, 2009

Time Delayed Networks II: Learning Law-4