Sunteți pe pagina 1din 48

WK5 Dynamic Networks

Contents
Sequences
Time Delayed I
Time Delayed II
Recurrent I
Recurrent II
Conclusions

CS 476: Networks of Neural


Computation

WK5 Dynamic Networks:


Time Delayed & Recurrent
Networks
Dr. Stathis Kasderidis
Dept. of Computer Science
University of Crete

Spring Semester, 2009


CS 476: Networks of Neural Computation, CSD, UOC, 2009

Contents

Sequence Learning

Contents

Time Delayed Networks I: Implicit


Representation

Sequences
Time Delayed I

Time Delayed Networks II: Explicit


Time Delayed II Representation

Recurrent I
Recurrent II
Conclusions

Recurrent Networks I: Elman + Jordan


Networks
Recurrent Networks II: Back Propagation
Through Time
Conclusions

CS 476: Networks of Neural Computation, CSD, UOC, 2009

Sequence Learning

Contents
Sequences
Time Delayed I
Time Delayed II
Recurrent I
Recurrent II
Conclusions

MLP & RBF networks are static networks, i.e.


they learn a mapping from a single input
signal to a single output response for an
arbitrary large number of pairs;
Dynamic networks learn a mapping from a
single input signal to a sequence of response
signals, for an arbitrary number of pairs
(signal,sequence).
Typically the input signal to a dynamic
network is an element of the sequence and
then the network produces as a response the
rest of the sequence.

To learn sequences we need to include some


form of memory (short term memory) to the
CS 476: Networks of Neural Computation, CSD, UOC, 2009
network.

Sequence Learning II

We can introduce memory effects with two


principal ways:

Contents
Sequences

Implicit:

Time Delayed II

e.g. Time lagged signal as input


to a static network or as recurrent
connections

Recurrent I

Explicit:

Time Delayed I

Recurrent II
Conclusions

e.g. Temporal Backpropagation

Method
In the implicit form, we assume that the
environment from which we collect examples
of (input signal, output sequence) is
stationary. For the explicit form the
environment could be non-stationary, i.e. the
network can track the changes in the
structure of the signal.

CS 476: Networks of Neural Computation, CSD, UOC, 2009

Time Delayed Networks I

The time delayed approach includes two basic


Contents
types of networks:
Implicit Representation of Time: We
Sequences
combine a memory structure in the input
Time Delayed I
layer of the network with a static network
Time Delayed II
model
Recurrent I
Explicit Representation of Time: We
explicitly allow the network to code time, by
Recurrent II
generalising the network weights from
Conclusions
scalars to vectors, as in TBP (Temporal
Backpropagation).
Typical forms of memories that are used are
the Tapped Delay Line and the Gamma Memory
family.
CS 476: Networks of Neural Computation, CSD, UOC, 2009

Time Delayed Networks I

The Tapped Delay Line form of memory is


shown below for an input signal x(n):

Contents
Sequences
Time Delayed I
Time Delayed II
Recurrent I
Recurrent II
Conclusions

The Gamma form of memory is defined by:


n 1
p (1 ) n p
g p ( n)
p 1

n p

CS 476: Networks of Neural Computation, CSD, UOC, 2009

Time Delayed Networks I

The Gamma Memory is shown below:

Contents
Sequences
Time Delayed I
Time Delayed II
Recurrent I
Recurrent II
Conclusions

CS 476: Networks of Neural Computation, CSD, UOC, 2009

Time Delayed Networks I

In the implicit representation approach we


Contents
combine a static network (e.g. MLP / RBF) with
a memory structure (e.g. tapped delay line).
Sequences
Time Delayed I An example is shown below (the NETtalk
network):
Time Delayed II

Recurrent I
Recurrent II
Conclusions

CS 476: Networks of Neural Computation, CSD, UOC, 2009

Time Delayed Networks I

We present the data in a sliding window. For


Contents
example in NETtalk the middle group of input
neurons present the letter in focus. The rest of
Sequences
the input groups, three before & three after,
Time Delayed I
present context.
Time Delayed II
The purpose is to predict for example the next
symbol in the sequence.
Recurrent I
The NETtalk model (Sejnowski & Rosenberg,
Recurrent II
1987) has:
Conclusions
203 input nodes
80 hidden neurons
26 output neurons
18629 weights
Used BP method for training
CS 476: Networks of Neural Computation, CSD, UOC, 2009

Time Delayed Networks II

In explicit time representation, neurons have a


Contents
spatio-temporal structure, i.e. its synapse
arriving to a neuron is not a scalar number but a
Sequences
vector of weights, which are used for
Time Delayed I
convolution of the time-delayed input signal of a
Time Delayed II previous neuron with the synapse.
Recurrent I
Recurrent II

A schematic representation of a neuron is


given below:

Conclusions

CS 476: Networks of Neural Computation, CSD, UOC, 2009

Time Delayed Networks II

The output of the neuron in this case is given


by:
p

Contents

y j ( n)

Sequences

Time Delayed I
Time Delayed II
Recurrent I
Recurrent II
Conclusions

w
l 0

(l ) x ( n l ) b j

In case of a whole network, for example


assuming a single output node and a linear
output layer,
the response
m
m
p is given by:
1

y ( n) w j y j ( n) w j
j 1

j 1

w
l 0

(l ) x ( n l ) b j b0

Where p is the depth of the memory and b0 is the


bias of the output neuron
CS 476: Networks of Neural Computation, CSD, UOC, 2009

Time Delayed Networks II

In the more general case, where we have


Contents
multiple neurons at the output layer, we have
for neuron j of any layer:
Sequences
m0 p

Time Delayed I

y j (n) w ji (l ) xi ( n l ) b j
i 1 l 0

Time Delayed II
Time Delayed II

Recurrent I
Recurrent II
Conclusions

The output of any synapse is given by the


convolution sum:
p
T
s ji (n) w ji xi (n) w ji (l ) xi (n l )
l 0

CS 476: Networks of Neural Computation, CSD, UOC, 2009

Time Delayed Networks II

Where the state vector xi(n) and weight vector


wji for synapse I are defined as follows:

Contents
Sequences
Time Delayed I
Time Delayed II

xi(n)=[xi(n),
wji=[wji(0),

xi(n-1),, xi(n-p)]T

wji(1),, wji(p)]T

Recurrent I
Recurrent II
Conclusions

CS 476: Networks of Neural Computation, CSD, UOC, 2009

Time Delayed Networks II: Learning Law

To train such a network we use the Temporal


BackPropagation algorithm. We present the
algorithm below.

Contents
Sequences

Assume that neuron j lies in the output layer


Time Delayed II and its response is denoted by yj(n) at time n,
while its desired response is given by dj(n).

Time Delayed I

Recurrent I

Recurrent II
Conclusions

We can define an instantaneous value for the


sum of squared errors produced by the network
as follows:
1
2
E ( n)
e

j ( n)
2 j

CS 476: Networks of Neural Computation, CSD, UOC, 2009

Time Delayed Networks II: Learning Law-1

The error signal at the output layer is defined


by:
e j ( n) d j ( n) y j ( n)

Contents
Sequences
Time Delayed I

The idea is a minimise an overall cost function,


Time Delayed II calculated over all time:

Recurrent I

Etotal E (n)
n

Recurrent II
Conclusions

We could proceed as usual by calculating the


gradient of the cost function over the weights.
This impliesE
that
we needto
the
E (calculate
n)
total

gradient:

instantaneous
w
w
ji

ji

CS 476: Networks of Neural Computation, CSD, UOC, 2009

Time Delayed Networks II: Learning Law-2

However this approach to work we need to


Contents
unfold the network in time (i.e. to convert it to
an equivalent static network and then calculate
Sequences
the gradient). This option presents a number of
Time Delayed I
disadvantages:
Time Delayed II

Recurrent II

loss of symmetry between forward and


backward pass for the calculation of
instantaneous gradient;

Conclusions

No

Recurrent I

nice recursive formula for propagation of


error terms;
Need

for global bookkeeping to keep track


of which static weights are actually the same
in the equivalent network
CS 476: Networks of Neural Computation, CSD, UOC, 2009

Time Delayed Networks II: Learning Law-3

For these reasons we prefer to calculate the


gradient of the cost function as follows:

Contents
Sequences
Time Delayed I
Time Delayed II
Recurrent I
Recurrent II
Conclusions

Etotal

w ji

Etotal v j ( n)

v j ( n) w ji

Note that in general holds:


Etotal v j ( n)
E ( n )

v j ( n) w ji
w ji

The equality is correct only when we take the


sum over all time.

CS 476: Networks of Neural Computation, CSD, UOC, 2009

Time Delayed Networks II: Learning Law-4

To calculate the weight update we use the


steepest descent method:

Contents
Sequences
Time Delayed I
Time Delayed II
Recurrent I
Recurrent II
Conclusions

Etotal v j ( n)

w ji ( n 1) w ji ( n)

v j ( n) w ji ( n)

Where is the learning rate.


We calculate the terms in the above relation as
follows:
v j ( n )

i ( n)
w ji ( n)

This is by definition the induced field vj(n)


CS 476: Networks of Neural Computation, CSD, UOC, 2009

Time Delayed Networks II: Learning Law-5

We define the local gradient as:

Contents

j ( n)

Sequences

Etotal
v j ( n )

Time Delayed I
Time Delayed II Thus
Recurrent I
Recurrent II

we can write the weight update equations


in the familiar form:

w ji (n 1) w ji (n) j (n) xi (n)

Conclusions

We need to calculate the for the cases of


output and hidden layers.

CS 476: Networks of Neural Computation, CSD, UOC, 2009

Time Delayed Networks II: Learning Law-6

For the output layer the local gradient is given


by:
Etotal
E ( n )

Contents

j ( n)

Sequences

v j ( n )

v j ( n )

e j (n) ' (v j (n))

Time Delayed I
Time Delayed II
Recurrent I
Recurrent II
Conclusions

For a hidden layer we assume that neuron j is


connected to a set A of neurons in the next
layer (hidden or output).
Then we have:
E

j ( n)

total

v j ( n )


r A k

Etotal vr (k )
vr ( k ) v j ( n )

CS 476: Networks of Neural Computation, CSD, UOC, 2009

Time Delayed Networks II: Learning Law-7

By re-writing we get the following:

Contents

j ( n) r ( k )

Sequences
Time Delayed I
Time Delayed II

r A

vr ( k )
v j ( n )

r (k )
r A

v r ( k ) y j ( n )
y j ( n ) v j ( n )

Recurrent I
Recurrent II
Conclusions

Finally we putting all together we get:


n p

j (n) ' (v j (n)) r (k ) wrj (k l )


r A k n

' (v j (n)) r (n l ) wrj (n)


r A l 0

CS 476: Networks of Neural Computation, CSD, UOC, 2009

Time Delayed Networks II: Learning Law-8

Where l is the layer level

Contents
Sequences
Time Delayed I
Time Delayed II
Recurrent I
Recurrent II
Conclusions

CS 476: Networks of Neural Computation, CSD, UOC, 2009

Recurrent I

Contents
Sequences
Time Delayed I

A network is called recurrent when there are


connections which feedback to previous layers
or neurons, including self-connections. An
example is shown next:

Time Delayed II
Recurrent I
Recurrent II
Conclusions

Successful early models of recurrent networks


are:
Jordan

Network

Elman

Network

CS 476: Networks of Neural Computation, CSD, UOC, 2009

Recurrent I

The Jordan Network has the structure of an


Contents
MLP and additional context units. The Output
neurons feedback to the context neurons in 1-1
Sequences
fashion. The context units also feedback to
Time Delayed I
themselves.
Time Delayed II
Recurrent I
Recurrent II

The network is trained by using the


Backpropagation algorithm
A schematic is shown in the next figure:

Conclusions

CS 476: Networks of Neural Computation, CSD, UOC, 2009

Recurrent I

The Elman Network has also the structure of


Contents
an MLP and additional context units. The Hidden
neurons feedback to the context neurons in 1-1
Sequences
fashion. The hidden neurons connections to the
Time Delayed I
context units are constant and equal to 1. It is
Time Delayed II also called Simple Recurrent Network
(SRN).
Recurrent I
Recurrent II
Conclusions

The network is trained by using the


Backpropagation algorithm
A schematic
is shown in the
next figure:

CS 476: Networks of Neural Computation, CSD, UOC, 2009

Recurrent II

More complex forms of recurrent networks are


possible. We can start by extending a MLP as a
basic building block.

Contents
Sequences
Time Delayed I
Time Delayed II
Recurrent I
Recurrent II
Conclusions

Typical paradigms of complex recurrent models


are:
Nonlinear

Autoregressive with Exogenous


Inputs Network (NARX)
The

State Space Model

The

Recurrent Multilayer Perceptron (RMLP)

Schematic representations of the networks are


given in the next slides:

CS 476: Networks of Neural Computation, CSD, UOC, 2009

Recurrent II-1

The structure of the NARX model includes:


Contents

Sequences
Time Delayed I
Time Delayed II
Recurrent I
Recurrent II
Conclusions

MLP static network;

current input u(n) and its delayed versions


up to a time q;
A

time delayed version of the current output


y(n) which feeds back to the input layer. The
memory of the delayed output vector is in
general p.
The

output is calculated as:

y(n+1)=F(y(n),,y(n-p+1),u(n),,u(n-q+1))

CS 476: Networks of Neural Computation, CSD, UOC, 2009

Recurrent II-2

A schematic of the NARX model is as follows:


Contents
Sequences
Time Delayed I
Time Delayed II
Recurrent I
Recurrent II
Conclusions

CS 476: Networks of Neural Computation, CSD, UOC, 2009

Recurrent II-3

The structure of the State Space model


includes:

Contents
Sequences

Time Delayed I

The

MLP network with a single hidden layer;

Time Delayed II

hidden neurons define the state of the


network;

Recurrent I

Recurrent II

Conclusions

linear output layer;

feedback of the hidden layer to the input


layer assuming a memory of q lags;
The

output is determined by the coupled


equations:
x(n+1)=f(x(n),u(n))
y(n+1)=C x(n+1)
CS 476: Networks of Neural Computation, CSD, UOC, 2009

Recurrent II-4

Where f is a suitable nonlinear function


Contents
characterising the hidden layer. x is the state
vector, as it is produced by the hidden layer. It
Sequences
has q components. y is the output vector and it
Time Delayed I has p components. The input vector is given by
Time Delayed II u and it has m components.
Recurrent I
Recurrent II

A schematic representation of the network is


given below:

Conclusions

CS 476: Networks of Neural Computation, CSD, UOC, 2009

Recurrent II-5

The structure of the RMLP includes:


Contents

One

Sequences
Time Delayed I
Time Delayed II
Recurrent I
Recurrent II
Conclusions

or more hidden layers;

Feedback

around each layer;

The

general structure of a static MLP


network;
The

output is calculated as follows


(assuming that xI, xII, and xo are the first,
second and output layer outputs):
xI(n+1)= I(xI(n), u(n))
xII(n+1)= II(xII(n), xI(n+1))
xO(n+1)= O(xO(n), xII(n+1))
CS 476: Networks of Neural Computation, CSD, UOC, 2009

Recurrent II-6

Where the functions I(), II() and O() denote


the

Contents
Sequences

Activation functions of the corresponding layer.

Time Delayed I

A schematic representation is given below:

Time Delayed II
Recurrent I
Recurrent II
Conclusions

CS 476: Networks of Neural Computation, CSD, UOC, 2009

Recurrent II-7

Some theorems on the computational


power of recurrent networks:

Contents
Sequences

Thm

Time Delayed I
Time Delayed II
Recurrent I
Recurrent II
Conclusions

1: All Turing machines may be


simulated by fully connected recurrent
networks built on neurons with sigmoid
activation functions.
Thm

2: NARX networks with one layer of


hidden neurons with bounded, one-sided
saturated activation functions and a linear
output neuron can simulate fully connected
recurrent networks with bounded, one-sided
saturated activation functions, except for a
linear slowdown.
CS 476: Networks of Neural Computation, CSD, UOC, 2009

Recurrent II-8
Corollary:

Contents
Sequences
Time Delayed I

NARX networks with one hidden


layer of neurons with BOSS activations
functions and a linear output neuron are
Turing equivalent.

Time Delayed II
Recurrent I
Recurrent II
Conclusions

CS 476: Networks of Neural Computation, CSD, UOC, 2009

Recurrent II-9

The training of the recurrent networks can be


done with two methods:

Contents
Sequences

BackPropagation

Time Delayed I

Real-Time

Time Delayed II
Recurrent I
Recurrent II
Conclusions

Through Time

Recurrent Learning

We can train a recurrent network with either


epoch-based or continuous training operation.
However an epoch in recurrent networks does
not mean the presentation of all learning
patterns but rather denotes the length of a
single sequence that we use for training. So an
epoch in recurrent network corresponds in
presenting only one pattern to the network.
At the end of an epoch the network stabilises.

CS 476: Networks of Neural Computation, CSD, UOC, 2009

Recurrent II-10

Some useful heuristics for the training is


given below:

Contents
Sequences

Lexigraphic

Time Delayed I
Time Delayed II
Recurrent I
Recurrent II
Conclusions

order of training samples should


be followed, with the shortest strings of
symbols being presented in the network first;
The

training should begin with a small


training sample and then its size should be
incrementally increased as the training
proceeds;
The

synaptic weights of the network should


be updated only if the absolute error on the
training sample currently being processed by
the network is greater than some prescribed
criterion;
CS 476: Networks of Neural Computation, CSD, UOC, 2009

Recurrent II-11
The

use of weight decay during training is


recommended; weight decay was discussed
in WK3

Contents
Sequences
Time Delayed I

Time Delayed II The


Recurrent I
Recurrent II
Conclusions

BackPropagation Through Time algorithm


proceeds by unfolding a network in time. To be
more specific:
Assume

that we have a recurrent network N


which is required to learn a temporal task
starting from time n0 and going all the way to
time n.
Let

N* denote the feedforward network that


results from unfolding the temporal
of the
recurrent
CS 476:operation
Networks of Neural
Computation,
CSD, network
UOC, 2009 N.

Recurrent II-12
The

network N* is related to the original


network N as follows:

Contents

For

Time Delayed II

each time step in the interval (n0,n], the


network N* has a layer containing K neurons,
where K is the number of neurons contained
in network N;

Recurrent I

In

Sequences
Time Delayed I

Recurrent II

every layer of network N* there is a copy


of each neuron in network N;

Conclusions

For

every time step l [n0,n], the synaptic


connection from neuron i in layer l to neuron j
in layer l+1 of the network N* is a copy of the
synaptic connection from neuron i to neuron j
in the network N.

The following example explains the idea of


CSunfolding:
476: Networks of Neural Computation, CSD, UOC, 2009

Recurrent II-13

We assume that we have a network with two


neurons which is unfolded for a number of
steps, n:

Contents
Sequences
Time Delayed I
Time Delayed II
Recurrent I
Recurrent II
Conclusions

CS 476: Networks of Neural Computation, CSD, UOC, 2009

Recurrent II-14

We present now the method of Epochwise


BackPropagation Through Time.

Let the dataset used for training the network


be partitioned into independent epochs, with
each epoch representing a temporal pattern
of interest. Let n0 denote the start time of an
epoch and n1 denotes its end time.

We can define the following cost function:

Contents
Sequences
Time Delayed I
Time Delayed II
Recurrent I
Recurrent II
Conclusions

1 n1
2
Etotal (n0 , n1 ) e j (n)
2 n n0 j A

CS 476: Networks of Neural Computation, CSD, UOC, 2009

Recurrent II-15

Contents
Sequences
Time Delayed I
Time Delayed II

Where A is the set of indices j pertaining to


those neurons in the network for which
desired responses are specified, and ej(n) is
the error signal at the output of such a
neuron measured with respect to some
desired response.

Recurrent I
Recurrent II
Conclusions

CS 476: Networks of Neural Computation, CSD, UOC, 2009

Recurrent II-16

Contents

1.

Sequences
Time Delayed I
Time Delayed II
Recurrent I
Recurrent II
Conclusions

The algorithm proceeds as follows:


For a given epoch, the recurrent network
starts running from some initial state until
it reaches a new state, at which point the
training is stopped and the network is
reset to an initial state for the next epoch.
The initial state doesnt have to be the
same for each epoch of training. Rather,
what is important is for the initial state for
the new epoch to be different from the
state reached by the network at the end
of the previous epoch;

CS 476: Networks of Neural Computation, CSD, UOC, 2009

Recurrent II-17
2.

First a single forward pass of the data


through the network for the interval (n0,
n1) is performed. The complete record of
input data, network state (i.e. synaptic
weights), and desired responses over this
interval is saved;

3.

A single backward pass over this past


record is performed to compute the
Etotalgradients:
( n0 , n1 )
values jof
( n)the
local

Contents
Sequences
Time Delayed I
Time Delayed II
Recurrent I
Recurrent II
Conclusions

v j ( n )

For all j A and n0 < n n1 . This


computation is performed by the formula:
CS 476: Networks of Neural Computation, CSD, UOC, 2009

Recurrent II-18

Contents

j (n)

Time Delayed I
Time Delayed II

Recurrent II
Conclusions

' (v j (n))

Sequences

Recurrent I

' (v j (n))e j (n)

for n n1

e j (n) w jk k (n 1)
for n0 n n1
k A

Where () is the derivative of an


activation function with respect to its
argument, and vj(n) is the induced local
field of neuron j.
The use of above formula is repeated,
starting from time n1 and working back,
step by step, to time n0 ; the number of
steps involved here is equal to the
number of time steps contained in the
epoch.

CS 476: Networks of Neural Computation, CSD, UOC, 2009

Recurrent II-19
4.

Contents
Sequences
Time Delayed I
Time Delayed II

Once the computation of backpropagation has been performed back to


time n0+1, the following adjustment is
applied to the synaptic weight wji of
neuron j:
Etotal
w ji
w ji

Recurrent I
Recurrent II

n1

n n0 1

(n) xi ( n 1)

Conclusions

Where is the learning rate parameter


and xi(n-1) is the input applied to the ith
synapse of neuron j at time n-1.
CS 476: Networks of Neural Computation, CSD, UOC, 2009

Recurrent II-20

There is a potential problem with the


method, which is called the Vanishing
Gradients Problem, i.e. the corrections
calculated for the weights are not large
enough when using methods based on
steepest descent.

However this is a research problem currently


and ones has to see the literature for details.

Contents
Sequences
Time Delayed I
Time Delayed II
Recurrent I
Recurrent II
Conclusions

CS 476: Networks of Neural Computation, CSD, UOC, 2009

Conclusions

Dynamic networks learn sequences in contrast


to the static mappings of MLP and RBF
networks.

Contents
Sequences
Time Delayed I
Time Delayed II
Recurrent I
Recurrent II
Conclusions

Time representation takes place explicitly or


implicitly.
The implicit form includes time-delayed
versions of the input vector and use of a static
network model afterwards or the use of
recurrent networks.

The explicit form uses a generalisation of the


MLP model where a synapse is modelled now as
a weight vector and not as a single number. The
synapse activation is not any more the product
of the synapses weight with the output of a
CSprevious
476: Networks
of Neural but
Computation,
2009product of
neuron
ratherCSD,
theUOC,
inner

Conclusions I

The extended MLP networks with explicit


temporal structure are trained with the
Temporal BackPropagation algorithm.

Contents
Sequences

The recurrent networks include a number of


simple and complex architectures. In the simpler
Time Delayed II
case we train the networks using the standard
Recurrent I
BackProgation algorithm.
Time Delayed I

Recurrent II
Conclusions

In the more complex cases we first unfold the


network in time and then train it using the
BackProgation Through Time algorithm.

CS 476: Networks of Neural Computation, CSD, UOC, 2009

S-ar putea să vă placă și