Handouts On Data-Driven Modelling, Part 2 (UNESCO-IHE)

Data-driven modelling in water-related problems. PART 2.
Neural Networks
Dimitri P. Solomatine
www.ihe.nl/hi/sol sol@ihe.nl
UNESCO-IHE Institute for Water Education

Department of Hydroinformatics and Knowledge management
Artificial neural networks
D.P. Solomatine. Data-driven modelling (part 2).
Artificial neural networks (ANN): main types
Artificial Neural Networks
Feed Forward
Feedback
Self-organising
Linear
Non-linear
Hopfield Model
Boltzman Machine
Feature Maps
ART
Supervised
Unsupervised
Linear regression as a simple ANN

actual output value model predicts new output value y(v)
Y = a1 X + a0
x1 a1 a0
y(t)
y = a0+a1 x1
X
x(t)
new input value x(v)
In one-dimensional case (one input x), given T vectors (data) {x(t),y(t)} t =1,T the coefficients of the equation
y = f (x) = a1 x + a0
can be found. Then for the new V vectors {x(v)}, v =1,V this equation can approximately reproduce the corresponding functions values
{y(v)}, v =1,V
How to measure the error?
Least squares error is used since it allows for the best
estimation of the parameters given errors for each measurement are independent and normally distributed. Optimization problem has to be solved: find such a0 and a1 that E is minimal:
E = ( y ( t ) ( a0 + a1 x ( t ) ) )
T t =1 2
in a similar fashion the problem can be posed for multiple a0 regression (with many inputs) x a
1 1
y = a0+a1 x1 + a2 x2 x2 a2
Function approximation: linear regression and ANN
X Linear regression Y = a1 X + a2
X Neural network approximation Y = f ( X, a1,, an )
ANN: multi-layer perceptron (MLP)

X Inputs X x1 x2 x3 xNinp 1 yNhid 1
N inp y j = F aoj + aij xi i =1 j=1,..., N hid
modelled (real) system weights a ij Hidden layer weights b jk y

1
f (X) (observed) Error = F(X) - f(X) min Outputs Z=F(X) z1 z2 z3 zNout
N hid z k = F bok + b jk y j i =1 k=1,..., N out
F(u) 1 0 u
Binary Sigmoid : F(u) = 1/ (1 + e-u)
There are (Ninp+1)Nhid + (Nhid+1)Nout weights (aij and bjk) to be identified by minimizing mean squared error (Y(X) - f(X))2. Method used: gradient-based steepest descent method (called error backpropagation)
D.P. Solomatine. Data-driven modelling (part 2). 7
ANN: identification of weights by training (calibration)
ANN error in reproducing the observed output (OBSi) is:

E=
N examp i =1
(OBS
ANN i )
Training of ANN is in solving a (multi-extremum) optimization problem: Find such values of weights that bring E to a minimum Problem of backpropagation algorithm - it assumes singleextremality
ANN as a universal model (function approximator)

ANN is a model with multiple (hundreds) parameters ANN is a capable of reproducing complex non-linear relationships ANN calibration (training) requires large amounts of data the resulting model is fast ANN can be used where physically based models fail, or they may complement physically-based models
Biological motivation
signals are transmitted between neurons by electrical pulses the neuron sums up the effects of thousands impulses if the integrated potential exceeds a threshold, the cell fires generates an impulse that travels across the axon further
Dendrites
Cells bodies
Hidden node
a0 x1 a1 u = a0 + a1 x1 + a2 x2 x2 a2 u y = g (u) y
Inputs to the network are:
xi , i = 1 , ... , N inp
Output of the j-th node of the hidden layer is
y j = g ( a 0 j + ai j xi ) , j = 1, ..., N hid
i=1
N inp
Output node
b0
y1
b1
v z = g (v) z
v = b0 + b1 y1 + b2 y2 y2 b2
inputs are the outputs of hidden nodes y1 ... yNhid outputs are:
z k = g ( b0 k + b j k y j ) , k = 1, ..., N out
j=1 N hid
12
Transfer function g
the transfer function is usually non-linear, bounded and differentiable. Widely used is the logistic function:
g (u) =
1 1 + e- u
Logistic function
Output va lue
1.2 1 0.8 0.6 0.4 0.2 0 -10 -8 -6 -4 -2 -0.2 0 2 4 6 8 10
Slope = /4
Input va lue
13
Derivative of g
g (u ) (1 + e u ) 1 = u u = (1)(1 + e u ) 2 e u ( ) = e u (1 + e u ) 2 = e u g 2 (u )
Note that Then
e u =
1 g (u ) g (u )
g (u ) 1 g (u ) 2 = g (u ) u g (u ) = (1 g (u )) g (u )
14
ANN complexity and its approximating ability

combination of transfer functions of hidden nodes produces a complex function with many hidden nodes any function can be approximated
15
ANN complexity and its approximating: example of approximating a harmonic function

one input x and two outputs y1 and y2 Outputs are given by sin(x) and cos(x) Data is generated by running x from 0 to 6.28 with the step 0.02
Training set: 315 instances
x 0.0000 0.0200 0.0400 ... 1.5400 1.5600 1.5800 1.6000 ... 6.2400 6.2600 6.2800 y1 y2 0.00000 1.00000 0.02000 0.99980 0.03999 0.99920 0.99953 0.99994 0.99996 0.99957 0.03079 0.01080 -0.00920 -0.02920
Test set: 12 instances

x 0.0000 0.3600 0.9800 1.8200 3.6400 4.4600 5.6200 5.8400 6.1200 6.2400 6.2600 6.2800 y1 y2 0.00000 1.00000 0.35227 0.93590 0.83050 0.55702 0.96911 -0.24663 -0.47802 -0.87835 -0.96832 -0.24972 -0.61563 0.78803 -0.42882 0.90339 -0.16247 0.98671 -0.04318 0.99907 -0.02319 0.99973 -0.00319 0.99999
16
-0.04318 0.99907 -0.02319 0.99973 -0.00319 0.99999
Performance of ANNs as its complexity increases
a) 1 hidden node
b) 2 hidden nodes
c) 3 hidden nodes
d) 4 hidden nodes
17
ANN training as an optimization problem
If Nout functions, each with Ninp independent (input) variables are given, and T instances (vectors)
{ x(t) , x(t) , ... , x(t)inp , f (t) , f (t) ,..., f (t)out } , t = 1, ... , T 1 2 N 1 2 N

are given, then, on the basis of these instances ANN can be trained so that the error is minimal. Then if presented other V instances (vectors)
{ x(v) , x(v) , ... , x(v)inp }, 1 2 N
v = 1, ... ,V
it would approximately reproduce the corresponding functions values
{ f (v) , f (v) ,..., f (v)out }, 1 2 N
v = 1, ... ,V
18
Detailed description of ANN error to be minimized
for output k the error for the input pattern t is: Ek(t) = (fk(t) zk(t))2 total for all outputs for input pattern t the error is: 1 ( E (t ) = ( f k(t ) z kt ) ) 2 2 k Total error is the summation of the errors for all output nodes for all T instances: 1 ( Etot = ( f k(t ) z kt ) ) 2 2 t k min 1 = [ f k(t ) g out (b0 k + b jk y j )]2 2 t k j
= 1 2 t
[ f
k
(t ) k
g out (b0 k + b jk g hid (a0 j + aij xi(t ) ))]2

j i
19
Error function w.r.t. weights (error surfaces) (1)
20
10
error function is a complex multi-extremum function
21
22
11
Decreasing error during training
23
Using steepest descent
24
12
Choosing the right step

step is OK
step is too large
25
Backpropagation algorithm (instance based) (1/2)
to ensure that the network is not saturated by large values of weights. 2 Select an instance t, that is the vector {xk(t)}, i = 1,...,Ninp (a pair of input and output patterns), from the training set. 3 Apply the network input vector to network input. 4 Calculate the network output vector {zk(t)}, k = 1,...,Nout. 5 Calculate the errors for each of the outputs k , k=1,...,Nout, the difference between the desired output and the network output:
( E = Ek( t ) = ( f k( t ) zkt ) ) 2
Randomize the weights {ws} (denoted above as matrices a and b) to small random values (both positive and negative) 1
...
13
Backpropagation algorithm (instance based) (2/2)
... 6 Calculate the necessary updates for weights ws in a way that minimizes this error (discussed below). 7 Adjust the weights of the network by ws. 8 Repeat steps 2 6 for each instance (pair of inputoutput vectors) in the training set until the error for the entire system (error E defined above or the error on cross-validation set) is acceptably low, or the pre-defined number of iterations is reached.
27
Backpropagation algorithm (with cumulative updates)
16 as above 7 add up the calculated weights updates {ws} to the accumulated total updates {Ws}. 8 Repeat steps 2 7 for several instances comprising an epoch (could be the whole set). 9 Adjust the weights {ws} of the network by the updates {Ws}. 10 Repeat steps 2 9 until all instances in the training set are processed. This constitutes one iteration. 11 Repeat the iteration of steps 2 10 until the error for the entire system (error E defined above or the error on crossvalidation set) is acceptably low, or the pre-defined number of iterations is reached.
14
How to update weights
optimization is done by the steepest descent algorithm steps are made in the space of variables (weights w) in the direction opposite to the direction of the gradient of the function E w (N+1) = w (N ) E (w (N )) in individual weights changes will be:
ws ( N + 1) = ws ( N )
and the update step for weight s is:
E ws
ws = ws ( N )
ws =
(this is the delta rule of Widrow and Hoff (1960) for a single linear perceptron)
E ws
29
How to update weights
typical chain of nodes and weights considered
input node i xi xi aij
hidden node j uj bjk gjhid(u) yj
output node k vk zk
measured (target) values Compare error fk zk fk
gkout(v)
error to minimize during training is E = (fk zk)2
30
15
Weights for the output layer
update for the output weight is:
b jk =
derivatives can be found using the chain rule: E E z k vk = b jk z k vk b jk this gives:
E b jk
b jk = k y j
where
k 2( f k z k )
g k (v) v
31
Weights for the hidden layer (1)
hidden nodes do not have explicit values of an error. Such errors are propagating from each of the nodes of the output layer to each of the nodes in the hidden layer
E (t ) =
1 ( f k(t ) zk(t ) ) 2 2 k
Ek(t ) E (t ) = aij = aij aij k
E (t ) Ek(t ) y j u j Ek( t ) zk vk y j u j = = aij y j u j aij z k vk y j u j aij k
32
16
Weights for the hidden layer (2)
finally, the update for weights a is:
aij = xi
g j (u ) u
b
k k
jk
33
Improving the learning rule
Momentum
ws ( N + 1) =
E + ws ( N ) ws
adaptive learning rate s (N )
ws ( N + 1) = (1 ) S ( N )
E + ws ( N ) ws
where s(N) = the learning rate which is updated according to the following rule: E s ( N ) = s ( N 1) + , if RAED( N 1) > 0
ws = s ( N 1) , otherwise
Here RAED is the recent "average" of that derivative that is recursively calculated: E RAED ( N ) = (1 ) + RAED ( N 1) ws
34
17
Choice of the learning constants
The recommended values of the learning constants

= 0.1 (0.05 is often better) = 0.5 = 0.7 (or 0.5) = 0.9 (or 0.5).
35
Practical issues of training
Preparing data scaling data to prevent network paralysis

g(1.0) g(2.0) g(3.0) g(4.0) = = = = 0.762 0.964 0.995 0.999
so scale input data to [-3, +3]
N hid N inp N out number of hidden nodes choice of the activation functions remove part of connections (optimal brain damage)
deal with local optima (re-randomize weights)
36
18
Radial basis function networks
37
Function approximation by combining functions
linear regression splines: using cubic functions that would pass through the points and the boundaries 1st and 2nd their derivatives would be equal orthogonal functions (Chebyshev polynomials) combining simple kernel functions
38
19
Radial basis functions
use simple functions F(x) that approximate the given function in the proximity to some representative locations these F(x) depend only on the distance from these centers and drop to zero as the distance from the centers increase
Centers:
J
39
function z = f (x), where x is a vector {x1... xI} in Idimensional space centers wj j=1...J are selected f (x) is approximated by
z ( x) = F (| x w j |; b j )
j =1 J
where |x wj| is distance (eg., Euclidean) bj are coefficients associated with the j-th center wj.
Centers:
40
20
we can choose the linear combination of basis functions:
z (x) = b j F (| x w j | )
It is common to choose Gaussian function for F: f (r ) = exp (r 2 / 2 2) ( is analogous to the standard deviation in a Gaussian normal distribution) Distance |x wj| is usually understood in Euclidean sense and denoted as j : I
j =1
j =
(x w )
i =1 i ij
so the approximation becomes:

J j =1
z ( x) = b j exp( 2 / 2 2 ) j j

J
z (x) = b j exp( | x w j |2 / 2 2 ) j
j =1
The problem of approximation requires:

the placement of the localized Gaussians to cover the space (positions of the centers wi); the control of the width of each Gaussian (parameter ); the setting of the amplitude of each Gaussian (parameters bi).
42
21
Radial basis function problem viewed as a neural network
xi
wij
yj bjk
xi

Gaussian functions Linear functions
zk
43
Training the RBF network (1)
1. Find the positions of centers {wj}:
Choose randomly J instances xj and use them as the positions of the centers {wj} All other instances are assigned to a class j of the closest center wj, and the locations of each center are calculated again using eg. k-nearest neighbor method. The above steps are repeated until the locations of the centers stop changing.
2. Calculate the output z(x) from each hidden neuron ...
44
22
Training the RBF network (2)
... 3. Weights {bjk} for the output layer are calculated by solving a multiple linear regression problem, which is formulated as the system of linear equations. The output from the output node J can be expressed as b jk y j j =1 zk = J yj
j =1
where bjk the weight on the connection from the hidden node j to the output node k, yj - the output from the hidden node j 4. If the total error is more than the desired limit, change the number of the hidden units repeat all the steps
Example: using RBF network to reproduce SIN and COS function

Input file: 1 input (X), 2 outputs (SIN, COS),315 examples
315 1 2 0.0000 0.00000 0.0200 0.02000 0.0400 0.03999 0.0600 0.05996 0.0800 0.07991 0.1000 0.09983 0.1200 0.11971 ... 6.2400 -0.04318 6.2600 -0.02319 6.2800 -0.00319 1.00000 0.99980 0.99920 0.99820 0.99680 0.99500 0.99281 0.99907 0.99973 0.99999
18 centers found
23
Example: using RBF network to reproduce behaviour of a 1-D modelling system (SOBEK)
Input file: 7 inputs (prev. rainfalls, flows), 1 output (flow), 1303 examples
21 center found
47
Radial basis functions: comments

RBF networks provide a global approximation to the target function, represented by a linear combination of many local kernel functions this can be viewed as the smooth linear combination of piecewise (local) non-linear functions - that is the best function is chosen for a particular range of input data training is faster than backpropagation networks since it is done in two steps it is an eager method, but used an idea of local approximation as in lazy methods such as k-NN
48
24
Other types of connectionist models (neural networks)
49
Recurrent neural networks
developed to deal with the time varying or time-lagged pattern are usable for the problems where the dynamics of the considered process is complex and the measured data is noisy Examples: Hopfield networks, Regressive networks, JordanElman networks, and Brain-State-In-A-Box (BSB) networks
Context units
outputs inputs
50
25
Hopfield network
belongs to a class of devices with the autoassociative memory: they store a set of patterns in such a way that when a new pattern is presented, the network responds by producing whichever one of the stored patterns most closely resembles this new one Hopfield network has feedback from each node to each other node but not to itself Each node computes the weighted sum of inputs and outputs and if it exceeds a fixed threshold, it generates 1, otherwise -1 The network stores N-dimensional vectors comprising symbols 1, and these vectors are used as generalizations over possible patterns that are presented to the network.
51
Some references
Solomatine D.P., Torres L.A. (1996) Neural network approximation of a hydrodynamic model in optimizing reservoir operation - Proc. 2nd Intern. Conference on Hydroinformatics, Zurich, September 9-13, pp. 201-206. Kuo-lin Hsu, H.V. Gupta, S. Sorooshian (1995). Artificial neural network modelling of the rainfall-runoff process. Water Resources Res., vol.31, No. 10, pp. 2517-2530. N. Gong, T. Denoeux, J.-L. Bertrand-Krajewski (1996). Neural networks for solid transport modelling in sewer systems during storm events. Water Sci. Tech., vol. 33, No. 9, pp. 85-92. A.W. Minns, M.J. Hall (1996). Artificial neural networks as rainfall-runoff models. Hydrological Sci. J., vol 41, No. 3, pp. 399-417. Y. Shen, D.P. Solomatine, H. van den Boogaard (1998). Improving performance of chlorophyl concentration time series simulation with artificial neural networks. Annual Journal of Hydraulic Engineering, JSCE, vol. 42, February, pp. 751-756. C.W. Dawson, R. Wilby (1998). An artificial neural network approach to rainfall-runoff modelling. Hydrological Sci. J., vol 43, No. 1, pp. 47-66. Dibike Y., Solomatine D.P., Abbott M.B. On the encapsulation of numerical-hydraulic models in artificial neural network. Journal of Hydraulic Research, No. 2, 1999. Lobbrecht A.H., Solomatine D.P. Control of water levels in polder areas using neural networks and fuzzy adaptive systems. In: Water Industry Systems: Modelling and Optimization Applications, D. Savic, G. Walters (eds.). Research Studies Press Ltd., 1999, pp. 509-518.
26
End of Part 2
53
27

Handouts On Data-Driven Modelling, Part 2 (UNESCO-IHE)

Încărcat de

Informații document

Descriere originală:

Titlu original

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

Handouts On Data-Driven Modelling, Part 2 (UNESCO-IHE)

Încărcat de

Drepturi de autor:

Formate disponibile

Data-driven modelling in water-related problems. PART 2.

UNESCO-IHE Institute for Water Education

Artificial neural networks

D.P. Solomatine. Data-driven modelling (part 2).

Artificial neural networks (ANN): main types

Artificial Neural Networks

D.P. Solomatine. Data-driven modelling (part 2).

Linear regression as a simple ANN

How to measure the error?

Least squares error is used since it allows for the best

D.P. Solomatine. Data-driven modelling (part 2).

Function approximation: linear regression and ANN

X Neural network approximation Y = f ( X, a1,, an )

D.P. Solomatine. Data-driven modelling (part 2).

ANN: multi-layer perceptron (MLP)

modelled (real) system weights a ij Hidden layer weights b jk y

f (X) (observed) Error = F(X) - f(X) min Outputs Z=F(X) z1 z2 z3 zNout

N hid z k = F bok + b jk y j i =1 k=1,..., N out

Binary Sigmoid : F(u) = 1/ (1 + e-u)

ANN: identification of weights by training (calibration)

ANN error in reproducing the observed output (OBSi) is:

ANN as a universal model (function approximator)

D.P. Solomatine. Data-driven modelling (part 2).

Inputs to the network are:

D.P. Solomatine. Data-driven modelling (part 2).

D.P. Solomatine. Data-driven modelling (part 2).

D.P. Solomatine. Data-driven modelling (part 2).

ANN complexity and its approximating ability

D.P. Solomatine. Data-driven modelling (part 2).

ANN complexity and its approximating: example of approximating a harmonic function

Test set: 12 instances

-0.04318 0.99907 -0.02319 0.99973 -0.00319 0.99999

D.P. Solomatine. Data-driven modelling (part 2).

Performance of ANNs as its complexity increases

ANN training as an optimization problem

{ x(t) , x(t) , ... , x(t)inp , f (t) , f (t) ,..., f (t)out } , t = 1, ... , T 1 2 N 1 2 N

{ x(v) , x(v) , ... , x(v)inp }, 1 2 N

it would approximately reproduce the corresponding functions values

{ f (v) , f (v) ,..., f (v)out }, 1 2 N

D.P. Solomatine. Data-driven modelling (part 2).

Detailed description of ANN error to be minimized

g out (b0 k + b jk g hid (a0 j + aij xi(t ) ))]2

D.P. Solomatine. Data-driven modelling (part 2).

Error function w.r.t. weights (error surfaces) (1)

D.P. Solomatine. Data-driven modelling (part 2).

Error function w.r.t. weights (error surfaces) (2)

error function is a complex multi-extremum function

D.P. Solomatine. Data-driven modelling (part 2).

Error function w.r.t. weights (error surfaces) (3)

D.P. Solomatine. Data-driven modelling (part 2).

Decreasing error during training

D.P. Solomatine. Data-driven modelling (part 2).

Using steepest descent

D.P. Solomatine. Data-driven modelling (part 2).

Choosing the right step

step is too large

D.P. Solomatine. Data-driven modelling (part 2).

Backpropagation algorithm (instance based) (1/2)

Backpropagation algorithm (instance based) (2/2)

D.P. Solomatine. Data-driven modelling (part 2).

Backpropagation algorithm (with cumulative updates)

How to update weights