Sunteți pe pagina 1din 27

Data-driven modelling in water-related problems. PART 2.

Neural Networks
Dimitri P. Solomatine
www.ihe.nl/hi/sol sol@ihe.nl

UNESCO-IHE Institute for Water Education


Department of Hydroinformatics and Knowledge management

Artificial neural networks

D.P. Solomatine. Data-driven modelling (part 2).

Artificial neural networks (ANN): main types

Artificial Neural Networks

Feed Forward

Feedback

Self-organising

Linear

Non-linear

Hopfield Model

Boltzman Machine

Feature Maps

ART

Supervised

Unsupervised

D.P. Solomatine. Data-driven modelling (part 2).

Linear regression as a simple ANN


actual output value model predicts new output value y(v)

Y = a1 X + a0
x1 a1 a0

y(t)

y = a0+a1 x1

X
x(t)
new input value x(v)

In one-dimensional case (one input x), given T vectors (data) {x(t),y(t)} t =1,T the coefficients of the equation

y = f (x) = a1 x + a0
can be found. Then for the new V vectors {x(v)}, v =1,V this equation can approximately reproduce the corresponding functions values
D.P. Solomatine. Data-driven modelling (part 2).

{y(v)}, v =1,V

How to measure the error?

Least squares error is used since it allows for the best

estimation of the parameters given errors for each measurement are independent and normally distributed. Optimization problem has to be solved: find such a0 and a1 that E is minimal:
E = ( y ( t ) ( a0 + a1 x ( t ) ) )
T t =1 2

in a similar fashion the problem can be posed for multiple a0 regression (with many inputs) x a
1 1

y = a0+a1 x1 + a2 x2 x2 a2

D.P. Solomatine. Data-driven modelling (part 2).

Function approximation: linear regression and ANN

X Linear regression Y = a1 X + a2

X Neural network approximation Y = f ( X, a1,, an )

D.P. Solomatine. Data-driven modelling (part 2).

ANN: multi-layer perceptron (MLP)


X Inputs X x1 x2 x3 xNinp 1 yNhid 1
N inp y j = F aoj + aij xi i =1 j=1,..., N hid

modelled (real) system weights a ij Hidden layer weights b jk y


1

f (X) (observed) Error = F(X) - f(X) min Outputs Z=F(X) z1 z2 z3 zNout

N hid z k = F bok + b jk y j i =1 k=1,..., N out

F(u) 1 0 u

Binary Sigmoid : F(u) = 1/ (1 + e-u)

There are (Ninp+1)Nhid + (Nhid+1)Nout weights (aij and bjk) to be identified by minimizing mean squared error (Y(X) - f(X))2. Method used: gradient-based steepest descent method (called error backpropagation)
D.P. Solomatine. Data-driven modelling (part 2). 7

ANN: identification of weights by training (calibration)

ANN error in reproducing the observed output (OBSi) is:


E=
N examp i =1

(OBS

ANN i )

Training of ANN is in solving a (multi-extremum) optimization problem: Find such values of weights that bring E to a minimum Problem of backpropagation algorithm - it assumes singleextremality
D.P. Solomatine. Data-driven modelling (part 2). 8

ANN as a universal model (function approximator)


ANN is a model with multiple (hundreds) parameters ANN is a capable of reproducing complex non-linear relationships ANN calibration (training) requires large amounts of data the resulting model is fast ANN can be used where physically based models fail, or they may complement physically-based models

D.P. Solomatine. Data-driven modelling (part 2).

Biological motivation
signals are transmitted between neurons by electrical pulses the neuron sums up the effects of thousands impulses if the integrated potential exceeds a threshold, the cell fires generates an impulse that travels across the axon further

Dendrites

Cells bodies
D.P. Solomatine. Data-driven modelling (part 2). 10

Hidden node
a0 x1 a1 u = a0 + a1 x1 + a2 x2 x2 a2 u y = g (u) y

Inputs to the network are:

xi , i = 1 , ... , N inp
Output of the j-th node of the hidden layer is

y j = g ( a 0 j + ai j xi ) , j = 1, ..., N hid
i=1
D.P. Solomatine. Data-driven modelling (part 2). 11

N inp

Output node
b0

y1

b1

v z = g (v) z

v = b0 + b1 y1 + b2 y2 y2 b2

inputs are the outputs of hidden nodes y1 ... yNhid outputs are:
z k = g ( b0 k + b j k y j ) , k = 1, ..., N out
j=1 N hid

D.P. Solomatine. Data-driven modelling (part 2).

12

Transfer function g

the transfer function is usually non-linear, bounded and differentiable. Widely used is the logistic function:

g (u) =

1 1 + e- u

Logistic function
Output va lue
1.2 1 0.8 0.6 0.4 0.2 0 -10 -8 -6 -4 -2 -0.2 0 2 4 6 8 10

Slope = /4

Input va lue

D.P. Solomatine. Data-driven modelling (part 2).

13

Derivative of g

g (u ) (1 + e u ) 1 = u u = (1)(1 + e u ) 2 e u ( ) = e u (1 + e u ) 2 = e u g 2 (u )
Note that Then

e u =

1 g (u ) g (u )

g (u ) 1 g (u ) 2 = g (u ) u g (u ) = (1 g (u )) g (u )
14

D.P. Solomatine. Data-driven modelling (part 2).

ANN complexity and its approximating ability


combination of transfer functions of hidden nodes produces a complex function with many hidden nodes any function can be approximated

D.P. Solomatine. Data-driven modelling (part 2).

15

ANN complexity and its approximating: example of approximating a harmonic function


one input x and two outputs y1 and y2 Outputs are given by sin(x) and cos(x) Data is generated by running x from 0 to 6.28 with the step 0.02
Training set: 315 instances
x 0.0000 0.0200 0.0400 ... 1.5400 1.5600 1.5800 1.6000 ... 6.2400 6.2600 6.2800 y1 y2 0.00000 1.00000 0.02000 0.99980 0.03999 0.99920 0.99953 0.99994 0.99996 0.99957 0.03079 0.01080 -0.00920 -0.02920

Test set: 12 instances


x 0.0000 0.3600 0.9800 1.8200 3.6400 4.4600 5.6200 5.8400 6.1200 6.2400 6.2600 6.2800 y1 y2 0.00000 1.00000 0.35227 0.93590 0.83050 0.55702 0.96911 -0.24663 -0.47802 -0.87835 -0.96832 -0.24972 -0.61563 0.78803 -0.42882 0.90339 -0.16247 0.98671 -0.04318 0.99907 -0.02319 0.99973 -0.00319 0.99999
16

-0.04318 0.99907 -0.02319 0.99973 -0.00319 0.99999

D.P. Solomatine. Data-driven modelling (part 2).

Performance of ANNs as its complexity increases

a) 1 hidden node

b) 2 hidden nodes

c) 3 hidden nodes
D.P. Solomatine. Data-driven modelling (part 2).

d) 4 hidden nodes
17

ANN training as an optimization problem

If Nout functions, each with Ninp independent (input) variables are given, and T instances (vectors)

{ x(t) , x(t) , ... , x(t)inp , f (t) , f (t) ,..., f (t)out } , t = 1, ... , T 1 2 N 1 2 N


are given, then, on the basis of these instances ANN can be trained so that the error is minimal. Then if presented other V instances (vectors)

{ x(v) , x(v) , ... , x(v)inp }, 1 2 N

v = 1, ... ,V

it would approximately reproduce the corresponding functions values

{ f (v) , f (v) ,..., f (v)out }, 1 2 N

v = 1, ... ,V

D.P. Solomatine. Data-driven modelling (part 2).

18

Detailed description of ANN error to be minimized

for output k the error for the input pattern t is: Ek(t) = (fk(t) zk(t))2 total for all outputs for input pattern t the error is: 1 ( E (t ) = ( f k(t ) z kt ) ) 2 2 k Total error is the summation of the errors for all output nodes for all T instances: 1 ( Etot = ( f k(t ) z kt ) ) 2 2 t k min 1 = [ f k(t ) g out (b0 k + b jk y j )]2 2 t k j
= 1 2 t

[ f
k

(t ) k

g out (b0 k + b jk g hid (a0 j + aij xi(t ) ))]2


j i

D.P. Solomatine. Data-driven modelling (part 2).

19

Error function w.r.t. weights (error surfaces) (1)

D.P. Solomatine. Data-driven modelling (part 2).

20

10

Error function w.r.t. weights (error surfaces) (2)

error function is a complex multi-extremum function

D.P. Solomatine. Data-driven modelling (part 2).

21

Error function w.r.t. weights (error surfaces) (3)

D.P. Solomatine. Data-driven modelling (part 2).

22

11

Decreasing error during training

D.P. Solomatine. Data-driven modelling (part 2).

23

Using steepest descent

D.P. Solomatine. Data-driven modelling (part 2).

24

12

Choosing the right step


step is OK

step is too large

D.P. Solomatine. Data-driven modelling (part 2).

25

Backpropagation algorithm (instance based) (1/2)

to ensure that the network is not saturated by large values of weights. 2 Select an instance t, that is the vector {xk(t)}, i = 1,...,Ninp (a pair of input and output patterns), from the training set. 3 Apply the network input vector to network input. 4 Calculate the network output vector {zk(t)}, k = 1,...,Nout. 5 Calculate the errors for each of the outputs k , k=1,...,Nout, the difference between the desired output and the network output:
( E = Ek( t ) = ( f k( t ) zkt ) ) 2

Randomize the weights {ws} (denoted above as matrices a and b) to small random values (both positive and negative) 1

...
D.P. Solomatine. Data-driven modelling (part 2). 26

13

Backpropagation algorithm (instance based) (2/2)

... 6 Calculate the necessary updates for weights ws in a way that minimizes this error (discussed below). 7 Adjust the weights of the network by ws. 8 Repeat steps 2 6 for each instance (pair of inputoutput vectors) in the training set until the error for the entire system (error E defined above or the error on cross-validation set) is acceptably low, or the pre-defined number of iterations is reached.

D.P. Solomatine. Data-driven modelling (part 2).

27

Backpropagation algorithm (with cumulative updates)

16 as above 7 add up the calculated weights updates {ws} to the accumulated total updates {Ws}. 8 Repeat steps 2 7 for several instances comprising an epoch (could be the whole set). 9 Adjust the weights {ws} of the network by the updates {Ws}. 10 Repeat steps 2 9 until all instances in the training set are processed. This constitutes one iteration. 11 Repeat the iteration of steps 2 10 until the error for the entire system (error E defined above or the error on crossvalidation set) is acceptably low, or the pre-defined number of iterations is reached.
D.P. Solomatine. Data-driven modelling (part 2). 28

14

How to update weights

optimization is done by the steepest descent algorithm steps are made in the space of variables (weights w) in the direction opposite to the direction of the gradient of the function E w (N+1) = w (N ) E (w (N )) in individual weights changes will be:

ws ( N + 1) = ws ( N )
and the update step for weight s is:

E ws

ws = ws ( N )

ws =
D.P. Solomatine. Data-driven modelling (part 2).

(this is the delta rule of Widrow and Hoff (1960) for a single linear perceptron)

E ws

29

How to update weights

typical chain of nodes and weights considered

input node i xi xi aij

hidden node j uj bjk gjhid(u) yj

output node k vk zk

measured (target) values Compare error fk zk fk

gkout(v)

error to minimize during training is E = (fk zk)2

D.P. Solomatine. Data-driven modelling (part 2).

30

15

Weights for the output layer

update for the output weight is:

b jk =

derivatives can be found using the chain rule: E E z k vk = b jk z k vk b jk this gives:

E b jk

b jk = k y j
where

k 2( f k z k )

g k (v) v
31

D.P. Solomatine. Data-driven modelling (part 2).

Weights for the hidden layer (1)

hidden nodes do not have explicit values of an error. Such errors are propagating from each of the nodes of the output layer to each of the nodes in the hidden layer

E (t ) =

1 ( f k(t ) zk(t ) ) 2 2 k

Ek(t ) E (t ) = aij = aij aij k

E (t ) Ek(t ) y j u j Ek( t ) zk vk y j u j = = aij y j u j aij z k vk y j u j aij k

D.P. Solomatine. Data-driven modelling (part 2).

32

16

Weights for the hidden layer (2)

finally, the update for weights a is:

aij = xi

g j (u ) u

b
k k

jk

D.P. Solomatine. Data-driven modelling (part 2).

33

Improving the learning rule

Momentum

ws ( N + 1) =

E + ws ( N ) ws

adaptive learning rate s (N )

ws ( N + 1) = (1 ) S ( N )

E + ws ( N ) ws

where s(N) = the learning rate which is updated according to the following rule: E s ( N ) = s ( N 1) + , if RAED( N 1) > 0
ws = s ( N 1) , otherwise

Here RAED is the recent "average" of that derivative that is recursively calculated: E RAED ( N ) = (1 ) + RAED ( N 1) ws
D.P. Solomatine. Data-driven modelling (part 2).

34

17

Choice of the learning constants

The recommended values of the learning constants


= 0.1 (0.05 is often better) = 0.5 = 0.7 (or 0.5) = 0.9 (or 0.5).

D.P. Solomatine. Data-driven modelling (part 2).

35

Practical issues of training

Preparing data scaling data to prevent network paralysis


g(1.0) g(2.0) g(3.0) g(4.0) = = = = 0.762 0.964 0.995 0.999

so scale input data to [-3, +3]

N hid N inp N out number of hidden nodes choice of the activation functions remove part of connections (optimal brain damage)

deal with local optima (re-randomize weights)

D.P. Solomatine. Data-driven modelling (part 2).

36

18

Radial basis function networks

D.P. Solomatine. Data-driven modelling (part 2).

37

Function approximation by combining functions

linear regression splines: using cubic functions that would pass through the points and the boundaries 1st and 2nd their derivatives would be equal orthogonal functions (Chebyshev polynomials) combining simple kernel functions

D.P. Solomatine. Data-driven modelling (part 2).

38

19

Radial basis functions

use simple functions F(x) that approximate the given function in the proximity to some representative locations these F(x) depend only on the distance from these centers and drop to zero as the distance from the centers increase

Centers:

J
39

D.P. Solomatine. Data-driven modelling (part 2).

Radial basis functions

function z = f (x), where x is a vector {x1... xI} in Idimensional space centers wj j=1...J are selected f (x) is approximated by
z ( x) = F (| x w j |; b j )
j =1 J

where |x wj| is distance (eg., Euclidean) bj are coefficients associated with the j-th center wj.

D.P. Solomatine. Data-driven modelling (part 2).

Centers:

40

20

Radial basis functions

we can choose the linear combination of basis functions:

z (x) = b j F (| x w j | )
It is common to choose Gaussian function for F: f (r ) = exp (r 2 / 2 2) ( is analogous to the standard deviation in a Gaussian normal distribution) Distance |x wj| is usually understood in Euclidean sense and denoted as j : I
j =1

j =

(x w )
i =1 i ij

so the approximation becomes:


J j =1

z ( x) = b j exp( 2 / 2 2 ) j j
D.P. Solomatine. Data-driven modelling (part 2). 41

Radial basis functions


J

z (x) = b j exp( | x w j |2 / 2 2 ) j
j =1

The problem of approximation requires:


the placement of the localized Gaussians to cover the space (positions of the centers wi); the control of the width of each Gaussian (parameter ); the setting of the amplitude of each Gaussian (parameters bi).

D.P. Solomatine. Data-driven modelling (part 2).

42

21

Radial basis function problem viewed as a neural network

xi

wij

yj bjk

xi


Gaussian functions Linear functions

zk

D.P. Solomatine. Data-driven modelling (part 2).

43

Training the RBF network (1)

1. Find the positions of centers {wj}:

Choose randomly J instances xj and use them as the positions of the centers {wj} All other instances are assigned to a class j of the closest center wj, and the locations of each center are calculated again using eg. k-nearest neighbor method. The above steps are repeated until the locations of the centers stop changing.

2. Calculate the output z(x) from each hidden neuron ...

D.P. Solomatine. Data-driven modelling (part 2).

44

22

Training the RBF network (2)

... 3. Weights {bjk} for the output layer are calculated by solving a multiple linear regression problem, which is formulated as the system of linear equations. The output from the output node J can be expressed as b jk y j j =1 zk = J yj
j =1

where bjk the weight on the connection from the hidden node j to the output node k, yj - the output from the hidden node j 4. If the total error is more than the desired limit, change the number of the hidden units repeat all the steps
D.P. Solomatine. Data-driven modelling (part 2). 45

Example: using RBF network to reproduce SIN and COS function


Input file: 1 input (X), 2 outputs (SIN, COS),315 examples
315 1 2 0.0000 0.00000 0.0200 0.02000 0.0400 0.03999 0.0600 0.05996 0.0800 0.07991 0.1000 0.09983 0.1200 0.11971 ... 6.2400 -0.04318 6.2600 -0.02319 6.2800 -0.00319 1.00000 0.99980 0.99920 0.99820 0.99680 0.99500 0.99281 0.99907 0.99973 0.99999

18 centers found
D.P. Solomatine. Data-driven modelling (part 2). 46

23

Example: using RBF network to reproduce behaviour of a 1-D modelling system (SOBEK)
Input file: 7 inputs (prev. rainfalls, flows), 1 output (flow), 1303 examples

21 center found

D.P. Solomatine. Data-driven modelling (part 2).

47

Radial basis functions: comments


RBF networks provide a global approximation to the target function, represented by a linear combination of many local kernel functions this can be viewed as the smooth linear combination of piecewise (local) non-linear functions - that is the best function is chosen for a particular range of input data training is faster than backpropagation networks since it is done in two steps it is an eager method, but used an idea of local approximation as in lazy methods such as k-NN

D.P. Solomatine. Data-driven modelling (part 2).

48

24

Other types of connectionist models (neural networks)

D.P. Solomatine. Data-driven modelling (part 2).

49

Recurrent neural networks

developed to deal with the time varying or time-lagged pattern are usable for the problems where the dynamics of the considered process is complex and the measured data is noisy Examples: Hopfield networks, Regressive networks, JordanElman networks, and Brain-State-In-A-Box (BSB) networks
Context units

outputs inputs

D.P. Solomatine. Data-driven modelling (part 2).

50

25

Hopfield network

belongs to a class of devices with the autoassociative memory: they store a set of patterns in such a way that when a new pattern is presented, the network responds by producing whichever one of the stored patterns most closely resembles this new one Hopfield network has feedback from each node to each other node but not to itself Each node computes the weighted sum of inputs and outputs and if it exceeds a fixed threshold, it generates 1, otherwise -1 The network stores N-dimensional vectors comprising symbols 1, and these vectors are used as generalizations over possible patterns that are presented to the network.

D.P. Solomatine. Data-driven modelling (part 2).

51

Some references

Solomatine D.P., Torres L.A. (1996) Neural network approximation of a hydrodynamic model in optimizing reservoir operation - Proc. 2nd Intern. Conference on Hydroinformatics, Zurich, September 9-13, pp. 201-206. Kuo-lin Hsu, H.V. Gupta, S. Sorooshian (1995). Artificial neural network modelling of the rainfall-runoff process. Water Resources Res., vol.31, No. 10, pp. 2517-2530. N. Gong, T. Denoeux, J.-L. Bertrand-Krajewski (1996). Neural networks for solid transport modelling in sewer systems during storm events. Water Sci. Tech., vol. 33, No. 9, pp. 85-92. A.W. Minns, M.J. Hall (1996). Artificial neural networks as rainfall-runoff models. Hydrological Sci. J., vol 41, No. 3, pp. 399-417. Y. Shen, D.P. Solomatine, H. van den Boogaard (1998). Improving performance of chlorophyl concentration time series simulation with artificial neural networks. Annual Journal of Hydraulic Engineering, JSCE, vol. 42, February, pp. 751-756. C.W. Dawson, R. Wilby (1998). An artificial neural network approach to rainfall-runoff modelling. Hydrological Sci. J., vol 43, No. 1, pp. 47-66. Dibike Y., Solomatine D.P., Abbott M.B. On the encapsulation of numerical-hydraulic models in artificial neural network. Journal of Hydraulic Research, No. 2, 1999. Lobbrecht A.H., Solomatine D.P. Control of water levels in polder areas using neural networks and fuzzy adaptive systems. In: Water Industry Systems: Modelling and Optimization Applications, D. Savic, G. Walters (eds.). Research Studies Press Ltd., 1999, pp. 509-518.
D.P. Solomatine. Data-driven modelling (part 2). 52

26

End of Part 2

D.P. Solomatine. Data-driven modelling (part 2).

53

27

S-ar putea să vă placă și