Sunteți pe pagina 1din 118

Lavinia Ferariu

EDITURA

CONSPRESS

2013

Copyright 2013, Editura Conspress i autorul


EDITURA CONSPRESS
este recunoscut de
Consiliul Naional al Cercetrii tiinifice din nvmntul Superior

Lucrare elaborat n cadrul proiectului: "Reea naional de centre pentru


dezvoltarea programelor de studii cu rute flexibile i a unor instrumente
didactice la specializarea de licen i masterat, din domeniul Ingineria
Sistemelor"

Descrierea CIP a Bibliotecii Naionale a Romniei


FERARIU, LAVINIA
Lecture Notes for Hybrid Intelligent Systems / Lavinia Ferariu
Bucureti : Conspress, 2013
Bibliogr.
ISBN 978-973-100-286-6
004
Carte universitar

CONSPRESS
B-dul Lacul Tei nr.124, sector 2,
cod 020396, Bucureti
Tel.: (021) 242 2719 / 300; Fax: (021) 242 0781

CONTENT

Chapter 1. Introduction ------------------------------------------------------------ 1


Chapter 2. Genetic Algorithms --------------------------------------------------- 4
Introduction ----------------------------------------------------------------------- 4
Genetic algorithms - overview-------------------------------------------------- 7
Evolutionary algorithm an artificial intelligence technique -------------16
Main research directions--------------------------------------------------------16
Genetic encoding ----------------------------------------------------------------17
Population initialization --------------------------------------------------------22
Genetic operators. Crossover and mutation. ---------------------------------22
Selection for recombination----------------------------------------------------29
Insertion (selection for survival)-----------------------------------------------42
GA convergence -----------------------------------------------------------------48
Parallel GA -----------------------------------------------------------------------55
Benchmarks for GA evaluation------------------------------------------------60
Chapter 3. Artificial Neural Networks ------------------------------------------62
Artificial neuron -----------------------------------------------------------------62
ANN architectures---------------------------------------------------------------66
Multi-layer Perceptron (MLP) -------------------------------------------------72
ANN with Radial basis functions - RBF -------------------------------------82
Chapter 4. Neuro-genetic systems-----------------------------------------------94
Supportive neuro-genetic systems---------------------------------------------95
Collaborative neuro-genetic systems -----------------------------------------98

ii

CHAPTER 1. INTRODUCTION
Intelligence =
The capacity of improving the own behavior based on acquired experience (by
repeating the same action or similar actions)
[Back, 2000].

Intelligence learning capacity + adaptability.


learning = creation and modification of knowledge representations
+
adaptation = improvement of system performances, as response to environmental
changes

Human intelligence vs. Machine learning

Model of human cognitive development (Piaget)


Schema = inner model of an activity.
- build via repeated actions.
- improved (assimilation) or extended (accommodation)
as response to events which trigger unstable states

Human psychological development


- continuous assimilation and accommodation of the cognitive schemes.

Stages:
focused on movement control and sensors
o information acquired by the sensors - organized and processed;
o the first cognitive schemes are constructed; they refer mainly to own
body/behavior and neighbor objects;
preoperational
o symbolic thinking;
o capacity of generalization;
concrete operations
o deductive reasoning;
o higher interest for surrounding environment;
formal operations
o use of abstract concepts,
o specification and verification of the working assumptions, etc.

Learning strategies classification


- ascending sorting in relation to inference complexity:
learning by heart
o memorizing, without inference.
learning with instructor:
o the instructor provides the information;
o information is selected, rephrased and integrated with the available
knowledge.
deductive learning:
o new conclusions are deduced from the available knowledge.
learning by analogy:
o available useful knowledge is transformed to tackle a new (similar)
situation.

inductive learning:
 by examples (acquisition of concepts):
look for universal rules describing all positive and negative examples.
 by observations and discovery (unsupervised):
look for universal rules describing the observations
observations are obtained without a supervisor.

Short history:
Beginning: 1950s
Main research directions:
automatic proof of theorems, planning and prediction,
programming, human language understanding,
= > requirements for building the Machine Learning.

Artificial intelligence was successful for well delimited problems,


only.

automatic

Knowledge representation
by symbols (classic)
 a formal set of primitives and rules are employed for symbols handling:
o predicates,
o frames, semantic networks,
o fuzzy systems;

by numbers (sub-symbolic):
o Artificial neural networks,
o Evolutionary algorithms.

CHAPTER 2. GENETIC ALGORITHMS


2.1. Introduction
Short history. Main research areas

First ideas regarding the evolution of species - Charles Darwin (1859).


Darwinian Theory: species go through a continuous development process.
Variations could occur during any species evolvement and these variations are
transmitted to offspring.
Best adapted individuals and species have greater chances of survival and development.
Evolution represents a natural selection of inherited variations.

Recently, Neo-Darwinism explains the mechanisms of inheritance based on Darwinian theory.

Modern Genetics studies the way in which information is encrypted by living organisms.

Evolutionary computation - translates the natural selection theory and evolution theory
to numerical algorithms.

 The natural model is adopted in a simplified version


Evolutionary algorithms work on a population of structures which is
evolved for several generations

Best adapted structures survive to the next generation and contribute to


the production of new, better adapted offspring.

First trials: 1950, Bremermann, Friedberg, Box


Problem revisited in 1960-1970: Holland, Rechenberg, Schwefel, Fogel
The reputation of the approach increases significantly in 1980-1985.
Starting with 1985, several specialized conferences are organized.
Since 1990, the involved research effort have increased exponentially.

Most common evolutionary algorithms:


Genetic algorithms
General adaptive process applicable to any optimization problem.
A structure contained by the population encodes a point in the space of decision variables.
Holland, De Jong, Goldberg, Davis, Eshelman, Forrest, Grefestette, Koza, Mitchel, Riolo,
Schaffer.

Evolutionary programming
Goal: the design of finite state automatons able to predict the changes occurred in the working
environment.
The environment is described by a string of symbols (according to a finite encoding alphabet).
The algorithm searches the output symbol providing the fittest prediction.
Fogel, Burgin, Atmar.

Genetic programming
The algorithm searches the fittest program able to solve a certain problem.
Koza.

Evolutionary strategies
Meant to solve optimization problems with continuous parameters.
A structure encodes the values of the decision variables corresponding to a point of the search
space.
Unlike GA: other mechanisms are employed for enriching the genetic material throughout the
generations.
Rechenberg, Schwefel, Herdy, Kursawe, Ostermeier, Rudolph.
Classifiers
Devoted to the design of classifiers by means of evolutionary techniques.
Holland, Reitman, Booker, De Jong.

Almost of the research is targeted to applications:


- wide areas of applications;
- good results.

Theoretical background - insufficiently developed.


6

2.2. Genetic algorithms - overview


Genetic algorithms = search/ optimization method.

o It uses strategies borrowed from Genetics and Evolutionary Theories natural selection.

o It can approach complex optimizations:


o nonlinear optimizations,
o constrained optimizations,
o multiobjective optimizations.

Problem statement
Let us consider f : S R n R .
The elements of x S are called decision variables.

Find:
arg min f ( x ) or
arg max f ( x ) .
x S





x S

Objective
Objective function
Objective value

General description of GA

Every iteration (generation), a set (population) of potential solutions (individuals,


chromosomes) x S is considered.

The individuals are evaluated in terms of the objective and the best ones are encouraged to
survive and reproduce.

New potential solutions (offspring) are obtained by combining the genetic material of the
parents, similarly to the recombination of DNA chains in biological systems. This process
guides the exploration by using the most valuable genetic material of the current population.

Small variations of the offspring ensure an adequate preservation of population diversity, with
positive impact on avoiding the stagnation in local optima.

The offspring fight for survival with the old solutions. The best adapted solutions will have
greater chances to win this contest.

The process is repeated for an adequate number of generations. If no additional special


mechanisms are employed, the population converges toward a set including duplicates of the
best adapted individual found during exploration.

initialization:
t=0;
generate N random points (uniformly) distributed within the search space, to form the initial
population P(t);

repeat until t < No_Generations


step 1: evaluate P(t);
step 2: selection form the recombination pool with individuals selected from P(t);
step 3: recombination (crossover) produce the offspring using the parents selected at step 2;
step 4: mutation apply small variations on the offspring produced at step 3;
step 5: evaluate the offspring obtained at step 4;
step 6 : insertion - create P(t+1), by selecting N individuals from the offspring obtained after
step 5 and the samples contained in P(t);
pasul 7: t=t+1;
end of the loop

display the best individual of the population;


end of the algorithm

Encoding
Most common: binary encoding.

An individual of the population = a chains of characters


(allowed by the employed alphabet).


for binary encoding the alphabet is {0,1}.

The encoding ensures the mapping of the exploration space S to S*. The genetic operators will act in
S*.
The chain (string) used for encoding an individuals is called chromosome.
A position (character) in this chain is called gene or locus.
The values allowed for a certain gene are called alleles (e.g, for binary encoding 0 and 1).

The genotype indicates the structure of the chromosomes and the values of genes (it is related to S*)

The phenotype indicates the behavior of an individual obtained due to its specific genotype (it is related
to S).

Considering the optimization problem


f ( x) :
min
x S
x
chromosome = chain of genes
(space S )
(space S * )
phenotype
genotype

Genetic operators
 Crossover - it works on two operands: 2 parents 2 offspring
by interchanging some sub-chains.

Depending on the number of the cutting points:



single cut point crossover

multiple cut point crossover

parent A

offspring C1


parent B

printe A

offspring C2
a) single cut point crossover
copil C1

parents

offspring
b) multiple cut point crossover

 Mutation

It works on a single operand:


Some randomly selected genes will be modified.
E.g., for binary encoding 0 1 and 1 0 .

1 0 ....................0 1 0 ...........11

1 1 ....................0 1 0 ...........10

Mutation for binary encoding

10

Remarks:

Genetic operators act according to stochastic rules (their probability is smaller than 1):

not all the pairs of parents formed from the recombination pool are combined by means
crossover.

the cutting points and the mutated genes are stochastically selected.

Individuals evaluation. Selection for recombination and survival


The adaptation capacity of an individual is assessed in comparison with the competitors from the set
(population).

Usually, the quality of an individual is indicated as follows:

as absolute value, by means of the objective function


 It indicates how much an individual fits the imposed objective.

in comparison with the other individuals of the set, by means of fitness.


 It encapsulates a comparison between the performances of the individual and the
performances of its co- habitants.
 It permits to choose the parents and the survivors.

11

Generally, an individual better than average is encouraged to survive and to produce offspring, because
it contains a genetic material better than other solutions of the current population.

Potential downsides:

By excessively encouraging the selection of superior individuals, the exploration is guided


toward restrained search areas of S*, containing the best individuals. The convergence speed is
high; however the exploration could stagnate around inconvenient solutions.

Even the worst solutions can generate fitted individuals by means of successive genetic changes
(performed via genetic operators).

To avoid the stagnation in local optima points and the premature convergence, an adequate
balance between convergence speed and diversity preservation is required. This balance is
mainly tuned by means of parents selection and offspring creation/insertion.

Stop criteria
Because the algorithm works randomly and in unsupervised manner, it is quite difficult to set
apriorically a proper stop condition.

The most common stop test deals with a maximum number of generations.
The maximum number of generations is tuned by trial and error.

Another stop criterion verifies if the differences produced on the individuals of the current
population become smaller than a predefined threshold.
If the individuals are still different, the evolutionary loop is continued. If the individuals become
too similar, the evolutionary loop is broken.
The allowed difference is difficult to set apriorically (more difficult than the number of
generations).
The encryption is very important, as small genotypic differences can involve big phenotypic
differences and vice versa.

12

The properties of genetic algorithms


When compared with other optimization methods, the main characteristics of GA could
be depicted as follows:

 GA work in parallel on a population of solutions;


 GA use stochastic transition rules;
 GA use the objective values only; no other information is necessary (e.g. the derivatives of the
objective functions);
 GA usually encrypt the set of decision variables (exception: GA based on float encoding).

Iterative optimization methods


Usually the algorithm start from an initial (known) solution, x0
Iteratively, x k x k +1 .
The goal is: lim x k = x* (global optimum)
k

0 order methods use the objective values, only;


- usually the objective values are computed in x k and some neighbors;
- examples: simulated annealing, hill climbing, Hook Jeeves, tabu search, GA, etc.;

1 order methods use the 1st order derivatives of the objective functions f
- assumption: the existence of the 1st order derivatives.
- example: deepest descent
 the algorithm goes in the inverse gradient direction:

13

2 order methods - use the 2nd order derivatives of the objective functions
- assumption: the existence of the 2nd order derivatives
- the searching direction is inverse to the gradient one; the 2nd order derivatives impose
the search step at each iteration.

Deepest descent (gradient) method:


f
xi k +1 = xi k
( xi k ) , with > 0 .
xi
Downside: requirement: derivable objective function
x0 and > 0 , needed
high risk to lock in local optima.
Advantage: simplicity, high convergence speed

The objective function f(x)

200
100
0
-100
-

0
The derivative of the objective function

500
0
-500
-1000
-

200
100
0
-100
-

x*

xA

xB

xC

xE

xD

14

decision variable x

The advantages of genetic algorithms

They use the objective values, only UNIVERSALITY = can solve ANY optimization problem
(including those with discontinuous, non-differentiable objective functions )
These methods are called weak/soft, because they request scarce aprioric information about the
targeted problem.
Available additional information can be integrated within GA in order to improve the
exploration capability and/or the convergence speed (e.g. start from a particular initial
population).
GA are efficient for complex (nonlinear, multiobjective, constrained, multimodal) optimizations:
They can converge toward the GLOBAL optima.
GA are EASY to implement and accept FLEXIBLE configuration.
GA are suitable for PARALLEL implementation.

The drawbacks of genetic algorithms

GA request huge computational resources (time + memory).


The performances are sensitive to several algorithm parameters such as: probability of crossover and
mutation, population size, number of generations etc.
Random number generator strongly influences the algorithm performances

15

2.3. Evolutionary algorithm an artificial intelligence technique


Evolutionary algorithms
observations:

involve

unsupervised

inductive

learning

based

on

The examples (individuals) are created without a supervisor.


The generation of new examples is based on inductive learning best available
knowledge is employed.
Good examples are kept in the population, bad ones are eliminated by means of
selection.

Evolutionary algorithms make use of sub-symbolic chromosomal representation.

2.4. Main research directions


- as outlined by Fogel
Improved theoretical background. Theory explaining the behavior of evolutionary algorithms.
Available results refer to limited cases (incompliant with real applications).
Empirical research. Comparative analysis meant to reveal the effect of various techniques/
mechanisms and the influence of algorithm parameters.
Automatic setting of algorithm parameters. Meta-algorithms and adaptive approaches.
Co-evolutionary systems. Each individual is an agent which needs to cooperate with the others for
solving the problems. An agent also competes with all the other individuals for survival.
Studies in natural evolution. Improved interdisciplinary research for finding valuable ideas translated later to the numerical approaches.

16

2.5. Genetic encoding

genetic
algorithm

problem
genetic
algorithm

problem

modified
problem

modified
genetic
algorithm

a) change the problem statement


- encode the decision variables

b) change GA techniques to
cope with the original decision variables

Two basic approaches for GA

A. Change the problem statement in compliance with canonical GA.

The decision variables are encoded with a finite alphabet.

the standard GA could be applied without any changes;

the exploration space S is mapped to S*;


The selections are applied in S;
The genetic operators are applied in S*;
The evaluation needs decoding (from S* to S)

17

Encoding (usually not a bijection!!): S S*, in S* - a string of genes (each


decision variable has a specific substring).
Decoding (inverse!!) use for interpreting the significance of genes at the
evaluation stage

v11....v1l
x1

v11....v1l
(S * )

(S * )

x1

(S )
xn
encoding
vn1.....vnl

....

..

vn1.....vnl
S*

xn

decoding
(S )

S*

Genetic steps are carried out in different spaces:


Selection for reproduction and insertion in S,
Crossover and mutation in S*.
The population includes points from S*; their images in S are obtained by decoding.
selection
S

P(t)

P(t+1)

P(t)
decoding

encoding

S*
G(t)

G(t)

genetic operators

18

Key issue - find a proper encoding.


Fogel: the size of the (finite) encoding alphabet has no huge influence (resulted GA are
equivalent)
use the most intuitive one.

Most popular: GA with binary encoding (canonical genetic algorithms).


Each decision variable is encoded by a substring of 0 and 1.

For optimization problems involving continuous decision variables:


-

the designer must indicate the length of the chromosome, l:


vu
for xi [u , v] , the encoding step could be set q = l ; the same binary encoding could be
2
used for all the decision variables xi [u + j q, u + ( j + 1) q ) , with j = 0,2l 1 .

0 1 2
0 0 0

...........

l
0

0 1 2
1 0 0

...........

l
0

0 1 2
0 1 0

...........

l
0

0 1 2
1 1 1

...........

l
0

0 1 2
1 1 1

...........

l
1

u+3q

u+q
u

u+2q

v-q
................

Binary l bits length encoding for a decision variable from [u,v]

19

Example:
Let us consider the encoding of x1,2 [2,2] by means of 4 bits per decision variable:
v u 2 ( 2)
q= l =
= 4 / 16 = 1 / 4 .
2
24
o code 0000 is associated with xi (2,2 + 1 / 4] ; at the evaluation stage, the decoding leads to

0000 xi = 2 + 1 / 4 1 / 2 (the middle of the interval).


o code 0001 is associated with xi (2 + 1 / 4,2 + 1 / 2] , at the evaluation stage, the decoding

leads to 0001 xi = 2 + 1 / 4 3 / 2 (the middle of the interval).


0001 0000
is interpreted as x1 = 2 + 1 / 8 and x2 = 2 + 3 / 8 and
x1
x2
the objective value is accordingly computed.

o Therefore, the chromosome

o If the optimum point is x1 = x2 = 2 + 1 / 16 , then the best result of the algorithm could be
x1 = x2 = 2 + 2 / 16 , with an error of 1 / 16 introduced by the finite length encoding. The error
could be decreased by using longer chromosomal strings (which lead to smaller q ).

Disadvantages:
o the accuracy of the algorithm depends to the length of the chromosome very
long chromosome are required to explore large, highly dimensional search spaces.

o the encoding can increase the complexity of the problem e. g. the problem
becomes multimodal (it admits multiple global optima);
This could happen whenever ordering relationship for the distances in S is not
preserved for the distances in S* (big distances between two individuals in S
does not mean big distances for the same individuals in S* and vice versa).
Solution: change the encoding! e. g. use Gray binary encoding.

Remark: For binary encoding, similitude (in S*) could be analyzed with Hamming
distance

20

B. Modify GA techniques
Modified genetic operators!!!!
the decision variables are not encoded: their values are directly memorized in the
chromosomes (S* is not used);
new genetic operators are needed to work in S (assuming infinite encoding
alphabet);

Advantages:
The chromosomal representation is more natural;
Additional knowledge can be more easily incorporated within the algorithm;
No need of extra computational time for decoding.

Float encoding ( evolutive program) recommended for continuous decision


variables
Advantages:
The length of the chromosome = the number of the decision variables (independent
to requested accuracy and exploration range).
Similar chromosomes mean neighbor points.
The complexity of the optimization problem cannot be changed by encoding.
Disadvantages:
New genetic operators are needed.

Remarks:
Which approach is best? there is not a general answer
 B intermediary results can be easily interpreted;
 A good theoretical background.
The encoding (A or B) must be joined with the compatible genetic operators.

21

2.6. Population initialization


Usually: randomly generated, according to uniform distribution in S*.
.
Binary encoding:
For a population of N chromosomes (each one having l genes), one has to
generate l N bits (with equal probability for the occurrence of 0 and 1).

Bramlete extended random initialization:


For an individual more trials, the best sample is used.
Additional knowledge can be used for creating an adequate initial population
( increased speed)
expected localization of optima,
expected structural properties of optimal chromosomes
for constrained optimization: the initial population can include feasible solutions,
only.

2.7. Genetic operators. Crossover and mutation.


- ensure the exploration and the creation of new solutions
They maintain the diversity of the population
Without genetic operators, the best solution of the initial population would be the
result of the algorithm

Crossover - it acts on two parents in order to produce two offspring.


It interchanges sub-chains randomly selected from the parents.

Mutation usually applied after crossover.


It changes some randomly selected genes.

22

What operator is more suitable?


What is the best probability? no general answer available, no firm winner.
Notations: pc = crossover probability, pm = mutation probability.

Different genetic operators have been suggested - no available rules for choosing
the most suitable one.
Kursawe the genetic operator must be designed taking into account the
dimension of the search space.
Combine crossover and mutation.

 Even for simple problems, the use of a single operator (crossover or


mutation) can lead to unsatisfactory results.
Some values recommended for crossover and mutation probabilities have been achieved by
means of experimental research.
GA: pc>>pm. Evolutionary strategies: use mutation only, or pm >>pc

A. Genetic operators for binary encoding

 Crossover (for GA)


Values suggested for (via experimental research) pc:
- some authors recommend pc 0.6 ,
- other authors recommend pc (0.75,0.95) .
Usually pc >> pm .
(for evolutionary strategies: usually pc = 0)

23

Types of crossovers:
 single cutting point crossover
 multiple cut points crossover
- more efficient for exploration;
- random selection of the cutting points + others methods (e.g. avoid interchanging
identical sub-chains).
 crossover using multiple parents: more than two parents participate to the production
of an offspring.
 discrete crossover: 50% probability to select the gene from a certain parent, 50%
from the other parent.

Most popular ones: multiple cutting point crossover and discrete crossover

 Mutation
- keeps the diversity of the offspring avoids the stagnation of the algorithm.
pm must be correlated with the employed selection.
Usually:
GA: pm small (rare mutation).
Large pm can disturb the algorithm convergence.
Example: If all the offspring survive implicitly to the next generation, the use of
pm > 1 / l ( l = the length of the chromosome) can lead to instability.
Recommendation - Bck (1996) for binary encoding: use Gray encoding with pm = 1 / l ,
with l = the length of the chromosome.

24

Maintaining constant pm throughout the evolutionary loop is not compulsory:

Decreasing pm.
Large pm. in the first generations in order to refresh the genetic material.
Small pm. in the last generation in order to allow algorithm convergence.

B. Genetic operators for float encoding

The encoding alphabet is infinite.


The length of the chromosome = the number of decision variables
Most popular operators proposed by Michalewicz (1996).

 Crossover
 simple crossover interchanges sub-chains randomly selected from the parents.
 heuristic crossover if parent x2 is better than parent x1 ,
x1' = a ( x2 x1 ) + x2 ; a (0,1) random, scalar

 discrete crossover see binary encoding

25

intermediary crossover - parents x1 and x2 , produce the offspring x1' and x2' :
x1' i = ai x1i + (1 ai ) x2i
x2' i = (1 ai ) x1i + ax2 i
a = [ai ]i - vector of random values - having the same size with the chromosome; its
elements can be chosen from ( 0.25,125
. ).
th
x ji , j {1,2} indicates the i element of the chromosome x j .
Gene 1

x1

Area where
offspring
could be
placed
x2

- the offspring are placed in a hypercube


slightly larger than the one delimitated by
the parents
Gene2

linear crossover it produces x1' , x2' from x1 , x2


x1' = ax1 + (1 a) x2
x2' = (1 a ) x1 + ax2

, a - scalar chosen within ( 0.25,125


. )

(*)

the offspring are placed on a segment slightly larger than the one delimitated by the parents:
Gene 1

segment where
offspring could be
placed

x1

x2

Gene2

 simple arithmetic crossover it changes a single gene according to (*)


Most popular: linear and intermediary crossover
26

Mutation
 Uniform mutation - changes the values of some randomly selected genes
The chromosome x = (v1 , v2 ,......., vk ,......vn ) is changed to x ' = (v1 , v2 ,......., v ' k ,......vn ) ,
when the mutation acts on vk.
The new value vk, is randomly chosen: vk' (vk a, vk + a), a > 0 .

This operator is very useful for populations containing multiple duplicates of the
same individual.

Non-uniform mutation acts differently at distinct generations.


v + (t , u v ), pentru r = 0
k
k
v 'k =
, with:
vk (t , vk l ), pentru r = 1
r random bit (uniform distribution of 0 and 1);
(u,l) the range of vk;
t the current generation;
(t , y ) decreased at subsequent generations according to:
t
(1 )b
T
y (1 z
)

(t , y) =
, z ( 0,1) ; b N (usually b = 5 ),
T maximum number of generations.

This mutation has larger impact in the first generations, when the genes can
be mutated in larger intervals ( (t , y ) is bigger). During the last generations,
only small variations are allowed.

27

C. Genetic operators for chromosomes with variable length


Vector-based chromosomes with variable length are used in (Goldberg, 1994) for
solving scheduling problems.
Genetic operators recommended for vector based chromosomes:
 concatenation: two parents are concatenated to form an offspring.

 splitting: it splits a chromosome in two offspring.


 mutation it acts similarly to previously described mutations.
The chromosomes of variable length can have more complex structures (e.g. trees,
graphs). In these cases, distinct specialized genetic operators are used.

D. Self - adaptive operators


The technique is imported from evolutionary strategies:
o The chromosome encodes some parameters of the operator.
Very useful for GA.

 Mutation
mutation with adaptive pm (Smith, Fogarty) - pm is encoded in the chromosome.

28

 Crossover
Usually, they are aimed at finding suitable cutting points.

 adaptive multiple cutting point crossover introduced by Schaffer and Morishima - it


uses an adaptive distribution of cutting points; the cutting points are encoded at the
end of the chromosome.
 Eliminating an individuals means eliminating the encoded cutting points
 segmented crossover uses variable number of cutting points; the chromosome
encodes the probability to cut the chromosomes in a specific locus.
 dual crossover introduced by Spears - the chromosome encodes an indicator which
specifies the type of the crossover to be applied (e.g. discrete or multiple).

2.7. Selection for recombination


- it selects the individuals of the current population which participate in the genetic
recombination; these individuals deserve to produce, offspring as their genetic material
is valuable.

Selection is also used for insertion. However, insertion will be separately analyzed, as
the specific mechanisms are enough different.

Attention should be paid to the fact that the GA works on a finite population, for a finite
number of generations.

29

Selection includes two main stages:


1) Compute the selection probabilities usually equal with fitness values.
2) Chose the parents (sampling), according to the selection probabilities
- some authors consider selection sampling
.

objective
values

for all the


individuals of
the current
population

STEP I

fitness

compute the
selection
probabilities

(selection
probablities)

STEP II

parents
sampling

Recombination
pool

Selection for recombination

Stage I. Fitness values computation


Objective function provides an absolute assessment of an individual.
Selection must evaluate each individual relative to its competitors. This evaluation is
provided by fitness function.

There are two main alternatives for computing the fitness values:
By explicitly using the objective values,
By considering the rank of the individual assigned in list sorted in terms of the
objective values.

30

1. Fitness assignment by means of an explicit use of the objective values (scaling)


F ( x) = g ( f ( x)) ,
where
F is the fitness function,
f is the objective function,
g describes the mapping from objective values to fitness values.

Fitness values are equal with the probability of selection:


F ( xi ) =

f ( xi )

f ( xi )

, cu F ( xi ) = 1
i =1

with F ( xi ) = pi .

i =1

where
xi denotes the individual i of a population including N solutions,
f is the objective function.

Scaling-based fitness assignment encourages the individuals with objective values


better than average.

Requirements: f can take positive values, only.


Usually, a preliminary scaling of the objective function (f f* ) is considered.

Aims:
met the requirement above;
change the influence of the individuals during the evolutionary loop
F ( xi ) =

f * ( xi )
N

f ( xi )

i =1

xi denotes the individual i in a population of size N


f* scaled objective function.

31

A classification of the most common scaling methods is presented by Goldberg (1994)


and Michalewicz (1996):
linear scaling provides a linear transformation from f to f*:
f * ( xi ) = a f ( xi ) + b; a, b R , where xi denotes an individual of the population
The parameters a and b influence the relative quality of the individuals,
therefore they have impact on the convergence speed and the exploration
capability of the algorithm.
Usually, f* does not significantly alter the average value, whilst increasing the
influence of the individuals better than average.
For maximization problems, a must be positive; for minimization problems a
must be negative. b ensures that f* is non-negative (Michalewicz, 1996).
a and b are constant during the evolutionary loop.

Example:
if a = 1 and b is significantly bigger than the mean of f,
then f* converge more hardly than f.

Let us consider 10 individuals, mean( f ) = 5, b = 100; in a maximization


problem, for f ( x1 ) = 25 it results:
25
f (x )
p1 = 10 1 =
= 0.5 - without scaling
f ( xi ) 50
i =1

p1 =

f * ( x1 )
10

f * ( xi )

125
0,11 - with scaling
50 + 1000

i =1

f * ( x1 ) = f ( x1 ) + b = 125 ,
10

10

i =1

i =1

( f ( xi ) + b) = f ( xi ) + 10b = 50 + 1000 = 1050 ,

the mean becomes 105.

32

f * ( xi ) = ( f ( xi )) k , k R , where xi represent an individual of the population.


The individuals having objective values bigger than 1 gain increased impact,
whilst those with objective values smaller than 1 are disadvantaged.
k is set slightly bigger than 1 (e.g. 1.005).

Remark:
For some authors, f* is used as fitness, as the proportional scaling is implicitly provided
by the stochastic roulette based sampling.

2. Ranking based fitness assignment

o Scaling based fitness assignment gives very much credit to the individuals
considerably better than average. If a population includes an individual which is
significantly more adapted than the others, this individual will be selected with
more copies within the recombination pool, the offspring will be close to him, so
this individual has huge chances to conquer the whole population with its
duplicates. This impedes the exploration in larger areas and allows the algorithm
to remain locked in local optima.
This disadvantage is eliminated by ranking-based fitness assignment.

The population is sorted subject to the objective values.


The rank represents the position that an individual has in this sorted list (rank 1 for the
best chromosome, rank N for the worst one).

33

The fitness values can be computed as follows:


linear method

F (ri ) = q (ri 1) r ,
where ri is the rank of the individual xi, and q, r are parameters
The selection probabilities belong to an arithmetic series (with the step r).
To ensure that the sum of all selection probabilities is equal to 1, it results:
N 1 1
q=r
+ , where N is the size of the population.
2
N
When r = 0 (and q = 1 / N ), all the individuals get the same fitness, no matter what
performances they have.
(and q = 2 / N), the biggest difference is made between the
When
individuals placed on consecutive ranks. The worst chromosome is assigned with the
fitness value 0 and the best with the fitness value q = 2 / N .

It results that the range of q is (1 / N ,2 / N ) .

Rephrasing suggested by Baker:

q=

2( SP 1)
SP
and r =
,
N ( N 1)
N

where SP (1,2) is the selection pressure, N is the size of the population.


For SP = 1 all the individuals gain the same fitness.
For SP = 2 the best individual receives the biggest selection probability.

Baker recommends SP=1.1


Advantage: a single parameter (SP) is used.

34

selection
probability

rank

q ( N 1)r

Remark:
For some authors, the fitness values are not equal to the probabilities of selection,
N

yet proportional. Therefore, in these cases the requirement F ( xi ) = 1 is not


i =1

exploited, being solved at sampling.

nonlinear method
F (ri ) = q (1 q ) ri 1 ,
where ri denotes the rank of xi and q (0,1) .
The selection probabilities belong to a geometric series of rate (1 q) .
N

For any q (0,1) , the requirement F ( xi ) = 1 cannot be met. However, this sum
i =1

results close to 1 for a large N.


N

F ( xi ) =1 (1 q ) < 1; F ( xi ) 1

i =1

i =1

Usually, q is chosen close to 0 (e.g. Michalewicz recommends q = 0.04).

35

The advantages of rank based fitness assignment:


it does not need a preliminary scaling of f ;
the selection probabilities can be directly controlled by means of q and r;
larger ranges of genetic algorithm behaviors could be obtained.

Disadvantages:
it does not met the requirements of schema theory (analyzing the GA
convergence);
it neglects the differences between the objective functions of the individuals with
consecutive ranks;
it requests to apriorically set two parameters (q, r) or one parameter (SP) of huge
impact.

Stage II. Sampling the individuals for the recombination pool


Sampling methods can be analyzed in terms of three indicators introduced by Baker

o Efficiency is usually related to the computational complexity of the method.

o Bias = absolute difference between expected selection probability (usually equal


to the fitness value) and the selection probability considered by the sampling
method.
 The bias defines the accuracy of the sampling.
Desired: bias = 0.

36

o The spread indicates the range of the number of selections allowed for an
individual: [min_no_samples max _no_samples]
 It measures the consistency of the method.
 Small spread means that the real number of selected samples is close to
the expected number of samples.
 Please note that a finite number of selection trials are considered. For
infinite trials, the number of occurrences would be compliant with the
selection probabilities; however for small number of trials, huge
differences can be met.

Roulette based method. Stochastic sampling with replacement

S UM

F(x 1 )

F(x 1 )+ F(x 2 )+ .... + F(xN - 1 )

.
.
.

F( x1 )+ F(x2 )

F(x 1 )+F(x 2 )+ F(x 3 )

Notations: xi the ith individual of the population, i=1,..N


F(xi) the fitness value of xi

37

Explanations:
Each individual gains a sector proportional to his fitness. The population forms a
circle with the length SUM (usually, SUM = 1 , although this requirement is not
mandatory).
Nsel random numbers are generated within (0, SUM), if Nsel indicates the number
of individuals needed in the recombination pool. Each selection corresponds to a
roulette turning; the individual sent to the recombination pool is the one indicated
by the position of the needle.
Obviously, a higher fitness value (assigned to a well adapted individual) conducts to a
higher sector and consequently to higher selection chances.
The method ensures:
null bias
large spread (0,Nsel)
(any individual with non-null fitness can be selected)
computational complexity order Nsel ln(Nsel ) .

Stochastic sampling with partial replacement


The method is similar to previous one, but once an individual is selected, its sector is
decreased.
o This decreasing corresponds to SUM / Nsel (for fitness), which means reducing
with 1 the number of future occurrences.
o Starting with the second trial, the needle will not be allowed to rotate around the
whole circle.
o By successive reductions, a sector can result of negative length in this case it is
eliminated from the roulette.
The method ensures:
- null bias
- smaller spread the maximum number of samples is [ F ( xi ) Nsel / SUM ] + 1,
where [x] indicates the integer part of x.

38

Remainder stochastic sampling


These methods include two main stages:
First stage - deterministic selection provided in terms of the integer part of the
expected number of selections.
Second stage stochastic - the rest of the samples are chosen by means roulettebased sampling.
Two methods are presented they employ distinct techniques during the second stage.
o Remainder stochastic sampling with replacement uses stochastic sampling
with replacing for the second stage: the sectors remain unchanged.
The method ensures: null bias, minimum number of samples [ F ( xi ) Nsel / SUM ] .

o Remainder stochastic sampling without replacement uses stochastic sampling


with partial replacement during the second stage: once an individual is selected,
its sector is eliminated
The method ensures: bias close to 0, very small spread.

Stochastic universal sampling


A single random number p is generated within (0,SUM/Nsel).
Then, the other individuals are uniformly distributed, corresponding to
p, p + SUM / Nsel ,...... p + ( Nsel 1) SUM / Nsel .

The method ensures:


o small spread,
o null bias,
o small computational complexity order (Nsel)

39

Other selection methods:


K-tournament selection
A set of k individuals is randomly formed. The best of them is placed in the
recombination pool.
For Nsel samplings, the above procedure is repeated for Nsel times.
Large k means large selection pressure.
Usually k = 2.
Additionally, a Boltzman mechanism can be used (Michalewicz, 1996):
If x competes with v and if x is better than v, sometimes the selection of v is
also allowed:
Generate a random number between 0 and 1.

F (v) F ( x)

T
, v wins the contest; otherwise x is the winner.
It is smaller than e
T is float number decreased during the evolutionary loop,
F(v) and F(x) are the fitness values.

Timeover

Let us consider a GA without genetic operators, which uses the selection for
recombination in order to directly form the population of the next generation. The size
of the population is kept constant.

The solutions better than average of the initial population are progressively
sampled with more and more copies.
After a finite number of generations the best individual conquers the whole
population with its duplicates and the algorithm stops.
This number of generations is called timeover.

40

Bck illustrates two types of selections in relation to timeover:

Soft selection, based on small selection pressure large timeover:

o The best solutions are not extremely encouraged.


o The algorithm ensures a very good exploration, the population is diverse
for a large number of generations (Kuo and Hwang, 1996).

Hard selection, based on big selection pressure small timeover

o The best solutions are extremely encouraged.


o The population loses the diversity at preliminary generations, so the
exploration is focused on certain small areas.
o The convergence speed is very high.

Bck (1996) compared the selection based on scaling with rank-based selection:
Sampling is solved with a roulette-based method.
 Scaling-based selections with linear, polynomial or exponentiation scaling lead
to a timeover of order N ln(N ) , where N is the size of the population.
The type of scaling has no huge influence.

 For rank based selection, the timeover is very sensitive to SP.


The timeover is monotonically decreasing in terms of SP.
(small SP means soft selection).

Bck recommends:
rank-based selection and k-tournament selections

41

Classification of selections - Bck and Hoffmeister + Michalewicz (1996).


Considering the dynamics of the selection probabilities:
static selections - same set of selection probabilities is used at every generation;
dynamic selection the set of selection probabilities changes from a generation to
another.
Considering the minimum selection probability:

 non-conservative selections accept individuals with null selection probability:


with left elimination no chances to select the best solutions (the goal is to
avoid premature convergence);
with right elimination no chance to select the worst solutions.
conservative selections null selection probabilities are not accepted.

2.8. Insertion (selection for survival)


Offspring insertion = a selection process.
One selects the survival offspring and old solutions

A: Insertion for fixed length populations


- the size of the population (N) is maintained constant during the evolutionary loop.

o The selection which ensures the survival of the best solution at every generation
is called elitist.
- for this type of selection, Rudolph proved the convergence towards the
optimal point.

42

A. 1. Methods specific to GA
- in order to reduce memory and computational time consumption: GA produces
fewer offspring than the population size.
- some offspring (the best ones) are inserted into the population.
 size of the recombination pool/ size of the population = generation gap
 it indicates the informational gain of the algorithm per generation .
The new information is achieved by exploiting the most valuable genetic material
inherited during the evolutionary loop.
- offspring are inserted at each generation, = constant.

o If {1,2} , the selection is called steady-state.


o If N offspring replace all N current solutions, the selection is called pure.
- each individual lives for a single generation.

Usually, offspring replace the worst solutions of the current population.

The replacement can be deterministic or stochastic (roulette-based), using inverse


fitness selection.
Inverse fitness selection means that the worst solutions of the current
population are replaced; this is computationally advantageous, because < N.
Fogarty proved that deterministic and stochastic insertions lead to similar
convergence speeds. Replacing the worst individuals means an elitist
selection.
Therefore, for the sake of simplicity and time performances improvement,
insertion is deterministic.

43

Other insertions consist in the replacement of oldest solutions of the current


populations.
As the best individuals have numerous duplicates created at successive
generations, a well adapted individual can survive via its newest copies.

Insertion can act


- once in a generation (generational insertion) - after the offspring are generated;
- on-fly- an offspring is introduced into the population immediately after its
generation; this means that an offspring can replace another offspring obtained at
the same generation.

Other insertions are based on similitude analysis:


Crowding insertion introduces an offspring by eliminating the most similar individual
of the current population. This insertion is useful for multimodal optimizations.
o preselection (proposed by Cavicchio): an offspring replaces the most similar parent
 multiple species can coexist within the same population;
 increased diversity.
o overcrowding (proposed by De Jong) an offspring replaces a similar solution of the current
population.
A set of k individuals is randomly selected form the current population. The most similar is
replaced by the offspring.

The similitude between two solutions is measured by means of Euclidian distance


(float encoding) or Hamming distance (binary encoding).

44

All the above described insertions are (, N), N .

 This means that every generation offspring are inserted in the population by
eliminating parents.

N individuals of the current population survive to the next generation.

For = N all the offspring are inserted, no old solution survives.

A.2. Insertions imported from evolutionary strategies


These insertions were firstly used for evolutionary strategies and then imported to
genetic algorithms.
Types:
(, N),
(N+)

Remark:
Schwefel recommends the (, N) insertion with >> N , however the (N+) insertion
has also proved its efficiency in numerous applications.

45

A.2.1 (, N) insertion with >> N


A huge number of offspring is produced at each generation.
Next population is formed with the best N offspring.

o Usually, this insertion is deterministic.


o However this insertion is not elitist, as the best offspring can be worse than the
best current solution.
o The insertion is static, non conservative (some eliminations accepted).
o This insertion is very useful for the optimization of time variant, noisy
functions.
o Bck proved that this selection ensures higher selection pressures than ktournament or ranking selections.

A.2.2 (N+) Insertion

offspring are generated. For evolutionary strategies >> N , although the method can
be applied for N .
An intermediary population (having (N+) size) is formed, by reuniting the old
solutions and the offspring. Then, its best N individuals deterministically survive to the
next generation.

o The insertion is elitist the performances of the best solution are monotonically
improved.
o The survivors can be also stochastically selected from the intermediary
population, using k-tournament selection. Usually, each selected solution is
extracted (eliminated from the population).
Small values of k are preferred.
For large k, the selection is close to the deterministic one.
o The method ensures high convergence speed. It can be used in combination
with techniques able to preserve high diversity within the population.
46

B .Insertion for algorithms evolving a variable sized population


The size of the population influences the accuracy and the convergence speed of the
algorithm:

 The algorithms working on small populations do not ensure an intense


exploration of S and can lock in non-optimal points, although their
convergence speed can be good.
 The algorithms working on large populations have an explorative behavior, at
the cost of high computational resource consumption.

These problems can be addressed by means of variable sized populations.

When created, each individuals is assigned with a life time:


Every generation the life time is decremented; when the life time becomes 0, the
individual is eliminated form the population (it dies).

o Insertion is implicitly solved.


o Should all the individuals be assigned with life time bigger than 1, the size
of the population increases exponentially.

Life time should be assigned by taking into account the performances of each individual
relative to the performances of the individuals included in the current population and/or
in the previous populations.

Better individuals should live for longer time intervals, thus having higher
chances to produce offspring inheriting their genetic material.

47

E.g: linear allocation:


abs
F ( xi ) Fmin
Life time = m + ( M m) abs
,
abs
Fmax Fmin

where
m and M represent the minimum and maximum life times, respectively,
F(xi) denotes the fitness value of xi individual
abs
abs
and Fmax
indicate the minimum and the maximum fitness values obtained from the beginning of
Fmin
the evolutionary loop.

Remark:
o It could be useful to use larger populations at the beginning of the algorithm (in
order to encourage the exploration), and smaller populations during the last
generations (when exploration is merely guided around some solutions).

2.9. GA convergence
The theory of GA is not capable to entirely explain the involved mechanisms.
GAs are very good, but we do not exactly know why they are so good.

Optimization theory states that an algorithm converges toward the


global optimum if it generates a sequence of solutions having as
limit the global optimum.

o The convergence has been proved for particular GAs, for unrealistic assumptions,
such as infinite populations, infinite number of generations.

48

2.9.1. Schema theory


Assumptions: chromosomes of size l; finite encoding alphabet of size k.
!!! For the sake of simplicity, the binary encoding is considered, meaning the
encoding alphabet is {0, 1} (size 2).

o Each gene/locus illustrates a distinct feature of the individual. The evolutionary


process determines the features of the best adapted solutions (e.g. almost well
adapted individuals have the first gene 0 and the last gene 1).

o This implicitly indicates a favorable searching direction within the phenotypic


space and a schema (building block) in the genotypic space.

Schema = a structural template describing the genotypic similarities of the individuals.

o Each schema contains constant and variable genes. For the previous example, the
schema is 0###.....##1. Here, # indicates the variable genes for which any allele is
permitted (in this case 0 or 1).

Therefore, the search can be viewed as the process which looking for the best adapted
schemata.
Holland stated that the fitness value of an individual gives partial
information about the adaptation capacity of the schemata belonging to
the individual.
Rephrasing, the fitness of schema H can be computed as the mean fitness
of the individuals containing instances of H.

49

A schema is characterized by two parameters:


The order , ( H ) = the number of constant genes
- e.g. schema 01**10*1 has the order 5.

The length, ( H ) = the length of the chain delimitated by the first and the last
constant genes -1
- e.g., schema 01**10*1 has the length 8 1 = 7 .

Schema Theorem: GA with linear scaling-based selection, simple crossover and rare
mutation encourages the multiplication of schemata better adapted than average, having
small lengths and small orders:
m( H , t + 1) m( H , t )

f (H )
1 N
f ( xi )
N i =1

[1 pc

(H )
l 1

( H ) pm ] , with

m( H , t ) - the number of H instances at generation t


m( H , t + 1) - the number of H instances at generation t +1
N the size of the population,
( H ) - the order of H;
( H ) - the length of H;
f ( H ) - the fitness of H, computed as the average fitness of all the individuals contained by
P(t) which comprise instances of H;
1 N
f ( xi ) - mean fitness of the individuals belonging to P(t);
N i =1
pc and pm - the crossover and mutation probabilities.

50

Proof
o Let us consider the scaling-based selection applied on P(t), in order to fill the
recombination pool with N samples.

o The expected number of selected samples for the individual xi having the fitness
f ( xi )
.
f ( xi ) : ni =
1 N
f ( xi )
N i =1
o After selection, within the recombination pool, the number of Hs instances is:
f (H )
m( H , t + 1) s = m( H , t )
, with
1 N
f ( xi )
N i =1
f (H ) - the fitness of H computed for P(t),
1 N
f ( xi ) - the mean fitness of P(t).
N i=1
This eq. indicates that the GA encourages the multiplication of schemata which are
better adapted than average.

o Afterwards, crossovers and mutations are applied with probabilities pc and pm ,


respectively. The resulted offspring will be all inserted in the population of the
next generation.

o The probability of destroying an instance of H (contained by a parent) by means


(H )
of crossover is
, so the instance survival probability results
l 1
(H )
psc 1 pc
.
l 1
Crossover encourages the survival of shortest schemata.

o Mutation can also destroy some instances of H.


The survival probability for H s instances is
p sm = (1 pm ) ( H ) 1 pm ( H ) .
Mutation encourages the schemata with lower orders.

51

o Considering an encoding alphabet of size k and chromosomes of length l,


(1 + k ) l different schemata can be produced.
Proof: A schema represents a string of size l, formed with any character of the
encoding alphabet and the character #.

o A chromosome of size l instantiates 2l schemata, because each gene can be


interpreted as a constant value or as #. Therefore, a chromosomal chain of size l
ensures the existence of 2l schemata.

o For a population of N chromosomes (each one having l genes), the number of


instantiated schemata is between 2l and N 2 l schemata, because distinct
chromosomes can also contain some common schemata.
o For a finite population, some schemata can have no instances - the best ones tend
to conquer the population.

o Even a small population contains rich information concerning the similarities of


the individuals.

o Holland proved that the number of the schemata which are efficiently processed
by GA at a certain generation are about N 3 , where N denotes the size of the
population.
Bertoni and Dorigo argued that Holland estimator is valid for populations
having the size proportional with 2l .

o However, GA can implicitly analyze significantly more schemata than the number
of the individual;: this behavior is called implicit parallelism.

52

2.9.2. Banachs Theorem regarding GA convergence

o Using Banachs theorem, a useful result concerning the GA convergence has been
obtained:
A GA which is capable to improve the mean performances of its
population at any successive generations converges towards a fix
population (fix point).
Therefore, for any initial population, after an infinite number of
generations, a final fix population is obtained this population includes
optimal solutions, only.

Remarks:
o The theorem does not give any result concerning the convergence speed of the
algorithm.
Obviously, the convergence speed is influenced by the algorithm parameter:
(the size of the population, genetic operators probabilities, etc.) and by the
content of the initial population.
In real implementations, the number of generations should be finite, too.

o The theorem requests the improvement of the mean performances of the


population. Therefore, we can validate the GA going to the next generation only
if this requirement is met. This can also involve reiterating the evolutionary
process at certain generations, until a better population is found.

53

2.9.3. Other results concerning GA convergence


Rudolph proved that a GA which performs the scaling-based selection of N parents,
and uses crossover and mutation, does not necessarily converge toward the global
optimum.
Even the expected number of optimal solutions tends towards values greater than
1, the convergence towards the global optimum is not assured. The explanation is
related to the fact that the probability of loosing these points is not zero.
Therefore, the Schema theorem does not guarantee the convergence toward the
global optimum.

Rudolph proves that a GA with elitist selection (which keeps the best solution within
the population) convergences towards the global optimum.
The requirement does not refer to the improvement of mean performances of the
population, yet to the survival of the best adapted individual, only.

Insightful explanations concerning the influence of selection and genetic operation were
delivered by Qi and Palmieri Let us consider a GA working on infinite populations for
optimizing bounded, positive, unimodal objective functions with a finite number of
discontinuities.

o If the initial population covers (continuously) the whole exploration space, the
scaling-based selection will encourage the individuals clustering towards the
regions characterized by the highest fitness values. The density of the solutions is
increased around the optimum point.
o So, the use of selection without genetic operators guarantees the convergence
towards the global optimum.
o This convergence is also proved for GAs working on infinite populations (which
continuously cover the search space) with scaling-based selection and mutation of
low magnitude or rare.

54

When working with finite populations, the initial population does not include all
the potential solutions of the exploration space, so the action of genetic operator
is crucial for refreshing the genetic material.

Also note that GAs involve a finite number of generations, so the convergence
speed is vital for the algorithm performances.
This convergence speed is dependent to all algorithm parameters.

2.10. Parallel GA
o Because GAs are time consuming, they are usually employed for offline
applications. The execution time depends on the size of the population, the
selection pressure, etc.

o Using smaller populations can lead to smaller execution time, at the cost of
reduced accuracy.

o A more valuable approach for reducing the execution time without altering the
other algorithm performances is to consider parallel implementations.

Three main directions could be depicted: global GAs, migration-based GAs and
diffused GAs.

55

2.10. 1. Global GAs


These approaches exploit the fact that some GAs stages can be carried out in parallel on
different individuals or pairs of individuals.

MASTER

SLAVE 1

...............

SLAVE k

Master-slave architecture for global GAs

Example of master- slave architecture:

o master for population initialization, fitness computation, selection and general


control of the population.
o slaves for crossover, mutation, offspring evaluation.

Other parallel implementations can be considered


- e.g. using systolic approaches.
.

56

2.10.2. Migration based GAs (GAs with migration)

The population is divided in several subpopulations of equal sizes, which evolves


independently for certain generations.
Periodically, some individuals are exchanged between the subpopulations.

One must indicate:


- when migration is allowed;
- the ratio of individuals which migrate (r% individuals of the subpopulation);
- which individuals migrate;
- which subpopulations interchange information.

initialization:
t=0; chose N random individuals to form P(t);
repeat until t < No_Generations +1
for each subpopulation SbP(t) execute separately:
step 1: evaluate SbP(t)
step 2: selection fill the recombination pool of the subpopulation;
step 3: crossover- generate offspring using the parents selected at step 2;
step 4: mutation apply small variations on the individuals obtained at step 3;
step 5: evaluate the offspring resulted at step 4;
step 6 : insertion - create SbP(t+1), choosing N individuals form SbP(t) and from the
offspring obtained at step 5;
if migration is allowed:
step 1: chose r% individuals from each subpopulation (the best ones) - for migration;
step 2: establish the content of the subpopulations for the next generation, eliminating
the less adapted host individuals;
t=t+1;
end of the loop
display the best individual of the entire population;
end of the algorithm

57

Types of communications between the subpopulations


neighborhood migration

ring migration
1

1
5

unrestricted migration
1
5

The implementations lead to good results if the best individuals of each subpopulation
are encouraged to migrate.

 Within each subpopulation, the best adapted individuals have multiple


instances.
 During migration, the worst individuals of the host subpopulation are replaced
by well adapted solutions coming from other subpopulations; therefore, each
subpopulation benefits from the experience of the other ones.

The emigrants can be also chosen from the offspring more offspring are produce at the
generations which involve migration.
 Some offspring migrate to other subpopulations. Because they combine the
genetic material of well adapted individuals, their genotype can be valuable for
the host subpopulation.

58

GAs with migration lead to reduced execution time.


GAs with migration lead to better accuracy.

!!!!! Usually, the GA working on a population having the size equal to the sum of
subpopulations sizes has worse results than the migration-based GA.

GA with migration is suitable for multimodal optimizations - each subpopulation can


evolve to a distinct optimum point.

2.10.3. Diffused GAs (neighborhood based, with fine granularity)


Unlike the island model which establishes rigid boundaries between the isles, here the
population is treated as a whole.

Some constraints concerning the recombination of the individuals are


imposed. The mate can be a neighbor, only.
The recombination is carried out as follows: each node received copies of its
neighbors and sends copies to them. One of the parents is the individual
encoded in the node. The second parent is chosen from the received
duplicates. A single offspring is produced and it competes with the individual
of the node. One can also allow the implicit survival of the offspring, despite
its fitness value.

59

The initial population is random, uniformly distributed over the exploration space.
After several generations, some clusters could be observed, indicating regions where the
nodes contain similar individuals
Better adapted individuals tend to be spread over the population, thus
conquering a larger surface.

- this GA uses a local selection, in compliance with the natural model;

2.11. Benchmarks for GA evaluation


Used for empirical analysis
!!!!!!! There is no objective function which permits the generalization of the analysis
concerning its optimization.
Usually, the benchmarks are less complex than engineering industrial applications; they
contain
unimodal and multimodal objective function;
non-derivable functions.

The functions should be scalable: the complexity of the optimization problem should be
tuned via some parameters.

A benchmark for constrained optimization can be found in [Michalewicz].

60

A good benchmark has been proposed by Bck:


quadratic function:
n

f1 ( x) = xi2 ; x = [ x1 ..... xn ];
i =1

 Unimodal function (admitting a single optimum point); usually n = 2 .


stair function resulted by the discretisation of the continuous squared in terms of
its output values.
Discontinuities, multiple optimum local points.
Ackley:
f 3 ( x) = c1 e

c 2

1 n 2
xi
n

i =1

1 n
cos( c3 xi )
n

i =1

+ c1 + e; x = [ x1 ..... xn ];

usually: c1 = 20; c2 = 0.2; c3 = 2 ; n < 30; xi [ 20;30]


 Multimodal function.

 Fletcher & Powell:


n

f 4 ( x) = ( Ai Bi ) 2 ; x = [ x1 ..... xn ];
i =1

j =1

j =1

Ai = (aij sin j + bij cos j ), Bi = (aij sin x j + bij cos x j )


aij , bij (100,100); j ( , ); xi ( , ); n < 30

 Multimodal function, non-symmetric.Very complex optimization


fractal function
n

f 5 ( x) = (C ' ( xi ) + xi2 1); x = [ x1 ..... x n ];


i =1

C ( xi )
, pentru xi 0
j
2 D

1 cos(b x )
'
C
(
1
)
x
i
C ( xi ) =
, C ( xi ) =
j =
b ( 2 D ) j
1, pentru xi = 0

D = 1.85; b = 1.5; n < 20; xi (5,5)


 Non-derivable function. Very complex optimization.

61

CHAPTER 3. ARTIFICIAL NEURAL NETWORKS


3.1. Artificial neuron
= the basic computational unit of unit artificial neural networks (ANN)

p1 ,...., p R - neuron inputs;

p1
...
pj

w1

wj

....

n
f (n)

f : , y = f (n ) - activation function
usually nonlinear;

b bias other usual notation ;

wR

w1 , w2 ,.....wR weights of incomming links.

pR

Components:
- synapses or links characterized by weights (also called strengths);
- summing block and activation function (the activation function is usually nonlinear);

Input-output mapping: static model

The output of the model is computed as follows:


y = f (n) ,
where the input of the activation function, called activation (n) is:
p1
n = w1 p1 + ..... + wR p R + b 1 = [ w1 ... wR ] ... + b = Wp + b ,
p R
or
R

n = wi pi + b
i =1

with b , W 1 x R , p R x 1 .

62

b can be treated as a supplementary weight:


w0 = b for the input p0 = 1 :

R
~~
n = wi pi = Wp
i =0

notations: [w0

[1

not
~
w1 K wR ] = W (transposed) extended weight vector
not

p1 K

~T
pR ] = p

(transposed) extended input vector

So, the diagram can be redrawn:


p0
=1

w0
=b

p1
...

w1
wj

pj
....

n
f (n)

wR

pR

Usually : y [0,1] or y [1,1] .


Other alternative for the extended weight vector:
~
W = [ w1 .. wR

~ T = [ p ...
b] , p
1

63

pR

R +1
~~
n
=
wi pi = Wp .
1] ,
i =1

Comparison between the artificial neuron (AN) and the biological one (BN):

1) AN admits negative weights !!! (unlike BN).


positive weights for excitation effect ,
negative or null weights - for inhibitive effect .

2) time constants: BN 1msec, AN 1nsec


fewer links and fewer neurons for ANN

3) energetic efficiency: BN 1016 J / sec per operation, AN 106 J / sec per operation.
4) BNN works asynchronously, without a clock master (continuous time domain).
5) BNN involves random connectivity; ANN uses specified connectivity.
6) BNN are tolerant to errors.

Typical activation functions:


Deterministic functions

1, n 0
1) y = f (n) =
- hard limiter
0, n < 0

-b

with nn = wi pi
i =1

+1

+1

nn

-b

0
-1

Hard limiter

Symmetric hard limiter

64

nn

2) y = f (n) = n - linear
f

+1

nn

, with nn = wi pi
R

i =1

3) y = f (n) =

1
, c > 0 - sigmoid
1 + exp(cn)

For c = 1 :
f
+1

0.5
-b

nn
R

, with nn = wi pi
i =1

FUNCTIE DE ACTIVARE SIGMOID


w=2>0; b=-3<0
w=2>0; b=3>0
1

0.6

0.6
a

0.8

0.8

0.4

0.4

0.2

0.2

0
-5
1

0
-5

intrare 5

w=-2<0; b=3>0

0.6

0.6

intrare

w=-2<0; b=-3<0

0.8

0.8

0.4

0.4

0.2

0.2

0
-5

0
5
-5
0
5
intrare
intrare
in punctul de inflexiune: a=0.5, p = -w/b, tangenta la grafic are panta= w/4
0

65

4) y = f (n) =

1 exp(2cn)
, c > 0 - hyperbolic tangent
1 + exp(2cn)

For c = 1 :
f
+1
-b
0
-1

nn

, with nn = wi pi
i =1

5) y = f (n ) = e

( n c )2

- Gaussian function >> see RBF (other input-output mapping)


f
+1

3.2. ANN architectures

The ANN structure allows a parallel and distributed processing.

Each ANN can be represented by a directed graph.

The nodes of the graph correspond to neurons.

The nodes are connected by links which ensure unidirectional and instant
communication.

A processing unit (neuron) admits any number of input links.

A processing unit has local memory.

A processing unit can be modeled in terms of input output formalism.

The neurons are organized in layers.


Within a layer, the neurons are considered to work in parallel.
66

Hidden
layers

Input
layer

Output
layer

u1

y1

u2

y2

Legend:
Lateral links (between the
nodes of the same layer)
Feedback links (from the
output of a neuron to its
input)

um-1

yk-1

B ackward links (t o the neurons


of the previous layers)

um

yk

F eedforward li nks (to the


neurons of t he next layers)

AN N
outputs

ANN
inputs

Remark:
- the input layer does not perform any processing; it will be not count;
- the hidden layers and the output layer include neurons.
.

Types of ANN

feed-forward: - with feed-forward links, only


dynamic/recurrent: - contain at least one lateral, backward or feedback link

Example 1: Feed-forward architectures with one layer:


Neuro n 1

b1

o Only feed-forward links!!!


o Input layer: m inputs, no processing
o Output layer: k neurons, characterized by
yl = f l ( wl ,1u1 + .. + wl ,mum + bl ), l = 1, k

n1

w1,1

u1

y1

f 1( n )

w1, j

...
w1, m
uj

...
....
w k ,1

Neuro n k

um
wk , j
bk
wk , m

nk
f k (n )

Inp ut
layer

67

Ou tp u t l ayer - k ne uro ns

yk

Let us assume that all activation functions are identical.


One can write:
w1,1u1 + ..... + w1,m u m + b1
y1 f ( w1,1u1 + ..... + w1,m u m + b1 )

...
...
...


~~
)
y = y l = f ( wl ,1u1 + ..... + wl ,m u m + bl ) = f ( wl ,1u1 + ..... + wl ,m u m + bl ) = f ( Wu


...
...

...
wk ,1u1 + ..... + wk ,m u m + bk
y k f ( wk ,1u1 + ..... + wk ,m u m + bk )

with
w1,1
..
~
W = wl ,1

...
wk ,1

..

w1, m

... ...
... wl , m
... ...
... wk , m

b1 extended weights for neuron 1


...
bl extended weights for neuron l

...
bk extended weights for neuron k

u1
...
~
u = = input vector,
u m

1
y1
y = ... = output vector.
y k

Remark:
~
W = [ wi, j ], i = 1, k , j = 1, m + 1
for wi, j - the first index indicates the neuron
- the second index indicates the link

68

= extended weights matrix,

Simplified diagram for feedforward ANNs with 1 layer

y1

u1

N euro n 1
...

uj
. ...

yk

um

N euro n k

O utp ut laye r
- k neuro ns

Inp ut
la ye r

Remark: The layers can be


Fully connected all feedforward links are used,
Partially connected some feedforward connections are missing.

Example2: Feed-forward architectures with two layers

b2

b1

w 11,1

n1

f 1(n )

y 11

w 12,1

f 2 (n )
1

u1

w 11, j

Neuron 1,
layer 1

...
w 11, m
u

n2

y 11

w2

...

...
....

w 1 s ,1

Neuron s,
layer 1

um

Neuron k,
layer 2

2
k ,1

s,j

b 2k

b1
s

w1 s ,m

n1

Inp ut
layer

Neuro n 1,
layer 2

1, s

y 1s

w k2,s

f s1 ( n )

Hidden layer - s neurons


- layer 1

y 2k

n k2

f k2 ( n )

Output layer - k neurons


- layer 2

Identical activation function within each layer!!!!

69

- Layer 1:

y1
w1 .. w1
1
1,1
1, m

~1
1
1
1 ~ 1~
y = f ( W u) , with y = ... , W = .. ..
..
y1
w1 .. w1
s, m
s,1
s sx1
-


b11
u1

~
, u = ..
..
u
m
b1s
sx ( m +1)
1 ( m +1) x1

Layer 2

y2
w12,1 .. w12, s
1

~ 1
~
y 2 = f 2 (W 2 ~
y ) , with y 2 = ... , W 2 = ..
..
..
y2
w 2 .. w 2
k,s
k ,1
k kx1

b12

,
..
2
bk
kx ( s +1)


y11
y1

~
=
y1 = ..
1
y1s

1 ( s +1) x1

Remark:

Upper index: the layer


Lower indexes: the first =the neuron; the second = the link of the neuron

y2

y1

u1
...

N e uro n 1,
la yer 1

N e uro n 1,
la yer 2

uj
. ...
um

Inp ut
la ye r

y 1s

N euro n s ,
laye r 1
H idd e n la yer
- s neuro ns

70

y2
k

Ne uro n k
layer 2
O u tp u t la ye r
- k ne uro ns

Example 3: Feed-forward architectures with two layers simplified diagram

y1

u1

Neuro n 1
...
q- 1

q- 1

uj
....

yk

um
Neuro n k
Inp ut
layer

ANN architecture =

Outp ut layer
- k neuro ns

Number of inputs and number of outputs


Number of layers
Number of neurons within each layer
Map of links
Type of the activation functions

ANN parameters
sigmoid/linear/step =

weights
biases

Gaussian

centers

spreads

71

3.3. Multi-layer Perceptron (MLP)


MLP architecture

b1

w11,1

u1

v1

f 1 (v )

v2

y11

w11, j

w 1 ,1

w2

...

...
w1 s,1

Neuron s,
Layer 1

um

Neuron k,
Layer 2

2
k ,1

s,j

bk

w1s, m

v1

Inp ut
layer

Neuro n 1,
La yer 2

1, s

uj

....

f ( v)
1

Neuron 1,
Layer 1

1
w 1, m

y11

...

2
1

y1s

wk2,s

1
f s ( v)

Hidden layer - s neurons


- layer 1

y 2k

v 2k
f k2 ( v )

Output layer - k neurons


- layer2

Characteristics:

o The layers are linked in series: the outputs of the neurons belonging to a layer are inputs for

the neurons of the next layer.

o Within a layer, the neurons work in parallel.

o All the neurons have sigmoidal activation function (linear, sigmoid, tanh)
o The MLP can have any number of hidden layers

72

Criteria for learning algorithms based on error correction


Let us consider k neurons within the output layer.

1. On-line
The training samples (u (i ), d (i )), i = 1, N are presented in sequel, one sample per iteration (the
number of iterations = multiple of N).
k

I (n) = 0.5 ei 2 (n) , with ei (n) = d i (n) yi (n) = the error of the ith output neuron
i =1

2. Batch
All the training samples (u i , d i ), i = 1, N are presented at a single iteration.
I (n) =

1 N k 2
ei (n, j ) ,with ei (n, j ) = the error of the ith output neuron for the jth training
2 N j =1 i =1

sample presented an the nth epoch

Backpropagation learning algorithm


= the steepest descent method (gradient)

wijl (n + 1) = wijl (n)

I
wijl

(n) = wijl (n) + wijl (n) ,

> 0 - influences the convergence speed

Overview steps carried out at each epoch:

Evaluate the ANN output and the error: feedforward INOUT

Adapt the parameters: backward OUTIN: (backpropagation)

73

2
- For online learning ( I (n) = 0.5 ei (n) )
i =1

Parameter variation
= learning rate ( ) x local gradient ( ) x input (corresponding to the link)

- For batch learning ( I (n) =

1 N k 2
ei (n, j ) ,)
2 N j =1 i =1

where ei (n, j ) = the error of the ith output neuron for the jth training sample presented at the nth
epoch

wikl (n) =

1 N
l
wik (n, j ) -the mean of variations separately computed for each sample
N j =1

Backpropagation adaptation equations


k

For the sake of simplicity, online learning is considered I (n) = 0.5 ei 2 (n)
i =1

1. For the output layer (denoted l)

Let us consider k output neurons and s neurons in the preceding layer.

ei (n) = d i (n) yil (n) , the error produced by the output neuron i
s

vil (n) = wijl y lj1 , with wil0 = bil i y 0l 1 = 1 .


j =0

I
wijl

( n) =

e
e
y l
v l
I
(n) i (n) = ei (n) i (n) i (n) i (n)
ei
wijl
yil
vil
wijl

74

for a certain sample

I
(n) = ei (n) (1) f ' i (vil (n)) y lj1 (n) = il (n) y lj1 (n)
l
wij

wijl (n) = il ( n) y lj1 ( n)

with

il = ei (n) f 'i (vil (n)) =

I
(n) = local gradient
vijl

Parameter variation
= learning rate ( ) x local gradient ( ) x input (corresponding to the link)

2. For the hidden layers

Problem: find the contribution of a hidden neuron to the total error.

The parameters will be adapted starting from the output layer to the input layer.

Considering the hidden layer l, the local gradients within the layers l +1, l +2, ..etc. must be
available from previous computations.

The output of the neuron i belonging to layer l is input for the neurons belonging to l +1 (output).

75

For simplicity: the layer l +1 is considered the output layer.

Layer l +1: with k neurons

y zl +1 (n) = f l +1 (v zl +1 ( n)) , z = 1, k ,
s

v lz+1 (n) = wlz+, j1 y lj (n) ,


j =0

s = the number of input connections for the neuron i (the number of neuron with the previous layer,
l ), wlz+,01 = bzl +1 , y lz = 1 .

zl +1 known for z = 1, k .

Layer l: with s neurons

yil (n) = f l (vil (n)) , z = 1, k ,

vil (n) = wil, j y lj1 (n) ,


j =0

q = the number of input connections for neuron i (the number of neurons belonging to the previuous
layer), wil,0 = bil , y 0l 1 = 1 .

If l is the first hidden layer ( l = 1 ), then y il 1 (n) = u i (n) .

76

k
e l +1
I
( n) =
( n) z ( n)
l +1
z =1 e z
wil, j
wil, j

k
y il
vil
e zl +1
y zl +1
v zl +1
I
l +1
n
e
n
n
n
n
n
(
)
=
(
)

(
)

(
)

(
)

(
)

( n)

z
wil, j
y zl +1
v zl +1
yil
vil
wil, j
z =1
k
I
(
n
)
e zl +1 ( n) (1) f l '+1 (v zl +1 ( n)) w zil +1 f l ' (v il (n)) y lj1 (n)
=

wil, j
z =1
k
k
I
'
l +1
l +1
l
l 1
l 1
(
n
)

w
f
(
v
(
n
))
y
(
n
)
y
(
)
zl +1 w zl +,i1 f l ' (v il ( n)) .
=

z
z ,i
l
i
j
j
wil, j
z =1
z =1

wil, j ( n) = y il 1 ( n) il ( n) ,

with

il (n) = zl +1 w lz+, i1 f l' (v1i (n)) = local gradient


z =1

Parameter variation
= learning rate ( ) x local gradient ( ) x input (corresponding to the link)

Remarks

1) The derivatives of the objective functions

o Sigmoid

f (v ) =

1
a exp(av)
, a > 0 f ' (v ) =
= a f (v) [1 f (v)]
1 + exp(av)
[1 + exp( av)]2

o Hyperbolic tangent
f (v ) = a tanh(bv), a, b > 0 f ' (v) =

b
[a f (v)] [a + f (v)]
a

77

2) Learning rate > 0

For small values: low convergence speed; a quite smooth trajectory is followed within the
search space

For large values: risk of unstable behavior

Improvements:

2a) use inertial back-propagation (with momentum)

2b) use a distinct learning rate for each link

2a) use inertial back-propagation (with momentum) - explanations


Generalized delta rule:
wijl (n) = wijl (n 1) + il ( n) y lj1 (n) , > 0, momentum constant

n t

n t

t =0

t =0

wijl (n) = n t il (t ) y lj1 (t ) = n t

wijl (n) = [ n

I
(t ) ,
wijl

I
I
I
(0) + n 1 l (1) + ... + l (n)] .
l
wij
wij
wij

[0,1] has a stabilizing effect:


 when

 when

I
wijl

I
wijl

(t ) keeps the sign at successive iterations, the absolute value of wijl increases
(t ) changes the sign at successive iterations, the absolute value of wijl

decreases

78

3) online or batch learning?


-

For online learning: training samples must be randomly presented to avoid cycling

Online learning

Reduced memory consumption

Faster learning for large training data sets

Convergence hardly to analyze (the examples must be randomly presented for avoiding the
stagnation in local optima)

Good results for training data sets containing similar samples

4) The initialization of weights


- The result is dependent on the initial ANN parameters

o Use uniformly distributed or normally distributed (mean 0, spread illustrating the


saturation of some neurons).

5) Stop criteria
- only some recommendation can be made:

Recommendations:
o The norm of the gradient becomes close to 0
Disadvantage: numerous epochs can be involved.
o The variation of the criterion I becomes insignificant
Disadvantage: premature stop.

6) Efficient exploitation of the training samples


-

For online learning, the successive samples can be different


When the training samples are randomly presented, this condition could be frequently
met.

Outliers can impede the convergence and can lead to bad generalization capabilities.

79

7) The learning faster for asymmetric activation functions


f ( v ) = f (v )

Ex:

Symmetric limiter
Hyperbolic tangent: f (v ) = a tanh(bv) ,
recommended values (LeCun) a = 1.7159 , b = 2 / 3 , with
f
(0) 1.14 .
v

8) Learning rate

The neurons should learn at the same speed.


-

Usually, the gradients in the output layer are bigger, so should be smaller for the output
neurons.

The neurons having more links can work with smaller .


1
LeCun suggests: =
, m = the number of input links for a certain neuron.
m
-

ATENTION!!! - Generalization capacity


Training = approximation in terms of the training data set
Generalization = approximation in terms of another data set
/ validation /
If N is too large or the ANN architecture is too complex, the model results overfitted

select the simplest function possible,


if there is no information to invalidate this

80

Applications of MLP - Function approximations




MLP = universal approximator

Theorem:
Any continuous bounded function can be approximated with any desired degree of accuracy
> 0 , by means of a MLP containing

o a hidden layer with m neurons, characterized by continuous, bounded, monotonic


activation functions;
o a hidden layer with a linear neuron ( or sigmoidal neuron working within its linear
region)

i =1

j =1

F (u) = i f ( wij u j + bi )
m = the number of hidden neurons
R = the number of inputs

Remarks regarding the content of this theorem:


-

MLP existence is guaranteed

the theorem does not give any indications concerning the resulted generalization capacity of the
model and the time requested for learning

the optimal structure is not given

Remarks regarding the applicability of this theorem:

- the value of m:
o if m is small, the empiric risk is lower (reduced risk to learn the noise captured by the
training samples);
o if m is large, a good accuracy can be obtained;
- when a single hidden layer is used:
o the parameters of the neurons tend to interact: the approximation of some samples can
be improved solely by accepting worse approximation for other samples
For ANNs with 2 hidden layers: the hidden layer 1 extracts the local properties
the hidden layers 2 extract the global properties

81

3.4. ANN with Radial basis functions - RBF


The neuron of RBFs
The structure for the hidden neuron of the RBFs :

p1

c1
...

y=f(n)

n
cj

pj
....

cR

pR

See demorb1

y = f ( p c ) = f ( ( p1 c1 ) 2 + ... + ( p R c R ) 2 )
c1
p1

p = ... , c = ...
c R
p R
c = center vector (a center for each input connection)

Usually, Gaussian activation function is used:


y = exp(

p c

) = exp(
2 2
p1
c1

p = ... , c = ...
p R
c R

( p1 c1 ) 2 + ... + ( p R c R ) 2
2 2

c = vector of centers for a hidden neuron


= spread

82

Remarks:

The neuron is activated only if the input (vector) is similar to the center (vector).
o The accepted similitude level is given by .
o If is large, the neuron is activated for reduced similitude between inputs and centers.

For inputs which are very dissimilar to the centers, the neuron is inactive:
y 0, for p c >> 0 p, c very different.

Comparison between Gaussian neuron and perceptron


p j has more significant influence for the activation of the neuron
if the absolute value of p j w j is larger.

RBF architecture
Standard architecture includes
an output linear neuron
a single hidden layer with s Gaussian neurons.
-

Because a single hidden layer is considered the upper index will be deleted for almost of
the notation (it was only kept for making distinction between the linear and the radial
basis activation functions).
n1

y1 = f1(n1)

u1
....
uj

um

ni

cij
....

w1

...

ci1

yi = f1 (ni)

b
n

wi

cim

....
ws

ns
ys = f1 (ns)

83

f2 (n) = y

y = f ( w1 y1 + .. + ws y s + b) = w1 y1 + .. + ws y s + b = [w1
2

y1
s
.. ws ] ... + b = wi yi + b
i =1
y s

yi = f 1 ( u c i ) = f 1 ( (u1 ci1 ) 2 + ... + (u m cim ) 2 )


u1
c i1

u = ... , c i = ...
u m
cim
c i = center vector for the hidden neuron i
For Gaussian activation functions within the hidden layer:
2

s
(u ci1 ) 2 + ... + (u m cim ) 2
) = b + wi exp( 1
)
i =1
i =1
2 i 2
2 i 2
ci = center vector for the hidden neuron i
i = spread for the hidden neuron i
s

y = b + wi exp(

u ci

RBF = universal approximator

RBF for classification problems


Covers Theorem for pattern classification:
A complex classification problem (nonlinearly separable) has great chances to become linearly
separable via a nonlinear mapping provided to a space of high dimension.
Let us consider the samples u(i ) = [u1 (i ) .. u m (i )]T belonging to m .
(e.g. the training samples).

f1 (u)
Let us consider f : , s large , with f (u) = ... , f1 ,..., f s : m .
f s (u)
(e.g. f1 ,... f s indicate the mappings provided by s hidden neurons)
m

84

Definition
The classes C1 ,C 2 are f-separable, if there exist w = [w1 .. ws ]T s with:

wT f (u) > 0, for u C1


wT f (u) 0, for u C2

Remarks:
-

according to Covers theorem: chose large s and non-linear f1 ,... f s

the f hiper-plane delimitating the classes is given by w T f(u) = 0 .

fi could be radial basis ones.

Example: XOR problem


1
0
1
0
Classify the samples: u(1) = C1 , u(2) = C 2 , u(3) = C 2 , u(1) = C1
1
1
0
0
Let us consider:
2

1
f1 (u) = exp( u ) = exp((u1 1) 2 (u 2 1) 2 )
1
2

u2

0
f 2 (u) = exp( u ) = exp(u12 u 2 2 )
0
u1

Knowing the inputs samples, it results:

f2(u)

f1 (u(1)) = 1, f 2 (u(1)) = 0.13


f1 (u(2)) = 0.36, f 2 (u(2)) = 0.36
f1 (u(3)) = 0.36, f 2 (u(3)) = 0.36
f1 (u(4)) = 0.13, f 2 (u(4)) = 1

f1(u)

85

RBF for function approximations (interpolation)

Find F : m accepting (u (i ), d (i )) , i = 1, N ,
with u(i ) = [u1 (i ) .. u m (i )]T m and d (i )
d (i ) = F (u(i )) = the desired output of the function corresponding to input u(i )

(these samples could be used for training).

Find the interpolation:


N

F (u) = wi f i ( u u(i) )
i =1

the number of radial basis functions = the number of the training samples;
the functions f i accept the centers c i = u(i ) .

Radial basis function could be chosen as follows:

a) f i (u ) =

b) f i (u ) =

u ci

+ qi 2 , qi > 0 : non local, unbounded

1
u ci

c) f i (u ) = exp(

+ qi

u ci
2 i 2

, qi > 0 : local, bounded

) : local, bounded

86

Knowing that d (i ) = F (u(i )) , it results:


f1 (u (1)) ..

..
..

f1 (u ( N )) ...

f N (u (1)) w1 d (1)
.. = .. .
..



f N (u ( N )) w N d ( N )

Let us consider:

f1 (u (1)) ..
..
= ..
f1 (u ( N )) ...

f N (u (1))

..
= interpolation matrix.
f N (u ( N ))

Using this notation, the equation can be rewritten as follows:

w1 d (1)
w1
d (1)

1
.. = .. .. = .. - if is nonsingular.
wN d ( N )
wN
d ( N )

Michellis Theorem (1986)


If f i are radial and samples u (i ) m are distinct
Then is nonsingular.

Remarks:

o For f i of types b) and c) is positively defined.


o For f i of type a) admits N 1 positive Eigen values and a negative Eigen value.

Remarks:

o large N (many samples) many radial basis functions complex model (over-fitting)
o large N (large samples) risk of poorly conditioned interpolation matrix and large execution
times

87

o It is desirable to use fewer radial basis functions than training samples.

s < N , s = the number of the hidden neurons.


Instead of
N

F (u) = wi f i ( u u(i ) ) ,
i =1

one has to consider


s

F (u) = b + wi f i ( u ci ) :
i =1

The centers of the radial basis functions and the input training samples are different.

The output neuron accepts nonzero bias.

Knowing that d (i ) = F (u(i )) , i = 1, N it results:


f1 (u (1)) ..

..
..

f1 (u ( N )) ...


f s (u (1)) 1 w1 d (1)
.. .. = .. .
..
w
f s (u ( N )) 1 s d ( N )
b

Let us denote:

f1 (u (1)) ..
G =
..
..
f1 (u ( N )) ...

f s (u (1))
..
f s (u ( N ))

1
N x ( s +1)
.

Therefore, it results:



w1 d (1)
d (1)
w1

G .. = .. .. = G .. ,
w
ws d ( N )
d ( N )
s
b
b
G + = (G T G ) 1 G T

88

Example: Revisit XOR classification problem


1
0
1
0
Classify the samples: u(1) = C1 , u(2) = C 2 , u(3) = C 2 , u(1) = C1
1
1
0
0
Let us consider:
2

1
f1 (u) = exp( u ) = exp((u1 1) 2 (u 2 1) 2 )
1
2

0
f 2 (u) = exp( u ) = exp(u12 u 2 2 )
0
For the above mentioned input training samples it results:
f1 (u(1)) = 1, f 2 (u(1)) = 0.13

0.13 1
1

f1 (u(2)) = 0.36, f 2 (u(2)) = 0.36


0.36 0.36 1
G=
.
0.36 0.36 1
f1 (u(3)) = 0.36, f 2 (u(3)) = 0.36

1 1
f1 (u(4)) = 0.13, f 2 (u(4)) = 1
0.13
Let us define: d (1) = 0, d (2) = 1, d (3) = 1, d (4) = 0

Therefore, it results: -

G + (given by the MATLAB function pinv):

1.7942 -1.2195 -1.2195 0.6448


0.6448 -1.2195 -1.2195 1.7942
-0.8780 1.3780 1.3780 -0.8780

0
w
1 2.439
1
w = G + = 2.439

1
2

2
.
7561
b

0

89

Theorem:

Any continuous bounded function F : m can be approximated with any desired degree of
accuracy by means of:
s

u ci

i =1

F (u) = b + wi f (

), > 0 , if

f : m is bounded and f (u )du < .


m

The requirements imposed by this theorem are met for the radial functions b), c).

The radial basis function a) can be used for s = N , only.

f is not necessarily symmetric !!!!!

For Gaussian functions:


s

F (u) = b + wi exp(
i =1

uc
2 2

) , when the same spread is employed for all the hidden neurons

Recommendation: Chose s = 3 N

Remark:
the ANN with hidden radial basis activation functions and a linear output neuron is compliant
with the requirements of the previous theorem.
-

training = optimization carried out in terms of the training data set

step 1: select the centers and the spread


step2: assuming the centers and the spread are known, compute the output weights:

f1 (u (1)) .. f s (u (1)) 1
d (1)
w1
N x ( s +1)
.. = G + .. , cu G =
..
..
..

w
f1 (u ( N )) ... f s (u ( N )) 1
d ( N )
s
b

Challenge: what centers to chose?

If the centers are known, the output weights can be computed in a single step.

generalization = interpolation

90

Comparison between MLP and RBF

RBF

MLP
Any number of hidden layers

One hidden layer

Input operator = scalar product

Input operator = Euclidian distance

Nonlinearity in terms of neural parameters

Linearity in terms of output parameters, if


fix centers and spreads are assumed

Large training time required

Small training time if the centers are


known (useful for on-line training)

Global action

Local action

Fewer parameters for the same degree of


accuracy (usually)

Learning strategies
1. Random centers selection

Step 1. Chose the centers randomly (uniformly distributed over the input range).

d
Step 2. Compute the spread = max ,
2s
with
d max = the maximum distance between the selected centers,
s = the number of hidden neurons.

Step 3. Compute the weights and the bias.

91

2. Centers self-organization
Step 1. The centers are chosen via clustering of training input samples (e. g. K-mean clustering)
K-mean clustering (tip: learn via competition):
Step 1-0: Chose random distinct initial values for all s centers, denoted c i (n) , with n = 0 and
i = 1, s .
Step 1-1: For the training sample u(n) , compute u(n) c i , i = 1, s and find the minimum
distance, which indicates the nearest center for this sample. Consider i , with i 1, s , the nearest
center.
Step 1-2: Update the nearest center, moving it towards the sample:
c i (n + 1) = c i (n) u(n) c i (n) , cu 1 > > 0
Step 1-3: n n + 1
Step 1-4: If some training samples have not been used yet, or the change made at step 1-2 is too
large,
Go to step 1-1.

Drawback: the result depends on the initial values

d
Step 2. Compute the spread = max ,
2s
with
d max = the maximum distance between the selected centers,
s = the number of hidden neurons.

Step 3. Compute the weights and the bias.

92

3. Supervised centers selection


The parameters of RBF are adapted by using error correction:
LMS algorithm:
let us consider the batch learning
criterion :

I=

s
u( j ) c i 2
1 N
1 N
2
)]
e( j ) = [ d ( j ) wi f (
2 j =1
2 j =1
i
i =1

convex in terms of weights;


non-convex in terms of centers (centers optimization can lock in local optima)
For Gaussian activation functions:
2
s
u ( j ) ci
1 N
I = [ d ( j ) wi exp(
) ]2
2
2 j =1
2 i
i =1

At each iteration, the parameters of RBF (weights, centers, spreads) are updated according to the
following rules:
- for weights:
wi wi 1

I
, with
wi

2
N
N
u( j ) c i
u( j ) c i
I
= e( j ) f (
) = e( j ) exp(
)
i
wi j =1
j =1
2 i 2

- for centers:
ci ci 2

I
, with
c i

2
u( j ) c i
wi N
I
=
) (c i ) k [u( j ) k (c i ) k ] ,
e( j ) exp(
(c i ) k i2 j =1
2 i 2

with (c i ) k , u( j) k indicating the kth component of the vectors c i , i = 1, N and u( j ), j = 1, N


(having the length m)

93

- for spreads:

i i 3

I
, with
i

N
u( j ) c i
u( j ) c i
w N
I
= e( j ) f (
) = i e( j ) exp(
i
i j =1
2 i 2
4 i 3 j =1

) u( j ) c i

4. Constructive algorithm

- insert the hidden neurons in sequel the center vector copies the input training sample that produces
the highest output squared error for the current architecture
> see MATLAB

CHAPTER 4. NEURO-GENETIC SYSTEMS


= evolutionary artificial neural networks or neuro-genetic systems

ANN
robustness
capacity of inductive learning (supervised or unsupervised)
high computational capacity
parallelism
+
AG
robustness, flexibility
scarce a priori information concerning the objective function

The symbiosis higher adaptation capacity.


94

Classification in terms of the inference provided between GA and ANN:

- supportive
reduced cooperation between GA and ANN.
The methods are sequentially, separately applied considering two distinct subproblems, or are independently used for solving the same problem.

- collaborative
strong cooperation between GA and ANN.
These combinations exploit more advantageously the merits of the involved
techniques.

4.1. Supportive neuro-genetic systems


They involve weak cooperation between GA and ANN.
One technique assumed the leader role, the other one the secondary role,
or both techniques are used for solving the same problem.

A. GA and ANN used for solving the same problem


The solutions provided by GA and ANN are used in parallel this redundancy can be useful for
diagnosis systems.

B. GA primary role, ANN secondary role


ANN helps in generating the initial the initial population..
ANN delivers additional information concerning the feasible space (e. g. after a
certain classification of feasible-unfeasible samples).
50% of the initial population is generated randomly, 50% using the ANN.

95

C. ANN primary role, GA secondary role

C1. GA used for preparing the input data for the neural classifiers:

 feature(input) selection:
Aim: improve the recognition rate and the execution times via the selection of few
relevant features.

Assuming the binary encoding, a locus can indicate the use/absence of a feature.
The drawbacks result from the fact that the method involves large computational time,
as the evaluation of each chromosome demands training the corresponding classifier.

o Chang & Lippman obtained 80% reduction of features in a voice recognition problem.

o Guo & Uhrig designed a diagnosis system based on neural observers for a nuclear plant. The
GA decided which are the inputs of each observer (given a large set of thousands available
variables).

Aim: ANN should have few inputs, and should be precise, so the objective function can be
defined as follows:
0.7 3 t +1

0.15(t +1)

f ( x) = (e ( z 1)
) (1 e 0.01err
) , where
x denotes the chromosome which has to be evaluated,
z=

no. of variables
,
no. of selected variables

t (1, NR _ MAX _ GEN ) denotes the number of the generation,


err denotes the error of the ANN computed at the end of the training stage.

Assuming the binary encoding, 1/0 indicates the use/the absence of the corresponding
plant variable.
A similar problem was solved by Weller.

96

input space transformation:


GA is used for selecting scaling and/or rotation parameters.
These transformations are meant to ensure better separation of classes (smaller distances
between the samples of the same class, larger distances between the samples belonging
to distinct classes).

 training data set configuration:


The training samples are chosen from a large database.
A chromosome specifies the samples to be used and the sequence in which they have to
be delivered during training.
If too few measurements are accessible, Cho & Cha suggest the genetic production of
virtual samples. In order to evaluate each resulted training data set, this set must be used
for explicitly training the ANN.

C2.GA used for setting the parameters of the training rules.


e. g.: the learning rate involved by the back-propagation algorithm, or the coefficients
used in other adaptation rules (Chalmer, Bengio).

C3.GA used for analyzing the ANN behavior.


Some explanations concerning the behavior of ANN result by depicting the regions of
the input space which correspond to the minimum, the maximum and the threshold
values of the output neurons.
To fit this end, a chromosome encode an input vector and the objective function can be
defined as
2

f ( x) = y yd , where
x denotes a chromosome,
y represents the neural output corresponding to x,
yd indicates the target output (minimum, maximum, threshold).

97

4.2. Collaborative neuro-genetic systems


o Strong cooperation between GA and ANN is considered.

o The symbiosis leads to better adaptation capability

o GA used for
training the ANN
and/or
select the ANN topology.
>> better accuracy and better generalization capabilities.

A.

GA used for training

Unlike gradient based training, GA learning is robust and reduced the risk of stagnation
in local optima.

Genetic training can be also used for ANNs with non-derivable activation functions or
recurrent connections.

Genetic training involves large computational times, however better convergence speed
can be achieved via hybridization with local optimizations.

The neural topology is known

A chromosome encodes the whole set of parameters.

One can use binary and float encoding.


98

Usually, the objective function is the mean output error computed for the whole
training data set.

Competing conventions
-

The crossover produces offspring less adapted than their parents.


o If the genetic sub-chains corresponding to hidden neurons are permutated, the
functionality of the encoded ANN remains the same, yet the genotype is changed
significantly.

(any hidden neuron is represented by a sub-chain encoding its parameters)


o If the parents encode similar ANNs in different genetic strings, the offspring can be
significantly less adapted.
o If the offspring are implicitly inserted in the population of the next generation, then the
convergence speed is dramatically altered.

Ideas for reducing/avoiding the risk of competing conventions:


- use mutation only (no crossover or small Pc).
Saravanan &Fogel use Gaussian mutation.
Each individual generates an offspring & tournament based selection (k=10) & ( N + )
insertion.

- use special crossovers (Hancock).


Radcliffe uses similitude based crossovers similar blocks of the parents are sent
unchanged to the offspring.
The similitude can be evaluated by means of Hamming distance (for binary encoding) or by
using the rate of identical parameters. Before crossover, the neurons are re-sorted within the
chromosome.
Hancock analyzed empirically the performances of Radcliffes crossover and multipoint
crossover. The results indicated very good results for the multipoint crossover, too (even
better results, if the selection pressure is high).

99

cascade correlation (CC) algorithm.


o The algorithm starts with a simple structure including an input and an output layer.
o The ANN is trained subject to the minimization of the output squared error.
o If the accuracy is inappropriate, a new sigmoidal hidden neuron is introduced. Its
input links are coming from all the neural inputs and from all existing hidden
neurons. Their weights are adapted via maximizing the covariance between the
squared output error and the outputs of the hidden neurons.
o The output of the new hidden neuron becomes input for the output neuron(s). The
weights of these new connections are computed via minimizing the output squared
error.
o As a single neuron is inserted at each stage, competing conventions cannot occur.

Genetic version of CC.

step 1: Initialize the minimal neural topology (2 layers: input and output).
step 2: Adapt the weights, for N _ ep epochs, by means of genetic learning.
step 3: Test if ANN accuracy is convenient E < E0 (E = output squared error):
Yes go to 8.
No continue with 4.
step 4: Insert a new hidden neuron, denoted with N.
C1 = the set of Ns input connections (coming from the neural inputs and the other hidden
neurons),
C 2 = the set of Ns output connections.
Initialize the weights of these links with random values close to 0.
step 5: Adapt the weights for C1 and the bias of N, by maximizing the covariance between
the outputs of the hidden neurons and the squared output error; the genetic
algorithm is applied for N _ ep _ 1 epochs (let C denotes the best objective value).
step 6: Adapt the weights of C 2 , by minimizing the output squared error; the genetic
procedure is applied for N _ ep _ 2 epochs.
step 7: Go to step 3.
step 8: Stop.

100

Advantages:
- the parameters of a single neuron are trained at each stage;
- the algorithm constructs the neural topology too (without genetic techniques);
- the hybridization CC- GA allows the selection of simpler topologies, at the cost
of increased computational time.

Improvements of CC
o Potter:
The weights of C2 are found by selecting the additive values belonging to (0,-C),
which correct the best adapted individuals obtained at the precedent neuron
insertion.

Chen design the RBF genetic selection of spreads:

f ( x) = err + g T g , where
err denotes the output squared error,
g is the vector of weights,

[] T indicates the transposed of a matrix.


Centers determined by means of Orthogonal Least Squares (OLS),
Weights determined by means of Least Squares.

101

Other genetic training algorithms


Hung & Adeli: two stages training:
- first stage apply a genetic training;
- second stage apply conjugated gradient, using as initial point the solution delivered at the first stage

Topalov:
-

apply genetic training; whenever GA stagnates, switch to back-propagation.

Tsinas & Dachwald:


-

multiple sequences of training, each one consisting of genetic training followed by back-propagation; the maximum
number of commutations (sequences) is preset.

Ng:
- apply back-propagation; if the output squared error is too big and its variation during the previous epochs is
insignificant, then a GA is used to guide the search far away from the local optimal point.
GA aims the minimization of the output squared error and uses Gaussian mutation.
Large permits too longer stagnations, whilst small can generate false alarms.

Ku:
- train the recurrent neural networks by means of genetic diffuse algorithms.
The chromosomes are organized according to a matrix-based topology.
Crossover acts between neighbors, only.

B.

GA employed for neural topology selection


ANN structure influences the overall performances:
o Too simple: low accuracy
o Too complex: longer training and evaluation;
generalization capability.

expected

lower

Two distinct standpoints can be considered:

Improvement of ANN learning abilities - shown by: accuracy, speed of training,


generalization capacity.
Main challenge: is a certain topology appropriate for learning a specific functionality?

Better understanding of the neural representation find ways for modeling symbolic
knowledge.

Most researches are focused on the first direction.

102

GA can provide a more flexible selection of the neural topologies.


(can consider any topology)
GA can explore large and multimodal search spaces.
No a priori information regarding the searching trajectory is required.

Traditional algorithms (constructive or destructive)


search within a limited area, only

To evaluate a neural architecture, a convenient set of parameters must be associated


(e.g. by using a non-evolutionary training algorithm).
The number of training epochs carried out for evaluating the individuals must be as
low as possible.

Expected noise:
- the performances of the ANN will be influenced by the initial random values
of the neural parameters.
- the performances of the ANN will be influenced by the training algorithm;
to eliminate this drawback, GA can also work on the neural parameters.

103

Problem: find an appropriate encoding of the neural architecture


The encoding must be compliant with:

 Correctness



allow a simple verification of the correctness of the encrypted topology


genetic operators must produce individuals encoding correct topologies

 Sensitivity in terms of genetic operator :




control the impact of the genetic operators (e.g. acting on the links, specific inner
structures, etc.).

The methods can be divided in direct and indirect encoding.

B1. Methods based on direct encoding of the neural topologies.


The strategy is appropriate for ANNs with few neurons and layers.
The chromosomal encoding is devoted to a specific ANN.
Most applications consider the MLP.
Possible chromosomal encodings:
matrix based (e.g. the chromosomes encode the matrix of connections)
vector based (e.g. the chromosomes result by reshaping the matrix of connections)
tree/graph based (e.g. the chromosomes encode the tree of connections).
Compliant genetic operators must be designed.

Hybridization with non-evolutionary local optimizations


usually, in a Lamarckian manner ( = compute/ improve the
neural parameters + store the new parameters).
104

Ideas for reducing/avoiding the risk of competing conventions:


- crossover rarely used or avoided.
Maniezzo: recommend the use of crossover (for preserving the diversity inside
the population)
Angeline, Lee: use mutation only; mutation can act on the structure and the
parameters:
 structural mutation: introduce a neuron or a link, delete a neuron or a link.
Immediately after insertion, the neurons are not connected with the rest of the
ANN, the new connections being added by successive appliance of mutation
(small changes are allowed, only).

Pujol & Poli:


Use a dual chromosomal representation: matrix based and vector based
Produce offspring on both encodings >> good diversity.

Examples
Braun & Zagorski: consider a term describing the complexity of the encoded
ANN within the objective function.
Improved genetic operators: e. g. the neurons which are deleted are stored for
further potential insertions.
Dasgupta, Mann: hierarchical encoding higher levels include control genes,
the leaves correspond to parametric genes.
Changes performed within the upper levels correspond to significant alterations
of the encoded neural architecture. Changes performed within the lower levels
correspond to less significant alterations of the encoded neural architecture.
Thierens: use a canonical representation which eliminates the effects produced
by the symmetries of the activation functions, the permutations of neurons/links,
etc. To fit this end, several transformations are made.
Negative biases are changed to positive ones, and additionally the sign of all
the incoming corresponding weights is also changed.
Then, the neurons are sorted in terms of bias.
105

Sato & Nagaya , Sato & Ochiai: use matrix based encoding for evolving the
neural architecture and the neural parameters for ANNs with binary weights..
Romaniuk: genetic CC for selecting the architecture of neural classifiers
o Recurrent ANNs with sigmoidal activation functions are considered.
o The algorithm starts with a simple structure and adds new structures which
are genetically configured. The blocks which were already inserted remain
unchanged at the next steps.
o Additionally, the resulted topologies are simplified by deleting the less
important links.
o The significance of a link is established in the following manner: in sequel,
each connection is deleted and the response of the ANN is evaluated for all
the input training samples, and the new resulted faults are counted. Small
counters indicate insignificant links.
Liu i Yao: select the architecture and the parameters of generalized neural
networks.
o These ANNs include both sigmoidal and Gaussian neurons.

B2. Methods based on indirect encoding of the neural architectures.


- may involve a more expensive evaluation,
- may lead to shorter chromosomes
Recommended for ANNs with many neurons and layers, featuring structural
regularities
Main categories of indirect encodings:


parametric
Encode the parameters which describe the architecture, such as: the number of layers, the
number of neurons within a layer, the type of accepted connections. Usually, this encoding
refers to a limited number of possible topologies.

developmental (grammar based)


The chromosome encodes a sequence of actions which allow the ANN generation, not the
ANN itself. The actions are described via a predefined grammar.

106

Gruau: cell based encoding.

o The algorithm was used for ANNs with binary and float weights. Good results were
obtained for symmetric ANNs.
o

The neural architecture is build via a cellular division process. The algorithm starts with a root
cell. Each cell possesses internal registers for storing the weights and the bias. The proposed
language indicates the following actions: cellular division, the transformation of a cell to a
neuron, choosing the values of the neural parameters, including delays, including recurrence.

A chromosome corresponds to the simplest architecture within a family of topologies.


The members of the family can be obtained by recurrently adding the structural blocks.
For the first member of the family the recurrence is not activated. For the second member the
recurrence is activated only once, etc.

A chromosome is evaluated by considering the first k members of the family, for which the
topology p + 1 gives better results than the topology p, p 1, k 1 , and the topology k + 1 is
worse than the topology k. The objective function is equal to the sum of output squared errors
corresponding to the first k members.

o The genetic operators act solely on the first member of the family. Both crossover and mutation
can be used. Higher probabilities are assigned for changing a symbol to recurrence of vice versa
are assigned. The individuals can be improved by Lamarckian local optimization.

Another classification of the genetic approaches devoted to neural architecture


selection
- how the chromosomes are used for generating the neural architecture

A.

Each chromosome encodes a single neural architecture.


The result of the algorithm is usually considered the architecture of the best
adapted individual found during the evolutionary loop or in the final population.

Yao & Liu produce the delivered neural architecture by combining the genetic
material of the individuals included in the last population.
o Various types of combinations were suggested.
o The resulted ANN has better performances at higher computational costs.
107

B.

The whole population forms a single neural architecture.


In this case, the individuals are competitors, yet they must also cooperate.

Smith & Cribbs: ANN with binary weights and hard limiter activation functions.
- the population includes structural blocks which need to be aggregated in order to
build the ANN (a chromosome encodes a structural block).
- when the population contains multiple copies of an individual, a single copy is
used within the neural structure.

- the output weights are computed with Widrow-Hopf (non-genetically).

Output layer
Weights set without GA

Blocks improved
with GA

NNCrom1

NNCrom2

----

NNCrom N

Blocks encoded by
the chromomes

Fitness is computed as follows.


For each training sample:
- if the response of the ANN is correct, then all n1 chromosomes having the output 1 are awarded
with the fitness 1 / n1 ; all n2 chromosomes having the output 0 are awarded with the fitness

1 / n2 .
-

If the response is incorrect, then all n3 chromosomes having the output 1 are awarded with the
fitness 1 / n3 ; all the chromosomes having the output 0 cannot participate to error correction,
therefore their fitness will not be changed.

High fitness is assigned to a structural block which is useful for many samples or which is
the main contributor for specific samples.

108

REFERENCES
Affenzeller, M., Winker, S., Wagner, S., Beham, A. (2009). Genetic Algorithms and
Genetic Programming Modern Concepts and Practical Application. Boca Raton, FL,
CRC Press, 157-207.
Angeline P. J., Saunders G. M., Pollack J. B., (1994). An Evolutionary Algorithm that
Constructs Recurrent Neural Networks, IEEE Transactions on Neural Networks, 1 (5),
55 - 65.
Ashlock, D. (2006). Evolutionary Computation for Modeling and Optimization,
Springer, New York.
Baluja S. (1996). Evolution of an Artificial Neural Network Based Autonomous Land
Vehicle Controller, IEEE Transactions on Systems, MAN and Cybernetics - part B, 26
(3), 450 - 463.
Bck T., Fogel D., Michalewicz Z. (2000). Evolutionary Computation 2. Advanced
Algorithms and Operators, Institute of Physics Publishing, USA.
Barton, A. J., Valds, J. J., Orchard, R. (2009). Neural networks with multiple general
neuron models: A hybrid computational intelligence approach using Genetic
Programming, Neural Networks, 22, 614-622.
Bengio S., Bengio, Y., Cloutier, J. (1994). Use of Genetic Programming for the Search
of a New Learning Rule for Neural Networks, Proc. of Conference on Evolutionary
Computation, USA, 324 - 327.
Benuskova L., Kasabov N. (2007). Computation Neuro-genetic Modeling, Springer, New
Zealand.
Bonarini, A. , F. Masulli, G. Pas (2003). Soft Computing Applications (Advances in Soft
Computing), Physica Verlag Heildeberg.
Braun H., Zagorski P., (1994). ENZO II a Powerful Design Tool to Evolve Multi-layer
Feed Forward Networks, Proc. of Conference on Evolutionary Computation, USA, 278 283.
Coello Coello, C.A., Lamont, G.B., Van Veldhuizen, D.A. (2007). Evolutionary
Algorithms for Solving Multiobjective Problems, 2nd Edition. New York, NY: Springer,
50-150.
Da Ruan (1997). Intelligent Hybrid Systems, Kluwer Academc Publisher, USA.
De Jong, K. A . (2006). Evolutionary Computation - A Unified Approach. Cambridge.
MA: MIT Press.
DiMattina, C. (2010). How to Modify a Neural Network Gradually Without Changing
Its Input-Output Functionality, Neural Computation 22, 147.
Dumitrache I., Buiu C. (1995). Introduction to Genetic Algorithms, Ed. Politehnica,
Bucuresti, Romania.
Ferariu L.(2005). Algoritmi evolutivi in identificarea si conducerea sistemelor,
Politehnium, Iasi, Romania.
Ferariu L. (2010). Sisteme neurogenetice, Politehnium, Iasi, Romania.

109

Fleming P. J., Purshouse R. C. (2002). Evolutionary algorithms in control systems


engineering: a survey, Control Engineering Practice, 10, 1223 - 1241.
Fogel, D. (2006). Evolutionary computation Toward a New Philosophy of Machine
Intelligence, 3rd Ed., Piscataway, NJ: IEEE Press.
Gruau F. (1993). Genetic Synthesis of Modular Neural networks, Genetic Algorithms
Proc., 312 - 317.
Haykin S. (2009), Neural Networks and Learning Machines, 3rd Edition, Prentice Hall,
USA.
Knowles, J., Corne, D., Deb, K. (Eds.). (2008). Multiobjective problem Solver from
Nature From Concepts to Applications. Pondicherry, India: Springer, 131-154.
Purshouse, R., Fleming P. (2006). On the Evolutionary Optimization of Many
Conflicting Objective?s, IEEE Transactions on Evolutionary Computation, 11 (6), 770784.
Smith R. E., Brown Cribbs III H. (1997). Combined Biological Paradigms: A Neural,
Genetic-based Autonomous System Strategy, Robotics and Autonomous Systems 22, 65
- 74.

110

S-ar putea să vă placă și